# ML model for Keyword Classification - Tech Notebook
This notebook introduces (1) how to explore, prepare and preprocess the datasets; (2) how to train and evaluate the ML model; and (3) how to use this trained ML model, for technical audiences.
## Problem Description
The AODN catalogue $C=\{M, K, P\}$ serves as a platform for storing datasets and their associated metadata. $M=\{m_1,m_2,\ldots, m_x\}$ is a set of metadata records which are used to describe the dataset in AODN catalogue $C$. $K=\{k_1, k_2, \ldots, k_y\}$ is a set of pre-defined keywords that are used to categorise dataset. In the catalogue $C = \{M, K\}$, a subset of metadata records, $M_t \subseteq M$, have not yet been categorised with keywords. For these records, $K_i = \emptyset $ for all $m_i \in M_t$. Given another subset of metadata records, $M_s \subseteq M$, where each record has already been categorised with keywords (i.e., $K_i \neq \emptyset $ for all $m_i \in M_s$). The research question is as follows:

How to design and develop a machine learning model, denoted as $MM_{keywords}$, that can automatically label the uncategorised metadata records $M_t$ using a predefined set of keywords $K$. Specifically, the model should be trained to learn a mapping rule $d_i \mapsto K_i$ based on the observed patterns from the sample set $M_s$, where each description $d_i$ of a metadata record $m_i \in M_s$ is associated with a set of keywords $K_i$. Once trained, the model should be able to apply this learned mapping to accurately categorise the records in $M_t$ by assigning their corresponding keywords based on the records' descriptions.

To simplify the task, we restrict the scope of keywords to those falling within the primary AODN vocabulary:
- AODN Instrument Vocabulary
- AODN Discovery Parameter Vocabulary
- AODN Platform Vocabulary

Only keywords $k_j \in K_i$ that are part of the listed AODN vocabularies will be considered. Any keyword not belonging to these vocabularies will be excluded from $K_i$ for all metadata records in the categorised metadata set $M_s$.

### Formal Definitions
- **Definition 1: A metadata record $m_i=(d_i, K_i), m_i \in M$** is a record describing a dataset. Specifically, $i$ is the unique identifier of the record. $d_i$ is a textual abstract that serves as the description of the dataset. $K_i \subseteq K$ is a subset of keywords used to label the dataset.
- **Definition 2: A abstract $d_i$** is a piece of textual information which is used to describe the dataset. The embedding $\mathbf{d_i}$ is a vector representation of the textual description $d_i$, calculated using the "bert-base-uncased" model. The embedding vector $\mathbf{d_i}$ for each abstract $d_i$ has an universal dimensionality, denoted as $dim=|\mathbf{d_i}|$. A feature matrix $\mathbf{X}$ of a shape $|M_s| \times dim$ aggregates the embeddings for the abstacts of all samples in $M_s$, where |M_s is the total number of metadata records.
- **Definition 3: A keyword $k_j$** is a predefined label used for catogarising datasets. Each metadata record $m_i$ is associated with a set of keywords $K_i \subseteq K$, while $K$ is the complete set of predefined keywords. The keywords $K_i$ for a metadata record $m_i$ is mathematiacally represented as a binary vector $y_i$ with a size of $|K|$. where each element indicates the presence or absence of a specific label. A value of 1 at position $j$ denotes the label $k_j \in K$ is present in the metadata record $m_i$, in this sence $k_j \in K_i$, while a value of 0 indicates its absence. A target matrix $\mathbf{Y}$ is a $|M_s| \times |K|$ binary matrix, where $|M_s|$ is the size of the metadata records set $M_s=\{m_1,m_2,\ldots, m_x\}$, and $|K|$ is the size of the keywords set $K=\{k_1, k_2, \ldots, k_y\}$. Each entry $ \mathbf{K}[i, j] $ is 1 if metadata record $ m_i $ is associated with keyword $ k_j $, and 0 otherwise.



In [1]:
# add module path for notebook to use
import sys
import os

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path+"\\data_discovery_ai\\utils")
    sys.path.append(module_path+"\\data_discovery_ai\\model")
    sys.path.append(module_path+"\\data_discovery_ai\\common")

current_path = os.getcwd()

# import modules
import preprocessor
import keywordModel
import constants
import es_connector

  from .autonotebook import tqdm as notebook_tqdm





As shown in the [framework](data-discovery-ai-framework.drawio.png), three distinct but connected modules work cooperatively as the keyword classifier pipeline. This notebook will go through the functions in these modules to show how we preprocess data, train the ML model, and make predictions.
## Data Preprocessing
The data preprocessing module is used to prepare data for training and testing models. Key features include: getting raw data, preparing sample data, converting textual data to numeric representations, resampling, and preparing input and output matrices.
### Getting Raw Data
Raw data means the all metadata records $M$ stored in Elasticsearch. A elasticsearch configuration file `esManager.ini` is needed to be created in folder `data_discoverty_ai/common`, in which two fields are required: `end_point` and `api_key`. For more information, please refer to [README](../README.md#file-structure). We first fetch raw data from Elasticsearch.

In [2]:
# load Elasticsearch configuration
import configparser
from pathlib import Path

def load_es_config() -> configparser.ConfigParser:
    elasticsearch_config_file_path = f"../data_discovery_ai/common/{constants.ELASTICSEARCH_CONFIG}"
    esConfig = configparser.ConfigParser()
    esConfig.read(elasticsearch_config_file_path)
    return esConfig

In [3]:
# connect and query Elasticsearch
esConfig = load_es_config()
client = es_connector.connect_es(esConfig)
index = os.getenv("ES_INDEX_NAME", default=constants.ES_INDEX_NAME)
raw_data = es_connector.search_es(client=client, index=index, batch_size=constants.BATCH_SIZE, sleep_time=constants.SLEEP_TIME)

searching elasticsearch: 100%|██████████| 129/129 [11:49<00:00,  5.50s/it]


In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12952 entries, 0 to 12951
Data columns (total 42 columns):
 #   Column                                             Non-Null Count  Dtype 
---  ------                                             --------------  ----- 
 0   _index                                             12952 non-null  object
 1   _id                                                12952 non-null  object
 2   _score                                             0 non-null      object
 3   sort                                               12952 non-null  object
 4   _source.title                                      12952 non-null  object
 5   _source.description                                12952 non-null  object
 6   _source.extent.bbox                                12952 non-null  object
 7   _source.extent.temporal                            12952 non-null  object
 8   _source.summaries.score                            12952 non-null  int64 
 9   _source.summaries

There are **12943** metadata records in the staging environment. We can also check that there are **1721** items has no keyword information.

In [5]:
no_keyword_items = raw_data[raw_data['_source.themes'].apply(lambda x: x == [])]
no_keyword_items_count = no_keyword_items.shape[0]
no_keyword_items_count

1721

### Identify Samples
Sample set is a subset of the raw dataset. A sample set $M_s$ is a set of metadata records in which keywords contain particular AODN vocabus. We first identify samples from raw data, and then preprocess the sample set.

In [6]:
# get predefined vocabs
def load_keyword_config() -> configparser.ConfigParser:
    keyword_config_file_path = f"../data_discovery_ai/common/{constants.KEYWORD_CONFIG}"
    keywordConfig = configparser.ConfigParser()
    keywordConfig.read(keyword_config_file_path)
    return keywordConfig
keywordConfig = load_keyword_config()
vocabs = keywordConfig["preprocessor"]["vocabs"].split(", ")
vocabs

['AODN Discovery Parameter Vocabulary', 'AODN Platform Vocabulary']

The identified sample lables look like this format: 

In [7]:
# identify samples with predefined vocabs
identified_sampleSet = preprocessor.identify_km_sample(raw_data, vocabs)
identified_sampleSet.iloc[0]["keywords"]

[{'concepts': [{'id': 'Oceans | Ocean Temperature | Water Temperature',
    'url': None},
   {'id': 'Oceans | Ocean Optics | Photosynthetically Active Radiation',
    'url': None},
   {'id': 'Oceans | Ocean Optics | Turbidity', 'url': None},
   {'id': 'Atmosphere | Precipitation | Rain', 'url': None},
   {'id': 'Oceans | Ocean Chemistry | Chlorophyll', 'url': None},
   {'id': 'Oceans | Salinity/density | Salinity', 'url': None}],
  'scheme': 'theme',
  'description': 'GCMD',
  'title': 'NASA/Global Change Master Directory Earth Science Keywords Version 5.3.8'},
 {'concepts': [{'id': 'Buoys | Moored Buoys', 'url': None},
   {'id': 'Fluorometers', 'url': None},
   {'id': 'CTD (Conductivity-Temperature-Depth Profilers)', 'url': None}],
  'scheme': '',
  'description': 'MCP',
  'title': 'Marine Community Profile of ISO19115 v1.4 Collection Methods Vocabulary (Annex C.1.3)'},
 {'concepts': [{'id': 'IMOS Platform | NRSDAR | Darwin National Reference Station Mooring',
    'url': None},
   {'i

The keywords is in a nested json format, we need to flattern them, and remove keywords which are not in the target vocabularies.

In [8]:
preprocessed_SampleSet = preprocessor.sample_preprocessor(identified_sampleSet, vocabs)
preprocessed_SampleSet

Unnamed: 0,id,title,description,keywords,information
12,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS - ANMN National Reference Stations - Darw...,This collection includes observations transmit...,"[{'vocab_type': 'AODN Platform Vocabulary', 'v...",IMOS - ANMN National Reference Stations - Darw...
16,0094682a-e438-41e8-a39b-19cf2093025d,Thursday Island Wind From 08 Feb 2012,This data set was collected by weather sensors...,[{'vocab_type': 'AODN Discovery Parameter Voca...,Thursday Island Wind From 08 Feb 2012 [SEP] Th...
20,00a64d43-86a8-4f2b-89e6-40f1abf288f6,Cumulative Pressures on the Distinctive Values...,A report was developed by the Western Australi...,[],Cumulative Pressures on the Distinctive Values...
28,00fee0c8-6203-4271-8d46-f36c075fa6cf,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...,IMOS SOOP Underway Data from AIMS Vessel RV So...
34,0145df96-3847-474b-8b63-a66f0e03ff54,Statewide Marine Habitat Map 2023,The Statewide Marine Habitat Map 2023 was deve...,"[{'vocab_type': 'AODN Platform Vocabulary', 'v...",Statewide Marine Habitat Map 2023 [SEP] The St...
...,...,...,...,...,...
12823,ff50ae2f-0f79-4eaa-806c-8954ab0e545b,One Tree Island Air Pressure From 18 Nov 2008 ...,The 'Wireless Sensor Networks Facility' (forme...,[{'vocab_type': 'AODN Discovery Parameter Voca...,One Tree Island Air Pressure From 18 Nov 2008 ...
12833,ffb04265-eb2a-4eea-943f-ef4cd2dd9531,Chemical microenvironment within complex multi...,-- Layton et al. Chemical microenvironments wi...,[{'vocab_type': 'AODN Discovery Parameter Voca...,Chemical microenvironment within complex multi...
12838,ffd235e6-814e-477e-b324-60b44ef8ea11,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...,IMOS SOOP Underway Data from AIMS Vessel RV So...
12840,ffe3c79d-0b1a-49cc-9995-5057dc1eb8f5,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...,IMOS SOOP Underway Data from AIMS Vessel RV So...


Clean the sample set, for instance, row at index `20` has an empty keyword filed like `[]`. After remove empty keyword, we got **1785** samples which have already been labelled.

In [10]:
filtered_sampleSet = preprocessed_SampleSet[preprocessed_SampleSet["keywords"].apply(lambda x: x != [])]
filtered_sampleSet.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1785 entries, 12 to 12841
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           1785 non-null   object
 1   title        1785 non-null   object
 2   description  1785 non-null   object
 3   keywords     1785 non-null   object
 4   information  1785 non-null   object
dtypes: object(5)
memory usage: 83.7+ KB


Then we calculate embeddings for the title and description field, which is used as the input feature matrix.

In [11]:
finalSampleSet = preprocessor.calculate_embedding(filtered_sampleSet)







100%|██████████| 1785/1785 [4:25:16<00:00,  8.92s/it]  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ds["embedding"] = ds["information"].progress_apply(


In [13]:
# save as local file to reduce debugging/experimental time
# preprocessor.save_to_file(finalSampleSet, "keyword_raw_data.pkl")
final_data = preprocessor.load_from_file("keyword_raw_data.pkl")
final_data

INFO:preprocessor:Load from keyword_raw_data.pkl


Unnamed: 0,id,title,description,keywords,information,embedding
12,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS - ANMN National Reference Stations - Darw...,This collection includes observations transmit...,"[{'vocab_type': 'AODN Platform Vocabulary', 'v...",IMOS - ANMN National Reference Stations - Darw...,"[-0.7780039, 0.1889344, -0.009652436, -0.32900..."
16,0094682a-e438-41e8-a39b-19cf2093025d,Thursday Island Wind From 08 Feb 2012,This data set was collected by weather sensors...,[{'vocab_type': 'AODN Discovery Parameter Voca...,Thursday Island Wind From 08 Feb 2012 [SEP] Th...,"[-0.52610874, -0.32091716, 0.4605962, -0.10483..."
28,00fee0c8-6203-4271-8d46-f36c075fa6cf,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...,IMOS SOOP Underway Data from AIMS Vessel RV So...,"[-0.6509674, -0.503706, 0.15677778, -0.0664816..."
34,0145df96-3847-474b-8b63-a66f0e03ff54,Statewide Marine Habitat Map 2023,The Statewide Marine Habitat Map 2023 was deve...,"[{'vocab_type': 'AODN Platform Vocabulary', 'v...",Statewide Marine Habitat Map 2023 [SEP] The St...,"[-1.3588586, -0.389985, -0.017488703, -0.20076..."
37,0155375c-8070-4662-9c93-b593ee4891b0,Davies Reef Water Temperature From 18 Oct 1991,The 'Wireless Sensor Networks Facility' (forme...,[{'vocab_type': 'AODN Discovery Parameter Voca...,Davies Reef Water Temperature From 18 Oct 1991...,"[-0.56771225, -0.14031741, 0.060430992, -0.031..."
...,...,...,...,...,...,...
12823,ff50ae2f-0f79-4eaa-806c-8954ab0e545b,One Tree Island Air Pressure From 18 Nov 2008 ...,The 'Wireless Sensor Networks Facility' (forme...,[{'vocab_type': 'AODN Discovery Parameter Voca...,One Tree Island Air Pressure From 18 Nov 2008 ...,"[-0.37659982, -0.27009085, 0.12564549, -0.1331..."
12833,ffb04265-eb2a-4eea-943f-ef4cd2dd9531,Chemical microenvironment within complex multi...,-- Layton et al. Chemical microenvironments wi...,[{'vocab_type': 'AODN Discovery Parameter Voca...,Chemical microenvironment within complex multi...,"[-0.5530796, -0.5692871, -0.36294222, 0.006198..."
12838,ffd235e6-814e-477e-b324-60b44ef8ea11,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...,IMOS SOOP Underway Data from AIMS Vessel RV So...,"[-0.59414643, -0.5531101, 0.17934471, -0.06969..."
12840,ffe3c79d-0b1a-49cc-9995-5057dc1eb8f5,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...,IMOS SOOP Underway Data from AIMS Vessel RV So...,"[-0.66024554, -0.5237442, 0.20254543, -0.09198..."


In [18]:
finalSampleSet.describe()

Unnamed: 0,id,title,description,keywords,information,embedding
count,1785,1785,1785,1785,1785,1785
unique,1785,1776,1570,300,1783,1785
top,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS SOOP Underway Data from AIMS Vessel RV Ca...,The 'Wireless Sensor Networks Facility' (forme...,[{'vocab_type': 'AODN Discovery Parameter Voca...,IMOS SOOP Underway Data from AIMS Vessel RV So...,"[-0.7780039, 0.1889344, -0.009652436, -0.32900..."
freq,1,2,42,598,2,1


### Prepare Train and Test Sets
We now have the sample set with extra embedding information. We are going to split the sample set into train and test sets by preparing input feature matrix $X$ and output target matrix $Y$. The input feature matrix X is based on the embedding column, and the output Y is the mathmatic representation of the keyword column.

In [19]:
X, Y, Y_df, labelMap = preprocessor.prepare_X_Y(finalSampleSet)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ds["keywordsMap"] = ds["keywords"].apply(


We have prepared the input feature matrix `X` and the output target matrix `Y`. Additionally, we have `Y_df`, which includes column names for the `Y` matrix, and `labelMap`, which represents the keyword set of predefined keywords. In `labelMap`, the key is an encoded number corresponding to a column name in `Y_df`, and the value is a Concept object. So there are **262** unique keywords from the sample set. We can review the details of a Concept object by its `to_json()` function.

In [28]:
Y_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,252,253,254,255,256,257,258,259,260,261
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1780,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1781,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1782,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1783,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
labelMap.get(261).to_json()

{'vocab_type': 'AODN Discovery Parameter Vocabulary',
 'value': 'directional variance spectral density of waves on the water body',
 'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/39'}

By defining the constant `RARE_LABEL_THRESHOLD`, we can customise the identification of rare labels: labels that appear fewer than a certain number of times across all records in the sample set. In this case, we set `RARE_LABEL_THRESHOLD=3`, so that there are 133 labels that appear less than 3 times.

In [23]:
rare_label_index = preprocessor.identify_rare_labels(Y_df, constants.RARE_LABEL_THRESHOLD, list(labelMap.keys()))
len(rare_label_index)

133

We found that among 262 unique keywords, there are 133 keywords appears less than the `RARE_LABEL_THRESHOLD`. So we firstly duplicate records which have these rare labels with a customised resamplying.

In [30]:
X_oversampled, Y_oversampled = preprocessor.resampling(
            X_train=X, Y_train=Y, strategy="custom", rare_keyword_index=rare_label_index
        )

INFO:preprocessor:Total samples: 2055
INFO:preprocessor:Dimension: 768
INFO:preprocessor:No. of labels: 262
INFO:preprocessor:X resampled set size: 2055
INFO:preprocessor:Y resampled set size: 2055


Now the sample size is increased 1785 to 2055, so that the records of rare labels are manually increased. We can now split the sample set to train and test sets follows a 80%-20% split.

In [31]:
dim, n_labels, X_train, Y_train, X_test, Y_test = (
            preprocessor.prepare_train_test(X_oversampled, Y_oversampled, keywordConfig)
        )

INFO:preprocessor:Total samples: 2055
INFO:preprocessor:Dimension: 768
INFO:preprocessor:No. of labels: 262
INFO:preprocessor:Train set size: 1637 (79.66%)
INFO:preprocessor:Test set size: 418 (20.34%)


Next, we perform oversampling only on the training set, as we want to avoid introducing training samples into the test set. This ensures the model does not encounter training data during testing.

In [32]:
X_train_oversampled, Y_train_oversampled = preprocessor.resampling(
            X_train=X_train, Y_train=Y_train, strategy="ROS", rare_keyword_index=None
        )

INFO:preprocessor:Total samples: 120736
INFO:preprocessor:Dimension: 768
INFO:preprocessor:No. of labels: 262
INFO:preprocessor:X resampled set size: 120736
INFO:preprocessor:Y resampled set size: 120736


Then, we calculate the class weight, so that we can apply in model training by assigning majority classes lower weight, and minority classes higher weight.

In [34]:
label_weight_dict = keywordModel.get_class_weights(Y_train)
label_weight_dict

{0: 0.023255813412655504,
 1: 0.199999960000008,
 2: 0.04347825897920613,
 3: 0.06249999609375024,
 4: 0.031249999023437534,
 5: 0.1428571224489825,
 6: 0.0238095232426304,
 7: 0.026315788781163457,
 8: 0.01886792417230332,
 9: 0.034482757431629055,
 10: 0.035714284438775556,
 11: 0.023255813412655504,
 12: 0.02173912996219283,
 13: 0.199999960000008,
 14: 0.08333332638888948,
 15: 0.03999999840000006,
 16: 0.11111109876543349,
 17: 0.020833332899305567,
 18: 0.07692307100591762,
 19: 0.023255813412655504,
 20: 0.09090908264462885,
 21: 0.023255813412655504,
 22: 0.33333322222225925,
 23: 0.12499998437500197,
 24: 0.07692307100591762,
 25: 0.08333332638888948,
 26: 0.499999750000125,
 27: 0.055555552469135974,
 28: 0.034482757431629055,
 29: 0.08333332638888948,
 30: 0.1428571224489825,
 31: 0.05882352595155729,
 32: 0.01886792417230332,
 33: 0.03846153698224858,
 34: 0.16666663888889352,
 35: 0.499999750000125,
 36: 0.055555552469135974,
 37: 0.023255813412655504,
 38: 0.0232558134126

Now, we have prepared all the data we need for training a keyword classification model. Let's move on to the next stage.

## Training and Evaluation of Model
A model name is required for training a model. As mentioned in [README.md](../README.md), available options are: `development`,`experimental`, `staging`, `production`, `benchmark`. 

In [35]:
model_name = "development"

In [36]:
trained_model, history, model_name = keywordModel.keyword_model(
            model_name=model_name,
            X_train=X_train,
            Y_train=Y_train,
            X_test=X_test,
            Y_test=Y_test,
            class_weight=label_weight_dict,
            dim=dim,
            n_labels=n_labels,
            params=keywordConfig,
        )

Epoch 1/100
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 256ms/step - accuracy: 0.0973 - loss: 6.8160e-04 - precision: 0.0265 - recall: 0.5445 - val_accuracy: 0.2744 - val_loss: 0.0751 - val_precision: 0.4185 - val_recall: 0.0702 - learning_rate: 0.0010
Epoch 2/100
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 202ms/step - accuracy: 0.3281 - loss: 1.5844e-04 - precision: 0.4243 - recall: 0.4985 - val_accuracy: 0.2256 - val_loss: 0.0436 - val_precision: 0.4595 - val_recall: 0.0702 - learning_rate: 0.0010
Epoch 3/100
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 180ms/step - accuracy: 0.3954 - loss: 1.1015e-04 - precision: 0.5899 - recall: 0.5201 - val_accuracy: 0.2409 - val_loss: 0.0381 - val_precision: 0.4805 - val_recall: 0.0605 - learning_rate: 0.0010
Epoch 4/100
[1m41/41[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 186ms/step - accuracy: 0.4504 - loss: 6.6754e-05 - precision: 0.7092 - recall: 0.6103 - val_accuracy: 0.

Then, we evaluate the trained model.

In [37]:
confidence = keywordConfig.getfloat("keywordModel", "confidence")
top_N = keywordConfig.getint("keywordModel", "top_N")
predicted_labels = keywordModel.prediction(
    X_test, trained_model, confidence, top_N
)
eval = keywordModel.evaluation(
    Y_test=Y_test, predictions=predicted_labels
)
eval

[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 115ms/step


{'precision': '0.8363',
 'recall': '0.8293',
 'f1': '0.8328',
 'hammingloss': '0.0070',
 'Jaccard Index': '0.7639',
 'accuracy': '0.6172'}

We had a result of 83% precision, 82% recall, and 83% F1 score. Which is not bad. But we can still try different hypermeters to improve model performance. Please refer to [README.md](../README.md) to see hypermeter descriptions. To adjust model hypermeters, please go to file `data_discovery_ai\common\keyword_classification_parameters.ini` to try different values.

## Make Prediction

Now we have the trained model, let's use this model to make prediction. Let's assume we have a item entitled: *"Corals and coral communities of Lord Howe Island, Australia"* with an abstract *"Ecological and taxonomic surveys of hermatypic scleractinian corals were carried out at approximately 100 sites around Lord Howe Island. Sixty-six of these sites were located on reefs in the lagoon, which extends for two-thirds of the length of the island on the western side. Each survey site consisted of a section of reef surface, which appeared to be topographically and faunistically homogeneous. The dimensions of the sites surveyed were generally of the order of 20m by 20m. Where possible, sites were arranged contiguously along a band up the reef slope and across the flat. The cover of each species was graded on a five-point scale of percentage relative cover. Other site attributes recorded were depth (minimum and maximum corrected to datum), slope (estimated), substrate type, total estimated cover of soft coral and algae (macroscopic and encrusting coralline). Coral data from the lagoon and its reef (66 sites) were used to define a small number of site groups which characterize most of this area.Throughout the survey, corals of taxonomic interest or difficulty were collected, and an extensive photographic record was made to augment survey data. A collection of the full range of form of all coral species was made during the survey and an identified reference series was deposited in the Australian Museum.In addition, less detailed descriptive data pertaining to coral communities and topography were recorded on 12 reconnaissance transects, the authors recording changes seen while being towed behind a boat.
 The purpose of this study was to describe the corals of Lord Howe Island (the southernmost Indo-Pacific reef) at species and community level using methods that would allow differentiation of community types and allow comparisons with coral communities in other geographic locations."* that is unlabelled.

In [41]:
item_title = "Corals and coral communities of Lord Howe Island, Australia"
item_abstract = """Ecological and taxonomic surveys of hermatypic scleractinian corals were carried out at approximately 100 sites around Lord Howe Island. Sixty-six of these sites were located on reefs in the lagoon, which extends for two-thirds of the length of the island on the western side. Each survey site consisted of a section of reef surface, which appeared to be topographically and faunistically homogeneous. The dimensions of the sites surveyed were generally of the order of 20m by 20m. Where possible, sites were arranged contiguously along a band up the reef slope and across the flat. The cover of each species was graded on a five-point scale of percentage relative cover. Other site attributes recorded were depth (minimum and maximum corrected to datum), slope (estimated), substrate type, total estimated cover of soft coral and algae (macroscopic and encrusting coralline). Coral data from the lagoon and its reef (66 sites) were used to define a small number of site groups which characterize most of this area.Throughout the survey, corals of taxonomic interest or difficulty were collected, and an extensive photographic record was made to augment survey data. A collection of the full range of form of all coral species was made during the survey and an identified reference series was deposited in the Australian Museum.In addition, less detailed descriptive data pertaining to coral communities and topography were recorded on 12 reconnaissance transects, the authors recording changes seen while being towed behind a boat.
 The purpose of this study was to describe the corals of Lord Howe Island (the southernmost Indo-Pacific reef) at species and community level using methods that would allow differentiation of community types and allow comparisons with coral communities in other geographic locations."""
description = f"{item_title} [SEP] {item_abstract}"

We first prepare input feature matrix X, which is the embedding of this description.

In [42]:
description_embedding = preprocessor.get_description_embedding(description)
dimension = description_embedding.shape[0]
target_X = description_embedding.reshape(1, dimension)
target_X

array([[-7.62841582e-01, -2.93963730e-01, -6.75775528e-01,
        -2.72130594e-02,  1.61575258e-01,  1.36871755e-01,
         3.51685137e-01,  3.32611710e-01,  2.18054429e-01,
        -2.60674685e-01, -6.53157413e-01, -2.17851877e-01,
        -1.56790227e-01,  7.64973581e-01,  6.43893182e-02,
         5.33772051e-01,  7.23699629e-02,  3.34060133e-01,
         3.63771886e-01,  4.27426696e-01, -4.14635211e-01,
        -3.91425282e-01,  2.89899439e-01,  3.73263061e-01,
        -4.92710680e-01, -4.38877106e-01, -1.66064978e-01,
         3.38278264e-01, -1.69556335e-01, -4.20632660e-01,
        -2.66675174e-01,  6.02981925e-01, -1.06652343e+00,
        -9.07420516e-01,  5.48748486e-02,  3.40701640e-01,
        -6.33432209e-01,  1.60902783e-01,  1.45107090e-01,
        -2.19583973e-01, -3.29156995e-01,  2.94659197e-01,
         3.98330539e-01, -2.37225473e-01,  3.84339452e-01,
         4.00018394e-01, -4.21667480e+00,  1.51392281e-01,
        -3.23978841e-01, -2.95730978e-02,  3.39358032e-0

The ML model is a probability model. The outputs are probabilities of labels presented in an item according to its title and abstract embeddings. We can check the output by load the pretrained model and print its predictions.

In [43]:
pretrained_model = keywordModel.load_saved_model(model_name)
pretrained_model

<Sequential name=sequential, built=True>

In [44]:
pretrained_model.predict(target_X)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 566ms/step


array([[0.0141251 , 0.0148331 , 0.01123162, 0.00763609, 0.01477294,
        0.01086779, 0.00581298, 0.14629266, 0.00630754, 0.0082457 ,
        0.43196604, 0.02452799, 0.01131908, 0.00454361, 0.00731042,
        0.01040885, 0.00773691, 0.04058121, 0.00879939, 0.01726866,
        0.03757109, 0.01942537, 0.00745127, 0.03129204, 0.01508618,
        0.00907429, 0.00993416, 0.00768015, 0.00769586, 0.01268324,
        0.01587276, 0.00709243, 0.01370739, 0.03063177, 0.002096  ,
        0.05275223, 0.01075495, 0.01641993, 0.03722664, 0.022268  ,
        0.06002471, 0.06569523, 0.5783683 , 0.07615415, 0.05850924,
        0.01469431, 0.29270202, 0.05475901, 0.03624   , 0.00912722,
        0.00590769, 0.00979623, 0.00815977, 0.09944522, 0.01855502,
        0.05258691, 0.07633297, 0.01577087, 0.01319886, 0.0139059 ,
        0.00344483, 0.0034726 , 0.00692552, 0.01625231, 0.00977275,
        0.00781164, 0.00797892, 0.05724322, 0.00762262, 0.03694672,
        0.11583592, 0.0303256 , 0.01351729, 0.00

Global parameters `confidence` and `top_N` are assigned in the `data_discovery_ai/common/keyword_classification_parameters.ini` configuration file.

- The `confidence` parameter specifies the probability threshold. Probabilities exceeding this value indicate that the keyword is considered present in the item; otherwise, it is not.
- The `top_N` parameter is used to select predicted keywords when no probability exceeds the confidence threshold. In this case, the top N keywords are selected and considered to appear in the item record.

Then we use the trained model and X to make prediction

In [45]:
target_predicted_labels = keywordModel.prediction(
        target_X,
        trained_model,
        keywordConfig.getfloat("keywordModel", "confidence"),
        keywordConfig.getint("keywordModel", "top_N"),
    )
target_predicted_labels

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 393ms/step


array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

It's in a binary format, but means that at which index the values are 1, the keywords at these index have a higher probability to be appeared in the item. So, we convert this binary array to readable format.

In [46]:
prediction = keywordModel.get_predicted_keywords(target_predicted_labels, labelMap)
prediction

[{'vocab_type': 'AODN Discovery Parameter Vocabulary',
  'value': 'abundance of biota',
  'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/488'},
 {'vocab_type': 'AODN Discovery Parameter Vocabulary',
  'value': 'biotic taxonomic identification',
  'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/489'}]

So this item has a most likely keyword `[{'vocab_type': 'AODN Discovery Parameter Vocabulary',
  'value': 'abundance of biota',
  'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/488'},
 {'vocab_type': 'AODN Discovery Parameter Vocabulary',
  'value': 'biotic taxonomic identification',
  'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/489'}]`