# ML model for Keyword Classification - Tech Notebook
This notebook introduces (1) how to explore, prepare and preprocess the datasets; (2) how to train and evaluate the ML model; and (3) how to use this trained ML model, for technical audiences.
## Problem Description
The AODN catalogue $C=\{M, K, P\}$ serves as a platform for storing datasets and their associated metadata. $M=\{m_1,m_2,\ldots, m_x\}$ is a set of metadata records which are used to describe the dataset in AODN catalogue $C$. $K=\{k_1, k_2, \ldots, k_y\}$ is a set of pre-defined keywords that are used to categorise dataset. In the catalogue $C = \{M, K\}$, a subset of metadata records, $M_t \subseteq M$, have not yet been categorised with keywords. For these records, $K_i = \emptyset $ for all $m_i \in M_t$. Given another subset of metadata records, $M_s \subseteq M$, where each record has already been categorised with keywords (i.e., $K_i \neq \emptyset $ for all $m_i \in M_s$). The research question is as follows:

How to design and develop a machine learning model, denoted as $MM_{keywords}$, that can automatically label the uncategorised metadata records $M_t$ using a predefined set of keywords $K$. Specifically, the model should be trained to learn a mapping rule $d_i \mapsto K_i$ based on the observed patterns from the sample set $M_s$, where each description $d_i$ of a metadata record $m_i \in M_s$ is associated with a set of keywords $K_i$. Once trained, the model should be able to apply this learned mapping to accurately categorise the records in $M_t$ by assigning their corresponding keywords based on the records' descriptions.

To simplify the task, we restrict the scope of keywords to those falling within the primary AODN vocabulary:
- AODN Instrument Vocabulary
- AODN Discovery Parameter Vocabulary
- AODN Platform Vocabulary

Only keywords $k_j \in K_i$ that are part of the listed AODN vocabularies will be considered. Any keyword not belonging to these vocabularies will be excluded from $K_i$ for all metadata records in the categorised metadata set $M_s$.

### Formal Definitions
- **Definition 1: A metadata record $m_i=(d_i, K_i), m_i \in M$** is a record describing a dataset. Specifically, $i$ is the unique identifier of the record. $d_i$ is a textual abstract that serves as the description of the dataset. $K_i \subseteq K$ is a subset of keywords used to label the dataset.
- **Definition 2: A abstract $d_i$** is a piece of textual information which is used to describe the dataset. The embedding $\mathbf{d_i}$ is a vector representation of the textual description $d_i$, calculated using the "bert-base-uncased" model. The embedding vector $\mathbf{d_i}$ for each abstract $d_i$ has an universal dimensionality, denoted as $dim=|\mathbf{d_i}|$. A feature matrix $\mathbf{X}$ of a shape $|M_s| \times dim$ aggregates the embeddings for the abstacts of all samples in $M_s$, where |M_s is the total number of metadata records.
- **Definition 3: A keyword $k_j$** is a predefined label used for catogarising datasets. Each metadata record $m_i$ is associated with a set of keywords $K_i \subseteq K$, while $K$ is the complete set of predefined keywords. The keywords $K_i$ for a metadata record $m_i$ is mathematiacally represented as a binary vector $y_i$ with a size of $|K|$. where each element indicates the presence or absence of a specific label. A value of 1 at position $j$ denotes the label $k_j \in K$ is present in the metadata record $m_i$, in this sence $k_j \in K_i$, while a value of 0 indicates its absence. A target matrix $\mathbf{Y}$ is a $|M_s| \times |K|$ binary matrix, where $|M_s|$ is the size of the metadata records set $M_s=\{m_1,m_2,\ldots, m_x\}$, and $|K|$ is the size of the keywords set $K=\{k_1, k_2, \ldots, k_y\}$. Each entry $ \mathbf{K}[i, j] $ is 1 if metadata record $ m_i $ is associated with keyword $ k_j $, and 0 otherwise.



In [1]:
# add module path for notebook to use
import sys
import os

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path+"\\data_discovery_ai\\utils")
    sys.path.append(module_path+"\\data_discovery_ai\\model")
    sys.path.append(module_path+"\\data_discovery_ai\\common")

current_path = os.getcwd()

# import modules
import preprocessor
import keywordModel
import constants
import es_connector

  from .autonotebook import tqdm as notebook_tqdm


As shown in the [framework](data-discovery-ai-framework.drawio.png), three distinct but connected modules work cooperatively as the keyword classifier pipeline. This notebook will go through the functions in these modules to show how we preprocess data, train the ML model, and make predictions.
## Data Preprocessing
The data preprocessing module is used to prepare data for training and testing models. Key features include: getting raw data, preparing sample data, converting textual data to numeric representations, resampling, and preparing input and output matrices.
### Getting Raw Data
Raw data means the all metadata records $M$ stored in Elasticsearch. A elasticsearch configuration file `esManager.ini` is needed to be created in folder `data_discoverty_ai/common`, in which two fields are required: `end_point` and `api_key`. For more information, please refer to [README](../README.md#file-structure). We first fetch raw data from Elasticsearch.

In [2]:
# load Elasticsearch configuration
import configparser
from pathlib import Path

def load_es_config() -> configparser.ConfigParser:
    elasticsearch_config_file_path = f"../data_discovery_ai/common/{constants.ELASTICSEARCH_CONFIG}"
    esConfig = configparser.ConfigParser()
    esConfig.read(elasticsearch_config_file_path)
    return esConfig

In [3]:
# connect and query Elasticsearch
esConfig = load_es_config()
client = es_connector.connect_es(esConfig)
index = os.getenv("ES_INDEX_NAME", default=constants.ES_INDEX_NAME)
raw_data = es_connector.search_es(client=client, index=index, batch_size=constants.BATCH_SIZE, sleep_time=constants.SLEEP_TIME)

searching elasticsearch: 100%|██████████| 129/129 [11:32<00:00,  5.37s/it]


In [12]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12943 entries, 0 to 12942
Data columns (total 41 columns):
 #   Column                                             Non-Null Count  Dtype 
---  ------                                             --------------  ----- 
 0   _index                                             12943 non-null  object
 1   _id                                                12943 non-null  object
 2   _score                                             0 non-null      object
 3   sort                                               12943 non-null  object
 4   _source.title                                      12943 non-null  object
 5   _source.description                                12943 non-null  object
 6   _source.extent.bbox                                12943 non-null  object
 7   _source.extent.temporal                            12943 non-null  object
 8   _source.summaries.score                            12943 non-null  int64 
 9   _source.summaries

There are **12943** metadata records in the staging environment. We can also check that there are **1721** items has no keyword information.

In [16]:
no_keyword_items = raw_data[raw_data['_source.themes'].apply(lambda x: x == [])]
no_keyword_items_count = no_keyword_items.shape[0]
no_keyword_items_count

1721

### Identify Samples
Sample set is a subset of the raw dataset. A sample set $M_s$ is a set of metadata records in which keywords contain particular AODN vocabus. We first identify samples from raw data, and then preprocess the sample set.

In [4]:
# get predefined vocabs
def load_keyword_config() -> configparser.ConfigParser:
    keyword_config_file_path = f"../data_discovery_ai/common/{constants.KEYWORD_CONFIG}"
    keywordConfig = configparser.ConfigParser()
    keywordConfig.read(keyword_config_file_path)
    return keywordConfig
keywordConfig = load_keyword_config()
vocabs = keywordConfig["preprocessor"]["vocabs"].split(", ")
vocabs

['AODN Instrument Vocabulary',
 'AODN Discovery Parameter Vocabulary',
 'AODN Platform Vocabulary']

The identified sample lables look like this format: 

In [17]:
# identify samples with predefined vocabs
identified_sampleSet = preprocessor.identify_km_sample(raw_data, vocabs)
identified_sampleSet.iloc[0]["keywords"]

[{'concepts': [{'id': 'Oceans | Ocean Temperature | Water Temperature',
    'url': None},
   {'id': 'Oceans | Ocean Optics | Photosynthetically Active Radiation',
    'url': None},
   {'id': 'Oceans | Ocean Optics | Turbidity', 'url': None},
   {'id': 'Atmosphere | Precipitation | Rain', 'url': None},
   {'id': 'Oceans | Ocean Chemistry | Chlorophyll', 'url': None},
   {'id': 'Oceans | Salinity/density | Salinity', 'url': None}],
  'scheme': 'theme',
  'description': 'GCMD',
  'title': 'NASA/Global Change Master Directory Earth Science Keywords Version 5.3.8'},
 {'concepts': [{'id': 'Buoys | Moored Buoys', 'url': None},
   {'id': 'Fluorometers', 'url': None},
   {'id': 'CTD (Conductivity-Temperature-Depth Profilers)', 'url': None}],
  'scheme': '',
  'description': 'MCP',
  'title': 'Marine Community Profile of ISO19115 v1.4 Collection Methods Vocabulary (Annex C.1.3)'},
 {'concepts': [{'id': 'IMOS Platform | NRSDAR | Darwin National Reference Station Mooring',
    'url': None},
   {'i

The keywords is in a nested json format, we need to flattern them, and remove keywords which are not in the target vocabularies.

In [6]:
preprocessed_SampleSet = preprocessor.sample_preprocessor(identified_sampleSet, vocabs)
preprocessed_SampleSet

Unnamed: 0,id,title,description,keywords
12,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS - ANMN National Reference Stations - Darw...,This collection includes observations transmit...,"[{'vocab_type': 'AODN Platform Vocabulary', 'v..."
16,0094682a-e438-41e8-a39b-19cf2093025d,Thursday Island Wind From 08 Feb 2012,This data set was collected by weather sensors...,[{'vocab_type': 'AODN Discovery Parameter Voca...
20,00a64d43-86a8-4f2b-89e6-40f1abf288f6,Cumulative Pressures on the Distinctive Values...,A report was developed by the Western Australi...,[]
28,00fee0c8-6203-4271-8d46-f36c075fa6cf,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...
34,0145df96-3847-474b-8b63-a66f0e03ff54,Statewide Marine Habitat Map 2023,The Statewide Marine Habitat Map 2023 was deve...,"[{'vocab_type': 'AODN Platform Vocabulary', 'v..."
...,...,...,...,...
12814,ff50ae2f-0f79-4eaa-806c-8954ab0e545b,One Tree Island Air Pressure From 18 Nov 2008 ...,The 'Wireless Sensor Networks Facility' (forme...,[{'vocab_type': 'AODN Discovery Parameter Voca...
12824,ffb04265-eb2a-4eea-943f-ef4cd2dd9531,Chemical microenvironment within complex multi...,-- Layton et al. Chemical microenvironments wi...,[{'vocab_type': 'AODN Discovery Parameter Voca...
12829,ffd235e6-814e-477e-b324-60b44ef8ea11,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...
12831,ffe3c79d-0b1a-49cc-9995-5057dc1eb8f5,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...


Clean the sample set, for instance, row at index `20` has an empty keyword filed like `[]`

In [7]:
filtered_sampleSet = preprocessed_SampleSet[preprocessed_SampleSet["keywords"].apply(lambda x: x != [])]
filtered_sampleSet

Unnamed: 0,id,title,description,keywords
12,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS - ANMN National Reference Stations - Darw...,This collection includes observations transmit...,"[{'vocab_type': 'AODN Platform Vocabulary', 'v..."
16,0094682a-e438-41e8-a39b-19cf2093025d,Thursday Island Wind From 08 Feb 2012,This data set was collected by weather sensors...,[{'vocab_type': 'AODN Discovery Parameter Voca...
28,00fee0c8-6203-4271-8d46-f36c075fa6cf,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...
34,0145df96-3847-474b-8b63-a66f0e03ff54,Statewide Marine Habitat Map 2023,The Statewide Marine Habitat Map 2023 was deve...,"[{'vocab_type': 'AODN Platform Vocabulary', 'v..."
37,0155375c-8070-4662-9c93-b593ee4891b0,Davies Reef Water Temperature From 18 Oct 1991,The 'Wireless Sensor Networks Facility' (forme...,[{'vocab_type': 'AODN Discovery Parameter Voca...
...,...,...,...,...
12814,ff50ae2f-0f79-4eaa-806c-8954ab0e545b,One Tree Island Air Pressure From 18 Nov 2008 ...,The 'Wireless Sensor Networks Facility' (forme...,[{'vocab_type': 'AODN Discovery Parameter Voca...
12824,ffb04265-eb2a-4eea-943f-ef4cd2dd9531,Chemical microenvironment within complex multi...,-- Layton et al. Chemical microenvironments wi...,[{'vocab_type': 'AODN Discovery Parameter Voca...
12829,ffd235e6-814e-477e-b324-60b44ef8ea11,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...
12831,ffe3c79d-0b1a-49cc-9995-5057dc1eb8f5,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...


Then we calculate embeddings for the title and description field, which is used as the input feature matrix.

In [8]:
finalSampleSet = preprocessor.calculate_embedding(filtered_sampleSet)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ds["information"] = ds["title"] + ": " + ds["description"]
100%|██████████| 1858/1858 [51:11<00:00,  1.65s/it]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ds["embedding"] = ds["information"].progress_apply(


In [9]:
finalSampleSet

Unnamed: 0,id,title,description,keywords,information,embedding
12,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS - ANMN National Reference Stations - Darw...,This collection includes observations transmit...,"[{'vocab_type': 'AODN Platform Vocabulary', 'v...",IMOS - ANMN National Reference Stations - Darw...,"[-0.80773824, 0.19351113, 0.02368839, -0.34996..."
16,0094682a-e438-41e8-a39b-19cf2093025d,Thursday Island Wind From 08 Feb 2012,This data set was collected by weather sensors...,[{'vocab_type': 'AODN Discovery Parameter Voca...,Thursday Island Wind From 08 Feb 2012: This da...,"[-0.6913842, -0.48216444, 0.51189363, 0.038675..."
28,00fee0c8-6203-4271-8d46-f36c075fa6cf,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...,IMOS SOOP Underway Data from AIMS Vessel RV So...,"[-0.7061849, -0.5095051, 0.4899681, -0.0899401..."
34,0145df96-3847-474b-8b63-a66f0e03ff54,Statewide Marine Habitat Map 2023,The Statewide Marine Habitat Map 2023 was deve...,"[{'vocab_type': 'AODN Platform Vocabulary', 'v...",Statewide Marine Habitat Map 2023: The Statewi...,"[-1.2934202, -0.40619895, -0.014477678, -0.177..."
37,0155375c-8070-4662-9c93-b593ee4891b0,Davies Reef Water Temperature From 18 Oct 1991,The 'Wireless Sensor Networks Facility' (forme...,[{'vocab_type': 'AODN Discovery Parameter Voca...,Davies Reef Water Temperature From 18 Oct 1991...,"[-0.83987963, -0.3394436, 0.35684508, -0.11953..."
...,...,...,...,...,...,...
12814,ff50ae2f-0f79-4eaa-806c-8954ab0e545b,One Tree Island Air Pressure From 18 Nov 2008 ...,The 'Wireless Sensor Networks Facility' (forme...,[{'vocab_type': 'AODN Discovery Parameter Voca...,One Tree Island Air Pressure From 18 Nov 2008 ...,"[-0.61762625, -0.36514324, 0.28105518, -0.1280..."
12824,ffb04265-eb2a-4eea-943f-ef4cd2dd9531,Chemical microenvironment within complex multi...,-- Layton et al. Chemical microenvironments wi...,[{'vocab_type': 'AODN Discovery Parameter Voca...,Chemical microenvironment within complex multi...,"[-0.5318709, -0.6527285, -0.36633912, 0.094197..."
12829,ffd235e6-814e-477e-b324-60b44ef8ea11,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...,IMOS SOOP Underway Data from AIMS Vessel RV So...,"[-0.64320445, -0.5581553, 0.46219164, -0.06278..."
12831,ffe3c79d-0b1a-49cc-9995-5057dc1eb8f5,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Ships of Opportunity' (SOOP) is a facility of...,[{'vocab_type': 'AODN Discovery Parameter Voca...,IMOS SOOP Underway Data from AIMS Vessel RV So...,"[-0.7151627, -0.52895564, 0.4912181, -0.107228..."


### Prepare Train and Test Sets
We now have the sample set with extra embedding information. We are going to split the sample set into train and test sets by preparing input feature matrix $X$ and output target matrix $Y$. The input feature matrix X is based on the embedding column, and the output Y is the mathmatic representation of the keyword column.

In [18]:
X, Y, Y_df, labelMap = preprocessor.prepare_X_Y(finalSampleSet)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ds["keywordsMap"] = ds["keywords"].apply(


We have prepared the input feature matrix `X` and the output target matrix `Y`. Additionally, we have `Y_df`, which includes column names for the `Y` matrix, and `labelMap`, which represents the keyword set of predefined keywords. In `labelMap`, the key is an encoded number corresponding to a column name in `Y_df`, and the value is a Concept object. We can review the details of a Concept object by its `to_json()` function.

In [19]:
Y_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,516,517,518,519,520,521,522,523,524,525
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1853,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1854,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1855,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1856,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
labelMap.get(0).to_json()

{'vocab_type': 'AODN Discovery Parameter Vocabulary',
 'value': 'northward current velocity in the water body',
 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/LCNSZZ01'}

In [21]:
rare_label_index = preprocessor.identify_rare_labels(Y_df, constants.RARE_LABEL_THRESHOLD, list(labelMap.keys()))
len(rare_label_index)

332

We found that among 525 unique keywords, there are 332 keywords appears less than the `RARE_LABEL_THRESHOLD`. So we firstly duplicate records which have these rare labels with a customised resamplying.

In [22]:
X_oversampled, Y_oversampled = preprocessor.resampling(
            X_train=X, Y_train=Y, strategy="custom", rare_keyword_index=rare_label_index
        )

Total samples: 3468
Dimension: 768
No. of labels: 526
X resampled set size: 3468
Y resampled set size: 3468


Now the sample size is increased from 647 to 1677 so that the records of rare labels are manually increased. We can now split the sample set to train and test sets follows a 80%-20% split.

In [23]:
dim, n_labels, X_train, Y_train, X_test, Y_test = (
            preprocessor.prepare_train_test(X_oversampled, Y_oversampled, keywordConfig)
        )

INFO:preprocessor:Total samples: 3468
INFO:preprocessor:Dimension: 768
INFO:preprocessor:No. of labels: 526
INFO:preprocessor:Train set size: 2765 (79.73%)
INFO:preprocessor:Test set size: 703 (20.27%)


Next, we perform oversampling only on the training set, as we want to avoid introducing training samples into the test set. This ensures the model does not encounter training data during testing.

In [24]:
X_train_oversampled, Y_train_oversampled = preprocessor.resampling(
            X_train=X_train, Y_train=Y_train, strategy="ROS", rare_keyword_index=None
        )

Total samples: 150975
Dimension: 768
No. of labels: 526
X resampled set size: 150975
Y resampled set size: 150975


Then, we calculate the class weight, so that we can apply in model training by assigning majority classes lower weight, and minority classes higher weight.

In [25]:
label_weight_dict = keywordModel.get_class_weights(Y_train)

Now, we have prepared all the data we need for training a keyword classification model. Let's move on to the next stage.

## Training and Evaluation of Model
A model name is required for training a model. As mentioned in [README.md](../README.md), available options are: `development`,`experimental`, `staging`, `production`, `benchmark`. 

In [26]:
model_name = "development"

In [27]:
trained_model, history, model_name = keywordModel.keyword_model(
            model_name=model_name,
            X_train=X_train,
            Y_train=Y_train,
            X_test=X_test,
            Y_test=Y_test,
            class_weight=label_weight_dict,
            dim=dim,
            n_labels=n_labels,
            params=keywordConfig,
        )

Epoch 1/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - accuracy: 0.0198 - loss: 7.6368e-04 - precision: 0.0241 - recall: 0.3065 - val_accuracy: 0.0289 - val_loss: 0.0170 - val_precision: 0.3563 - val_recall: 0.0460 - learning_rate: 0.0010
Epoch 2/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.0923 - loss: 1.6681e-04 - precision: 0.4890 - recall: 0.1954 - val_accuracy: 0.0000e+00 - val_loss: 0.0149 - val_precision: 0.7624 - val_recall: 0.1674 - learning_rate: 0.0010
Epoch 3/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.0980 - loss: 1.4274e-04 - precision: 0.6441 - recall: 0.3160 - val_accuracy: 0.0000e+00 - val_loss: 0.0126 - val_precision: 0.8934 - val_recall: 0.3478 - learning_rate: 0.0010
Epoch 4/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.1164 - loss: 1.1164e-04 - precision: 0.7453 - recall: 0.4684 - val_accuracy: 0.0

Then, we evaluate the trained model.

In [28]:
confidence = keywordConfig.getfloat("keywordModel", "confidence")
top_N = keywordConfig.getint("keywordModel", "top_N")
predicted_labels = keywordModel.prediction(
    X_test, trained_model, confidence, top_N
)
eval = keywordModel.evaluation(
    Y_test=Y_test, predictions=predicted_labels
)
eval

[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step


{'precision': '0.9487',
 'recall': '0.9265',
 'f1': '0.9375',
 'hammingloss': '0.0022',
 'Jaccard Index': '0.7977',
 'accuracy': '0.6942'}

We had a result of 94% precision, 92% recall, and 93% F1 score. Which is not bad. But we can still try different hypermeters to improve model performance. Please refer to [README.md](../README.md) to see hypermeter descriptions. To adjust model hypermeters, please go to file `data_discovery_ai\common\keyword_classification_parameters.ini` to try different values.

## Make Prediction

Now we have the trained model, let's use this model to make prediction. Let's assume we have a item entitled: *"Corals and coral communities of Lord Howe Island, Australia"* with an abstract *"Ecological and taxonomic surveys of hermatypic scleractinian corals were carried out at approximately 100 sites around Lord Howe Island. Sixty-six of these sites were located on reefs in the lagoon, which extends for two-thirds of the length of the island on the western side. Each survey site consisted of a section of reef surface, which appeared to be topographically and faunistically homogeneous. The dimensions of the sites surveyed were generally of the order of 20m by 20m. Where possible, sites were arranged contiguously along a band up the reef slope and across the flat. The cover of each species was graded on a five-point scale of percentage relative cover. Other site attributes recorded were depth (minimum and maximum corrected to datum), slope (estimated), substrate type, total estimated cover of soft coral and algae (macroscopic and encrusting coralline). Coral data from the lagoon and its reef (66 sites) were used to define a small number of site groups which characterize most of this area.Throughout the survey, corals of taxonomic interest or difficulty were collected, and an extensive photographic record was made to augment survey data. A collection of the full range of form of all coral species was made during the survey and an identified reference series was deposited in the Australian Museum.In addition, less detailed descriptive data pertaining to coral communities and topography were recorded on 12 reconnaissance transects, the authors recording changes seen while being towed behind a boat.
 The purpose of this study was to describe the corals of Lord Howe Island (the southernmost Indo-Pacific reef) at species and community level using methods that would allow differentiation of community types and allow comparisons with coral communities in other geographic locations."* that is unlabelled.

In [29]:
item_title = "Corals and coral communities of Lord Howe Island, Australia"
item_abstract = """Ecological and taxonomic surveys of hermatypic scleractinian corals were carried out at approximately 100 sites around Lord Howe Island. Sixty-six of these sites were located on reefs in the lagoon, which extends for two-thirds of the length of the island on the western side. Each survey site consisted of a section of reef surface, which appeared to be topographically and faunistically homogeneous. The dimensions of the sites surveyed were generally of the order of 20m by 20m. Where possible, sites were arranged contiguously along a band up the reef slope and across the flat. The cover of each species was graded on a five-point scale of percentage relative cover. Other site attributes recorded were depth (minimum and maximum corrected to datum), slope (estimated), substrate type, total estimated cover of soft coral and algae (macroscopic and encrusting coralline). Coral data from the lagoon and its reef (66 sites) were used to define a small number of site groups which characterize most of this area.Throughout the survey, corals of taxonomic interest or difficulty were collected, and an extensive photographic record was made to augment survey data. A collection of the full range of form of all coral species was made during the survey and an identified reference series was deposited in the Australian Museum.In addition, less detailed descriptive data pertaining to coral communities and topography were recorded on 12 reconnaissance transects, the authors recording changes seen while being towed behind a boat.
 The purpose of this study was to describe the corals of Lord Howe Island (the southernmost Indo-Pacific reef) at species and community level using methods that would allow differentiation of community types and allow comparisons with coral communities in other geographic locations."""
description = f"{item_title}: {item_abstract}"

We first prepare input feature matrix X, which is the embedding of this description.

In [30]:
description_embedding = preprocessor.get_description_embedding(description)
dimension = description_embedding.shape[0]
target_X = description_embedding.reshape(1, dimension)
target_X

array([[-7.76477098e-01, -3.33351672e-01, -7.07494497e-01,
         2.77853608e-02,  1.66572064e-01,  9.25325900e-02,
         3.53269756e-01,  3.13511342e-01,  3.27710301e-01,
        -2.67760873e-01, -6.13279700e-01, -2.74785250e-01,
        -2.21668690e-01,  7.12784171e-01,  8.41400586e-03,
         5.41278899e-01,  1.35883421e-01,  1.88588262e-01,
         2.41953388e-01,  3.85780513e-01, -4.87768143e-01,
        -3.06539148e-01,  2.47475669e-01,  3.70917350e-01,
        -4.72536981e-01, -3.74548137e-01, -1.55186579e-01,
         2.62748361e-01, -8.84262845e-02, -3.37665975e-01,
        -1.81324035e-01,  5.80801129e-01, -1.15090990e+00,
        -1.01637912e+00, -4.25794758e-02,  3.44661593e-01,
        -6.89785540e-01,  1.17133960e-01,  1.27914354e-01,
        -2.11383685e-01, -4.57495719e-01,  3.49888206e-01,
         3.89928788e-01, -2.88182795e-01,  2.89381206e-01,
         4.38351631e-01, -4.48247337e+00,  1.56454593e-01,
        -2.23044991e-01,  1.47479400e-02,  4.61551875e-0

The ML model is a probability model. The outputs are probabilities of labels presented in an item according to its title and abstract embeddings. We can check the output by load the pretrained model and print its predictions.

In [34]:
pretrained_model = keywordModel.load_saved_model(model_name)
pretrained_model

<Sequential name=sequential, built=True>

In [35]:
pretrained_model.predict(target_X)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 62ms/step


array([[0.02190415, 0.00446595, 0.10778404, 0.03010673, 0.0267204 ,
        0.01581634, 0.0016995 , 0.00384111, 0.00272711, 0.01360901,
        0.00367227, 0.00116701, 0.00290348, 0.00121695, 0.00285149,
        0.00269797, 0.05284765, 0.00507949, 0.00672343, 0.02559042,
        0.17217676, 0.00195777, 0.1340263 , 0.00723547, 0.00809273,
        0.02203846, 0.01938279, 0.01003943, 0.00606263, 0.02213104,
        0.01808908, 0.00969874, 0.00221914, 0.02558189, 0.00397264,
        0.00253744, 0.02732949, 0.03110578, 0.0109174 , 0.04362976,
        0.00227865, 0.00450206, 0.02374408, 0.03578586, 0.0370452 ,
        0.00232738, 0.00282683, 0.00353674, 0.00284026, 0.00910962,
        0.0036987 , 0.00859565, 0.04059619, 0.0926408 , 0.00668889,
        0.01174666, 0.00666557, 0.00448337, 0.0123541 , 0.00272207,
        0.00631364, 0.00300497, 0.20000635, 0.00692031, 0.01048738,
        0.00239681, 0.45698577, 0.05836551, 0.00154252, 0.01776574,
        0.0195182 , 0.02331006, 0.028092  , 0.07

Global parameters `confidence` and `top_N` are assigned in the `data_discovery_ai/common/keyword_classification_parameters.ini` configuration file.

- The `confidence` parameter specifies the probability threshold. Probabilities exceeding this value indicate that the keyword is considered present in the item; otherwise, it is not.
- The `top_N` parameter is used to select predicted keywords when no probability exceeds the confidence threshold. In this case, the top N keywords are selected and considered to appear in the item record.

Then we use the trained model and X to make prediction

In [31]:
target_predicted_labels = keywordModel.prediction(
        target_X,
        trained_model,
        keywordConfig.getfloat("keywordModel", "confidence"),
        keywordConfig.getint("keywordModel", "top_N"),
    )
target_predicted_labels

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step


array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

It's in a binary format, but means that at which index the values are 1, the keywords at these index have a higher probability to be appeared in the item. So, we convert this binary array to readable format.

In [32]:
prediction = keywordModel.get_predicted_keywords(target_predicted_labels, labelMap)
prediction

[{'vocab_type': 'AODN Discovery Parameter Vocabulary',
  'value': 'abundance of biota',
  'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/488'},
 {'vocab_type': 'AODN Discovery Parameter Vocabulary',
  'value': 'biotic taxonomic identification',
  'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/489'}]

So this item has a most likely keyword `[{'vocab_type': 'AODN Discovery Parameter Vocabulary',
  'value': 'abundance of biota',
  'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/488'},
 {'vocab_type': 'AODN Discovery Parameter Vocabulary',
  'value': 'biotic taxonomic identification',
  'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/489'}]`