# ML model for Keyword Classification - Tech Notebook
This notebook introduces (1) how to explore, prepare and preprocess the datasets; (2) how to train and evaluate the ML model; and (3) how to use this trained ML model, for technical audiences.
## Problem Description
The AODN catalogue $C=\{M, K, P\}$ serves as a platform for storing datasets and their associated metadata. $M=\{m_1,m_2,\ldots, m_x\}$ is a set of metadata records which are used to describe the dataset in AODN catalogue $C$. $K=\{k_1, k_2, \ldots, k_y\}$ is a set of pre-defined keywords that are used to categorise dataset. In the catalogue $C = \{M, K\}$, a subset of metadata records, $M_t \subseteq M$, have not yet been categorised with keywords. For these records, $K_i = \emptyset $ for all $m_i \in M_t$. Given another subset of metadata records, $M_s \subseteq M$, where each record has already been categorised with keywords (i.e., $K_i \neq \emptyset $ for all $m_i \in M_s$). The research question is as follows:

How to design and develop a machine learning model, denoted as $MM_{keywords}$, that can automatically label the uncategorised metadata records $M_t$ using a predefined set of keywords $K$. Specifically, the model should be trained to learn a mapping rule $d_i \mapsto K_i$ based on the observed patterns from the sample set $M_s$, where each description $d_i$ of a metadata record $m_i \in M_s$ is associated with a set of keywords $K_i$. Once trained, the model should be able to apply this learned mapping to accurately categorise the records in $M_t$ by assigning their corresponding keywords based on the records' descriptions.

To simplify the task, we restrict the scope of keywords to those falling within the primary AODN vocabulary:
- AODN Instrument Vocabulary
- AODN Discovery Parameter Vocabulary
- AODN Platform Vocabulary

Only keywords $k_j \in K_i$ that are part of the listed AODN vocabularies will be considered. Any keyword not belonging to these vocabularies will be excluded from $K_i$ for all metadata records in the categorised metadata set $M_s$.

### Formal Definitions
- **Definition 1: A metadata record $m_i=(d_i, K_i), m_i \in M$** is a record describing a dataset. Specifically, $i$ is the unique identifier of the record. $d_i$ is a textual abstract that serves as the description of the dataset. $K_i \subseteq K$ is a subset of keywords used to label the dataset.
- **Definition 2: A abstract $d_i$** is a piece of textual information which is used to describe the dataset. The embedding $\mathbf{d_i}$ is a vector representation of the textual description $d_i$, calculated using the "bert-base-uncased" model. The embedding vector $\mathbf{d_i}$ for each abstract $d_i$ has an universal dimensionality, denoted as $dim=|\mathbf{d_i}|$. A feature matrix $\mathbf{X}$ of a shape $|M_s| \times dim$ aggregates the embeddings for the abstacts of all samples in $M_s$, where |M_s is the total number of metadata records.
- **Definition 3: A keyword $k_j$** is a predefined label used for catogarising datasets. Each metadata record $m_i$ is associated with a set of keywords $K_i \subseteq K$, while $K$ is the complete set of predefined keywords. The keywords $K_i$ for a metadata record $m_i$ is mathematiacally represented as a binary vector $y_i$ with a size of $|K|$. where each element indicates the presence or absence of a specific label. A value of 1 at position $j$ denotes the label $k_j \in K$ is present in the metadata record $m_i$, in this sence $k_j \in K_i$, while a value of 0 indicates its absence. A target matrix $\mathbf{Y}$ is a $|M_s| \times |K|$ binary matrix, where $|M_s|$ is the size of the metadata records set $M_s=\{m_1,m_2,\ldots, m_x\}$, and $|K|$ is the size of the keywords set $K=\{k_1, k_2, \ldots, k_y\}$. Each entry $ \mathbf{K}[i, j] $ is 1 if metadata record $ m_i $ is associated with keyword $ k_j $, and 0 otherwise.



In [1]:
# add module path for notebook to use
import sys
import os

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path+"\\data_discovery_ai\\utils")
    sys.path.append(module_path+"\\data_discovery_ai\\model")
    sys.path.append(module_path+"\\data_discovery_ai\\common")

current_path = os.getcwd()

# import modules
import preprocessor
import keywordModel
import constants
import es_connector

  from .autonotebook import tqdm as notebook_tqdm


As shown in the [framework](data-discovery-ai-framework.drawio.png), three distinct but connected modules work cooperatively as the keyword classifier pipeline. This notebook will go through the functions in these modules to show how we preprocess data, train the ML model, and make predictions.
## Data Preprocessing
The data preprocessing module is used to prepare data for training and testing models. Key features include: getting raw data, preparing sample data, converting textual data to numeric representations, resampling, and preparing input and output matrices.
### Getting Raw Data
Raw data means the all metadata records $M$ stored in Elasticsearch. A elasticsearch configuration file `esManager.ini` is needed to be created in folder `data_discoverty_ai/common`, in which two fields are required: `end_point` and `api_key`. For more information, please refer to [README](../README.md#file-structure). We first fetch raw data from Elasticsearch.

In [None]:
# load Elasticsearch configuration
import configparser
from pathlib import Path

def load_es_config() -> configparser.ConfigParser:
    elasticsearch_config_file_path = f"../data_discovery_ai/common/{constants.ELASTICSEARCH_CONFIG}"
    esConfig = configparser.ConfigParser()
    esConfig.read(elasticsearch_config_file_path)
    return esConfig

In [None]:
# connect and query Elasticsearch
esConfig = load_es_config()
client = es_connector.connect_es(esConfig)
index = os.getenv("ES_INDEX_NAME", default=constants.ES_INDEX_NAME)
raw_data = es_connector.search_es(client=client, index=index, batch_size=constants.BATCH_SIZE, sleep_time=constants.SLEEP_TIME)

In [None]:
raw_data.info()

There are **12943** metadata records in the staging environment. We can also check that there are **1721** items has no keyword information.

In [None]:
no_keyword_items = raw_data[raw_data['_source.themes'].apply(lambda x: x == [])]
no_keyword_items_count = no_keyword_items.shape[0]
no_keyword_items_count

### Identify Samples
Sample set is a subset of the raw dataset. A sample set $M_s$ is a set of metadata records in which keywords contain particular AODN vocabus. We first identify samples from raw data, and then preprocess the sample set.

In [None]:
# get predefined vocabs
def load_keyword_config() -> configparser.ConfigParser:
    keyword_config_file_path = f"../data_discovery_ai/common/{constants.KEYWORD_CONFIG}"
    keywordConfig = configparser.ConfigParser()
    keywordConfig.read(keyword_config_file_path)
    return keywordConfig
keywordConfig = load_keyword_config()
vocabs = keywordConfig["preprocessor"]["vocabs"].split(", ")
vocabs

The identified sample lables look like this format: 

In [None]:
# identify samples with predefined vocabs
identified_sampleSet = preprocessor.identify_km_sample(raw_data, vocabs)
identified_sampleSet.iloc[0]["keywords"]

The keywords is in a nested json format, we need to flattern them, and remove keywords which are not in the target vocabularies.

In [None]:
preprocessed_SampleSet = preprocessor.sample_preprocessor(identified_sampleSet, vocabs)
preprocessed_SampleSet

Clean the sample set, for instance, row at index `20` has an empty keyword filed like `[]`

In [None]:
filtered_sampleSet = preprocessed_SampleSet[preprocessed_SampleSet["keywords"].apply(lambda x: x != [])]
filtered_sampleSet

Then we calculate embeddings for the title and description field, which is used as the input feature matrix.

In [None]:
finalSampleSet = preprocessor.calculate_embedding(filtered_sampleSet)

In [None]:
finalSampleSet

### Prepare Train and Test Sets
We now have the sample set with extra embedding information. We are going to split the sample set into train and test sets by preparing input feature matrix $X$ and output target matrix $Y$. The input feature matrix X is based on the embedding column, and the output Y is the mathmatic representation of the keyword column.

In [None]:
X, Y, Y_df, labelMap = preprocessor.prepare_X_Y(finalSampleSet)

We have prepared the input feature matrix `X` and the output target matrix `Y`. Additionally, we have `Y_df`, which includes column names for the `Y` matrix, and `labelMap`, which represents the keyword set of predefined keywords. In `labelMap`, the key is an encoded number corresponding to a column name in `Y_df`, and the value is a Concept object. We can review the details of a Concept object by its `to_json()` function.

In [None]:
Y_df

In [None]:
labelMap.get(0).to_json()

In [None]:
rare_label_index = preprocessor.identify_rare_labels(Y_df, constants.RARE_LABEL_THRESHOLD, list(labelMap.keys()))
len(rare_label_index)

We found that among 525 unique keywords, there are 332 keywords appears less than the `RARE_LABEL_THRESHOLD`. So we firstly duplicate records which have these rare labels with a customised resamplying.

In [None]:
X_oversampled, Y_oversampled = preprocessor.resampling(
            X_train=X, Y_train=Y, strategy="custom", rare_keyword_index=rare_label_index
        )

Now the sample size is increased from 647 to 1677 so that the records of rare labels are manually increased. We can now split the sample set to train and test sets follows a 80%-20% split.

In [None]:
dim, n_labels, X_train, Y_train, X_test, Y_test = (
            preprocessor.prepare_train_test(X_oversampled, Y_oversampled, keywordConfig)
        )

Next, we perform oversampling only on the training set, as we want to avoid introducing training samples into the test set. This ensures the model does not encounter training data during testing.

In [None]:
X_train_oversampled, Y_train_oversampled = preprocessor.resampling(
            X_train=X_train, Y_train=Y_train, strategy="ROS", rare_keyword_index=None
        )

Then, we calculate the class weight, so that we can apply in model training by assigning majority classes lower weight, and minority classes higher weight.

In [None]:
label_weight_dict = keywordModel.get_class_weights(Y_train)

Now, we have prepared all the data we need for training a keyword classification model. Let's move on to the next stage.

## Training and Evaluation of Model
A model name is required for training a model. As mentioned in [README.md](../README.md), available options are: `development`,`experimental`, `staging`, `production`, `benchmark`. 

In [None]:
model_name = "development"

In [None]:
trained_model, history, model_name = keywordModel.keyword_model(
            model_name=model_name,
            X_train=X_train,
            Y_train=Y_train,
            X_test=X_test,
            Y_test=Y_test,
            class_weight=label_weight_dict,
            dim=dim,
            n_labels=n_labels,
            params=keywordConfig,
        )

Then, we evaluate the trained model.

In [None]:
confidence = keywordConfig.getfloat("keywordModel", "confidence")
top_N = keywordConfig.getint("keywordModel", "top_N")
predicted_labels = keywordModel.prediction(
    X_test, trained_model, confidence, top_N
)
eval = keywordModel.evaluation(
    Y_test=Y_test, predictions=predicted_labels
)
eval

We had a result of 94% precision, 92% recall, and 93% F1 score. Which is not bad. But we can still try different hypermeters to improve model performance. Please refer to [README.md](../README.md) to see hypermeter descriptions. To adjust model hypermeters, please go to file `data_discovery_ai\common\keyword_classification_parameters.ini` to try different values.

## Make Prediction

Now we have the trained model, let's use this model to make prediction. Let's assume we have a item entitled: *"Corals and coral communities of Lord Howe Island, Australia"* with an abstract *"Ecological and taxonomic surveys of hermatypic scleractinian corals were carried out at approximately 100 sites around Lord Howe Island. Sixty-six of these sites were located on reefs in the lagoon, which extends for two-thirds of the length of the island on the western side. Each survey site consisted of a section of reef surface, which appeared to be topographically and faunistically homogeneous. The dimensions of the sites surveyed were generally of the order of 20m by 20m. Where possible, sites were arranged contiguously along a band up the reef slope and across the flat. The cover of each species was graded on a five-point scale of percentage relative cover. Other site attributes recorded were depth (minimum and maximum corrected to datum), slope (estimated), substrate type, total estimated cover of soft coral and algae (macroscopic and encrusting coralline). Coral data from the lagoon and its reef (66 sites) were used to define a small number of site groups which characterize most of this area.Throughout the survey, corals of taxonomic interest or difficulty were collected, and an extensive photographic record was made to augment survey data. A collection of the full range of form of all coral species was made during the survey and an identified reference series was deposited in the Australian Museum.In addition, less detailed descriptive data pertaining to coral communities and topography were recorded on 12 reconnaissance transects, the authors recording changes seen while being towed behind a boat.
 The purpose of this study was to describe the corals of Lord Howe Island (the southernmost Indo-Pacific reef) at species and community level using methods that would allow differentiation of community types and allow comparisons with coral communities in other geographic locations."* that is unlabelled.

In [None]:
item_title = "Corals and coral communities of Lord Howe Island, Australia"
item_abstract = """Ecological and taxonomic surveys of hermatypic scleractinian corals were carried out at approximately 100 sites around Lord Howe Island. Sixty-six of these sites were located on reefs in the lagoon, which extends for two-thirds of the length of the island on the western side. Each survey site consisted of a section of reef surface, which appeared to be topographically and faunistically homogeneous. The dimensions of the sites surveyed were generally of the order of 20m by 20m. Where possible, sites were arranged contiguously along a band up the reef slope and across the flat. The cover of each species was graded on a five-point scale of percentage relative cover. Other site attributes recorded were depth (minimum and maximum corrected to datum), slope (estimated), substrate type, total estimated cover of soft coral and algae (macroscopic and encrusting coralline). Coral data from the lagoon and its reef (66 sites) were used to define a small number of site groups which characterize most of this area.Throughout the survey, corals of taxonomic interest or difficulty were collected, and an extensive photographic record was made to augment survey data. A collection of the full range of form of all coral species was made during the survey and an identified reference series was deposited in the Australian Museum.In addition, less detailed descriptive data pertaining to coral communities and topography were recorded on 12 reconnaissance transects, the authors recording changes seen while being towed behind a boat.
 The purpose of this study was to describe the corals of Lord Howe Island (the southernmost Indo-Pacific reef) at species and community level using methods that would allow differentiation of community types and allow comparisons with coral communities in other geographic locations."""
description = f"{item_title}: {item_abstract}"

We first prepare input feature matrix X, which is the embedding of this description.

In [None]:
description_embedding = preprocessor.get_description_embedding(description)
dimension = description_embedding.shape[0]
target_X = description_embedding.reshape(1, dimension)
target_X

The ML model is a probability model. The outputs are probabilities of labels presented in an item according to its title and abstract embeddings. We can check the output by load the pretrained model and print its predictions.

In [None]:
pretrained_model = keywordModel.load_saved_model(model_name)
pretrained_model

In [None]:
pretrained_model.predict(target_X)

Global parameters `confidence` and `top_N` are assigned in the `data_discovery_ai/common/keyword_classification_parameters.ini` configuration file.

- The `confidence` parameter specifies the probability threshold. Probabilities exceeding this value indicate that the keyword is considered present in the item; otherwise, it is not.
- The `top_N` parameter is used to select predicted keywords when no probability exceeds the confidence threshold. In this case, the top N keywords are selected and considered to appear in the item record.

Then we use the trained model and X to make prediction

In [None]:
target_predicted_labels = keywordModel.prediction(
        target_X,
        trained_model,
        keywordConfig.getfloat("keywordModel", "confidence"),
        keywordConfig.getint("keywordModel", "top_N"),
    )
target_predicted_labels

It's in a binary format, but means that at which index the values are 1, the keywords at these index have a higher probability to be appeared in the item. So, we convert this binary array to readable format.

In [None]:
prediction = keywordModel.get_predicted_keywords(target_predicted_labels, labelMap)
prediction

So this item has a most likely keyword `[{'vocab_type': 'AODN Discovery Parameter Vocabulary',
  'value': 'abundance of biota',
  'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/488'},
 {'vocab_type': 'AODN Discovery Parameter Vocabulary',
  'value': 'biotic taxonomic identification',
  'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/489'}]`