# ML model for Keyword Classification
This notebook introduces (1) how we explore, prepare and preprocess the datasets; (2) how we train and evaluate the ML model; and (3) how we use this trained ML model.
## Problem Description
In the catalogue $C = \{M, K, P\}$, a subset of metadata records, $M_t \subseteq M$, have not yet been categorised with keywords. For these records, $K_i = \emptyset $ for all $m_i \in M_t$. Given another subset of metadata records, $M_s \subseteq M$, where each record has already been categorised with keywords (i.e., $K_i \neq \emptyset $ for all $m_i \in M_s$). The research question is as follows:

How to design and develop a machine learning model, denoted as $MM_{keywords}$, that can automatically label the uncategorised metadata records $M_t$ using a predefined set of keywords $K$. Specifically, the model should be trained to learn a mapping rule $d_i \mapsto K_i$ based on the observed patterns from the labelled metadata records $M_s$, where each description $d_i$ of a metadata record $m_i \in M_s$ is associated with a set of keywords $K_i$. Once trained, the model should be able to apply this learned mapping to accurately categorise the records in $M_t$ by assigning their corresponding keywords based on the records' descriptions.

To simplify the task, we restrict the scope of keywords to those falling within the primary AODN vocabulary:
- AODN Discovery Parameter Vocabulary

Only keywords $k_j \in K_i$ that are part of the listed AODN vocabularies will be considered. Any keyword not belonging to these vocabularies will be excluded from $K_i$ for all metadata records in the categorised metadata set $M_s$.

## Explore Sample Set $M_s$

In [None]:
import utils.preprocessor as preprocessor
sampleDS = preprocessor.load_sample()
sampleDS.describe()

The keywords $K_i$ of a metadata record $M_i$ is in JSON format, such as:

In [None]:
sampleDS.loc[0]['keywords']

## Feature Engineering
We preprocessed the keywords field for better readability, converting the JSON format to a list of keywords.

In [None]:
sampleDS = preprocessor.extract_labels(sampleDS)
sampleDS.loc[0]['keywords']

Clean samples, i.e., remove records have empty keyword. (see example [4673208b-def5-4340-9818-8419496e4863](https://geonetwork-edge.edge.aodn.org.au/geonetwork/srv/eng/catalog.search#/metadata/4673208b-def5-4340-9818-8419496e4863), [f55a53db-09fc-480d-aa9e-2aa6bb304b8c](f55a53db-09fc-480d-aa9e-2aa6bb304b8c), and [d265307c-5a6a-4a52-b352-35ad904fca52](https://geonetwork-edge.edge.aodn.org.au/geonetwork/srv/eng/catalog.search#/metadata/d265307c-5a6a-4a52-b352-35ad904fca52)).

In [None]:
list_lengths = sampleDS['keywords'].apply(len)
empty_keywords_records_index= list_lengths[list_lengths == 0].index.tolist()
print(f'Index of records has empty keywords field: {empty_keywords_records_index}')

In [None]:
empty_keywords_records = []
for index in empty_keywords_records_index:
    empty_keywords_records.append(sampleDS.iloc[index]['id'])
empty_keywords_records
sampleDS_cleaned = sampleDS[~sampleDS['id'].isin(empty_keywords_records)]
sampleDS_cleaned.info()

We use the embeddings of metadata records as the input (feature) $X$, and we use the keywords (labels) as the output $Y$. So we convert the labels to math representations: a binary matrix.

In [None]:
Y = preprocessor.prepare_Y_matrix(sampleDS_cleaned)
labels = Y.columns
Y

### Label Distribution
By plotting the label distribution, we can have a deeper understanding of the sample dataset.

In [None]:
from matplotlib import pyplot as plt

category_distribution = Y.copy()
category_distribution = category_distribution.sum()

category_distribution.sort_values()

# plt.figure(figsize=(15,60))
# category_distribution.sort_values().plot(kind='barh', color='skyblue', edgecolor='black')
# plt.title("Keywords Distribution")
# plt.ylabel("Keywords")
# plt.xlabel("Count of Related Metadata Records")
# plt.xticks(fontsize=12)
# plt.yticks(fontsize=10)
# plt.tight_layout()
# plt.show()

In [None]:
count_K = Y.copy()
count_K['Label Count'] = Y.sum(axis=1)

print(f"Average number of labels each record has: {count_K['Label Count'].mean()}")
print(f"Maximum number of labels a record has: {count_K['Label Count'].max()}")
print(f"Minium number of labels a record has: {count_K['Label Count'].min()}")

Based on these statistical analysis, we identified several key challenges in this multi-label classification task:

- **Global Label Imbalance**: The keyword distribution is highly imbalanced. For example, the keyword 'Temperature of the water body' appears in over 1,000 records, while 118 keywords appear only once. This imbalance causes the model to favor predicting common keywords like 'Temperature of the water body' over rare keywords, leading to biased and less accurate predictions for minority classes.

- **Internal Label Imbalance**: On average, each record is associated with 4.14 keywords. This means averagely, each record has 4.14 positive labels (keywords present) and the rest are negative labels (keywords absent). As a result, the model tends to predict negative labels more frequently, making it biased towards predicting the absence of keywords.

To improve the label distribution issue, we first oversample rare classes, that is the labels only appear once or twice in the overall sample recrods.
### First Round Resampling

In [None]:
# identify rare lables
category_distribution.sort_values()
category_distribution_df = category_distribution.to_frame(name='count')
rare_category = category_distribution_df[category_distribution_df['count']==1] + category_distribution_df[category_distribution_df['count']==2]
print(f'Number of labels which has rare records: {len(rare_category)}')

In [10]:
import numpy as np
from utils.preprocessor import resampling

X = np.array(sampleDS_cleaned['embedding'].tolist())
X_oversampled, Y_oversampled = resampling(X_train=X, Y_train=Y.to_numpy(), strategy='ROS')

Again, we can check the label distribution for the oversampled data.

In [None]:
import pandas as pd
K_oversampled = pd.DataFrame(Y_oversampled, columns=labels)

category_distribution = K_oversampled.copy()
category_distribution = category_distribution.sum()

category_distribution.sort_values()
category_distribution_df = category_distribution.to_frame(name='count')
rare_category = category_distribution_df[category_distribution_df['count']==1] + category_distribution_df[category_distribution_df['count']==2]
print(f'Number of labels which has rare records: {len(rare_category)}')

category_distribution_df['count'].min()

In this way, records with rare labels are duplicated to increase their frequency. After oversampling, we found that the minimum value is 539, meaning each label appears in at least 539 records. In this way, we tried to balance the label imbalance distribution issue.

## Prepare Train and Test Sets
We select description embedding as input X, and keyword vetors as output Y. We split the train and test sets follow the propotion of 80%-20%. Notably, we don't split validation set in this step, as in our model, we seperate the validation set when fit the model by setting parameter `validation_split=0.1`.
```

In [None]:
from utils.preprocessor import prepare_train_validation_test

dimension, n_labels, X_train, Y_train, X_test, Y_test = prepare_train_validation_test(X_oversampled, Y_oversampled)
print(Y_train.shape)

### Resampling for minority class

In [13]:
X_train_resampled, Y_train_resampled = resampling(X_train=X_train, Y_train=Y_train, strategy='SMOTE')

## Train Model

In [None]:
import model.keywordModel as km
label_weight_dict = {}
model, history = km.keyword_model(X_train, Y_train, X_test, Y_test, label_weight_dict, dimension, n_labels)

## Evaluate Model

In [None]:
from model.keywordModel import evaluation, prediction

confidence = 0.5
predicted_labels = prediction(X_test, model, confidence)
eval = evaluation(Y_test=Y_test, predictions=predicted_labels)
print(eval)

## Make Prediction

We first load target set $M_t$.

In [None]:
from utils.preprocessor import load_target
targetDS = load_target()
targetDS.info()

In [None]:
from model.keywordModel import prediction
target_X = np.array(targetDS['embedding'].tolist())
target_predicted_labels = prediction(target_X, model, confidence)

In [18]:
from model.keywordModel import get_predicted_keywords
get_predicted_keywords(target_predicted_labels, labels, targetDS)