# ML model for Keyword Classification
This notebook introduces (1) how we prepare and preprocess the datasets; (2) how we train and evaluate the ML model; and (3) how we use this trained ML model.

## 1. Prepare Datasets

Query result from ElasticSearch with the following scripts, make sure the number of the size is larger than the real number of records so that can get all records.
```
    POST /es-indexer-edge/_search
    {
    "size": 11000,
    "query": {
        "match_all": {}
    }
    }
```
and to get the IMOS records only:
```
    POST /es-indexer-edge/_search
    {
    "size": 800,
    "query": {
        "bool": {
        "must": [
            {
            "match": {
                "providers.name": "IMOS"
            }
            }
        ]
        }
    }
    }
```

Step 1: import necessory libraries

In [None]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, hamming_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.utils.class_weight import compute_class_weight
import pandas as pd
import ast
import pickle
import numpy as np
from tqdm import tqdm
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Input, Dropout
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

import logging
from matplotlib import pyplot as plt
from datetime import datetime

import os
os.environ["TF_USE_LEGACY_KERAS"] ="1"

Step 2: Feature Extraction

In [2]:
DATASET = "./output/AODN_description.tsv"
KEYWORDS_DS = "./output/AODN_parameter_vocabs.tsv"
TARGET_DS = "./output/keywords_target.tsv"
VOCABS = ['AODN Discovery Parameter Vocabulary']

DATASET is a subset of the original source dataset, containing only the '_id', '_source.title', and '_source.description' columns. We retained these columns because we want to use '_source.description' as the feature X for the classification task. Therefore, we calculated the embeddings of the descriptions. Finally, we saved the processed dataset as a file for future use, as calculating embeddings is time-consuming, and saving/loading the file helps reduce this time overhead.

In [None]:
ds = pd.read_csv(DATASET, sep='\t')
ds.info()

In [None]:
ds = pd.read_csv(KEYWORDS_DS, sep='\t')
ds.describe()

In [5]:
ds = pd.read_csv(DATASET, sep='\t')
ds.describe()

def get_description_embedding(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')
    
    with torch.no_grad():
        outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :] 
    return cls_embedding.squeeze().numpy()

def calculate_embedding(ds):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=False)
    model = BertModel.from_pretrained('bert-base-uncased')
    tqdm.pandas()
    ds['embedding'] = ds['_source.description'].progress_apply(lambda x: get_description_embedding(x, tokenizer, model))
    return ds

# saved_ds = calculate_embedding(ds)
# save_to_file(ds, './output/AODN.pkl')

Step 3: Prepare Target set

The target set is the metadata records that we want to apply our trained ML model for predicting keywords, this is all non-categorised records. We apply the calculated embeddings for these records.

In [None]:
from utils.preprocessor import load_from_file, save_to_file
def get_target_ds():
    target = pd.read_csv(TARGET_DS, sep='\t')
    aodn = load_from_file('./output/AODN.pkl')
    aodn.columns = ['id', 'title', 'description', 'embedding']
    merged_df = target.merge(aodn, on=['id', 'title','description'], how='left')
    return merged_df

target = get_target_ds()
print(target.head)

We can check the keywords for the target dataset are all empty

In [None]:
all_nan = target['keywords'].isnull().all()
all_nan

Step 4: Prepare train and test sets

We prepare the train and test sets from the KEYWORDS_DS, which is the subset of AODN dataset that keywords using AODN vocabularies. We can check the keywords for the target dataset are all not empty.

In [None]:
keyword_ds = pd.read_csv(KEYWORDS_DS, sep='\t')
all_not_nan = keyword_ds['keywords'].notnull().all()
all_not_nan

In [None]:
keyword_ds.describe()

In [None]:
keyword_ds.head

We format the keywords field for better read.

In [None]:
def keywords_formatter(text):
    keywords = ast.literal_eval(text)
    k_list = []
    for keyword in keywords:
        for concept in keyword['concepts']:
            if keyword['title'] in VOCABS:
                concept_str = keyword['title'] + ':' + concept['id']
                k_list.append(concept_str)
    return k_list

def extract_labels(ds):
    ds['keywords'] = ds['keywords'].apply(lambda x: keywords_formatter(x))
    return ds

formatted_keywords_ds = extract_labels(keyword_ds)
print(formatted_keywords_ds['keywords'].iloc[0])

And apply embedding column

In [12]:
aodn = load_from_file('./output/AODN.pkl')
aodn.columns = ['id', 'title', 'description', 'embedding']
X_df = formatted_keywords_ds.merge(aodn, on=['id', 'title','description'], how='left')

# save for further use
save_to_file(X_df, './output/keyword_train.pkl')

In [None]:
formatted_keywords_ds.describe()

We only want the keywords field as the output Y. So we transfer the values in keywords from a list to a binary matrix.

In [14]:
mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(formatted_keywords_ds['keywords'])
Y_df = pd.DataFrame(Y, columns=mlb.classes_)
save_to_file(Y_df, './output/AODN_vocabs_label.pkl')

In [None]:
Y_df.describe()

We can check if there are any cell has value 1 in each row. This means the transform should be right and makes sure that item in Y has positive labels.

In [None]:
rows_with_ones = (Y_df == 1).any(axis=1)
print(f'Exist rows has no one values?:{(~rows_with_ones).any()}')

In [17]:
# save for further use
save_to_file(Y_df, './output/keyword_target.pkl')

Step 5: Split data

In [None]:
X_df = load_from_file('./output/keyword_train.pkl')

def split_data(ds):
    print(f' ----------- \n Shape: {ds.shape} \n Columns{ds.columns} \n ----------- ')

    X = np.array(ds['embedding'].tolist())
    Y = load_from_file('./output/AODN_vocabs_label.pkl')
    Y_labels = Y.columns.tolist()

    Y = Y.to_numpy()
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)    

    return X_train, Y_train, X_test, Y_test, Y_labels

X_train, Y_train, X_test, Y_test, Y_labels = split_data(X_df)

Step 6: Train Model

In [19]:
current_time = datetime.now().strftime('%Y%m%d%H%M%S')
INPUT_DIM = 768
N_LABELS = 393

In [20]:
def keyword_model(X_train, Y_train, X_test, Y_test):
    current_time = datetime.now().strftime('%Y%m%d%H%M%S')
    model = Sequential([
        Input(shape=(INPUT_DIM,)),
        Dense(128, activation='relu'),
        Dropout(0.3),
        Dense(64, activation='relu'),
        Dropout(0.3),
        Dense(N_LABELS, activation='sigmoid')
    ])

    
    # Adam(learning_rate=1e-3)
    model.compile(optimizer=Adam(learning_rate=1e-3), loss='binary_crossentropy', metrics=['accuracy', tf.keras.metrics.Precision()])

    epoch = 100
    batch_size = 32

    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    reduce_lr = ReduceLROnPlateau(monitor='val_loss', patience=3, min_lr=1e-6)

    history = model.fit(X_train, Y_train, epochs=epoch, batch_size=batch_size, validation_data=(X_test, Y_test), callbacks=[early_stopping, reduce_lr])

    # history = model.fit(X_train, Y_train, epochs=epoch, batch_size=batch_size, class_weight=class_weights, validation_data=(X_test, Y_test))

    model.save(f"./output/saved/{current_time}-trained-keyword-epoch{epoch}-batch{batch_size}.keras")

    test_loss, test_accuracy, test_precision = model.evaluate(X_test, Y_test)
    print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}, Test Precision: {test_precision}")
    return model, history

In [None]:
model, history = keyword_model(X_train, Y_train, X_test, Y_test)

Step 7: Predict on test set

In [None]:
confidence = 0.4
predictions = model.predict(X_test)
predicted_labels = (predictions > confidence).astype(int)

Step 8: Evaluation

In [23]:
def evaluation(Y_test, predictions):
    accuracy = accuracy_score(Y_test, predictions)
    hammingloss = hamming_loss(Y_test, predictions)
    precision = precision_score(Y_test, predictions, average='micro')
    recall = recall_score(Y_test, predictions, average='micro')
    f1 = f1_score(Y_test, predictions, average='micro')

    return {
        'accuracy': accuracy,
        'hammingloss': hammingloss,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

In [None]:
eval_trained_model = evaluation(Y_test=Y_test, predictions=predicted_labels)
print(eval_trained_model)

In [None]:
for i in range(5):
    predicted_keywords = [Y_labels[j] for j in range(len(predicted_labels[i])) if predicted_labels[i][j] == 1]
    true_keywords = [Y_labels[j] for j in range(len(Y_test[i])) if Y_test[i][j] == 1]

    print(f"Predicted Labels: {predicted_keywords}")
    print(f"True Labels: {true_keywords}")
    print("----------------------")