# ML model for Keyword Classification
This notebook introduces (1) how we prepare and preprocess the datasets; (2) how we train and evaluate the ML model; and (3) how we use this trained ML model.

## 1. Prepare Datasets

Query result from ElasticSearch with the following scripts, make sure the number of the size is larger than the real number of records so that can get all records.
```
    POST /es-indexer-edge/_search
    {
    "size": 11000,
    "query": {
        "match_all": {}
    }
    }
```
and to get the IMOS records only:
```
    POST /es-indexer-edge/_search
    {
    "size": 800,
    "query": {
        "bool": {
        "must": [
            {
            "match": {
                "providers.name": "IMOS"
            }
            }
        ]
        }
    }
    }
```

Step 1: import necessory libraries

In [1]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, hamming_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.utils.class_weight import compute_class_weight
import pandas as pd
import ast
import pickle
import numpy as np
from tqdm import tqdm
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Input, Dropout
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

import logging
from matplotlib import pyplot as plt
from datetime import datetime

import os
os.environ["TF_USE_LEGACY_KERAS"] ="1"

  from .autonotebook import tqdm as notebook_tqdm


Step 2: Feature Extraction

In [2]:
DATASET = "./output/AODN_description.tsv"
KEYWORDS_DS = "./output/AODN_parameter_vocabs.tsv"
TARGET_DS = "./output/keywords_target.tsv"
VOCABS = ['AODN Discovery Parameter Vocabulary']

DATASET is a subset of the original source dataset, containing only the '_id', '_source.title', and '_source.description' columns. We retained these columns because we want to use '_source.description' as the feature X for the classification task. Therefore, we calculated the embeddings of the descriptions. Finally, we saved the processed dataset as a file for future use, as calculating embeddings is time-consuming, and saving/loading the file helps reduce this time overhead.

In [3]:
ds = pd.read_csv(DATASET, sep='\t')
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9856 entries, 0 to 9855
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   _id                  9856 non-null   object
 1   _source.title        9856 non-null   object
 2   _source.description  9856 non-null   object
dtypes: object(3)
memory usage: 231.1+ KB


In [4]:
ds = pd.read_csv(KEYWORDS_DS, sep='\t')
ds.describe()

Unnamed: 0,id,title,description,keywords
count,1588,1588,1588,1588
unique,1588,1581,1343,457
top,52b58d9a-a0b4-4396-be8e-a9e5e2b493f0,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Australian National Moorings Network' (ANMN) ...,[{'concepts': [{'id': 'Practical salinity of t...
freq,1,2,44,463


In [5]:
ds = pd.read_csv(DATASET, sep='\t')
ds.describe()

def get_description_embedding(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')
    
    with torch.no_grad():
        outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :] 
    return cls_embedding.squeeze().numpy()

def calculate_embedding(ds):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', clean_up_tokenization_spaces=False)
    model = BertModel.from_pretrained('bert-base-uncased')
    tqdm.pandas()
    ds['embedding'] = ds['_source.description'].progress_apply(lambda x: get_description_embedding(x, tokenizer, model))
    return ds

# saved_ds = calculate_embedding(ds)
# save_to_file(ds, './output/AODN.pkl')

Step 3: Prepare Target set

The target set is the metadata records that we want to apply our trained ML model for predicting keywords, this is all non-categorised records. We apply the calculated embeddings for these records.

In [6]:
from utils.preprocessor import load_from_file, save_to_file
def get_target_ds():
    target = pd.read_csv(TARGET_DS, sep='\t')
    aodn = load_from_file('./output/AODN.pkl')
    aodn.columns = ['id', 'title', 'description', 'embedding']
    merged_df = target.merge(aodn, on=['id', 'title','description'], how='left')
    return merged_df

target = get_target_ds()
print(target.head)

<bound method NDFrame.head of                                         id   
0     52bd4235-7461-47eb-a607-11cdbf93cd9f  \
1     52e8f4a0-4000-4650-9372-fd97de9e7725   
2     531cc7a0-6548-485d-b3b2-d49b89e04a40   
3     5335dd35-0a9a-453f-a1a9-95677f75bd8b   
4     53556c47-38bb-4073-b67f-576e4e3c1903   
...                                    ...   
1070  fe70b360-2208-4a9b-8de6-bbd9c7f652bb   
1071  ff34b8fc-8a52-4270-82b3-f6bebde4aa10   
1072  fd2d6481-cc16-42af-af76-92ad0cdc166c   
1073  fd4c6c5b-99da-4e77-adf5-ac04f54af393   
1074  fffbf4b5-e860-407e-a4d8-6d81f93157fb   

                                                  title   
0     Corals and coral communities of Lord Howe Isla...  \
1     Predictive toxinology: calculated molecular de...   
2     Benthic processes in the intertidal mudflats o...   
3     Rapid Ecological Assessment (REA) of fringing ...   
4     Impacts of individual aromatics on larvae of t...   
...                                                 ...   
1070

We can check the keywords for the target dataset are all empty

In [7]:
all_nan = target['keywords'].isnull().all()
all_nan

True

Step 4: Prepare train and test sets

We prepare the train and test sets from the KEYWORDS_DS, which is the subset of AODN dataset that keywords using AODN vocabularies. We can check the keywords for the target dataset are all not empty.

In [8]:
keyword_ds = pd.read_csv(KEYWORDS_DS, sep='\t')
all_not_nan = keyword_ds['keywords'].notnull().all()
all_not_nan

True

In [9]:
keyword_ds.describe()

Unnamed: 0,id,title,description,keywords
count,1588,1588,1588,1588
unique,1588,1581,1343,457
top,52b58d9a-a0b4-4396-be8e-a9e5e2b493f0,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Australian National Moorings Network' (ANMN) ...,[{'concepts': [{'id': 'Practical salinity of t...
freq,1,2,44,463


In [10]:
keyword_ds.head

<bound method NDFrame.head of                                         id   
0     52b58d9a-a0b4-4396-be8e-a9e5e2b493f0  \
1     52c92036-cea9-4b1a-b4f0-cc94b8b5df98   
2     52e8e882-5108-4295-b336-e97c11af7ad4   
3     52f09a23-63a2-4c14-8b3b-1fc7c8167281   
4     533bba87-bd26-4bb6-91a4-613d104ae310   
...                                    ...   
1583  ff50ae2f-0f79-4eaa-806c-8954ab0e545b   
1584  fed52ea8-bde9-4126-aa3d-69431fce5694   
1585  fcd7a039-2134-4761-ad08-ec42b8e05610   
1586  fbe4dbce-3435-48df-a054-f0e399886e2b   
1587  fe669c1d-6b14-467e-8d8e-f1bf192841aa   

                                                  title   
0     IMOS SOOP Underway Data from AIMS Vessel RV So...  \
1     IMOS - SRS - SST - L3C - NOAA 19 - 3 day - day...   
2     Sea Water Temperature Logger Data at Taure Ree...   
3     IMOS - ACORN - Turquoise Coast HF ocean radar ...   
4            Square Rocks Air Pressure From 19 Dec 2009   
...                                                 ...   
1583

We format the keywords field for better read.

In [11]:
def keywords_formatter(text):
    keywords = ast.literal_eval(text)
    k_list = []
    for keyword in keywords:
        for concept in keyword['concepts']:
            if keyword['title'] in VOCABS:
                concept_str = keyword['title'] + ':' + concept['id']
                k_list.append(concept_str)
    return k_list

def extract_labels(ds):
    ds['keywords'] = ds['keywords'].apply(lambda x: keywords_formatter(x))
    return ds

formatted_keywords_ds = extract_labels(keyword_ds)
print(formatted_keywords_ds['keywords'].iloc[0])

['AODN Discovery Parameter Vocabulary:Practical salinity of the water body', 'AODN Discovery Parameter Vocabulary:Fluorescence of the water body', 'AODN Discovery Parameter Vocabulary:Temperature of the water body', 'AODN Discovery Parameter Vocabulary:Turbidity of the water body']


And apply embedding column

In [12]:
aodn = load_from_file('./output/AODN.pkl')
aodn.columns = ['id', 'title', 'description', 'embedding']
X_df = formatted_keywords_ds.merge(aodn, on=['id', 'title','description'], how='left')

# save for further use
save_to_file(X_df, './output/keyword_train.pkl')

In [13]:
formatted_keywords_ds.describe()

Unnamed: 0,id,title,description,keywords
count,1588,1588,1588,1588
unique,1588,1581,1343,237
top,52b58d9a-a0b4-4396-be8e-a9e5e2b493f0,IMOS SOOP Underway Data from AIMS Vessel RV So...,'Australian National Moorings Network' (ANMN) ...,[AODN Discovery Parameter Vocabulary:Practical...
freq,1,2,44,463


We only want the keywords field as the output Y. So we transfer the values in keywords from a list to a binary matrix.

In [14]:
mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(formatted_keywords_ds['keywords'])
Y_df = pd.DataFrame(Y, columns=mlb.classes_)
save_to_file(Y_df, './output/AODN_vocabs_label.pkl')

In [15]:
Y_df.describe()

Unnamed: 0,AODN Discovery Parameter Vocabulary:,AODN Discovery Parameter Vocabulary:Abundance of biota,AODN Discovery Parameter Vocabulary:Accelerometer data,AODN Discovery Parameter Vocabulary:Acoustic signal return amplitude from the water body,AODN Discovery Parameter Vocabulary:Aluminium Bioaccumulation,AODN Discovery Parameter Vocabulary:Aluminium Dissolved Water Quality,AODN Discovery Parameter Vocabulary:Aluminium Total Water Quality,AODN Discovery Parameter Vocabulary:Ammonia-N Physicochemistry,AODN Discovery Parameter Vocabulary:Amplicon,AODN Discovery Parameter Vocabulary:Animal-borne video,...,AODN Discovery Parameter Vocabulary:net_downward_shortwave_flux_in_air,AODN Discovery Parameter Vocabulary:pH,AODN Discovery Parameter Vocabulary:pH (total scale) of the water body,AODN Discovery Parameter Vocabulary:particulate iron data quality flag,AODN Discovery Parameter Vocabulary:potential temperature,AODN Discovery Parameter Vocabulary:surface_albedo,AODN Discovery Parameter Vocabulary:the maximum potential quantum efficiency of Photosystem II,AODN Discovery Parameter Vocabulary:upwelling_longwave_flux_in_air,AODN Discovery Parameter Vocabulary:upwelling_shortwave_flux_in_air,AODN Discovery Parameter Vocabulary:voltage
count,1588.0,1588.0,1588.0,1588.0,1588.0,1588.0,1588.0,1588.0,1588.0,1588.0,...,1588.0,1588.0,1588.0,1588.0,1588.0,1588.0,1588.0,1588.0,1588.0,1588.0
mean,0.001889,0.09068,0.00063,0.002519,0.020151,0.010076,0.010705,0.020151,0.001889,0.00063,...,0.001259,0.010076,0.006927,0.00063,0.00063,0.001259,0.00063,0.001259,0.001259,0.00063
std,0.043437,0.287244,0.025094,0.050141,0.140561,0.099902,0.102943,0.140561,0.043437,0.025094,...,0.035477,0.099902,0.082966,0.025094,0.025094,0.035477,0.025094,0.035477,0.035477,0.025094
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We can check if there are any cell has value 1 in each row. This means the transform should be right and makes sure that item in Y has positive labels.

In [16]:
rows_with_ones = (Y_df == 1).any(axis=1)
print(f'Exist rows has no one values?:{(~rows_with_ones).any()}')

Exist rows has no one values?:False


In [17]:
# save for further use
save_to_file(Y_df, './output/keyword_target.pkl')

Step 5: Split data

In [18]:
X_df = load_from_file('./output/keyword_train.pkl')

def split_data(ds):
    print(f' ----------- \n Shape: {ds.shape} \n Columns{ds.columns} \n ----------- ')

    X = np.array(ds['embedding'].tolist())
    Y = load_from_file('./output/AODN_vocabs_label.pkl')
    Y_labels = Y.columns.tolist()

    Y = Y.to_numpy()
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)    

    return X_train, Y_train, X_test, Y_test, Y_labels

X_train, Y_train, X_test, Y_test, Y_labels = split_data(X_df)

 ----------- 
 Shape: (1588, 5) 
 ColumnsIndex(['id', 'title', 'description', 'keywords', 'embedding'], dtype='object') 
 ----------- 


Step 6: Train Model

In [19]:
current_time = datetime.now().strftime('%Y%m%d%H%M%S')
INPUT_DIM = 768
N_LABELS = 393

In [20]:
def keyword_model(X_train, Y_train, X_test, Y_test):
    current_time = datetime.now().strftime('%Y%m%d%H%M%S')
    model = Sequential([
        Input(shape=(INPUT_DIM,)),
        Dense(128, activation='relu'),
        Dropout(0.3),
        Dense(64, activation='relu'),
        Dropout(0.3),
        Dense(N_LABELS, activation='sigmoid')
    ])

    
    # Adam(learning_rate=1e-3)
    model.compile(optimizer=Adam(learning_rate=1e-3), loss='binary_crossentropy', metrics=['accuracy', tf.keras.metrics.Precision()])

    epoch = 100
    batch_size = 32

    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    reduce_lr = ReduceLROnPlateau(monitor='val_loss', patience=3, min_lr=1e-6)

    history = model.fit(X_train, Y_train, epochs=epoch, batch_size=batch_size, validation_data=(X_test, Y_test), callbacks=[early_stopping, reduce_lr])

    # history = model.fit(X_train, Y_train, epochs=epoch, batch_size=batch_size, class_weight=class_weights, validation_data=(X_test, Y_test))

    model.save(f"./output/saved/{current_time}-trained-keyword-epoch{epoch}-batch{batch_size}.keras")

    test_loss, test_accuracy, test_precision = model.evaluate(X_test, Y_test)
    print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}, Test Precision: {test_precision}")
    return model, history

In [21]:
model, history = keyword_model(X_train, Y_train, X_test, Y_test)

Epoch 1/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.0033 - loss: 0.4881 - precision: 0.0144 - val_accuracy: 0.0818 - val_loss: 0.0581 - val_precision: 0.6578 - learning_rate: 0.0010
Epoch 2/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.1026 - loss: 0.0535 - precision: 0.4360 - val_accuracy: 0.0818 - val_loss: 0.0371 - val_precision: 0.7865 - learning_rate: 0.0010
Epoch 3/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.0801 - loss: 0.0397 - precision: 0.5923 - val_accuracy: 0.0818 - val_loss: 0.0325 - val_precision: 0.7569 - learning_rate: 0.0010
Epoch 4/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.1268 - loss: 0.0331 - precision: 0.7106 - val_accuracy: 0.0818 - val_loss: 0.0293 - val_precision: 0.8200 - learning_rate: 0.0010
Epoch 5/100
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step

Step 7: Predict on test set

In [22]:
confidence = 0.4
predictions = model.predict(X_test)
predicted_labels = (predictions > confidence).astype(int)

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step 


Step 8: Evaluation

In [23]:
def evaluation(Y_test, predictions):
    accuracy = accuracy_score(Y_test, predictions)
    hammingloss = hamming_loss(Y_test, predictions)
    precision = precision_score(Y_test, predictions, average='micro')
    recall = recall_score(Y_test, predictions, average='micro')
    f1 = f1_score(Y_test, predictions, average='micro')

    return {
        'accuracy': accuracy,
        'hammingloss': hammingloss,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

In [24]:
eval_trained_model = evaluation(Y_test=Y_test, predictions=predicted_labels)
print(eval_trained_model)

{'accuracy': 0.5566037735849056, 'hammingloss': 0.006121273224830765, 'precision': 0.8267543859649122, 'recall': 0.5540044085231447, 'f1': 0.6634403871535416}


In [25]:
for i in range(5):
    predicted_keywords = [Y_labels[j] for j in range(len(predicted_labels[i])) if predicted_labels[i][j] == 1]
    true_keywords = [Y_labels[j] for j in range(len(Y_test[i])) if Y_test[i][j] == 1]

    print(f"Predicted Labels: {predicted_keywords}")
    print(f"True Labels: {true_keywords}")
    print("----------------------")

Predicted Labels: ['AODN Discovery Parameter Vocabulary:Fluorescence of the water body', 'AODN Discovery Parameter Vocabulary:Practical salinity of the water body', 'AODN Discovery Parameter Vocabulary:Temperature of the water body', 'AODN Discovery Parameter Vocabulary:Turbidity of the water body']
True Labels: ['AODN Discovery Parameter Vocabulary:Fluorescence of the water body', 'AODN Discovery Parameter Vocabulary:Practical salinity of the water body', 'AODN Discovery Parameter Vocabulary:Temperature of the water body', 'AODN Discovery Parameter Vocabulary:Turbidity of the water body']
----------------------
Predicted Labels: ['AODN Discovery Parameter Vocabulary:Abundance of biota']
True Labels: ['AODN Discovery Parameter Vocabulary:Abundance of biota']
----------------------
Predicted Labels: ['AODN Discovery Parameter Vocabulary:Abundance of biota']
True Labels: ['AODN Discovery Parameter Vocabulary:Abundance of biota', 'AODN Discovery Parameter Vocabulary:Biotic taxonomic ident