# A ML model for Keyword Classification Task: Technical Notebook
This notebook introduces how we define, design, and development of a ML model for Keyword Classification Task, from a technical perspective.### Preliminaries
The AODN catalogue $C=\{M, K, P\}$ serves as a platform for storing datasets and their associated metadata. $M=\{m_1,m_2,\ldots, m_x\}$ is a set of metadata records which are used to describe the dataset in AODN catalogue $C$. $K=\{k_1, k_2, \ldots, k_y\}$ is a set of pre-defined keywords that are used to categorise dataset. $P=\{p_1, p_2, \ldots, p_n\}$ is a set of pre-defined parameters used to describe the attributes of raw data.

- **Definition 1: A metadata $m_i=(d_i, K_i, P_i), m_i \in M$** is a record describing a dataset. Specifically, $i$ is the unique identifier of the record. $d_i$ is a textual abstract that serves as the description of the dataset. $K_i \subseteq K$ is a subset of keywords used to label the dataset. $P_i \subseteq P$ is a subset of parameters used to describe the attributes of the raw data in the dataset.

- **Definition 2: A description $d_i$** is the textual abstract of a metadata, which is used to describe the dataset in a plain text way. $\mathbf{d_i}$ is the embedding representation of the textual description $d_i$. We used ``BERT'' to calculate the description embedding $\mathbf{d_i}$ for each description $d_i$.

- **Definition 3: A keyword matrix $\mathbf{K}$** is a pre-defined textual label, which is be used to categorise datasets.$X \times Y$ binary matrix, where $X=|M|$ is the size of the metadata records set $M=\{m_1,m_2,\ldots, m_x\}$, and $Y=|K|$ is the size of the keywords set $K=\{k_1, k_2, \ldots, k_y\}$. Each entry $ \mathbf{K}[i, j] $ is 1 if metadata record $ m_i $ is associated with keyword $ k_j $, and 0 otherwise.
### Problem Description
In the catalogue $C = \{M, K, P\}$, a subset of metadata records, $M_t \subseteq M$, have not yet been categorised with keywords. For these records, $K_i = \emptyset $ for all $m_i \in M_t$. Given another subset of metadata records, $M_s \subseteq M$, where each record has already been categorised with keywords (i.e., $K_i \neq \emptyset $ for all $m_i \in M_s$). The research question is as follows:

How to design and develop a machine learning model, denoted as $MM_{keywords}$, that can automatically label the uncategorised metadata records $M_t$ using a predefined set of keywords $K$. Specifically, the model should be trained to learn a mapping rule $d_i \mapsto K_i$ based on the observed patterns from the labelled metadata records $M_s$, where each description $d_i$ of a metadata record $m_i \in M_s$ is associated with a set of keywords $K_i$. Once trained, the model should be able to apply this learned mapping to accurately categorise the records in $M_t$ by assigning their corresponding keywords based on the records' descriptions.

To simplify the task, we restrict the scope of keywords to those falling within the primary AODN vocabularies:

- AODN Organisation Vocabulary
- AODN Instrument Vocabulary
- AODN Discovery Parameter Vocabulary
- AODN Platform Vocabulary
- AODN Parameter Category Vocabulary

Only keywords $k_j \in K_i$ that are part of the listed AODN vocabularies will be considered. Any keyword not belonging to these vocabularies will be excluded from $K_i$ for all metadata records in the categorised metadata set $M_s$.

## Connecting Datasets
The metadata records are fetched by querying ElasticSearch with the following code:
```json
POST /es-indexer-edge/_search
    {
    "size": 11000,
    "query": {
        "match_all": {}
    }
    }
```
Programmatically, we can fetch the data by connecting to ElasticSearch.

TODO: add script

## Identify Samples

In [2]:
import pandas as pd

ds = pd.read_csv("output/AODN.tsv", sep="\t")
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9856 entries, 0 to 9855
Data columns (total 40 columns):
 #   Column                                               Non-Null Count  Dtype 
---  ------                                               --------------  ----- 
 0   _index                                               9856 non-null   object
 1   _id                                                  9856 non-null   object
 2   _score                                               9856 non-null   int64 
 3   _ignored                                             5600 non-null   object
 4   _source.title                                        9856 non-null   object
 5   _source.description                                  9856 non-null   object
 6   _source.extent.bbox                                  9856 non-null   object
 7   _source.extent.temporal                              9856 non-null   object
 8   _source.summaries.score                              9856 non-null   int64 
 9

In [5]:
from utils.preprocessor import identify_sample

vocabs = ['AODN Organisation Vocabulary']

sampleDS = identify_sample(ds=ds, vocabs=vocabs)

(200, 4)


## Calculate embeddings

In [None]:
# from utils.preprocessor import calculate_embedding
# sampleDS = calculate_embedding(sampleDS)

In [6]:
from utils.preprocessor import load_from_file, save_to_file
dataset = load_from_file('./output/AODN.pkl')
dataset.columns = ['id', 'title', 'description', 'embedding']

keywordDS = pd.read_csv('./output/keywords_sample.tsv', sep='\t')
keywordDS = keywordDS.merge(dataset, on=['id', 'title', 'description'])
save_to_file(keywordDS, './output/keywords_sample.pkl')


In [6]:
from utils.preprocessor import save_to_file
save_to_file(sampleDS, './output/keywords_sample.pkl')

In [None]:
from utils.preprocessor import load_from_file
sampleDS = load_from_file('./output/keywords_sample.pkl')
sampleDS.info()

In [7]:
sampleDS['keywords']

5       [{'concepts': [{'id': 'Oceans | Ocean Temperat...
9       [{'concepts': [{'id': 'Oceans | Ocean Circulat...
169     [{'concepts': [{'id': 'diver'}], 'scheme': 'di...
219     [{'concepts': [{'id': 'Oceans | Ocean Circulat...
262     [{'concepts': [{'id': 'Oceans | Ocean Temperat...
                              ...                        
9617    [{'concepts': [{'id': 'Oceans | Ocean Temperat...
9643    [{'concepts': [{'id': 'Oceans | Ocean Optics |...
9667    [{'concepts': [{'id': 'Southern Ocean Time Ser...
9755    [{'concepts': [{'id': 'Oceans | Ocean Chemistr...
9830    [{'concepts': [{'id': 'Oceans | Ocean Temperat...
Name: keywords, Length: 200, dtype: object

In [2]:
from utils.preprocessor import extract_labels