# ML model for Keyword Classification - Non-tech Notebook
This notebook introduces (1) how to explore, prepare and preprocess the datasets; (2) how to train and evaluate the ML model; and (3) how to use this trained ML model, for non-technical audiences.
## Problem Description
In the catalogue $C = \{M, K, P\}$, a subset of metadata records, $M_t \subseteq M$, have not yet been categorised with keywords. For these records, $K_i = \emptyset $ for all $m_i \in M_t$. Given another subset of metadata records, $M_s \subseteq M$, where each record has already been categorised with keywords (i.e., $K_i \neq \emptyset $ for all $m_i \in M_s$). The research question is as follows:

How to design and develop a machine learning model, denoted as $MM_{keywords}$, that can automatically label the uncategorised metadata records $M_t$ using a predefined set of keywords $K$. Specifically, the model should be trained to learn a mapping rule $d_i \mapsto K_i$ based on the observed patterns from the labelled metadata records $M_s$, where each description $d_i$ of a metadata record $m_i \in M_s$ is associated with a set of keywords $K_i$. Once trained, the model should be able to apply this learned mapping to accurately categorise the records in $M_t$ by assigning their corresponding keywords based on the records' descriptions.

To simplify the task, we restrict the scope of keywords to those falling within the primary AODN vocabulary:
- AODN Discovery Parameter Vocabulary
- AODN Instrument Vocabulary', 'AODN Discovery Parameter Vocabulary', 'AODN Platform Vocabulary'

Only keywords $k_j \in K_i$ that are part of the listed AODN vocabularies will be considered. Any keyword not belonging to these vocabularies will be excluded from $K_i$ for all metadata records in the categorised metadata set $M_s$.

## Prepare Dataset
This section introduces how to load the target and sample sets, snippets to explore these sets, and prepare and test sets.

In [3]:
# add module path for notebook to use
import sys
import os

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path+"\\data_discovery_ai\\utils")
    sys.path.append(module_path+"\\data_discovery_ai\\model")

current_path = os.getcwd()

# import modules
import preprocessor
import keywordModel

  from .autonotebook import tqdm as notebook_tqdm


We first load labelled dataset

In [6]:
labelledSet = preprocessor.load_from_file(
        "./data_discovery_ai/input/keyword_sample.pkl"
    )

FileNotFoundError: [Errno 2] No such file or directory: './data_discovery_ai/input/keyword_sample.pkl'

In [5]:
labelledSet.shape

(1631, 5)

Then prepare the input X and Y based on the labelled set