# Data Discovery AI – Technical Tutorial

**Author**: Yuxuan Hu

---

## Introduction

This tutorial introduces an AI framework developed to support the [new portal](https://portal.staging.aodn.org.au/). Its current focus is on **metadata record inference**, which helps improve the quality and consistency of metadata across datasets. This notebook targets for **technical** audiences. In this tutorial, we mainly explore the design and development of our ML and AI models.

## Keyword Classification
### Problem Description
The AODN catalogue $C=\{M, K, P\}$ serves as a platform for storing datasets and their associated metadata. $M=\{m_1,m_2,\ldots, m_x\}$ is a set of metadata records which are used to describe the dataset in AODN catalogue $C$. $K=\{k_1, k_2, \ldots, k_y\}$ is a set of pre-defined keywords that are used to categorise dataset. In the catalogue $C = \{M, K\}$, a subset of metadata records, $M_t \subseteq M$, have not yet been categorised with keywords. For these records, $K_i = \emptyset $ for all $m_i \in M_t$. Given another subset of metadata records, $M_s \subseteq M$, where each record has already been categorised with keywords (i.e., $K_i \neq \emptyset $ for all $m_i \in M_s$). The research question is as follows:

How to design and develop a machine learning model, denoted as $MM_{keywords}$, that can automatically label the uncategorised metadata records $M_t$ using a predefined set of keywords $K$. Specifically, the model should be trained to learn a mapping rule $d_i \mapsto K_i$ based on the observed patterns from the sample set $M_s$, where each description $d_i$ of a metadata record $m_i \in M_s$ is associated with a set of keywords $K_i$. Once trained, the model should be able to apply this learned mapping to accurately categorise the records in $M_t$ by assigning their corresponding keywords based on the records' descriptions.

To simplify the task, we restrict the scope of keywords to those falling within the primary AODN vocabulary:
- AODN Discovery Parameter Vocabulary
- AODN Platform Vocabulary

Only keywords $k_j \in K_i$ that are part of the listed AODN vocabularies will be considered. Any keyword not belonging to these vocabularies will be excluded from $K_i$ for all metadata records in the categorised metadata set $M_s$.

#### Formal Definitions
- **Definition 1: A metadata record $m_i=(d_i, K_i), m_i \in M$** is a record describing a dataset. Specifically, $i$ is the unique identifier of the record. $d_i$ is a textual abstract that serves as the description of the dataset. $K_i \subseteq K$ is a subset of keywords used to label the dataset.
- **Definition 2: A abstract $d_i$** is a piece of textual information which is used to describe the dataset. The embedding $\mathbf{d_i}$ is a vector representation of the textual description $d_i$, calculated using the "bert-base-uncased" model. The embedding vector $\mathbf{d_i}$ for each abstract $d_i$ has an universal dimensionality, denoted as $dim=|\mathbf{d_i}|$. A feature matrix $\mathbf{X}$ of a shape $|M_s| \times dim$ aggregates the embeddings for the abstacts of all samples in $M_s$, where |M_s is the total number of metadata records.
- **Definition 3: A keyword $k_j$** is a predefined label used for catogarising datasets. Each metadata record $m_i$ is associated with a set of keywords $K_i \subseteq K$, while $K$ is the complete set of predefined keywords. The keywords $K_i$ for a metadata record $m_i$ is mathematiacally represented as a binary vector $y_i$ with a size of $|K|$. where each element indicates the presence or absence of a specific label. A value of 1 at position $j$ denotes the label $k_j \in K$ is present in the metadata record $m_i$, in this sence $k_j \in K_i$, while a value of 0 indicates its absence. A target matrix $\mathbf{Y}$ is a $|M_s| \times |K|$ binary matrix, where $|M_s|$ is the size of the metadata records set $M_s=\{m_1,m_2,\ldots, m_x\}$, and $|K|$ is the size of the keywords set $K=\{k_1, k_2, \ldots, k_y\}$. Each entry $ \mathbf{K}[i, j] $ is 1 if metadata record $ m_i $ is associated with keyword $ k_j $, and 0 otherwise.

### Model Lifecycle
As shown below, three distinct but connected modules work cooperatively as the keyword classification model training lifecycle.
![keyword-classification-lifecycle](keyword-classification-model.png)
#### Data Preprocessing
The data preprocessing module is used to prepare data for training and testing models. Key features include: getting raw data, preparing sample data, converting textual data to numeric representations, resampling, and preparing input and output matrices.
#### Getting Raw Data
Raw data means the all metadata records $M$ stored in Elasticsearch. You need to set up two fields: `end_point` and `api_key` as environment variables. For more information, please refer to [README](../README.md
). We first fetch raw data from Elasticsearch.

The pre-processing is implemented by an object of class `KeywordClassificationPreprocessor`.

In [1]:
from data_discovery_ai.ml.preprocessor import KeywordPreprocessor, KMData
from data_discovery_ai.utils.agent_tools import load_from_file

preprocessor = KeywordPreprocessor()

  from .autonotebook import tqdm as notebook_tqdm
2025-05-12 15:11:27.289754: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-12 15:11:27.404252: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform

In [2]:
# raw_data = preprocessor.fetch_raw_data()
# raw_data.info()

In [3]:
# ds = preprocessor.calculate_embedding(preprocessor.filter_raw_data(raw_data), seperator=preprocessor.trainer_config["separator"])
ds = load_from_file("data_discovery_ai/resources/KeywordClassifier/preprocessed_data.pkl")

In [6]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1781 entries, 12 to 12818
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             1781 non-null   object
 1   title          1781 non-null   object
 2   abstract       1781 non-null   object
 3   keywords       1781 non-null   object
 4   combined_text  1781 non-null   object
 5   embedding      1781 non-null   object
 6   keywordsMap    1781 non-null   object
dtypes: object(7)
memory usage: 111.3+ KB


In [5]:
preprocessor.prepare_train_test_set(raw_data=ds)

This preprocessor split the train and test sets for training the classification model.

In [9]:
preprocessor.train_test_data.X_train.shape

(1535, 768)