# ML model for Data Delivery Mode Filter - Tech Notebook
This notebook introduces (1) identify the research question, (2) justify the selection of the ML model, (3) explore, prepare and preprocess the datasets; (4) how to train and evaluate the ML model; and (5) how to use this trained ML model, for technical audiences.
## Problem Description
The AOND catalougue $C=\{M, K, F_D \}$ serves as a platform for discovering datasets and their associated metadata. Particularly:
- $M=\{m_1,m_2,\ldots, m_x\}$ is the set of metadata records that describe the datasets in the AODN catalogue $C$.
- $K$ represents the pre-defined keywords used to categorise these datasets.
- $F_D$ (see Definition 5) is the data delivery mode filter which used for filtering datasets with distinctive data delivery mode.

### Formal Definitions

- **Definition 1: A Metadata Record $m_i \in M$** is defined as a tuple $m_i = (t_i, d_i, K_i, s_i, u_i)$, where:  
- $i$: A unique identifier for the metadata record.  
- $t_i$: The **title**, a textual summary of the dataset.  
- $d_i$: The **abstract**, a textual description of the dataset.  
- $K_i \subseteq K$: A set of **keywords** associated with the dataset (see **Definition 2**).  
- $s_i$: A textual statement representing the **lineage** of the dataset.  
- $u_i$: The **status** of the dataset, as defined in **Definition 4**.  

The vector representation $\mathbf{m_i}$ of a metadata record $m_i$ is calculated as the embedding of its combined textual fields. Depending on the task:  
- For **keyword classification**, only the title $t_i$ and abstract $d_i$ are used.  
- For **data delivery mode classification**, the title $t_i$, abstract $d_i$, and lineage $s_i$ are considered.  

The embeddings are generated using the "bert-base-uncased" model, and each record has a universal dimensionality, denoted as $\text{dim} = |\mathbf{d_i}|$.  

A feature matrix $\mathbf{X} $ is the input of the classification models. $\mathbf{X} \in \mathbb{R}^{|M_s| \times \text{dim}}$ aggregates the embeddings of all records in $M_s$, where $|M_s|$ is the total number of metadata records. 

- **Definition 2: A Keyword $k_j$**is a predefined label used for catogarising datasets. Each metadata record $m_i$ is associated with a set of keywords $K_i \subseteq K$, while $K$ is the complete set of predefined keywords. The keywords $K_i$ for a metadata record $m_i$ is mathematiacally represented as a binary vector $y_i$ with a size of $|K|$. where each element indicates the presence or absence of a specific label. A value of 1 at position $j$ denotes the label $k_j \in K$ is present in the metadata record $m_i$, in this sence $k_j \in K_i$, while a value of 0 indicates its absence. A target matrix $\mathbf{Y}$ is a $|M_s| \times |K|$ binary matrix, where $|M_s|$ is the size of the metadata records set $M_s=\{m_1,m_2,\ldots, m_x\}$, and $|K|$ is the size of the keywords set $K=\{k_1, k_2, \ldots, k_y\}$. Each entry $ \mathbf{K}[i, j] $ is 1 if metadata record $ m_i $ is associated with keyword $ k_j $, and 0 otherwise.

The target matrix $\mathbf{Y} $ is the output of the *keyword classification model*. $\mathbf{Y} \in \{0, 1\}^{|M_s| \times |K|}$ aggregates all binary vectors for the metadata set $M_s$, where:  
- $|M_s|$: Total number of metadata records.  
- $|K|$: Total number of keywords.  

Each entry $\mathbf{Y}[i, j]$ is $1$ if record $m_i$ is associated with keyword $k_j$, and $0$ otherwise.

- **Definition 3: Data Delivery Mode Filter $F_D$** is a predefined set of four modes: $F_D = \{\text{Completed, Real-Time, Delayed, Other}\}$.
The mapping of a dataset's status to a delivery mode follows these rules:  
1. If the status is "Completed," it is mapped to **Completed**.  
2. If the status is neither "Completed" nor "Ongoing," it is mapped to **Other**.  
3. If the status is "Ongoing," further analysis determines the mapping as either **Real-Time** or **Delayed**.  

- **Definition 4: The Status $s_i$** is textual information denoting the state of a dataset. It can take one of three possible values: $u_i \in \{\text{Completed, Ongoing, Others}\}.$

### Problem Description:
In the catalog $C=\{M, K, F_D \}$, the **data delivery mode filter F_D** is used to search for and filter datasets based on specific delivery modes from four options: $\{Completed, Real-Time, Delayed, Other\}$. Normaly, the "Completed" and "Others" modes are straightforward to identify as they align directly with the metadata record's status field. However, for records marked with an "Ongoing" status, classification becomes ambiguous, especially for non-IMOS datasets. For IMOS datasets, a clear rule exists: titles contain keywords like "Real-Time" (or its variants) and "Delayed" (or its variants), which can be directly used for classification.

For non-IMOS datasets with an "Ongoing" status, the classification logic cannot rely on such title-based rules. Current approaches, such as if-else decision trees, fail to generalise due to the volume of ambiguous records (as discussed in the [GitHub issue](https://github.com/aodn/backlog/issues/6148)).

To address this, the task is framed as a **text classification** task focus on non-IMOS datasets with an "Ongoing" status. The input is the combied textual features, including the metadata records' title $t_i$, abstract $d_i$, and lineage $s_i$. The output is the data delivery mode as either "Real-Time" or "Delayed". The goal is to develop a machine learning (ML) model capable of learning a mapping rule between the textual content of metadata records and the appropriate data delivery mode.

In [1]:
# add module path for notebook to use
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(project_root)

# import customised modules
import data_discovery_ai.utils.preprocessor as preprocessor
import data_discovery_ai.common.constants as constants
import data_discovery_ai.utils.es_connector as es_connector

  from .autonotebook import tqdm as notebook_tqdm





## DEVELOPMENT

-----------

In [2]:
ddm_resource_folder = constants.FILTER_FOLDER
ddm_resource_file = constants.FILTER_PREPROCESSED_FILE
filter_preprocessed_data = preprocessor.load_from_file(f"../data_discovery_ai/resources/{ddm_resource_folder}/{ddm_resource_file}")
filter_preprocessed_data

Unnamed: 0,id,title,abstract,lineage,status,information,embedding
1,0024b456-4636-42cd-b097-388f6d39a835,WAMSI Node 4.2.3 - Fisheries dependent data an...,"From 1990, continuous (15-minute, 30-minute or...","a) 1990 to 1994: Wesdata models 886, 187, 389;...",onGoing,WAMSI Node 4.2.3 - Fisheries dependent data an...,"[-1.2472235, -0.52754223, 0.08079154, -0.50992..."
12,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS - ANMN National Reference Stations - Darw...,This collection includes observations transmit...,NATIONAL REFERENCE STATIONS The IMOS national ...,onGoing,IMOS - ANMN National Reference Stations - Darw...,"[-0.91694874, -0.2440666, 0.26462442, -0.03149..."
16,0094682a-e438-41e8-a39b-19cf2093025d,Thursday Island Wind From 08 Feb 2012,This data set was collected by weather sensors...,Data from AIMS weather stations are subjected ...,onGoing,Thursday Island Wind From 08 Feb 2012 [SEP] Th...,"[-1.0509518, -0.55623543, 0.2895106, -0.310500..."
24,00d34cd4-24fe-4361-b667-1782c919e870,Coastal Infrastructure Points,Spatial representation of Department of Transp...,Features sourced from DOT Asset Management dat...,onGoing,Coastal Infrastructure Points [SEP] Spatial re...,"[-0.96098024, -0.38567236, 0.054584935, -0.344..."
37,0155375c-8070-4662-9c93-b593ee4891b0,Davies Reef Water Temperature From 18 Oct 1991,The 'Wireless Sensor Networks Facility' (forme...,All sensors are factory calibrated and then ca...,onGoing,Davies Reef Water Temperature From 18 Oct 1991...,"[-0.8229743, -0.26159173, 0.58714104, -0.04611..."
...,...,...,...,...,...,...,...
12891,svenner_penguin_gis,Svenner Islands penguin GIS dataset,Aerial photography (Linhof) of penguin colonie...,TOPIC: Planimetric accuracy of penguin colony ...,onGoing,Svenner Islands penguin GIS dataset [SEP] Aeri...,"[-1.1401156, -0.40232602, -0.1859744, -0.21416..."
12900,underway_ship_data,Underway voyage data collected from Australian...,Australian Antarctic Division chartered vessel...,See the child records for further information.,onGoing,Underway voyage data collected from Australian...,"[-0.9423639, -0.3985203, 0.38082838, -0.003341..."
12908,voyage_202122050,"RSV Nuyina Voyage 5 2021-22 Voyage Data, South...",This dataset contains the Voyage Data from voy...,The quality for the science data produced duri...,onGoing,"RSV Nuyina Voyage 5 2021-22 Voyage Data, South...","[-0.9219631, -0.30171615, 0.2630324, -0.072466..."
12914,voyage_202324050,RSV Nuyina Voyage Data 2023-24 V5,Voyage 5 of the 2023/2024 season onboard RSV N...,Before using any data collected on a voyage pl...,onGoing,RSV Nuyina Voyage Data 2023-24 V5 [SEP] Voyage...,"[-1.119304, -0.7627072, 0.2805291, 0.03869628,..."


In [3]:
temp = filter_preprocessed_data.copy()

In [4]:
# find rows with title contains 'real time' and its variants
# define real time string and variants and ignore case
real_time_variants = ['real time', 'real-time', 'realtime']
real_time_data = temp[temp['title'].str.contains('|'.join(real_time_variants), case=False)]
real_time_data.loc[:, 'mode'] = 'Real-Time'
# and also for 'delayed' and its variants
delayed_variants = ['delayed', 'delay', 'delaying']
delayed_data = temp[temp['title'].str.contains('|'.join(delayed_variants), case=False)]
delayed_data.loc[:, 'mode'] = 'Delayed'

# concatenate real time and delayed data
import pandas as pd
real_time_delayed_data = pd.concat([real_time_data, delayed_data])
real_time_delayed_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  real_time_data.loc[:, 'mode'] = 'Real-Time'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  delayed_data.loc[:, 'mode'] = 'Delayed'


Unnamed: 0,id,title,abstract,lineage,status,information,embedding,mode
12,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS - ANMN National Reference Stations - Darw...,This collection includes observations transmit...,NATIONAL REFERENCE STATIONS The IMOS national ...,onGoing,IMOS - ANMN National Reference Stations - Darw...,"[-0.91694874, -0.2440666, 0.26462442, -0.03149...",Real-Time
551,0c9eb39c-9cbe-4c6a-8a10-5867087e703a,IMOS - OceanCurrent - Gridded sea level anomal...,"Gridded (adjusted) sea level anomaly (GSLA), g...",Since 28/03/2023 the near real time (NRT) coll...,onGoing,IMOS - OceanCurrent - Gridded sea level anomal...,"[-0.91473687, -0.55327815, 0.44929105, -0.0817...",Real-Time
599,0dd3832a-cf67-4068-a446-a9c91c77273e,IMOS - ACORN - Coral Coast HF ocean radar site...,The Coral Coast (CORL) HF ocean radar system c...,This site was offline from November 2022 - Mar...,onGoing,IMOS - ACORN - Coral Coast HF ocean radar site...,"[-0.9607874, -0.41031778, 0.49067673, -0.33070...",Real-Time
2353,35234913-aa3c-48ec-b9a4-77f822f66ef8,IMOS - SOOP Expendable Bathythermographs (XBT)...,XBT real-time data is available through the IM...,XBT real-time data contains only RAW data. The...,onGoing,IMOS - SOOP Expendable Bathythermographs (XBT)...,"[-0.7841503, -0.3210754, 0.39499345, -0.144484...",Real-Time
3205,4d3d4aca-472e-4616-88a5-df0f5ab401ba,IMOS - ANMN Acidification Moorings (AM) Sub-Fa...,This collection delivers in near real-time mea...,Currently the instrumentation is as follows: Y...,onGoing,IMOS - ANMN Acidification Moorings (AM) Sub-Fa...,"[-1.1529939, -0.31419733, 0.26247346, -0.20037...",Real-Time


In [5]:
filter_data = filter_preprocessed_data.join(real_time_delayed_data['mode'])

In [6]:
# check how many NaN values in the mode column
print(f"Number of records with unknown mode: {filter_data['mode'].isnull().sum()}")
#  check how many Real-Time records in the mode column
print(f"Number of Real-Time records: {filter_data['mode'].str.contains('Real-Time').sum()}")
#  check how many Delayed records in the mode column
print(f"Number of Delayed records: {filter_data['mode'].str.contains('Delayed').sum()}")

Number of records with unknown mode: 1600
Number of Real-Time records: 19
Number of Delayed records: 22


-------
## ML model selection
From the statistic result we know that within all **1641** "OnGoing" records, there are **1600** records with unknown data delivery mode, **19** records are explicitly identified as *"Real-Time"* data delivery mode, and **22** records are with *"Delayed"* data delivery mode. So there are very few labelled data we can use to train the ML model. This indicates that only a small amount of labelled data is available for training a machine learning (ML) model. Consequently, traditional supervised learning approaches are not suitable in this scenario.

To address this limitation, we need to leverage the unlabelled data, potentially in conjunction with the known categorised data, to develop an appropriate ML model. Two potential solutions are as follows:
1. Unsupervised learning: This approach does not require labelled data and could be applied to uncover patterns within the unlabelled dataset.
2. Semi-supervised learning: This method combines both unlabelled and labelled data, enabling the model to learn a mapping function by utilising the strengths of both data types.

### Method 1: Clustering (unspervised learning)
We know that there are should be only two classes: "Real-Time" and "Delayed". So in the clustering, we use KMeans, a widely-accepted clustering algorithm to investigate the feasibility of unsupervised learning in this task. We set $K=2$.


In [7]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
# get embeddings
embeddings = filter_data["embedding"].tolist()
id = filter_data["id"].tolist()
embedding_map = dict(zip(id, embeddings))
kmeans.fit(embeddings)

--------

In [8]:
# assigne cluster labels to each record
clustered_data = filter_data.copy()
clustered_data["cluster"] = kmeans.labels_
clustered_data


Unnamed: 0,id,title,abstract,lineage,status,information,embedding,mode,cluster
1,0024b456-4636-42cd-b097-388f6d39a835,WAMSI Node 4.2.3 - Fisheries dependent data an...,"From 1990, continuous (15-minute, 30-minute or...","a) 1990 to 1994: Wesdata models 886, 187, 389;...",onGoing,WAMSI Node 4.2.3 - Fisheries dependent data an...,"[-1.2472235, -0.52754223, 0.08079154, -0.50992...",,0
12,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS - ANMN National Reference Stations - Darw...,This collection includes observations transmit...,NATIONAL REFERENCE STATIONS The IMOS national ...,onGoing,IMOS - ANMN National Reference Stations - Darw...,"[-0.91694874, -0.2440666, 0.26462442, -0.03149...",Real-Time,0
16,0094682a-e438-41e8-a39b-19cf2093025d,Thursday Island Wind From 08 Feb 2012,This data set was collected by weather sensors...,Data from AIMS weather stations are subjected ...,onGoing,Thursday Island Wind From 08 Feb 2012 [SEP] Th...,"[-1.0509518, -0.55623543, 0.2895106, -0.310500...",,0
24,00d34cd4-24fe-4361-b667-1782c919e870,Coastal Infrastructure Points,Spatial representation of Department of Transp...,Features sourced from DOT Asset Management dat...,onGoing,Coastal Infrastructure Points [SEP] Spatial re...,"[-0.96098024, -0.38567236, 0.054584935, -0.344...",,1
37,0155375c-8070-4662-9c93-b593ee4891b0,Davies Reef Water Temperature From 18 Oct 1991,The 'Wireless Sensor Networks Facility' (forme...,All sensors are factory calibrated and then ca...,onGoing,Davies Reef Water Temperature From 18 Oct 1991...,"[-0.8229743, -0.26159173, 0.58714104, -0.04611...",,0
...,...,...,...,...,...,...,...,...,...
12891,svenner_penguin_gis,Svenner Islands penguin GIS dataset,Aerial photography (Linhof) of penguin colonie...,TOPIC: Planimetric accuracy of penguin colony ...,onGoing,Svenner Islands penguin GIS dataset [SEP] Aeri...,"[-1.1401156, -0.40232602, -0.1859744, -0.21416...",,0
12900,underway_ship_data,Underway voyage data collected from Australian...,Australian Antarctic Division chartered vessel...,See the child records for further information.,onGoing,Underway voyage data collected from Australian...,"[-0.9423639, -0.3985203, 0.38082838, -0.003341...",,0
12908,voyage_202122050,"RSV Nuyina Voyage 5 2021-22 Voyage Data, South...",This dataset contains the Voyage Data from voy...,The quality for the science data produced duri...,onGoing,"RSV Nuyina Voyage 5 2021-22 Voyage Data, South...","[-0.9219631, -0.30171615, 0.2630324, -0.072466...",,1
12914,voyage_202324050,RSV Nuyina Voyage Data 2023-24 V5,Voyage 5 of the 2023/2024 season onboard RSV N...,Before using any data collected on a voyage pl...,onGoing,RSV Nuyina Voyage Data 2023-24 V5 [SEP] Voyage...,"[-1.119304, -0.7627072, 0.2805291, 0.03869628,...",,0


In [9]:
# based on labelled dataset, calculate majority vote for each cluster
labelled_data = clustered_data[clustered_data["mode"].notnull()]
# replace mode with 1 for Delayed and 0 for Real-Time {0:Real-Time, 1:Delayed}
labelled_data["mode"] = labelled_data["mode"].apply(lambda x: 1 if x == "Delayed" else 0)
labelled_data

import numpy as np
from collections import Counter
cluster_assignments = np.array(labelled_data['cluster'])  # Clustering results
true_labels = np.array(labelled_data['mode'])          # Ground truth labels

# Map clusters to their dominant class
clusters = {}
for cluster in np.unique(cluster_assignments):
    indices = np.where(cluster_assignments == cluster)[0]
    cluster_labels = true_labels[indices]
    dominant_class = Counter(cluster_labels).most_common(1)[0][0]
    clusters[cluster] = dominant_class

print("Cluster-to-Class Mapping:", clusters)

# assign majority vote to each cluster
clustered_data["predicted_mode"] = clustered_data["cluster"].map(clusters)


Cluster-to-Class Mapping: {0: 0, 1: 1}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labelled_data["mode"] = labelled_data["mode"].apply(lambda x: 1 if x == "Delayed" else 0)


The Cluster-to-Class Mapping: $\{0: 0, 1: 1\}$ suggests cluster 0 is related to class 0 (Real-Time) and cluster 1 is related to class 1 (Delayed). Let's convert them to semantic representations.

In [10]:
clustered_data["predicted_mode"] = clustered_data["predicted_mode"].apply(lambda x: "Delayed" if x == 1 else "Real-Time")
clustered_data

Unnamed: 0,id,title,abstract,lineage,status,information,embedding,mode,cluster,predicted_mode
1,0024b456-4636-42cd-b097-388f6d39a835,WAMSI Node 4.2.3 - Fisheries dependent data an...,"From 1990, continuous (15-minute, 30-minute or...","a) 1990 to 1994: Wesdata models 886, 187, 389;...",onGoing,WAMSI Node 4.2.3 - Fisheries dependent data an...,"[-1.2472235, -0.52754223, 0.08079154, -0.50992...",,0,Real-Time
12,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS - ANMN National Reference Stations - Darw...,This collection includes observations transmit...,NATIONAL REFERENCE STATIONS The IMOS national ...,onGoing,IMOS - ANMN National Reference Stations - Darw...,"[-0.91694874, -0.2440666, 0.26462442, -0.03149...",Real-Time,0,Real-Time
16,0094682a-e438-41e8-a39b-19cf2093025d,Thursday Island Wind From 08 Feb 2012,This data set was collected by weather sensors...,Data from AIMS weather stations are subjected ...,onGoing,Thursday Island Wind From 08 Feb 2012 [SEP] Th...,"[-1.0509518, -0.55623543, 0.2895106, -0.310500...",,0,Real-Time
24,00d34cd4-24fe-4361-b667-1782c919e870,Coastal Infrastructure Points,Spatial representation of Department of Transp...,Features sourced from DOT Asset Management dat...,onGoing,Coastal Infrastructure Points [SEP] Spatial re...,"[-0.96098024, -0.38567236, 0.054584935, -0.344...",,1,Delayed
37,0155375c-8070-4662-9c93-b593ee4891b0,Davies Reef Water Temperature From 18 Oct 1991,The 'Wireless Sensor Networks Facility' (forme...,All sensors are factory calibrated and then ca...,onGoing,Davies Reef Water Temperature From 18 Oct 1991...,"[-0.8229743, -0.26159173, 0.58714104, -0.04611...",,0,Real-Time
...,...,...,...,...,...,...,...,...,...,...
12891,svenner_penguin_gis,Svenner Islands penguin GIS dataset,Aerial photography (Linhof) of penguin colonie...,TOPIC: Planimetric accuracy of penguin colony ...,onGoing,Svenner Islands penguin GIS dataset [SEP] Aeri...,"[-1.1401156, -0.40232602, -0.1859744, -0.21416...",,0,Real-Time
12900,underway_ship_data,Underway voyage data collected from Australian...,Australian Antarctic Division chartered vessel...,See the child records for further information.,onGoing,Underway voyage data collected from Australian...,"[-0.9423639, -0.3985203, 0.38082838, -0.003341...",,0,Real-Time
12908,voyage_202122050,"RSV Nuyina Voyage 5 2021-22 Voyage Data, South...",This dataset contains the Voyage Data from voy...,The quality for the science data produced duri...,onGoing,"RSV Nuyina Voyage 5 2021-22 Voyage Data, South...","[-0.9219631, -0.30171615, 0.2630324, -0.072466...",,1,Delayed
12914,voyage_202324050,RSV Nuyina Voyage Data 2023-24 V5,Voyage 5 of the 2023/2024 season onboard RSV N...,Before using any data collected on a voyage pl...,onGoing,RSV Nuyina Voyage Data 2023-24 V5 [SEP] Voyage...,"[-1.119304, -0.7627072, 0.2805291, 0.03869628,...",,0,Real-Time


In [11]:
# filter for non-NaN mode
labelled_data = clustered_data[clustered_data["mode"].notnull()]
labelled_data.head()

Unnamed: 0,id,title,abstract,lineage,status,information,embedding,mode,cluster,predicted_mode
12,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS - ANMN National Reference Stations - Darw...,This collection includes observations transmit...,NATIONAL REFERENCE STATIONS The IMOS national ...,onGoing,IMOS - ANMN National Reference Stations - Darw...,"[-0.91694874, -0.2440666, 0.26462442, -0.03149...",Real-Time,0,Real-Time
77,02640f4e-08d0-4f3a-956b-7f9b58966ccc,IMOS - SOOP-Temperate Merchant Vessel (TMV) su...,Enhancement of Measurements on Ships of Opport...,Overview of SOOP data 12 hour - Pt Melbourne (...,onGoing,IMOS - SOOP-Temperate Merchant Vessel (TMV) su...,"[-1.1845901, -0.33011627, 0.5757241, -0.254327...",Delayed,0,Real-Time
86,028b9801-279f-427c-964b-0ffcdf310b59,IMOS - ACORN - Rottnest Shelf HF ocean radar s...,The Rottnest Shelf (ROT) HF ocean radar system...,,onGoing,IMOS - ACORN - Rottnest Shelf HF ocean radar s...,"[-1.1958747, -0.033913232, 0.14824693, -0.3890...",Delayed,1,Delayed
207,055342fc-f970-4be7-a764-8903220d42fb,IMOS - ACORN - Turquoise Coast HF ocean radar ...,The Turquoise Coast (TURQ) HF ocean radar syst...,,onGoing,IMOS - ACORN - Turquoise Coast HF ocean radar ...,"[-0.96425456, -0.22488132, 0.29475918, -0.2436...",Delayed,0,Real-Time
551,0c9eb39c-9cbe-4c6a-8a10-5867087e703a,IMOS - OceanCurrent - Gridded sea level anomal...,"Gridded (adjusted) sea level anomaly (GSLA), g...",Since 28/03/2023 the near real time (NRT) coll...,onGoing,IMOS - OceanCurrent - Gridded sea level anomal...,"[-0.91473687, -0.55327815, 0.44929105, -0.0817...",Real-Time,0,Real-Time


We have labelled data, let's use them as a ground truth to evaluate this method.

In [12]:
# check how many records are correctly classified
correctly_classified = labelled_data[labelled_data["mode"] == labelled_data["predicted_mode"]]
# calculate accuracy
accuracy = len(correctly_classified) / len(labelled_data)
accuracy

0.5609756097560976

The accuracy is not satisfied, so we need to think about other solution, which is semi-supervised learning approach.

### Method 2: Semi-Supervised Learning
https://scikit-learn.org/1.5/auto_examples/semi_supervised/plot_semi_supervised_newsgroups.html#sphx-glr-auto-examples-semi-supervised-plot-semi-supervised-newsgroups-py
Two approches are investigated in this section, they are: Label Propagation approach and Pseudo-labeling approach.
#### Label Propagation approach
[LabelSpreading](https://scikit-learn.org/1.5/modules/generated/sklearn.semi_supervised.LabelSpreading.html#sklearn.semi_supervised.LabelSpreading)

In [14]:
# prepare sample data
sample_data = filter_data.copy()
label_map = {"Real-Time": 0, "Delayed": 1, np.nan: -1}
sample_data["mode"] = sample_data["mode"].map(label_map)
sample_data

Unnamed: 0,id,title,abstract,lineage,status,information,embedding,mode
1,0024b456-4636-42cd-b097-388f6d39a835,WAMSI Node 4.2.3 - Fisheries dependent data an...,"From 1990, continuous (15-minute, 30-minute or...","a) 1990 to 1994: Wesdata models 886, 187, 389;...",onGoing,WAMSI Node 4.2.3 - Fisheries dependent data an...,"[-1.2472235, -0.52754223, 0.08079154, -0.50992...",-1
12,006bb7dc-860b-4b89-bf4c-6bd930bd35b7,IMOS - ANMN National Reference Stations - Darw...,This collection includes observations transmit...,NATIONAL REFERENCE STATIONS The IMOS national ...,onGoing,IMOS - ANMN National Reference Stations - Darw...,"[-0.91694874, -0.2440666, 0.26462442, -0.03149...",0
16,0094682a-e438-41e8-a39b-19cf2093025d,Thursday Island Wind From 08 Feb 2012,This data set was collected by weather sensors...,Data from AIMS weather stations are subjected ...,onGoing,Thursday Island Wind From 08 Feb 2012 [SEP] Th...,"[-1.0509518, -0.55623543, 0.2895106, -0.310500...",-1
24,00d34cd4-24fe-4361-b667-1782c919e870,Coastal Infrastructure Points,Spatial representation of Department of Transp...,Features sourced from DOT Asset Management dat...,onGoing,Coastal Infrastructure Points [SEP] Spatial re...,"[-0.96098024, -0.38567236, 0.054584935, -0.344...",-1
37,0155375c-8070-4662-9c93-b593ee4891b0,Davies Reef Water Temperature From 18 Oct 1991,The 'Wireless Sensor Networks Facility' (forme...,All sensors are factory calibrated and then ca...,onGoing,Davies Reef Water Temperature From 18 Oct 1991...,"[-0.8229743, -0.26159173, 0.58714104, -0.04611...",-1
...,...,...,...,...,...,...,...,...
12891,svenner_penguin_gis,Svenner Islands penguin GIS dataset,Aerial photography (Linhof) of penguin colonie...,TOPIC: Planimetric accuracy of penguin colony ...,onGoing,Svenner Islands penguin GIS dataset [SEP] Aeri...,"[-1.1401156, -0.40232602, -0.1859744, -0.21416...",-1
12900,underway_ship_data,Underway voyage data collected from Australian...,Australian Antarctic Division chartered vessel...,See the child records for further information.,onGoing,Underway voyage data collected from Australian...,"[-0.9423639, -0.3985203, 0.38082838, -0.003341...",-1
12908,voyage_202122050,"RSV Nuyina Voyage 5 2021-22 Voyage Data, South...",This dataset contains the Voyage Data from voy...,The quality for the science data produced duri...,onGoing,"RSV Nuyina Voyage 5 2021-22 Voyage Data, South...","[-0.9219631, -0.30171615, 0.2630324, -0.072466...",-1
12914,voyage_202324050,RSV Nuyina Voyage Data 2023-24 V5,Voyage 5 of the 2023/2024 season onboard RSV N...,Before using any data collected on a voyage pl...,onGoing,RSV Nuyina Voyage Data 2023-24 V5 [SEP] Voyage...,"[-1.119304, -0.7627072, 0.2805291, 0.03869628,...",-1


In [15]:
# split as traning and testing data
from sklearn.model_selection import train_test_split
X = np.array(sample_data["embedding"].tolist())
y = np.array(sample_data["mode"].tolist())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)