# ML model for Keyword Classification - Non-tech Notebook
This notebook introduces how to use this trained ML model, for non-technical audiences. For how to preprocessing raw data, and the details of training a keyword classification model, please refer to [keywordClassificationTechNotebook.ipynb](KeywordClassificationTechNotebook.ipynb). Please note that methods provided in [keywordClassificationTechNotebook.ipynb](KeywordClassificationTechNotebook.ipynb) are breakdown. For better use the code as a pipeline, we use embedded functions provided in `pipeline.py`.

First, let's define an unlabelled item with its title and abstract. We will use it as an example to show how we use the pipelines along with different parameters to make keyword classification.

In [8]:
item_title = "Corals and coral communities of Lord Howe Island, Australia"
item_abstract = """Ecological and taxonomic surveys of hermatypic scleractinian corals were carried out at approximately 100 sites around Lord Howe Island. Sixty-six of these sites were located on reefs in the lagoon, which extends for two-thirds of the length of the island on the western side. Each survey site consisted of a section of reef surface, which appeared to be topographically and faunistically homogeneous. The dimensions of the sites surveyed were generally of the order of 20m by 20m. Where possible, sites were arranged contiguously along a band up the reef slope and across the flat. The cover of each species was graded on a five-point scale of percentage relative cover. Other site attributes recorded were depth (minimum and maximum corrected to datum), slope (estimated), substrate type, total estimated cover of soft coral and algae (macroscopic and encrusting coralline). Coral data from the lagoon and its reef (66 sites) were used to define a small number of site groups which characterize most of this area.Throughout the survey, corals of taxonomic interest or difficulty were collected, and an extensive photographic record was made to augment survey data. A collection of the full range of form of all coral species was made during the survey and an identified reference series was deposited in the Australian Museum.In addition, less detailed descriptive data pertaining to coral communities and topography were recorded on 12 reconnaissance transects, the authors recording changes seen while being towed behind a boat.
 The purpose of this study was to describe the corals of Lord Howe Island (the southernmost Indo-Pacific reef) at species and community level using methods that would allow differentiation of community types and allow comparisons with coral communities in other geographic locations."""
item_description = f"{item_title}: {item_abstract}"
item_description

'Corals and coral communities of Lord Howe Island, Australia: Ecological and taxonomic surveys of hermatypic scleractinian corals were carried out at approximately 100 sites around Lord Howe Island. Sixty-six of these sites were located on reefs in the lagoon, which extends for two-thirds of the length of the island on the western side. Each survey site consisted of a section of reef surface, which appeared to be topographically and faunistically homogeneous. The dimensions of the sites surveyed were generally of the order of 20m by 20m. Where possible, sites were arranged contiguously along a band up the reef slope and across the flat. The cover of each species was graded on a five-point scale of percentage relative cover. Other site attributes recorded were depth (minimum and maximum corrected to datum), slope (estimated), substrate type, total estimated cover of soft coral and algae (macroscopic and encrusting coralline). Coral data from the lagoon and its reef (66 sites) were used 

In [None]:
# add module path for notebook to use
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(project_root)

# import customised modules
from data_discovery_ai.pipeline import pipeline

The `pipeline()` function in `pipeline` module contains four parameters:

1. `isDataChanged`: bool. The indicator to call the data preprocessing module or not.
2. `usePretrainedModel`: bool. The indicator to use the pretrained model or not.
3. `description`: The item description which is used for making prediction. It should be the title and abstract in a string format as `item_title: item_abstract`, to align with the format of training set.
4. `selected_model`: The model name for a selected pretrained model. It is strictly controlled with these options: `development`,`experimental`, `staging`, `production`, `benchmark`.

| Option | Purpose | Typical Use |
| ---- | ---- | ---- |
| `development` | Dedicated to active model development, testing, and iteration. | Building and refining new model versions, features, or datasets. |
| `experimental` | Supports exploratory work for new techniques or fine-tuning. | Experimenting with new architectures, features, or hyperparameter tuning. |
| `staging` | Prepares the model for production with real-use evaluations. | Conducting final testing in a production-like environment to verify stability and performance. |
| `production` | Deployment environment for live model usage in real-world scenarios. | Running and monitoring models in active use by API. |
| `benchmark` | Baseline model used to assess improvements or changes. | Comparing performance metrics against new models. |

With fixed `description` and `selected_model`, there are three combinations of `isDataChanged` and `usePretrainedModel` that cover three typical scenarios:

1. `isDataChanged=True` and `usePretrainedModel=False`

    This setup is commonly used when a new environment is established, such as during the initial deployment of an ML system. It is also appropriate when the raw data has undergone significant changes, requiring the pipeline to be reapplied. However, use caution when applying this pipeline for end-users, as it can be time-consuming.

2. `isDataChanged=False` and `usePretrainedModel=True`

    This setup is suitable when we adjust the model through hyperparameter tuning without changing the raw data, or when the raw data has not undergone significant changes. This should be only used for developers conducting experiments.

3. `isDataChanged=False` and `usePretrainedModel=False`

    This setup is used to employ a pretrained model for prediction. It is designed to provide a keyword classification suggestion service for end-users, enabling efficient and reliable functionality.

In [10]:
pipeline(
        isDataChanged=True,
        usePretrainedModel=False,
        description=item_description,
        selected_model="experimental",
    )

searching elasticsearch: 100%|██████████| 67/67 [06:06<00:00,  5.47s/it]
100%|██████████| 1055/1055 [24:21<00:00,  1.39s/it]
INFO:data_discovery_ai.utils.preprocessor:Saved to C:\Users\yhu12\AppData\Local\Temp\tmpzknlp5xo\keyword_sample.pkl
INFO:data_discovery_ai.utils.preprocessor:Saved to C:\Users\yhu12\OneDrive - University of Tasmania\IMOS\DataDiscovery\data-discovery-ai\data_discovery_ai\resources\keyword_sample.pkl
INFO:data_discovery_ai.utils.preprocessor:Saved to C:\Users\yhu12\AppData\Local\Temp\tmpzknlp5xo\keyword_label.pkl


Total samples: 2793
Dimension: 768
No. of labels: 456
X resampled set size: 2793
Y resampled set size: 2793


INFO:data_discovery_ai.utils.preprocessor:Total samples: 2793
INFO:data_discovery_ai.utils.preprocessor:Dimension: 768
INFO:data_discovery_ai.utils.preprocessor:No. of labels: 456
INFO:data_discovery_ai.utils.preprocessor:Train set size: 2240 (80.20%)
INFO:data_discovery_ai.utils.preprocessor:Test set size: 553 (19.80%)
INFO:data_discovery_ai.utils.preprocessor:Saved to C:\Users\yhu12\OneDrive - University of Tasmania\IMOS\DataDiscovery\data-discovery-ai\data_discovery_ai\resources\keyword_label.pkl


Total samples: 74277
Dimension: 768
No. of labels: 456
X resampled set size: 74277
Y resampled set size: 74277


Epoch 1/100
[1m1857/1857[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.1507 - loss: 4.2403e-04 - precision: 0.3715 - recall: 0.4629 - val_accuracy: 0.0000e+00 - val_loss: 0.0284 - val_precision: 0.5911 - val_recall: 0.1440 - learning_rate: 0.0010
Epoch 2/100
[1m1857/1857[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.3633 - loss: 3.7228e-05 - precision: 0.9092 - recall: 0.9495 - val_accuracy: 0.0000e+00 - val_loss: 0.0157 - val_precision: 0.6533 - val_recall: 0.4812 - learning_rate: 0.0010
Epoch 3/100
[1m1857/1857[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.3852 - loss: 1.8292e-05 - precision: 0.9446 - recall: 0.9792 - val_accuracy: 0.0000e+00 - val_loss: 0.0127 - val_precision: 0.7065 - val_recall: 0.5794 - learning_rate: 0.0010
Epoch 4/100
[1m1857/1857[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.4001 - loss: 1.3349e-05 - precision: 0.9573 - recall: 0.9872

So it go through the whole pipeline: including data preprocessing, training, and prediction. Once we have fetched data, with minimal data change, we can directly train/retrain the model:

In [13]:
pipeline(
        isDataChanged=False,
        usePretrainedModel=False,
        description=item_description,
        selected_model="development",
    )

INFO:data_discovery_ai.utils.preprocessor:Load from C:\Users\yhu12\OneDrive - University of Tasmania\IMOS\DataDiscovery\data-discovery-ai\data_discovery_ai\resources\keyword_sample.pkl
INFO:data_discovery_ai.utils.preprocessor:Saved to C:\Users\yhu12\AppData\Local\Temp\tmp2vvpgtjw\keyword_label.pkl


Total samples: 2793
Dimension: 768
No. of labels: 456
X resampled set size: 2793
Y resampled set size: 2793


INFO:data_discovery_ai.utils.preprocessor:Total samples: 2793
INFO:data_discovery_ai.utils.preprocessor:Dimension: 768
INFO:data_discovery_ai.utils.preprocessor:No. of labels: 456
INFO:data_discovery_ai.utils.preprocessor:Train set size: 2240 (80.20%)
INFO:data_discovery_ai.utils.preprocessor:Test set size: 553 (19.80%)
INFO:data_discovery_ai.utils.preprocessor:Saved to C:\Users\yhu12\OneDrive - University of Tasmania\IMOS\DataDiscovery\data-discovery-ai\data_discovery_ai\resources\keyword_label.pkl


Total samples: 74277
Dimension: 768
No. of labels: 456
X resampled set size: 74277
Y resampled set size: 74277


Epoch 1/100
[1m1857/1857[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.1508 - loss: 4.3564e-04 - precision: 0.3593 - recall: 0.4486 - val_accuracy: 0.0000e+00 - val_loss: 0.0243 - val_precision: 0.5680 - val_recall: 0.1397 - learning_rate: 0.0010
Epoch 2/100
[1m1857/1857[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.3670 - loss: 3.7261e-05 - precision: 0.9061 - recall: 0.9488 - val_accuracy: 0.0000e+00 - val_loss: 0.0189 - val_precision: 0.6976 - val_recall: 0.4343 - learning_rate: 0.0010
Epoch 3/100
[1m1857/1857[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.3840 - loss: 1.9359e-05 - precision: 0.9447 - recall: 0.9779 - val_accuracy: 0.0242 - val_loss: 0.0162 - val_precision: 0.7495 - val_recall: 0.6477 - learning_rate: 0.0010
Epoch 4/100
[1m1857/1857[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.3957 - loss: 1.2444e-05 - precision: 0.9615 - recall: 0.9877 - v



[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
{'precision': '0.9930', 'recall': '0.9856', 'f1': '0.9893', 'hammingloss': '0.0009', 'Jaccard Index': '0.9219', 'accuracy': '0.8770'}
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[{'vocab_type': 'AODN Platform Vocabulary', 'value': 'research vessel', 'url': 'http://vocab.nerc.ac.uk/collection/L06/current/31'}, {'vocab_type': 'AODN Discovery Parameter Vocabulary', 'value': 'abundance of biota', 'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/488'}, {'vocab_type': 'AODN Discovery Parameter Vocabulary', 'value': 'biotic taxonomic identification', 'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/489'}]


Last, we already have two pretrained model named `experimental` and `development`, we can directly use them for prediction.

In [14]:
pipeline(
        isDataChanged=False,
        usePretrainedModel=True,
        description=item_description,
        selected_model="development",
    )

INFO:data_discovery_ai.utils.preprocessor:Load from C:\Users\yhu12\OneDrive - University of Tasmania\IMOS\DataDiscovery\data-discovery-ai\data_discovery_ai\resources\keyword_sample.pkl
INFO:data_discovery_ai.utils.preprocessor:Load from C:\Users\yhu12\OneDrive - University of Tasmania\IMOS\DataDiscovery\data-discovery-ai\data_discovery_ai\resources\keyword_label.pkl


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[{'vocab_type': 'AODN Platform Vocabulary', 'value': 'research vessel', 'url': 'http://vocab.nerc.ac.uk/collection/L06/current/31'}, {'vocab_type': 'AODN Discovery Parameter Vocabulary', 'value': 'abundance of biota', 'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/488'}, {'vocab_type': 'AODN Discovery Parameter Vocabulary', 'value': 'biotic taxonomic identification', 'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/489'}]
