# ML model for Keyword Classification - Non-tech Notebook
This notebook introduces how to use this trained ML model, for non-technical audiences. For how to preprocessing raw data, and the details of training a keyword classification model, please refer to [keywordClassificationTechNotebook.ipynb](KeywordClassificationTechNotebook.ipynb). Please note that methods provided in [keywordClassificationTechNotebook.ipynb](KeywordClassificationTechNotebook.ipynb) are breakdown. For better use the code as a pipeline, we use embedded functions provided in `keyword_classifier_pipeline.py`.

First, let's define an unlabelled item with its title and abstract. We will use it as an example to show how we use the pipelines along with different parameters to make keyword classification.

In [1]:
item_title = "Corals and coral communities of Lord Howe Island, Australia"
item_abstract = """Ecological and taxonomic surveys of hermatypic scleractinian corals were carried out at approximately 100 sites around Lord Howe Island. Sixty-six of these sites were located on reefs in the lagoon, which extends for two-thirds of the length of the island on the western side. Each survey site consisted of a section of reef surface, which appeared to be topographically and faunistically homogeneous. The dimensions of the sites surveyed were generally of the order of 20m by 20m. Where possible, sites were arranged contiguously along a band up the reef slope and across the flat. The cover of each species was graded on a five-point scale of percentage relative cover. Other site attributes recorded were depth (minimum and maximum corrected to datum), slope (estimated), substrate type, total estimated cover of soft coral and algae (macroscopic and encrusting coralline). Coral data from the lagoon and its reef (66 sites) were used to define a small number of site groups which characterize most of this area.Throughout the survey, corals of taxonomic interest or difficulty were collected, and an extensive photographic record was made to augment survey data. A collection of the full range of form of all coral species was made during the survey and an identified reference series was deposited in the Australian Museum.In addition, less detailed descriptive data pertaining to coral communities and topography were recorded on 12 reconnaissance transects, the authors recording changes seen while being towed behind a boat.
 The purpose of this study was to describe the corals of Lord Howe Island (the southernmost Indo-Pacific reef) at species and community level using methods that would allow differentiation of community types and allow comparisons with coral communities in other geographic locations."""
item_description = f"{item_title}: {item_abstract}"
item_description

'Corals and coral communities of Lord Howe Island, Australia: Ecological and taxonomic surveys of hermatypic scleractinian corals were carried out at approximately 100 sites around Lord Howe Island. Sixty-six of these sites were located on reefs in the lagoon, which extends for two-thirds of the length of the island on the western side. Each survey site consisted of a section of reef surface, which appeared to be topographically and faunistically homogeneous. The dimensions of the sites surveyed were generally of the order of 20m by 20m. Where possible, sites were arranged contiguously along a band up the reef slope and across the flat. The cover of each species was graded on a five-point scale of percentage relative cover. Other site attributes recorded were depth (minimum and maximum corrected to datum), slope (estimated), substrate type, total estimated cover of soft coral and algae (macroscopic and encrusting coralline). Coral data from the lagoon and its reef (66 sites) were used 

In [2]:
# add module path for notebook to use
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(project_root)

# import customised modules
from data_discovery_ai.pipeline.pipeline import KeywordClassifierPipeline

  from .autonotebook import tqdm as notebook_tqdm


When initialise a `KeywordClassfierPipeline` object, three parameters are essential:
1. `isDataChanged`: bool. The indicator to call the data preprocessing module or not.
2. `usePretrainedModel`: bool. The indicator to use the pretrained model or not.
3. `model_name`: Specifies the name of a trained model. The options are strictly limited to the following: `development`,`experimental`, `staging`, `production`, `benchmark`. If `usePretrainedModel` is set to `True`, the `selected_model` must already be pretrained. This implies that a corresponding `selected_model.keras` file must exist in the `resources` folder.

| Option | Purpose | Typical Use |
| ---- | ---- | ---- |
| `development` | Dedicated to active model development, testing, and iteration. | Building and refining new model versions, features, or datasets. |
| `experimental` | Supports exploratory work for new techniques or fine-tuning. | Experimenting with new architectures, features, or hyperparameter tuning. |
| `staging` | Prepares the model for production with real-use evaluations. | Conducting final testing in a production-like environment to verify stability and performance. |
| `production` | Deployment environment for live model usage in real-world scenarios. | Running and monitoring models in active use by API. |
| `benchmark` | Baseline model used to assess improvements or changes. | Comparing performance metrics against new models. |

Once a `KeywordClassfierPipeline` object is initialised, we can use its `pipeline(description)` function to make prediction. The parameter `description` is a string which is composed with the title and abstract of a metadata record, it should constructed as `item_title: item_abstract`, to align with the format of training set.




With fixed `description` and `selected_model`, there are three combinations of `isDataChanged` and `usePretrainedModel` that cover three typical scenarios:

1. `isDataChanged=True` and `usePretrainedModel=False`

    This setup is commonly used when a new environment is established, such as during the initial deployment of an ML system. It is also appropriate when the raw data has undergone significant changes, requiring the pipeline to be reapplied. However, use caution when applying this pipeline for end-users, as it can be time-consuming.

2. `isDataChanged=False` and `usePretrainedModel=True`

    This setup is suitable when we adjust the model through hyperparameter tuning without changing the raw data, or when the raw data has not undergone significant changes. This should be only used for developers conducting experiments.

3. `isDataChanged=False` and `usePretrainedModel=False`

    This setup is used to employ a pretrained model for prediction. It is designed to provide a keyword classification suggestion service for end-users, enabling efficient and reliable functionality.

In [8]:
scenario1_pipeline = KeywordClassifierPipeline(
        isDataChanged=True,
        usePretrainedModel=False,
        model_name="experimental",
    )
scenario1_pipeline.pipeline(description=item_description)

searching elasticsearch: 100%|██████████| 129/129 [11:27<00:00,  5.33s/it]
100%|██████████| 1860/1860 [50:43<00:00,  1.64s/it] 
INFO:data_discovery_ai.utils.preprocessor:Saved to C:\Users\yhu12\AppData\Local\Temp\tmppgnleynw\keyword_sample.pkl
INFO:data_discovery_ai.utils.preprocessor:Saved to C:\Users\yhu12\OneDrive - University of Tasmania\IMOS\DataDiscovery\data-discovery-ai\data_discovery_ai\resources\KeywordClassifier\keyword_sample.pkl
INFO:data_discovery_ai.utils.preprocessor:Saved to C:\Users\yhu12\AppData\Local\Temp\tmppgnleynw\keyword_label.pkl


Total samples: 3468
Dimension: 768
No. of labels: 526
X resampled set size: 3468
Y resampled set size: 3468


INFO:data_discovery_ai.utils.preprocessor:Total samples: 3468
INFO:data_discovery_ai.utils.preprocessor:Dimension: 768
INFO:data_discovery_ai.utils.preprocessor:No. of labels: 526
INFO:data_discovery_ai.utils.preprocessor:Train set size: 2767 (79.79%)
INFO:data_discovery_ai.utils.preprocessor:Test set size: 701 (20.21%)
INFO:data_discovery_ai.utils.preprocessor:Saved to C:\Users\yhu12\OneDrive - University of Tasmania\IMOS\DataDiscovery\data-discovery-ai\data_discovery_ai\resources\KeywordClassifier\keyword_label.pkl


Total samples: 151525
Dimension: 768
No. of labels: 526
X resampled set size: 151525
Y resampled set size: 151525


Epoch 1/100
[1m3789/3789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 4ms/step - accuracy: 0.1749 - loss: 2.6597e-04 - precision: 0.4649 - recall: 0.5400 - val_accuracy: 0.0163 - val_loss: 0.0210 - val_precision: 0.6305 - val_recall: 0.3027 - learning_rate: 0.0010
Epoch 2/100
[1m3789/3789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 3ms/step - accuracy: 0.3290 - loss: 2.6967e-05 - precision: 0.8999 - recall: 0.9569 - val_accuracy: 0.0000e+00 - val_loss: 0.0163 - val_precision: 0.7148 - val_recall: 0.5286 - learning_rate: 0.0010
Epoch 3/100
[1m3789/3789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 4ms/step - accuracy: 0.3380 - loss: 1.7389e-05 - precision: 0.9319 - recall: 0.9770 - val_accuracy: 0.0163 - val_loss: 0.0142 - val_precision: 0.7660 - val_recall: 0.6079 - learning_rate: 0.0010
Epoch 4/100
[1m3789/3789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 3ms/step - accuracy: 0.3544 - loss: 1.1600e-05 - precision: 0.9507 - recall: 0.9856 - v

So it go through the whole pipeline: including data preprocessing, training, and prediction. Once we have fetched data, with minimal data change, we can directly train/retrain the model:

In [3]:
scenario2_pipeline = KeywordClassifierPipeline(
        isDataChanged=False,
        usePretrainedModel=False,
        model_name="development",
    )
scenario2_pipeline.pipeline(description=item_description)

Total samples: 3468
Dimension: 768
No. of labels: 526
X resampled set size: 3468
Y resampled set size: 3468
Total samples: 150150
Dimension: 768
No. of labels: 526
X resampled set size: 150150
Y resampled set size: 150150


Epoch 1/100
[1m3754/3754[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 5ms/step - accuracy: 0.1682 - loss: 3.4296e-04 - precision: 0.5254 - recall: 0.5778 - val_accuracy: 0.0000e+00 - val_loss: 0.0192 - val_precision: 0.6009 - val_recall: 0.2793 - learning_rate: 0.0010
Epoch 2/100
[1m3754/3754[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 4ms/step - accuracy: 0.3233 - loss: 3.4273e-05 - precision: 0.9205 - recall: 0.9653 - val_accuracy: 0.0000e+00 - val_loss: 0.0123 - val_precision: 0.7212 - val_recall: 0.4675 - learning_rate: 0.0010
Epoch 3/100
[1m3754/3754[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 4ms/step - accuracy: 0.3396 - loss: 2.0940e-05 - precision: 0.9457 - recall: 0.9811 - val_accuracy: 0.0015 - val_loss: 0.0109 - val_precision: 0.7827 - val_recall: 0.5360 - learning_rate: 0.0010
Epoch 4/100
[1m3754/3754[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 4ms/step - accuracy: 0.3529 - loss: 1.4593e-05 - precision: 0.9597 - recall: 0.9877

Last, we already have two pretrained model named `experimental` and `development`, we can directly use them for prediction.

In [4]:
scenario3_pipeline = KeywordClassifierPipeline(
        isDataChanged=False,
        usePretrainedModel=True,
        model_name="development",
    )
scenario3_pipeline.pipeline(description=item_description)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[{'vocab_type': 'AODN Discovery Parameter Vocabulary', 'value': 'abundance of biota', 'url': 'http://vocab.aodn.org.au/def/discovery_parameter/entity/488'}, {'vocab_type': 'AODN Discovery Parameter Vocabulary', 'value': 'biotic taxonomic identification', 'url': '16a5b7ad-7c1e-4ede-b8d5-c5487b7e57d4:concept:489'}]
