## Evaluate Models

This notebook aims to evaluate the `base_model` and `modified_model` on the test set. That is the final evaluation on the unseen data, as would happen in production.
The published versions of the models are used to assure the metrics are computed with the wrapper that would also be used in production. Thus, we avoid checking the data with an experimentation pipeline that could be different from the production code.

### Tasks:
 - [ ] Load test dataset.
 - [ ] Load models:
     - [ ] Base model.
     - [ ] Modified model (`clothing`).
 - [ ] Evaluate models on the test set.
     - [ ] Generate confusion matrix.
     - [ ] Check metrics on clothing category.

## Libraries and Configurations

In [1]:
import pandas as pd

import mlflow
from mlflow.tracking import MlflowClient
from sklearn.preprocessing._label import LabelEncoder

import seaborn as sns
import matplotlib.pyplot as plt
from IPython.core.display import HTML

from application.code.core.configurations import configs
from application.code.adapters.storage import read_dataset
from application.code.core.model_evaluation import (compute_multiclass_classification_metrics,
                                                    generate_feature_importance_report,
                                                    generate_confusion_matrix,
                                                    plot_folds_metrics)

from application.code.adapters.mlflow_adapter import (get_mlflow_artifact_content,
                                                      get_published_model,
                                                      extract_internal_model)
from application.code.core.feature_engineering import (format_string_columns, 
                                                       standardize_labels)

sns.set_style("whitegrid")

## MLflow Settings

In [2]:
mlflow.set_tracking_uri(configs.mlflow.uri)
mlflow.set_experiment(configs.mlflow.experiment_name);

## Load Dataset

The `test` dataset is loaded to perform the final evaluation.

In [3]:
df = read_dataset(base_path=configs.datasets.base_path, stage='raw', file_name='test')

display(HTML('<h4>Dataset</h4>'))
print(f'Records: {len(df)}')

df = df.drop_duplicates()
display(HTML('<h4>Deduplicated Dataset</h4>'))
print(f'Records: {len(df)}')

Records: 1011


Records: 1004


To create the model and perform experiments, only the `training` dataset will be used. The evaluation will be performed by creating some time-oriented `validation` datasets using the same methodology used to create the `test` dataset.

3 sets of `training` and `validation` sets will be created, each of them representing a fold. At the end, it will be possible to have an efficacy measurement with a variance notion.
It is important to use `validation` set avoid using the `test` several times. Ideally, it should be used only once, for the final assessment.

## Load Models

The models are retrieved from MLflow server to be used as it would in production.

In [4]:
base_model = get_published_model(model_name=configs.mlflow.base_model_name,
                                 stage="Staging")

As MLFlow only exposes the `predict` function, it is necessary to extract the internal model to have access to all the developed functions. These functions are necessary to perform some low level operations to evaluate the model. 

In [5]:
internal_model = extract_internal_model(base_model)

## Evaluate Model

Compute predictions and encode labels to be able to compare predictions with ground truth labels and compute metrics.

In [6]:
predictions = base_model.predict(df)

print(f'Predictions: {len(predictions)}')
print(f'Sample: {", ".join(predictions[:5])}')

encoded_predictions = internal_model.encode_labels(predictions)

print(f'\nEncoded Predictions: {len(encoded_predictions)}')
print(f'Sample: {", ".join(map(str, encoded_predictions[:5]))}')

Predictions: 1004
Sample: compra online, serviço, artigos eletro, artigos eletro, serviço

Encoded Predictions: 1004
Sample: 5, 16, 2, 2, 16


Preprocess and encode raw labels to be able to compare with the model generated labels.

In [7]:
labels = df['grupo_estabelecimento'].to_list()

print(f'Labels: {len(labels)}')
print(f'Sample: {", ".join(map(str, labels[:5]))}')

encoded_labels = internal_model.encode_labels(labels)

print(f'\nEncoded Labels: {len(encoded_labels)}')
print(f'Sample: {", ".join(map(str, encoded_labels[:5]))}')

Labels: 1004
Sample: ARTIGOS ELETRO, M.O.T.O., M.O.T.O., ARTIGOS ELETRO, SERVIO

Encoded Labels: 1004
Sample: 2, 5, 5, 2, 16


Compute metrics

In [8]:
metrics = compute_multiclass_classification_metrics(encoded_labels, encoded_predictions)

In [9]:
metrics_df = (
    pd
    .DataFrame([metrics])
    .T
    .reset_index()
    .set_axis(['metric', 'value'], axis=1)
)
metrics_df

Unnamed: 0,metric,value
0,macro_precision,0.204147
1,macro_recall,0.173658
2,macro_f1,0.171323
3,micro_precision,0.39741
4,micro_recall,0.39741
5,micro_f1,0.39741
6,weighted_precision,0.379709
7,weighted_recall,0.39741
8,weighted_f1,0.381886


## Concluding Remarks
