## Evaluate Models

This notebook aims to evaluate the `base_model` and `modified_model` on the test set. That is the final evaluation on the unseen data, as would happen in production.
The published versions of the models are used to assure the metrics are computed with the wrapper that would also be used in production. Thus, we avoid checking the data with an experimentation pipeline that could be different from the production code.

### Tasks:
 - [X] Load test dataset.
 - [X] Load models:
     - [X] Base model.
     - [X] Modified model (`clothing`).
 - [X] Evaluate models on the test set.
     - [X] Generate confusion matrix.
     - [X] Check metrics on clothing category.
 - [X] Update MLFlow with models metrics

## Libraries and Configurations

In [1]:
from operator import itemgetter

import pandas as pd

import mlflow
from mlflow.tracking import MlflowClient
from sklearn.preprocessing._label import LabelEncoder
from IPython.core.display import HTML

from application.code.core.configurations import configs
from application.code.adapters.storage import read_dataset
from application.code.core.model_evaluation import (compute_multiclass_classification_metrics,
                                                    generate_classification_report)

from application.code.adapters.mlflow_adapter import (get_mlflow_artifact_content,
                                                      get_published_model,
                                                      extract_internal_model,
                                                      set_active_run,
                                                      end_run,
                                                      log_dataframe_artifact,
                                                      log_metrics
                                                     )
from application.code.core.feature_engineering import standardize_label

## MLflow Settings

In [2]:
mlflow.set_tracking_uri(configs.mlflow.uri)
mlflow.set_experiment(configs.mlflow.experiment_name);

## Load Dataset

The `test` dataset is loaded to perform the final evaluation.

In [3]:
df = read_dataset(base_path=configs.datasets.base_path, stage='raw', file_name='test')

display(HTML('<h4>Dataset</h4>'))
print(f'Records: {len(df)}')

df = df.drop_duplicates()
display(HTML('<h4>Deduplicated Dataset</h4>'))
print(f'Records: {len(df)}')

Records: 1011


Records: 1004


To create the model and perform experiments, only the `training` dataset will be used. The evaluation will be performed by creating some time-oriented `validation` datasets using the same methodology used to create the `test` dataset.

3 sets of `training` and `validation` sets will be created, each of them representing a fold. At the end, it will be possible to have an efficacy measurement with a variance notion.
It is important to use `validation` set avoid using the `test` several times. Ideally, it should be used only once, for the final assessment.

## Load Models

The models are retrieved from MLflow server to be used as it would in production.

In [4]:
base_model = get_published_model(model_name=configs.mlflow.base_model_name,
                                 stage="Staging")

modified_model = get_published_model(model_name=configs.mlflow.modified_model_name,
                                 stage="Staging")

As MLFlow only exposes the `predict` function, it is necessary to extract the internal model to have access to all the developed functions. These functions are necessary to perform some low level operations to evaluate the model. 

To keep these models and also additional information that will be computed, a nested dictionary is created.

In [5]:
models = dict()

models['base_model'] = {'model': extract_internal_model(base_model),
                        'wrapped_model': base_model}
models['modified_model'] = {'model': extract_internal_model(modified_model),
                            'wrapped_model': modified_model}

## Evaluate Models

Compute predictions and encode labels to be able to compare predictions with ground truth labels and compute metrics.

In [6]:
for name, model_registry in models.items():
    display(HTML(f'<h3>{name}</h3>'))
    model = model_registry['model']
   
    predictions = model.predict(df)
    print(f' - Predictions: {len(predictions)}')
    print(f' - Sample: {", ".join(predictions[:5])}')

    encoded_predictions = model.encode_labels(predictions)
    print(f'\n - Encoded Predictions: {len(encoded_predictions)}')
    print(f' - Sample: {", ".join(map(str, encoded_predictions[:5]))}')
    
    models[name]['predictions'] = predictions
    models[name]['encoded_predictions'] = encoded_predictions    

 - Predictions: 1004
 - Sample: compra online, serviço, artigos eletro, artigos eletro, serviço

 - Encoded Predictions: 1004
 - Sample: 5, 16, 2, 2, 16


 - Predictions: 1004
 - Sample: compra online, serviço, artigos eletro, artigos eletro, serviço

 - Encoded Predictions: 1004
 - Sample: 5, 16, 2, 2, 16


Preprocess and encode raw labels to be able to compare with the model generated labels.

In [7]:
for name, model_registry in models.items():
    display(HTML(f'<h3>{name}</h3>'))
    model = model_registry['model']

    raw_labels = df['grupo_estabelecimento'].tolist()
    labels = [standardize_label(l) for l in raw_labels]

    print(f'Labels: {len(labels)}')
    print(f'Sample: {", ".join(map(str, labels[:5]))}')

    encoded_labels = model.encode_labels(labels)

    print(f'\nEncoded Labels: {len(encoded_labels)}')
    print(f'Sample: {", ".join(map(str, encoded_labels[:5]))}')
    
    models[name]['labels'] = labels
    models[name]['encoded_labels'] = encoded_labels
    models[name]['classes'] =  model.label_encoder.classes_    

Labels: 1004
Sample: artigos eletro, compra online, compra online, artigos eletro, serviço

Encoded Labels: 1004
Sample: 2, 5, 5, 2, 16


Labels: 1004
Sample: artigos eletro, compra online, compra online, artigos eletro, serviço

Encoded Labels: 1004
Sample: 2, 5, 5, 2, 16


Compute metrics

In [8]:
for name, model_registry in models.items():
    formated_name = name.replace("_", " ").capitalize()
    display(HTML(f'<h3>{formated_name}</h3>'))
    model = model_registry['model']

    encoded_labels, encoded_predictions, classes = itemgetter('encoded_labels', 'encoded_predictions', 'classes')(model_registry)
    metrics = compute_multiclass_classification_metrics(encoded_labels, encoded_predictions)
    models[name]['metrics'] = metrics

    display(HTML('<h4>Metrics</h4>'))
    metrics_df = (
        pd
        .DataFrame([metrics])
        .T
        .reset_index()
        .set_axis(['metric', 'value'], axis=1)
    )
    display(metrics_df)    

    display(HTML('<h4>Class Metrics</h4>'))
    classification_report_df = generate_classification_report(encoded_labels, encoded_predictions, classes)
    formatted_classification_report_df = (
        classification_report_df
        .astype({'support': int})
        .sort_values(by='support', ascending=False)
        .reset_index()
        .rename(columns={'index': 'class'})
        .style
        .applymap(lambda value: 'background-color:#7fb3d5' if value  > .5 else '',
                  subset=['f1-score', 'precision', 'recall'])
        .applymap(lambda value: 'background-color:#d2b4de' if value  == 'vestuário' else '',
                 subset=['class'])
        .applymap(lambda value: 'font-weight:bold', subset=['class'])
    )
    models[name]['classification_report'] = formatted_classification_report_df
    display(formatted_classification_report_df)
    print('\n')

Unnamed: 0,metric,value
0,macro_precision,0.204147
1,macro_recall,0.173658
2,macro_f1,0.171323
3,micro_precision,0.39741
4,micro_recall,0.39741
5,micro_f1,0.39741
6,weighted_precision,0.379709
7,weighted_recall,0.39741
8,weighted_f1,0.381886


Unnamed: 0,class,precision,recall,f1-score,support
0,serviço,0.589235,0.64,0.613569,325
1,restaurante,0.364486,0.478528,0.413793,163
2,varejo,0.317829,0.273333,0.293907,150
3,supermercados,0.155172,0.2,0.174757,90
4,farmácias,0.176471,0.052632,0.081081,57
5,compra online,0.366667,0.44898,0.40367,49
6,posto de gás,0.363636,0.195122,0.253968,41
7,vestuário,0.166667,0.153846,0.16,39
8,artigos eletro,0.3125,0.30303,0.307692,33
9,loja de departamento,0.0,0.0,0.0,16






Unnamed: 0,metric,value
0,macro_precision,0.169641
1,macro_recall,0.163294
2,macro_f1,0.159068
3,micro_precision,0.393426
4,micro_recall,0.393426
5,micro_f1,0.393426
6,weighted_precision,0.365613
7,weighted_recall,0.393426
8,weighted_f1,0.372496


Unnamed: 0,class,precision,recall,f1-score,support
0,serviço,0.568306,0.64,0.602026,325
1,restaurante,0.357466,0.484663,0.411458,163
2,varejo,0.309524,0.26,0.282609,150
3,supermercados,0.158879,0.188889,0.172589,90
4,farmácias,0.153846,0.035088,0.057143,57
5,compra online,0.40625,0.530612,0.460177,49
6,posto de gás,0.304348,0.170732,0.21875,41
7,vestuário,0.114286,0.102564,0.108108,39
8,artigos eletro,0.275862,0.242424,0.258065,33
9,loja de departamento,0.25,0.0625,0.1,16






## Register Model Metrics at MLFlow

The final metrics and reports are recorded at MLFlow for each model.

In [9]:
for name, model_registry in models.items():
    wrapped_model = model_registry['wrapped_model']
    run_id = wrapped_model.metadata.to_dict()['run_id']

    set_active_run(run_id)
    log_metrics(model_registry['metrics'])
    log_dataframe_artifact(model_registry['classification_report'], 'main model', 'test_classification_report')
    end_run()

## Concluding Remarks

This notebook ends the project by loading the published versions of `base model` and `modified model` and evaluating both of them on a unseen dataset (`test`). 

Overwall, the only class that could be suitable for usage (yet, with a not so good performance) would be `serviço`. This class has a feasible sample size for allow training and evaluating a model. 

The other classes do not contain a representative sample size. When considering the concentration of information based on time, it is not possible to property evaluate the data.

The issues on the dataset also seem to cause an unapropriate increase in performance for the `modified model`. The `vestuário` class had a performance decrease from `0.19` to `0.10` from `validation` to the `test` set. Even the `base model` performed better with a F1 of `0.16`.  Considering this scenario, a better approach to improve the class results with less variance would be adjusting the class probability of the model.

One of the main causes of the unappropriate metrics might be the time-based split of the dataset. It, however, gives a proper methodology of assessment to be sure to have an evaluation that is not optimistic and do not leak information to the model. That is especially necessary after the pandemic and all the behavioral changes that happened as consequence.