## Train Base Model -- Development of a Transaction Categorization Model

This notebook aims to create the base model to categorize transactions. In this scenario, all categories will have the same importance, adjusted by the frequency.

### Tasks:
 - [X] Load training dataset.
 - [X] Create k folds for experiments.
 - [X] Generate feature vector.
 - [X] Train model.
     - [X] Adjust weights;
     - [X] Compute metrics based on folds;
     - [X] Train final model;
 - [ ] Submit model, parameters and metrics do MLflow.

## Libraries and Configurations

In [1]:
import pandas as pd
from IPython.core.display import HTML
from lightgbm.sklearn import LGBMClassifier
# import mlflow
# import mlflow.sklearn
# from mlflow.tracking import MlflowClient
# from mlflow.models.signature import infer_signature

from application.code.core.configurations import configs
from application.code.adapters.storage import read_dataset
from application.code.core.dataset_split_service import generate_folds, describe_datasets
from application.code.core.feature_engineering import engineer_features
from application.code.core.model_training import (clean_data,
                                                  vectorize_folds, 
                                                  compute_weights, 
                                                  generate_encoders,
                                                  vectorize_dataset)
from application.code.core.model_evaluation import (compute_multiclass_classification_metrics, 
                                                    generate_feature_importance_report)

## Constants

In [2]:
TARGET_COLUMN = 'grupo_estabelecimento'

CATEGORICAL_COLUMNS = ['cidade', 'estado', 'sexo', 'data',
                       'cidade_estabelecimento','pais_estabelecimento']

HIGH_CARDINALITY_CATEGORICAL_COLUMNS = [
    'cidade', 'estado', 
    'cidade_estabelecimento', 'pais_estabelecimento', 'estado_estabelecimento',
]

BINARY_COLUMNS = ['sexo',  'dia_util',
                  'cidade_diferente', 'estado_diferente', 'pais_diferente',]

NUMERIC_COLUMNS = ['idade',
                   'limite_total', 'limite_disp', 'valor', 
                   'dia_semana', 'dia_mes', 'mes',
                   'valor_relativo_total', 'valor_relativo_disponivel',
                  ]

COLUMNS_SELECTION  = (
    HIGH_CARDINALITY_CATEGORICAL_COLUMNS +
    BINARY_COLUMNS + 
    NUMERIC_COLUMNS
)

## Load Dataset

The `training` dataset is loaded to create the model perform experiments.

In [3]:
df = read_dataset(base_path=configs.datasets.base_path, stage='raw', file_name='train')

display(HTML('<h4>Dataset</h4>'))
print(f'Records: {len(df)}')
print('\nSample:')
display(df.head(3).T)

Records: 3944

Sample:


Unnamed: 0,0,1,2
id,"4,53E+11","4,53E+11","4,53E+11"
safra_abertura,201405,201405,201405
cidade,CAMPO LIMPO PAULISTA,CAMPO LIMPO PAULISTA,CAMPO LIMPO PAULISTA
estado,SP,SP,SP
idade,37,37,37
sexo,F,F,F
limite_total,4700,4700,4700
limite_disp,5605,5343,2829
data,4.12.2019,9.11.2019,6.05.2019
valor,31,15001,50


To create the model and perform experiments, only the `training` dataset will be used. The evaluation will be performed by creating some time-oriented `validation` datasets using the same methodology used to create the `test` dataset.

3 sets of `training` and `validation` sets will be created, each of them representing a fold. At the end, it will be possible to have an efficacy measurement with a variance notion.
It is important to use `validation` set avoid using the `test` several times. Ideally, it should be used only once, for the final assessment.

In [4]:
folds = generate_folds(df, 
                       n_folds=configs.model_training.folds, 
                       min_validation_size=configs.model_training.min_validation_size)

Summarize and validate folds (`training` and `validation` sets should not share records).

In [5]:
for ix, (train_df, valid_df) in enumerate(folds):

    display(HTML(f'<strong>Fold <code>{ix}</code></strong>'))
    describe_datasets(train_df, valid_df, TARGET_COLUMN)
    print()

    train_periods = set(train_df["period"].tolist())
    valid_periods = set(valid_df["period"].tolist())

    assert len(train_periods & valid_periods) == 0, \
    'Training and Validation share dates.'

 - Split Period: 2019-12-31
 - Training:
	 - Size: 3533
	 - Days: 274
	 - Labels: 21
 - Assessment:
	 - Size: 27
	 - Days: 27
	 - Labels: 18
 - Assessment Relative Size: 10.42%



 - Split Period: 2019-12-03
 - Training:
	 - Size: 3108
	 - Days: 246
	 - Labels: 21
 - Assessment:
	 - Size: 28
	 - Days: 28
	 - Labels: 19
 - Assessment Relative Size: 12.03%



 - Split Period: 2019-11-06
 - Training:
	 - Size: 2679
	 - Days: 219
	 - Labels: 21
 - Assessment:
	 - Size: 27
	 - Days: 27
	 - Labels: 18
 - Assessment Relative Size: 13.80%



 - Split Period: 2019-10-04
 - Training:
	 - Size: 2243
	 - Days: 186
	 - Labels: 20
 - Assessment:
	 - Size: 33
	 - Days: 33
	 - Labels: 17
 - Assessment Relative Size: 16.27%



 - Split Period: 2019-08-27
 - Training:
	 - Size: 1826
	 - Days: 148
	 - Labels: 20
 - Assessment:
	 - Size: 38
	 - Days: 38
	 - Labels: 18
 - Assessment Relative Size: 18.59%



## Dataset Preprocessing and Feature Vectorization

Vectorization of each fold based on the following strategy:
 - For each fold:
   - Perform basic cleaning:
       - Remove duplicated records.
       - Format column names.
       - Cast column types.
       - Standardize string values.
   - Create new features based on the orignal features.
   - Use the `training` set to create encoders (label and categorical columns);
       - `LabelEncoder` represents target labels into numbers.
       - `CountEncoder` represents high cardinality categorical data into numbers -- This method deals with missing values and avoid the need to create multiple columns to represent the values, reducing the sparsity of the feature vector.
    - Transform binary columns into `0` or `1`
   
The value are not scaled due to the use of a tree-based algorithm (LightGBM), which is not sensitive to feature scale -- [besides some evidences possibilities](https://arxiv.org/pdf/1611.04561.pdf).


In [6]:
vectorized_folds = vectorize_folds(folds,
                                   columns_selection=COLUMNS_SELECTION,
                                   categorical_columns=CATEGORICAL_COLUMNS,
                                   high_cardinality_categorical_columns=HIGH_CARDINALITY_CATEGORICAL_COLUMNS,
                                   binary_columns=BINARY_COLUMNS,
                                   target_column=TARGET_COLUMN,
                                  )

## Model Training on Folds

As stated before, LightGBM will be used as algorithm to learn a model. It was chosen because:
 - Is based on trees and does not require scalling features.
 - Has good results on Industry and at Machine Learning competitions.
 - Is able to deal with missing values.
 - Has good efficiency and support for high volume of data (using GPU or distributed computing). 
 - Provides an API compatible with SKlearn.
 - Has a good [documentation](https://lightgbm.readthedocs.io/en/v3.3.2/) and community content (e.g., blogs and forums).
 - Is supported by different ML tools (e.g., Optuna, ONNX, Dask, and Spark)

### Training and Evaluation on Folds

In [7]:
%%time

model_params = {'objective': 'multiclass', 
                'metric': 'multi_error',                 
                'verbosity': -1, 
                'n_estimators': 500,
                'random_state': configs.model_training.random_seed,
               }

iterations_tracking = []

for ix, ((train_X, train_y), (valid_X, valid_y)) in enumerate(vectorized_folds):

    class_weights = compute_weights(train_y)
    
    model_params.update({'class_weight': class_weights,
                         'num_class': len(set(train_y))})

    model = LGBMClassifier(**model_params)
    model.fit(train_X, train_y)

    preds = model.predict(valid_X)
    eval_metrics = compute_multiclass_classification_metrics(valid_y, 
                                                             preds.round(), 
                                                             average='macro')

    iteration_tracking = {**{'Fold': ix,
                             'training_size': train_X.shape[0],
                             'validation_size': valid_X.shape[0],},
                          **eval_metrics}
    iterations_tracking.append(iteration_tracking)    

CPU times: user 4min 40s, sys: 2.24 s, total: 4min 42s
Wall time: 39.7 s


### Folds Metrics

Considering the unbalance of the labels, it is important to consider metrics that are not sensitive to it. For that purpose, `Precision`, `Recall`, and `F1` are good alternatives to provide information about each class. `Accuracy` might not be the best to be affected by the unbalance -- but there are `balanced` versions of it that could be used.

In [8]:
folds_evaluation_df = pd.DataFrame(iterations_tracking)

display(HTML('<strong>Individual Fold Metrics</strong>'))
display(folds_evaluation_df)

display(HTML('<strong>Summarized Fold Metrics</strong>'))
display(folds_evaluation_df
        .drop(columns=['Fold', 'training_size', 'validation_size'])
        .agg(['mean', 'std'])
        .T
       )

Unnamed: 0,Fold,training_size,validation_size,precision,recall,f1
0,0,3531,411,0.168378,0.146911,0.152202
1,1,3106,425,0.340539,0.297991,0.300029
2,2,2677,429,0.29568,0.308083,0.29191
3,3,2241,436,0.262015,0.260289,0.251683
4,4,1824,417,0.266909,0.271522,0.245534


Unnamed: 0,mean,std
precision,0.266704,0.063208
recall,0.256959,0.064481
f1,0.248272,0.058804


## Final Training

This section creates the final model, using all the `training` set. The evaluation will be performed in another notebook, to avoid reusing the `test` set. 

In [9]:
%%time

clean_df = (
    df
    .pipe(clean_data, CATEGORICAL_COLUMNS + [TARGET_COLUMN])
    .pipe(engineer_features)
)

labels = clean_df[TARGET_COLUMN].unique().tolist()

label_encoder, categorical_encoder = generate_encoders(
    clean_df[COLUMNS_SELECTION],
    labels,
    HIGH_CARDINALITY_CATEGORICAL_COLUMNS,
)

X_training, y_training = vectorize_dataset(df,
                                           label_encoder, 
                                           categorical_encoder,
                                           columns_selection=COLUMNS_SELECTION,
                                           categorical_columns=CATEGORICAL_COLUMNS,
                                           binary_columns=BINARY_COLUMNS,
                                           target_column=TARGET_COLUMN,
                                          )

class_weights = compute_weights(y_training)

model_params.update({'class_weight': class_weights,
                     'num_class': len(set(y_training))})

model = LGBMClassifier(**model_params)
model.fit(X_training, y_training);

CPU times: user 1min 10s, sys: 386 ms, total: 1min 10s
Wall time: 9.85 s


### Feature Importance

In [10]:
generate_feature_importance_report(model, COLUMNS_SELECTION)

Unnamed: 0,feature,absolute_importance,relative_importance
13,valor,28003,17.01%
12,limite_disp,23099,14.03%
17,valor_relativo_total,22885,13.90%
18,valor_relativo_disponivel,22197,13.49%
15,dia_mes,18154,11.03%
16,mes,10252,6.23%
2,cidade_estabelecimento,9024,5.48%
10,idade,8603,5.23%
14,dia_semana,7887,4.79%
11,limite_total,6951,4.22%


## Concluding Remarks
 - Eficiency:
     - The F1 performance on the `validation folds` are low. The main reason might be related to the unbalance and low frequency of some categories.- The F1 performance on the `validation folds` are low. The main reason might be related to the unbalance and low frequency of some categories.
     - The worst performance was on the most recent fold, which has the highest ammount of training data.
     - The best performance was on the second fold.
 - Features:
     - Features based on `valor` and `limite` are the most important for the algorithm.
     - From the new feature created, `dia_mes` and `mes` were the most relevant.     
     - There are 7 features with less than 1% of relative importance. 
 - 


Some alternative approaches could be experimented:
 - Perform hyperparameters tuning using [Optuna](https://optuna.org/).
 - Make older data less relevant by decreasing the weights of each record based on time.
 - Use [alternative encoders](https://contrib.scikit-learn.org/category_encoders) for high cardinality categories. `Catboost` and `LeaveOneOut` are some of the notable candidates.
 - Apply alternative algorithms to improve results.
 - Perform adversarial validation to check drift between `training` and `assessment` datasets.
 - Use SHAP to compute feature importance in a more reliable way.