# Modeling Notebook

- **Creation Date**: June 13, 2025  
- **Author**: Corentin Vasseur — [vasseur.corentin@gmail.com](mailto:vasseur.corentin@gmail.com)

---

In this notebook, we present a baseline model along with several variations that were tested. We include the performance metrics used to evaluate them, as well as the details of hyperparameter tuning.

Finally, we propose several improvements to enhance model performance based on our findings.


## 1. Imports section

In [57]:
import pandas as pd
import numpy as np
import re

from sklearn.model_selection import RandomizedSearchCV

from skrub import GapEncoder
from skrub import Cleaner, TableReport
from skrub import StringEncoder, MinHashEncoder, TableVectorizer, TextEncoder

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD

from sklearn.preprocessing import KBinsDiscretizer
from sklearn.metrics import confusion_matrix, accuracy_score

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

from sklearn.feature_extraction.text import TfidfVectorizer

import sys
sys.path.append('../src/')
from qto_categorizer_ml.io.datasets import CSVReader
from qto_categorizer_ml.io.services import LoggerService
from qto_categorizer_ml.core.models import BaselineModel, SKLearnPipelineModel
from matplotlib import pyplot as plt

logger = LoggerService().logger()

## 2. Load datasets 

For this section we use [CSVReader](https://github.com/data-corentinv/qto-categorizer-ml) object (from `qto-categorized-ml` pacakage). The idea behind is to separate and manage different method to read data from local, s3 bucket, deltalake, etc. on different environement (local, dev, preprod, production).

In [58]:
dtypes = {
    'TRANSACTION_ID': str,
    'AMOUNT': float,
    'TYPE_OF_PAYMENT': str,
    'MERCHANT_NAME': str,
    'DESCRIPTION': str,
    'SIDE':  int,
    'CATEGORY': str,
}

parse_dates = ['DATE_EMITTED']

path = "../data/data-products.csv"
df = CSVReader(path=path, dtypes=dtypes, parse_dates=parse_dates).read()

In [59]:
# Fill missing values in MERCHANT_NAME, DESCRIPTION, TYPE_OF_PAYMENT

df['MERCHANT_NAME'] = df.MERCHANT_NAME.fillna("")
df['DESCRIPTION'] = df.DESCRIPTION.fillna("")
df['TYPE_OF_PAYMENT'] = df.TYPE_OF_PAYMENT.fillna("")

# Features selection
features = ['AMOUNT', 'TYPE_OF_PAYMENT', 'MERCHANT_NAME', 'DESCRIPTION']
target = 'CATEGORY'

## 3. Create train et test datasets

In this section we encode target for training and split X and y datasets to make attention to respect distribution of target (`CATEGORY` feature) in both dataset.

ADD

In [60]:
X = df[features+[target]].drop_duplicates()
y = X.pop(target)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

logger.info(f"Dim of train set {X_train.shape}, dim of test set {X_test.shape}")

[32m2025-06-13 17:44:32.719[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1mDim of train set (268729, 4), dim of test set (67183, 4)[0m


## Baseline

As shown in the EDA notebook, a simple (but solid!) `baseline` is to assign each merchant their most frequent category. 

This rule-based approach serves as a reference point to evaluate whether our model can outperform a naive strategy that already achieves surprisingly good accuracy.

In the `qto-categorizer-ml` python module, we propose an implementation in `qto_categorizer_ml.core.model.BaselineModel`.

In [61]:
baseline = BaselineModel()
baseline.fit(inputs=X_train, targets=y_train)

y_baseline_pred = baseline.predict(inputs=X_test)
logger.info(f"Baseline prediction example (test set): {y_baseline_pred[:5]}...")

[32m2025-06-13 17:44:33.523[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m5[0m - [1mBaseline prediction example (test set): ['Operational Expenses: Office Supplies'
 'Operational Expenses: Production Costs'
 'Operational Expenses: Production Costs'
 'Operational Expenses: Office Supplies'
 'Operational Expenses: Office Supplies']...[0m


## Create pipeline

This pipeline is designed to classify transactions using a mix of `numerical`(e.g. `AMOUNT`), `categorical`(e.g. `TYPE_OF_PAYMENT`), and textual features (e.g. `DESCRIPTION`). 

It includes preprocessing steps tailored to each type of data, followed by a `RandomForestclassifier`.

### Here the details of the pipeine:

**Preprocessing (via ColumnTransformer):**
- Numerical Feature:
    *  `AMOUNT`: Scaled using StandardScaler to normalize values (mean 0, std 1).
Categorical Feature:
    * `TYPE_OF_PAYMENT`: Encoded using OneHotEncoder, with unknown categories ignored at inference time.
- Text Features:
    * `DESCRIPTION`:
        * Transformed using `TF-IDF` (max 1000 features).
        * Reduced to 50 dimensions using `TruncatedSVD` (a form of PCA for sparse matrices).
    * `MERCHANT_NAME`:
        * Similar processing with `TF-IDF` (max 500 features).
        * Dimensionality reduced to 30 components.

**Modeling:**

- Classifier: A `RandomForestClassifier` with:
  * 200 trees (`n_estimators`=200)
  * `Maximum depth` of 30 per tree
  * Parallel processing (`n_jobs`=-1)
  * Fixed randomness (`random_state`=42) for reproducibility

In [62]:
pipeline = SKLearnPipelineModel()
pipeline.pipeline

## Find best hyperparameter

Choosing the right hyperparameters can significantly improve model performance. sklearn offers several methods for automated hyperparameter tuning: [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), [RandominzedCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html), [HalvingGridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html), [HalvingRandomSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingRandomSearchCV.html).

Best practices I used are: 
- Always combine these methods with cross-validation (cv parameter)
- Use n_jobs=-1 to parallelize the search
- Choose scoring metric based on the problem we want to solve (e.g. accuracy, f1_macro, roc_auc)

For the categorizer, I selected RandomizedSearchCV because it efficiently explore a wide hyperparameter space with fewer computations, making it ideal for the time-constrained searches.

In [63]:
if 0: # replace by 1 for running the hptuning job
    param_grid ={
    'classifier__n_estimators': [50,100,200],
    'classifier__max_depth': [10, 20, 30, None]
}

    grid_search = RandomizedSearchCV(
        pipeline,
        param_grid, 
        n_iter=10,
        cv=3,
        scoring="accuracy",
        n_jobs=-1
    )
    
    grid_search.fit(X_train, y_train)
    
    logger.info(f"Best params {grid_search.best_params_}, best score {grid_search.best_score_}")

## Train the pipeline
Based on hyperparameter tuning search.

In [None]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

logger.info(f"Baseline prediction example (test set): {y_pred[:5]}...")

## Estimate performance
Here a description of an data science problem:
In this section we encode target for training and split X and y datasets to make attention to respect distribution of target (CATEGORY feature) in both dataset.

Explain in few sentences the fact we want to create a model for a classifcaition multicalsse
IMPROVE

We are facing to a multiclasss classification problem. We propose to used accuracy score for a global (and mean) estimation.
In order to get details of performances on each classes, we can used classical binary metrics such as precision, recall and f1-score.

### Accuracy score

In [None]:
cm, cm_baseline = \
    confusion_matrix(pipeline.pipeline._encode_target(y_test), pipeline.pipeline._encode_target(y_pred)), \
    confusion_matrix(baseline._encode_target(y_test), baseline._encode_target(y_baseline_pred))
acc, acc_baseline = \
    accuracy_score(pipeline.pipeline._encode_target(y_test), pipeline.pipeline._encode_target(y_pred)), \
    accuracy_score(baseline._encode_target(y_test), baseline._encode_target(y_baseline_pred))
acc, acc_baseline

In [None]:
# baseline
classification_report?

In [None]:
# pipeline
classification_report