# Modeling Notebook

- **Creation Date**: June 13, 2025  
- **Author**: Corentin Vasseur — [vasseur.corentin@gmail.com](mailto:vasseur.corentin@gmail.com)

---

In this notebook, we present a baseline model along with several variations that were tested. We include the performance metrics used to evaluate them, as well as the details of hyperparameter tuning.

Finally, we propose several improvements to enhance model performance based on our findings.


## 1. Imports section

In [1]:
import sys
sys.path.append('../src/')

import pandas as pd
import numpy as np
import re

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from matplotlib import pyplot as plt

from qto_categorizer_ml.io.datasets import CSVReader
from qto_categorizer_ml.io.services import LoggerService
from qto_categorizer_ml.core.models import BaselineModel, SKLearnPipelineModel

logger = LoggerService().logger()

## 2. Load datasets 

For this section we use [CSVReader](https://github.com/data-corentinv/qto-categorizer-ml) object (from `qto-categorized-ml` pacakage). The idea behind is to separate and manage different method to read data from local, s3 bucket, deltalake, etc. on different environement (local, dev, preprod, production).

In [2]:
dtypes = {
    'TRANSACTION_ID': str,
    'AMOUNT': float,
    'TYPE_OF_PAYMENT': str,
    'MERCHANT_NAME': str,
    'DESCRIPTION': str,
    'SIDE':  int,
    'CATEGORY': str,
}

parse_dates = ['DATE_EMITTED']

path = "../data/data-products.csv"
df = CSVReader(path=path, dtypes=dtypes, parse_dates=parse_dates).read()

In [3]:
# Fill missing values in MERCHANT_NAME, DESCRIPTION, TYPE_OF_PAYMENT

df['MERCHANT_NAME'] = df.MERCHANT_NAME.fillna("")
df['DESCRIPTION'] = df.DESCRIPTION.fillna("")
df['TYPE_OF_PAYMENT'] = df.TYPE_OF_PAYMENT.fillna("")

# Features selection
features = ['AMOUNT', 'TYPE_OF_PAYMENT', 'MERCHANT_NAME', 'DESCRIPTION']
target = 'CATEGORY'

## 3. Create train et test datasets

In this section we encode target for training and split X and y datasets to make attention to respect distribution of target (`CATEGORY` feature) in both dataset.

In [4]:
X = df[features+[target]].drop_duplicates()
y = X.pop(target)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

logger.info(f"Dim of train set {X_train.shape}, dim of test set {X_test.shape}")

[32m2025-06-13 21:38:12.974[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1mDim of train set (268729, 4), dim of test set (67183, 4)[0m


## 4. Modeling
### 4.1. Baseline

As shown in the EDA notebook, a simple (but solid!) `baseline` is to assign each merchant their most frequent category. 

This rule-based approach serves as a reference point to evaluate whether our model can outperform a naive strategy that already achieves surprisingly good accuracy.

In the `qto-categorizer-ml` python module, we propose an implementation in `qto_categorizer_ml.core.model.BaselineModel`.

In [5]:
baseline = BaselineModel()
baseline.fit(inputs=X_train, targets=y_train)

y_baseline_pred = baseline.predict(inputs=X_test)
logger.info(f"Baseline prediction example (test set): {y_baseline_pred[:5]}...")

[32m2025-06-13 21:38:13.120[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m5[0m - [1mBaseline prediction example (test set): ['Operational Expenses: Office Supplies'
 'Operational Expenses: Production Costs'
 'Operational Expenses: Production Costs'
 'Operational Expenses: Office Supplies'
 'Operational Expenses: Office Supplies']...[0m


## 4.2. Create a sklearn pipeline

This pipeline is designed to classify transactions using a mix of `numerical`(e.g. `AMOUNT`), `categorical`(e.g. `TYPE_OF_PAYMENT`), and textual features (e.g. `DESCRIPTION`). 

It includes preprocessing steps tailored to each type of data, followed by a `RandomForestclassifier`.

### Here the details of the pipeine:

**Preprocessing (via ColumnTransformer):**
- Numerical Feature:
    *  `AMOUNT`: Scaled using StandardScaler to normalize values (mean 0, std 1).
Categorical Feature:
    * `TYPE_OF_PAYMENT`: Encoded using OneHotEncoder, with unknown categories ignored at inference time.
- Text Features:
    * `DESCRIPTION`:
        * Transformed using `TF-IDF` (max 1000 features).
        * Reduced to 50 dimensions using `TruncatedSVD` (a form of PCA for sparse matrices).
    * `MERCHANT_NAME`:
        * Similar processing with `TF-IDF` (max 500 features).
        * Dimensionality reduced to 30 components.

**Modeling:**

- Classifier: A `RandomForestClassifier` with:
  * 200 trees (`n_estimators`=200)
  * `Maximum depth` of 30 per tree
  * Parallel processing (`n_jobs`=-1)
  * Fixed randomness (`random_state`=42) for reproducibility

[Note: Alternatively, a more advanced approach for handling textual features would be to use embeddings from a pretrained language model (e.g. downloaded from the `HuggingFace Hub`), such as `all-MiniLM-L6-v2` or `all-mpnet-base-v2`.]

In [6]:
pipeline = SKLearnPipelineModel()
pipeline.pipeline

0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,50
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,42
,tol,0.0

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,30
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,42
,tol,0.0

0,1,2
,n_estimators,200
,criterion,'gini'
,max_depth,30
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## 5. Find best hyperparameter

Choosing the right hyperparameters can significantly improve model performance. sklearn offers several methods for automated hyperparameter tuning: [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), [RandominzedCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html), [HalvingGridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html), [HalvingRandomSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingRandomSearchCV.html).

Best practices I used are: 
- Always combine these methods with cross-validation (cv parameter)
- Use n_jobs=-1 to parallelize the search
- Choose scoring metric based on the problem we want to solve (e.g. accuracy, f1_macro, roc_auc)

For the categorizer, I selected `RandomizedSearchCV` because it explore a wide hyperparameter space with fewer computations, making it ideal for the time-constrained searches.

In [13]:
if 0: # replace by 1 for running the hptuning job
    param_grid ={
    'classifier__n_estimators': [50,100,200],
    'classifier__max_depth': [10, 20, 30, None]
}

    grid_search = RandomizedSearchCV(
        pipeline.pipeline,
        param_grid, 
        n_iter=10,
        cv=3,
        scoring="accuracy",
        n_jobs=-1
    )
    
    grid_search.fit(X_train, y_train)
    
    logger.info(f"Best params {grid_search.best_params_}, best score {grid_search.best_score_}")
logger.info("Best params {'classifier__n_estimators': 200, 'classifier__max_depth': 30}, best score 0.6")

[32m2025-06-13 22:03:40.882[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m19[0m - [1mBest params {'classifier__n_estimators': 200, 'classifier__max_depth': 30}, best score 0.6[0m


## 6. Train the pipeline
(Based on hyperparameter tuning search results.)

In [9]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

logger.info(f"Baseline prediction example (test set): {y_pred[:5]}...")

[32m2025-06-13 21:51:03.212[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mBaseline prediction example (test set): ['Operational Expenses: Office Supplies'
 'Operational Expenses: Production Costs'
 'Operational Expenses: Production Costs'
 'Operational Expenses: Office Supplies'
 'Operational Expenses: Office Supplies']...[0m


## 7. Estimate performance
Here a description of an data science problem:
In this section we encode target for training and split X and y datasets to make attention to respect distribution of target (CATEGORY feature) in both dataset.

Explain in few sentences the fact we want to create a model for a classifcaition multicalsse
IMPROVE

We are facing to a multiclasss classification problem. We propose to used accuracy score for a global (and mean) estimation.
In order to get details of performances on each classes, we can used classical binary metrics such as precision, recall and f1-score.

### Accuracy score

In [10]:
acc, acc_baseline = \
    accuracy_score(pipeline._encode_target(y_test), pipeline._encode_target(y_pred)), \
    accuracy_score(baseline._encode_target(y_test), baseline._encode_target(y_baseline_pred))

acc_train, acc_baseline_train = \
    accuracy_score(pipeline._encode_target(y_train), pipeline._encode_target(pipeline.predict(X_train))), \
    accuracy_score(baseline._encode_target(y_train), baseline._encode_target(baseline.predict(X_train)))

logger.info(f"Accuracy sklearn pipeline {round(acc,2)}, accuracy baseline: {round(acc_baseline,2)}")
logger.info(f"Accuracy sklearn pipeline {round(acc_train,2)}, accuracy baseline: {round(acc_baseline_train,2)}")

[32m2025-06-13 21:51:07.681[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m9[0m - [1mAccuracy sklearn pipeline 0.61, accuracy baseline: 0.69[0m
[32m2025-06-13 21:51:07.681[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m10[0m - [1mAccuracy sklearn pipeline 0.8, accuracy baseline: 0.79[0m


### Precision, rappel, f1-score

In [11]:
# baseline
logger.info("\n"+
    classification_report(
        baseline._encode_target(y_test), 
        baseline._encode_target(y_baseline_pred), 
        target_names= pipeline.le.classes_, 
        zero_division=np.nan,
    )
)

[32m2025-06-13 21:51:07.727[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1m
                                                        precision    recall  f1-score   support

              Administrative Expenses: Accounting Fees       0.80      0.80      0.80       686
                      Administrative Expenses: HR Fees       0.70      0.73      0.71       453
               Administrative Expenses: Insurance Fees       0.82      0.71      0.76       369
                   Administrative Expenses: Legal Fees       0.83      0.69      0.75      1187
Administrative Expenses: Other Administrative Expenses       0.81      0.61      0.70      1590
            Administrative Expenses: Service Providers       0.71      0.62      0.66      3123
                  Bank Fees & Charges: Loan Repayments       0.62      0.62      0.62        63
               Bank Fees & Charges: Other Bank Charges       0.86      0.80      0.83       351
                Bank Fees &

In [12]:
# pipeline
logger.info("\n"+
    classification_report(
        pipeline._encode_target(y_test), 
        pipeline._encode_target(y_pred), 
        target_names= pipeline.le.classes_, 
        zero_division=np.nan,
    )
)

[32m2025-06-13 21:51:07.771[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1m
                                                        precision    recall  f1-score   support

              Administrative Expenses: Accounting Fees       0.81      0.54      0.65       686
                      Administrative Expenses: HR Fees       0.68      0.46      0.55       453
               Administrative Expenses: Insurance Fees       0.81      0.42      0.55       369
                   Administrative Expenses: Legal Fees       0.79      0.58      0.67      1187
Administrative Expenses: Other Administrative Expenses       0.81      0.58      0.68      1590
            Administrative Expenses: Service Providers       0.54      0.43      0.47      3123
                  Bank Fees & Charges: Loan Repayments       0.75      0.57      0.65        63
               Bank Fees & Charges: Other Bank Charges       0.86      0.80      0.83       351
                Bank Fees &

## 7.1. Conclusion

The comparison between the **baseline model** (which assigns the most frequent category per `merchant`) and the `sklearn` pipeline (based on a `RandomForestClassifier` with standard preprocessing: `StandardScaler`, `OneHotEncoder`, `TF-IDF + SVD`) highlights a key insight: **the pipeline underperforms the baseline in terms of overall accuracy (0.61 vs. 0.69)**.

Despite using more advanced methods, the model fails to capture the complexity of the data, particularly in the text fields like **DESCRIPTION** and **MERCHANT_NAME**. This is evident in the low F1-scores and the wide variation in performance across classes — showing a clear struggle with class imbalance and ambiguity.


## 7.2 Next steps

There are several promising directions to improve the model’s performance, especially regarding the handling of text features:

1. **Richer Text Encoding with Skrub**  
   Replace or complement TF-IDF with encoders from the [`skrub`](https://github.com/skrub-data/skrub) library, which are designed for semi-structured text data:
   - `GapEncoder`: can be understood as a continuous encoding on a set of latent categories estimated from the data.
   - `MinHashEncoder`: method to n-gram decompositions of strings.
   - `TextEncoder`: embeddings from a pretrained language model such as `all-MiniLM-L6-v2` or `all-mpnet-base-v2` (via HuggingFace).

   These are particularly useful for columns like **MERCHANT_NAME**, which often contain inconsistencies, typos, or variations in formatting.

2. **Model Upgrade**  
   While `RandomForest` is robust, it may not be optimal for handling dense, high-dimensional text representations. Alternatives like **LightGBM**, **XGBoost**, or even a **lightweight neural network** might handle text embeddings more effectively.
