# Classification Model Comparison

This notebook trains a simple classifier to predict the label of a transaction using:
1. The raw `description`
2. The tokenized `description` from `replace_tokens()`

The goal is to demonstrate how preprocessing affects model generalization and performance.


In [8]:
import sys
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

sys.path.append(os.path.abspath('..'))


In [13]:
from src.preprocess import replace_tokens

df_raw = pd.read_csv("../data/transactions.csv")
df_tokenized = df_raw.copy()
df_tokenized['description'] = df_tokenized.apply(lambda row: replace_tokens(row['description'], row['label']), axis=1)


In [18]:
def train_model(df, text_col, label_col='label'):
    X = df[text_col]
    y = df[label_col]

    # Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


    # Vectorize
    vectorizer = CountVectorizer()
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    # Train
    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train_vec, y_train)

    # Predict
    y_pred = clf.predict(X_test_vec)
    print(f"Results using: {text_col}")
    print(classification_report(y_test, y_pred, zero_division=0))


In [19]:
train_model(df_raw, 'description')



Results using: description
                            precision    recall  f1-score   support

   ATM and Cash Withdrawal       1.00      1.00      1.00         1
                   Alcohol       0.00      0.00      0.00         2
            Auto Insurance       1.00      1.00      1.00         1
     Bank Charges and Fees       0.78      1.00      0.88        14
             Bank Transfer       0.50      0.50      0.50         2
    Cellular Data Purchase       1.00      1.00      1.00         3
                  Clothing       0.00      0.00      0.00         2
                    Coffee       1.00      1.00      1.00         2
               Debit Order       1.00      1.00      1.00         2
                  Donation       0.00      0.00      0.00         1
                Eating Out       0.55      0.92      0.69        12
Electronics and Appliances       0.00      0.00      0.00         1
      Fast Food & Takeouts       1.00      0.50      0.67         2
                    

In [20]:
train_model(df_tokenized, 'description')


Results using: description
                            precision    recall  f1-score   support

   ATM and Cash Withdrawal       1.00      1.00      1.00         1
                   Alcohol       0.00      0.00      0.00         2
            Auto Insurance       1.00      1.00      1.00         1
     Bank Charges and Fees       0.78      1.00      0.88        14
             Bank Transfer       0.50      0.50      0.50         2
    Cellular Data Purchase       1.00      1.00      1.00         3
                  Clothing       0.00      0.00      0.00         2
                    Coffee       1.00      1.00      1.00         2
               Debit Order       1.00      1.00      1.00         2
                  Donation       0.00      0.00      0.00         1
                Eating Out       1.00      1.00      1.00        12
Electronics and Appliances       0.00      0.00      0.00         1
      Fast Food & Takeouts       1.00      0.50      0.67         2
                    

## Classification Model Results Summary

### Table 1: Overall Model Performance (Raw vs. Tokenized)

| Metric                | Raw Descriptions | Tokenized Descriptions | Improvement |
|-----------------------|------------------|------------------------|-------------|
| **Accuracy**          | 0.75             | 0.84                   | **+0.09**  |
| **Macro F1-Score**    | 0.52             | 0.60                   | **+0.08**  |
| **Weighted F1-Score** | 0.69             | 0.79                   | **+0.10**  |

---

### Table 2: Class-Level F1 Score Improvement (Selected Labels)

| Label                 | F1 (Raw) | F1 (Tokenized) | Improvement |
|-----------------------|----------|----------------|-------------|
| **Eating Out**         | 0.69     | 1.00           | **+0.31**  |
| **Groceries**          | 0.71     | 0.91           | **+0.20**  |
| **Fuel**               | 0.00     | 1.00           | **+1.00**  |
| **General Purchases**  | 0.40     | 0.67           | **+0.27**  |

 These classes benefited directly from semantic token replacement, where specific vendor/location strings were replaced with learnable class tokens.

---

### Table 3: Token Replacements and Their Purpose

| Token         | Replaces Example Strings           | Purpose                              |
|---------------|------------------------------------|--------------------------------------|
| `[grocers]`   | Woolworths, Checkers, PnP, etc.    | Reduces vendor name sparsity         |
| `[restaurant]`| Uber Eats, Steers, KFC, etc.       | Generalizes food vendors             |
| `[garage]`    | Engen, Sasol, Shell, etc.          | Standardizes fuel-related terms      |
| `[location]`  | Craighall, Mitchell Park, etc.     | Removes low-frequency location noise |
| `[date]`      | `* 23 Sep`, `* 14 Oct`, etc.       | Removes one-off date tokens          |

---

### Conclusion

The token replacement strategy led to clear improvements in model performance, especially for categories most affected by sparse, vendor-specific naming.

By abstracting vendors, dates, and locations into general semantic tokens, the model was able to generalize better, particularly on labels like `Eating Out`, `Groceries`, and `Fuel`.

These results validate that thoughtful preprocessing can significantly enhance classification accuracy on noisy transactional text.
