<a href="https://colab.research.google.com/github/dzastin96/product-category-classifier/blob/main/notebooks/model_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ü§ñ Model Training & Evaluation
### Author: Dzastin Januzi

## üéØ Goal
Compare five ML algorithms for product category classification, evaluate their performance, and save the best model for deployment.

### Models tested
- Logistic Regression
- Linear SVC
- Random Forest
- Gradient Boosting
- Multinomial Naive Bayes

### Evaluation metrics
- Accuracy
- Macro F1 score
- Classification report
- Confusion matrix

## üì• 1. Load Data

We load the preprocessed product listing dataset from a serialized .pkl file using joblib

In [41]:
import joblib

df = joblib.load("../data/final_product_data.pkl")
df.head()

Unnamed: 0,product_id,product_title,merchant_id,category_label,product_code,number_of_views,merchant_rating,listing_date,days_since_listing,views_per_day,popularity_score
0,1,apple iphone 8 plus 64gb silver,1,mobile phones,QA-2276-XC,860.0,2.5,2024-05-10,569,1.51,2150.0
1,2,apple iphone 8 plus 64 gb spacegrau,2,mobile phones,KA-2501-QO,3772.0,4.8,2024-12-31,334,11.29,18105.6
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,mobile phones,FP-8086-IE,3092.0,3.9,2024-11-10,385,8.03,12058.8
3,4,apple iphone 8 plus 64gb space grey,4,mobile phones,YI-0086-US,466.0,3.4,2022-05-02,1308,0.36,1584.4
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,mobile phones,NZ-3586-WP,4426.0,1.6,2023-04-12,963,4.6,7081.6


## ‚úÇÔ∏è 2. Train/Test Split

Split the cleaned dataset into training and testing sets using stratified sampling to preserve class distribution. This ensures fair evaluation across all product categories.

In [42]:
from sklearn.model_selection import train_test_split

print(df['category_label'].value_counts())

use_text_only = True  # toggle this flag to switch between product_title only (TRUE) and combined features (FALSE) 

if use_text_only:
    X = df[['product_title']]
else:
    X = df[['product_title', 'views_per_day', 'popularity_score', 'merchant_rating']]
    
y = df['category_label'].astype(str)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

category_label
fridge freezers     5424
mobile phones       4023
washing machines    3971
cpus                3792
fridges             3524
tvs                 3502
dishwashers         3374
digital cameras     2661
microwaves          2307
freezers            2182
Name: count, dtype: int64


## 3. Preprocessing

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler

# We use MinMaxScaler instead of StandardScaler because:
# - MultinomialNB requires non-negative inputs
# - MinMaxScaler maps all numeric features into [0,1], preserving non-negativity
# - StandardScaler would introduce negative values, causing errors

if use_text_only:
    preprocess = ColumnTransformer([
        ('title', TfidfVectorizer(), 'product_title')
    ])
else:
    preprocess = ColumnTransformer([
        ('title', TfidfVectorizer(), 'product_title'),
        ('num', MinMaxScaler(), ['views_per_day', 'popularity_score', 'merchant_rating'])
    ])

## 4. Candidate Models

In [44]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Define candidate models
models = {
    "Logistic Regression": LogisticRegression(max_iter=5000),
    "Naive Bayes": MultinomialNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine": LinearSVC(max_iter=5000)
}

## 5. Training, Prediction, Evaluation

In [45]:
for name, model in models.items():
    print(f"\n=== {name} ===")
    
    pipeline = Pipeline([
        ("preprocessing", preprocess),
        ("classifier", model)
    ])
    
    # Train
    pipeline.fit(X_train, y_train)
    
    # Predict
    y_pred = pipeline.predict(X_test)
    
    # Classification report
    print(classification_report(y_test, y_pred))



=== Logistic Regression ===
                  precision    recall  f1-score   support

            cpus       1.00      1.00      1.00       758
 digital cameras       1.00      0.99      0.99       532
     dishwashers       0.90      0.97      0.94       675
        freezers       0.99      0.91      0.95       436
 fridge freezers       0.95      0.94      0.95      1085
         fridges       0.91      0.90      0.91       705
      microwaves       1.00      0.95      0.97       461
   mobile phones       0.97      0.99      0.98       805
             tvs       0.97      0.98      0.97       701
washing machines       0.94      0.95      0.94       794

        accuracy                           0.96      6952
       macro avg       0.96      0.96      0.96      6952
    weighted avg       0.96      0.96      0.96      6952


=== Naive Bayes ===
                  precision    recall  f1-score   support

            cpus       1.00      1.00      1.00       758
 digital cameras  

## 6. Select the Best Model

In this step, we evaluate all trained models (Logistic Regression, Naive Bayes, Decision Tree, Random Forest, and Support Vector Machine) using key performance metrics:

- **Accuracy** ‚Äì overall proportion of correct predictions  
- **Macro Avg. F1 Score** ‚Äì harmonic mean of precision and recall, treating all classes equally  
- **Weighted Avg. F1 Score** ‚Äì harmonic mean of precision and recall, weighted by class support

### Results (Text Only ‚Äî `product_title`)

| Model                  | Accuracy | Macro Avg. F1 | Weighted Avg. F1 | Comments                                                   |
|------------------------|----------|----------------|------------------|------------------------------------------------------------|
| Logistic Regression     | 0.96     | 0.96           | 0.96             | Very consistent across all classes                         |
| Naive Bayes             | 0.94     | 0.93           | 0.94             | Good overall, but recall collapses for freezers and fridge freezers |
| Decision Tree           | 0.94     | 0.94           | 0.94             | Solid, but weaker on fridges/freezers                      |
| Random Forest           | 0.95     | 0.96           | 0.95             | Strong, balanced, slightly below Logistic Regression       |
| **Support Vector Machine** | **0.97** | **0.97**     | **0.97**         | üèÜ Best overall ‚Äî highest accuracy & balanced across classes |

### Results (`product_title` + Numeric Features)

| Model                  | Accuracy | Macro Avg. F1 | Weighted Avg. F1 | Comments                                                   |
|------------------------|----------|----------------|------------------|------------------------------------------------------------|
| Logistic Regression     | 0.96     | 0.96           | 0.96             | Very consistent across all classes                         |
| Naive Bayes             | 0.92     | 0.91           | 0.92             | Struggles on freezers & fridge freezers              |
| Decision Tree           | 0.93     | 0.93           | 0.93             | Solid, but weaker on fridges/freezers                      |
| Random Forest           | 0.95     | 0.95           | 0.95             | Strong, balanced, slightly below Logistic Regression       |
| **Support Vector Machine** | **0.97** | **0.97**     | **0.97**         | üèÜ Best overall ‚Äî highest accuracy & balanced across classes |



From the above models, we select the one that achieves the **highest accuracy and F1 scores across macro and weighted averages**.  
In this case, the **Support Vector Machine (SVM)** is selected as the best model, consistently outperforming others in both setups.

Typical evaluation shows that product classification tasks can be solved with high accuracy using textual features alone. 
Empirical results confirm that `product_title` provides near‚Äëperfect predictive power (97% accuracy with SVM).  
Numeric features add negligible benefit and in some cases reduce performance.  
Therefore, the final production model requires only `product_title` for both training and prediction, simplifying the pipeline and ensuring reproducibility.

