<a href="https://colab.research.google.com/github/dzastin96/product-category-classifier/blob/main/notebooks/model_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ü§ñ Model Training & Evaluation
### Author: Dzastin Januzi

## üéØ Goal

Build and evaluate a robust product category classification pipeline using multiple machine learning algorithms.  
The objective is to identify the best-performing model and feature setup for deployment in e-commerce or inventory systems.

We compare five classifiers:
- Logistic Regression
- Naive Bayes
- Decision Tree
- Random Forest
- Support Vector Machine (SVM)

Each model is evaluated under two feature configurations:
- üìù `product_title` only (text-based)
- üßÆ `product_title` + numeric features (`views_per_day`, `popularity_score`, `merchant_rating`)

üìä Evaluation metrics:
- Accuracy
- Macro F1 Score
- Weighted F1 Score
- Classification Report

## üì• 1. Load Data

We load the preprocessed product listing dataset from a serialized .pkl file using joblib

In [4]:
import joblib

df = joblib.load("../data/final_product_data.pkl")
df.head()

Unnamed: 0,product_title,category_label,num_words,num_chars,has_digits_or_special,has_uppercase_terms,longest_word_len
0,apple iphone 8 plus 64gb silver,mobile phones,6,31,1,1,6
1,apple iphone 8 plus 64 gb spacegrau,mobile phones,7,35,1,1,9
2,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,mobile phones,13,70,1,1,10
3,apple iphone 8 plus 64gb space grey,mobile phones,7,35,1,1,6
4,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,mobile phones,11,54,1,1,8


## ‚úÇÔ∏è 2. Train/Test Split

Split the cleaned dataset into training and testing sets using stratified sampling to preserve class distribution. This ensures fair evaluation across all product categories.

In [5]:
from sklearn.model_selection import train_test_split

print(df['category_label'].value_counts())


X = df[['product_title', 'num_words', 'num_chars', 'has_digits_or_special', 'has_uppercase_terms', 'longest_word_len']]
    
y = df['category_label'].astype(str)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

category_label
fridge freezers     11130
mobile phones        4023
washing machines     3971
cpus                 3792
tvs                  3502
dishwashers          3374
digital cameras      2661
microwaves           2307
Name: count, dtype: int64


## üßº 3. Preprocessing

We prepare the input features using a modular preprocessing pipeline.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler

# We use MinMaxScaler instead of StandardScaler because:
# - MultinomialNB requires non-negative inputs
# - MinMaxScaler maps all numeric features into [0,1], preserving non-negativity
# - StandardScaler would introduce negative values, causing errors


print(df.head(5))
preprocess = ColumnTransformer([
    ('title', TfidfVectorizer(), 'product_title'),
    ('length', MinMaxScaler(), [
        "longest_word_len",
        "num_words",
        "num_chars"
    ]),
    ('binary', MinMaxScaler(), [
        "has_digits_or_special",
        "has_uppercase_terms"
    ]),
])
    
    

                                       product_title category_label  \
0                    apple iphone 8 plus 64gb silver  mobile phones   
1                apple iphone 8 plus 64 gb spacegrau  mobile phones   
2  apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...  mobile phones   
3                apple iphone 8 plus 64gb space grey  mobile phones   
4  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...  mobile phones   

   num_words  num_chars  has_digits_or_special  has_uppercase_terms  \
0          6         31                      1                    1   
1          7         35                      1                    1   
2         13         70                      1                    1   
3          7         35                      1                    1   
4         11         54                      1                    1   

   longest_word_len  
0                 6  
1                 9  
2                10  
3                 6  
4                 8  


## üì¶ 4. Candidate Models

We define a set of candidate classification models to evaluate performance across different algorithmic approaches.
All models are wrapped in a pipeline that includes preprocessing (TF-IDF + optional scaling) and classification.  
This modular setup allows us to benchmark each model consistently across both feature configurations (`product_title` only vs `product_title` + numeric features).

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Define candidate models
models = {
    "Logistic Regression": LogisticRegression(max_iter=5000),
    "Naive Bayes": MultinomialNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine": LinearSVC(max_iter=5000)
}

## üìä 5. Training, Prediction, Evaluation

We train and evaluate each candidate model using a consistent pipeline:
- Each model is wrapped in a `Pipeline` that includes:
  - **Preprocessing**: TF-IDF vectorization (and optional MinMax scaling if numeric features are used)
  - **Classifier**: one of the selected models

üîÅ For each model:
- We **fit** the pipeline on the training set (`X_train`, `y_train`)
- We **predict** on the test set (`X_test`)
- We generate a **classification report** showing precision, recall, and F1-score per class

In [8]:
for name, model in models.items():
    print(f"\n=== {name} ===")
    
    pipeline = Pipeline([
        ("preprocessing", preprocess),
        ("classifier", model)
    ])
    
    # Train
    pipeline.fit(X_train, y_train)
    
    # Predict
    y_pred = pipeline.predict(X_test)
    
    # Classification report
    print(classification_report(y_test, y_pred))


=== Logistic Regression ===


ValueError: A given column is not a column of the dataframe

## üèÜ 6. Select the Best Model

In this step, we evaluate all trained models (Logistic Regression, Naive Bayes, Decision Tree, Random Forest, and Support Vector Machine) using key performance metrics:

- **Accuracy** ‚Äì overall proportion of correct predictions  
- **Macro Avg. F1 Score** ‚Äì harmonic mean of precision and recall, treating all classes equally  
- **Weighted Avg. F1 Score** ‚Äì harmonic mean of precision and recall, weighted by class support

### Results (Text Only ‚Äî `product_title`)

| Model                  | Accuracy | Macro Avg. F1 | Weighted Avg. F1 | Comments                                                   |
|------------------------|----------|----------------|------------------|------------------------------------------------------------|
| Logistic Regression     | 0.96     | 0.96           | 0.96             | Very consistent across all classes                         |
| Naive Bayes             | 0.94     | 0.93           | 0.94             | Good overall, but recall collapses for freezers and fridge freezers |
| Decision Tree           | 0.94     | 0.94           | 0.94             | Solid, but weaker on fridges/freezers                      |
| Random Forest           | 0.95     | 0.96           | 0.95             | Strong, balanced, slightly below Logistic Regression       |
| **Support Vector Machine** | **0.97** | **0.97**     | **0.97**         | üèÜ Best overall ‚Äî highest accuracy & balanced across classes |

### Results (`product_title` + Numeric Features)

| Model                  | Accuracy | Macro Avg. F1 | Weighted Avg. F1 | Comments                                                   |
|------------------------|----------|----------------|------------------|------------------------------------------------------------|
| Logistic Regression     | 0.96     | 0.96           | 0.96             | Very consistent across all classes                         |
| Naive Bayes             | 0.92     | 0.91           | 0.92             | Struggles on freezers & fridge freezers              |
| Decision Tree           | 0.93     | 0.93           | 0.93             | Solid, but weaker on fridges/freezers                      |
| Random Forest           | 0.95     | 0.95           | 0.95             | Strong, balanced, slightly below Logistic Regression       |
| **Support Vector Machine** | **0.97** | **0.97**     | **0.97**         | üèÜ Best overall ‚Äî highest accuracy & balanced across classes |

### üìå Final Decision

After comparing both setups, we select:

- ‚úÖ **Best Model:** Support Vector Machine (SVM)  
- üìù **Best Feature Setup:** `product_title` only  

üìà Text alone provides near‚Äëperfect classification performance.  
üìâ Numeric features add minimal value and may reduce performance in some models.  
üéØ Therefore, the final production pipeline uses **only `product_title`**, ensuring simplicity, speed, and reproducibility.

