<a href="https://colab.research.google.com/github/arpanpathak/DataScienceNotebooks/blob/main/BERT_For_Product_Categorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸš€ Fine-Tuning SLMs for Enterprise Product Categorization



## The Business Case: Why Small Language Models (SLMs)?

> In real-world production systems (e.g., e-commerce search, inventory tagging), Latency and Cost are the primary constraints. While LLMs (Llama-3 70B, GPT-4) are "smart," they are inefficient for high-throughput classification.

| Metric | Large Language Model (LLM) | Small Language Model (DistilBERT) |
| :--- | :--- | :--- |
| Parameters | 7B - 175B | 66M |
| Inference Time | 500ms - 2s (GPU) | ~15ms (CPU) |
| Cost | High (Requires A100s) | Low (Runs on T4 or CPU) |
| Task Suitability | Creative Generation, Reasoning | Pattern Recognition, Classification |

Conclusion: For a defined taxonomy (e.g., Amazon > Hardware > Laptops), a fine-tuned SLM often achieves >95% accuracy with 1/100th the compute cost of an LLM.

## Mathematical Intuition: The Transformer Architecture

> DistilBERT is a distilled version of BERT. To understand it, we must look at the mathematical operations inside a Transformer encoder.

### A. Input Embedding & Positional Encoding

Transformers have no inherent sense of order (unlike RNNs). We must inject position information mathematically.

Given a token sequence $X = [x_1, \dots, x_n]$, we map them to learned embeddings $E \in \mathbb{R}^{d_{model}}$. We add a fixed positional vector $P$:

$$Input(pos) = E(x_{pos}) + P(pos)$$

Where the positional encoding $P$ is defined by sine/cosine frequencies:

$$P_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

$$P_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

### B. Scaled Dot-Product Attention

This is the "brain" of the model. It calculates how much focus (attention) token $i$ should give to token $j$.

For a specific head, we project the input $X$ into Query ($Q$), Key ($K$), and Value ($V$) matrices:


$$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$

The attention score is calculated as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

$QK^T$: The dot product measures semantic similarity between tokens.

$\sqrt{d_k}$: Scaling factor to stabilize gradients (prevents softmax saturation).

$V$: The actual information content we want to aggregate based on the attention scores.

## Why DistilBERT? (Knowledge Distillation)

DistilBERT is trained to mimic a larger "Teacher" model (BERT-Base). The training objective minimizes a triple loss function to transfer knowledge:

$$\mathcal{L} = \alpha \mathcal{L}_{ce} + \beta \mathcal{L}_{cos} + \gamma \mathcal{L}_{mlm}$$

$\mathcal{L}_{ce}$ (Distillation Loss): KL-Divergence between the Teacher's soft probabilities ($t_i$) and the Student's soft probabilities ($s_i$). The temperature $T$ controls the "softness" of the probability distribution to capture rich relationships between classes.


$$\mathcal{L}_{ce} = \sum_i t_i \log \left( \frac{t_i}{s_i} \right)$$

$\mathcal{L}_{cos}$ (Cosine Embedding Loss): Aligns the hidden state vectors directions between Teacher and Student.

$\mathcal{L}_{mlm}$ (Masked Language Modeling): Standard BERT training (predicting masked words).

## The Fine-Tuning Algorithm

We adapt the pre-trained DistilBERT for our specific Product Category Classification task.

### Step 1: Contextual Representation

We pass our product description string $x$ through the DistilBERT model. We extract the hidden state of the special classification token [CLS] from the last layer.

$$h_{[CLS]} = \text{DistilBERT}(x) \in \mathbb{R}^{768}$$

### Step 2: The Classification Head

We project this 768-dimension vector down to our $K$ product categories using a learned weight matrix $W \in \mathbb{R}^{K \times 768}$ and bias $b$.

$$z = W h_{[CLS]} + b$$

### Step 3: Probability Distribution (Softmax)

We convert the logits $z$ into probabilities $\hat{y}$ for each class $k$:

$$\hat{y}_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$

### Step 4: Optimization Objective (Cross-Entropy Loss)

We minimize the difference between our predicted distribution $\hat{y}$ and the true one-hot label $y$ (where $y_c=1$ for the correct category).

$$\mathcal{L}(\theta) = -\sum_{c=1}^{K} y_c \log(\hat{y}_c)$$

### Step 5: Weight Update (AdamW)

We update all model parameters $\theta$ (both DistilBERT weights and Classification Head weights) using backpropagation.

$$\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

In [None]:
# Install dependencies
!pip install torch transformers tqdm

# Data Models and Datasets (BYOD - Bring Your Own Dataset)

## Sample Dataset
| Description | Category |
| :--- | :--- |
| Dell XPS 13 Laptop 16GB RAM 512GB SSD | Electronics > Computers > Laptops |
| Nike Air Max Running Shoes Size 10 | Clothing > Men > Shoes |
| Sony WH-1000XM4 Noise Canceling Headphones | Electronics > Audio > Headphones |
| Apple MacBook Pro M1 Chip 13-inch | Electronics > Computers > Laptops |
| Adidas Ultraboost 21 Sneakers | Clothing > Men > Shoes |
| Bose QuietComfort 45 Bluetooth Headphones | Electronics > Audio > Headphones |
| Lenovo ThinkPad X1 Carbon Gen 9 | Electronics > Computers > Laptops |
| Puma T-Shirt Cotton Black | Clothing > Men > Tops |
| JBL Flip 5 Waterproof Portable Bluetooth Speaker | Electronics > Audio > Speakers |
| Harry Potter and the Sorcerer's Stone Hardcover | Books > Fiction > Fantasy |
| The Great Gatsby Paperback | Books > Fiction > Classics |
| Python Crash Course 2nd Edition | Books > Non-Fiction > Education |

In [4]:
import torch
from torch.utils.data import Dataset
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from transformers import PreTrainedTokenizer

class ProductDataset(Dataset):
    """
    Custom PyTorch Dataset for Product Classification.
    """
    def __init__(
        self,
        descriptions: list[str],
        categories: list[int],
        tokenizer: PreTrainedTokenizer,
        max_len: int = 128
    ):
        self.descriptions = descriptions
        self.categories = categories
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.descriptions)

    def __getitem__(self, item_idx):
        text = str(self.descriptions[item_idx])
        label = self.categories[item_idx]

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )

        return {
            'description_text': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

class DataProcessor:
    """
    Handles loading CSV, encoding labels, and splitting data.
    """
    def __init__(self, csv_path: str):
        self.df = pd.read_csv(csv_path)
        self.label_encoder = LabelEncoder()

        # Encode string categories to integers
        self.df['label_encoded'] = self.label_encoder.fit_transform(self.df['category'])

        self.num_classes = len(self.label_encoder.classes_)
        self.id2label = {i: label for i, label in enumerate(self.label_encoder.classes_)}
        self.label2id = {label: i for i, label in enumerate(self.label_encoder.classes_)}

    def get_data(self):
        return self.df['description'].values, self.df['label_encoded'].values

    def decode_label(self, label_id: int) -> str:
        return self.label_encoder.inverse_transform([label_id])[0]

# Training Engine

In [9]:
import torch
import torch.nn as nn
from tqdm import tqdm
import numpy as np

def train_fn(data_loader, model, optimizer, device, scheduler):
    """
    Training loop for one epoch.
    """
    model.train()
    final_loss = 0
    correct_predictions = 0
    total_examples = 0

    # TQDM progress bar for live monitoring
    progress_bar = tqdm(data_loader, desc="  Training", leave=False)

    for data in progress_bar:
        input_ids = data['input_ids'].to(device)
        attention_mask = data['attention_mask'].to(device)
        labels = data['labels'].to(device)

        optimizer.zero_grad()

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        loss = outputs.loss
        logits = outputs.logits

        _, preds = torch.max(logits, dim=1)
        correct_predictions += torch.sum(preds == labels)
        total_examples += labels.size(0)

        loss.backward()
        optimizer.step()
        scheduler.step()

        final_loss += loss.item()

        # Update progress bar with current loss
        progress_bar.set_postfix(loss=loss.item())

    avg_loss = final_loss / len(data_loader)
    accuracy = correct_predictions.double() / total_examples

    return accuracy.item(), avg_loss

def eval_fn(data_loader, model, device):
    """
    Evaluation loop for validation.
    """
    model.eval()
    final_loss = 0
    correct_predictions = 0
    total_examples = 0

    progress_bar = tqdm(data_loader, desc="  Validating", leave=False)

    with torch.no_grad():
        for data in progress_bar:
            input_ids = data['input_ids'].to(device)
            attention_mask = data['attention_mask'].to(device)
            labels = data['labels'].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )

            loss = outputs.loss
            logits = outputs.logits

            _, preds = torch.max(logits, dim=1)
            correct_predictions += torch.sum(preds == labels)
            total_examples += labels.size(0)

            final_loss += loss.item()

    avg_loss = final_loss / len(data_loader)
    accuracy = correct_predictions.double() / total_examples

    return accuracy.item(), avg_loss

# Main Script

In [10]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader

# --- Configuration ---
# Note that dataset is not included. Make your own dataset and categories.
CSV_FILE = 'data/sample_products.csv'
MAX_LEN = 64  # Short length for sample; increase to 128+ for real data
BATCH_SIZE = 4 # Small batch size for sample; increase to 16/32
EPOCHS = 5
LEARNING_RATE = 2e-5
MODEL_NAME = 'distilbert-base-uncased'

def main():
    # 1. Device Setup
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    # 2. Data Preparation
    print("Loading and processing data...")
    processor = DataProcessor(CSV_FILE)
    descriptions, labels = processor.get_data()

    # Split data
    train_texts, val_texts, train_labels, val_labels = train_test_split(
        descriptions, labels, test_size=0.2, random_state=42
    )

    print(f"Training samples: {len(train_texts)}")
    print(f"Validation samples: {len(val_texts)}")
    print(f"Number of classes: {processor.num_classes}")
    print(f"Classes: {processor.id2label}")

    # 3. Tokenizer & Datasets
    tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

    train_dataset = ProductDataset(train_texts, train_labels, tokenizer, MAX_LEN)
    val_dataset = ProductDataset(val_texts, val_labels, tokenizer, MAX_LEN)

    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

    # 4. Model Initialization
    # We pass id2label and label2id so the model config saves the mapping for inference later
    model = DistilBertForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=processor.num_classes,
        id2label=processor.id2label,
        label2id=processor.label2id
    )
    model.to(device)

    # 5. Optimizer & Scheduler
    optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
    total_steps = len(train_loader) * EPOCHS
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=total_steps
    )

    # 6. Training Loop
    best_accuracy = 0

    print("\nStarting training...")
    for epoch in range(EPOCHS):
        print(f"\nEpoch {epoch + 1}/{EPOCHS}")
        print("-" * 30)

        train_acc, train_loss = train_fn(train_loader, model, optimizer, device, scheduler)
        print(f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f}")

        val_acc, val_loss = eval_fn(val_loader, model, device)
        print(f"Val Loss:   {val_loss:.4f} | Val Acc:   {val_acc:.4f}")

        if val_acc > best_accuracy:
            best_accuracy = val_acc
            torch.save(model.state_dict(), 'best_model.bin')
            print("-> Best model saved!")

    print("\nTraining complete.")

    # Optional: Quick Inference Check
    print("\n--- Inference Check ---")
    sample_text = "Lenovo laptop with 16GB RAM"
    inputs = tokenizer(sample_text, return_tensors="pt").to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    predicted_class_id = logits.argmax().item()
    print(f"Input: {sample_text}")
    print(f"Predicted: {processor.decode_label(predicted_class_id)}")

if __name__ == "__main__":
    main()

Using device: cpu
Loading and processing data...
Training samples: 9
Validation samples: 3
Number of classes: 8
Classes: {0: 'Books > Fiction > Classics', 1: 'Books > Fiction > Fantasy', 2: 'Books > Non-Fiction > Education', 3: 'Clothing > Men > Shoes', 4: 'Clothing > Men > Tops', 5: 'Electronics > Audio > Headphones', 6: 'Electronics > Audio > Speakers', 7: 'Electronics > Computers > Laptops'}


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Starting training...

Epoch 1/5
------------------------------




Train Loss: 2.1073 | Train Acc: 0.1111




Val Loss:   2.0769 | Val Acc:   0.0000

Epoch 2/5
------------------------------




Train Loss: 2.0886 | Train Acc: 0.2222




Val Loss:   2.0739 | Val Acc:   0.3333
-> Best model saved!

Epoch 3/5
------------------------------




Train Loss: 2.0078 | Train Acc: 0.4444




Val Loss:   2.0777 | Val Acc:   0.3333

Epoch 4/5
------------------------------




Train Loss: 1.9374 | Train Acc: 0.4444




Val Loss:   2.0800 | Val Acc:   0.3333

Epoch 5/5
------------------------------




Train Loss: 2.0035 | Train Acc: 0.6667


                                                           

Val Loss:   2.0809 | Val Acc:   0.3333

Training complete.

--- Inference Check ---
Input: Lenovo laptop with 16GB RAM
Predicted: Electronics > Computers > Laptops


