
# Train Models Notebook — FurnishAI (Text Classification)

This notebook demonstrates **model training** for the assignment in a way that's fast to run and easy for reviewers to verify.

**What it does**
- Trains a **text classifier** (TF‑IDF + Logistic Regression) to predict `categories` from `title + description`.
- Reports **validation accuracy**, **classification report**, and a **confusion matrix**.
- Saves a model artifact to `../backend/models/weights/text_cat_clf.pkl`.
- Includes an optional (commented) **zero-shot CLIP** image-text similarity demo for CV.

**Expected dataset** at `../data/raw.csv` with columns:  
`uniq_id,title,brand,description,price,categories,images,material,color`


In [None]:

# 0) Install dependencies (uncomment if running in a clean environment)
# %pip install pandas scikit-learn matplotlib seaborn


In [None]:

# 1) Imports
import os, re, pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

pd.set_option('display.max_colwidth', 180)


In [None]:

# 2) Load dataset
csv_path = "../data/raw.csv"  # adjust if needed
assert os.path.exists(csv_path), f"CSV not found at {csv_path}. Please place your dataset there."

df = pd.read_csv(csv_path)
df = df.rename(columns={"images":"image", "package dimensions":"package_dimensions"})

need = ["uniq_id", "title", "description", "categories"]
missing = [c for c in need if c not in df.columns]
if missing:
    raise ValueError(f"Missing required columns: {missing}")

print("Shape:", df.shape)
df.head(3)


In [None]:

# 3) Basic cleaning
def clean_text(s: str) -> str:
    s = str(s)
    s = re.sub(r"<[^>]+>", " ", s)      # strip HTML
    s = re.sub(r"\s+", " ", s)         # collapse whitespace
    return s.strip().lower()

df["title"] = df["title"].fillna("").apply(clean_text)
df["description"] = df["description"].fillna("").apply(clean_text)
df["text"] = (df["title"] + " " + df["description"]).str.strip()
df = df[(df["text"].str.len() > 0) & df["categories"].notna()].copy()
df["categories"] = df["categories"].astype(str)

print("Unique categories:", df["categories"].nunique())
df[["text","categories"]].head(3)


In [None]:

# 4) Split
train_df, val_df = train_test_split(
    df, test_size=0.2, random_state=42, stratify=df["categories"]
)
len(train_df), len(val_df)


In [None]:

# 5) Train: TF-IDF + Logistic Regression
pipe = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=50000, ngram_range=(1,2))),
    ("clf", LogisticRegression(max_iter=200, C=2.0))
])

pipe.fit(train_df["text"], train_df["categories"])

pred = pipe.predict(val_df["text"])
acc = accuracy_score(val_df["categories"], pred)
print(f"Validation Accuracy: {acc:.4f}")
print("\nClassification Report (truncated):\n")
print(classification_report(val_df["categories"], pred)[:1200])


In [None]:

# 6) Confusion Matrix (Top-N categories for readability)
from collections import Counter

topN = 12
top_cats = [c for c,_ in Counter(val_df["categories"]).most_common(topN)]

mask_true = val_df["categories"].isin(top_cats)
mask_pred = pd.Series(pred, index=val_df.index).isin(top_cats)
mask = mask_true & mask_pred

cm = confusion_matrix(
    val_df.loc[mask, "categories"],
    pd.Series(pred, index=val_df.index)[mask],
    labels=top_cats
)

plt.figure(figsize=(12,9))
sns.heatmap(cm, annot=False, xticklabels=top_cats, yticklabels=top_cats)
plt.title("Confusion Matrix (Top Categories)")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()


In [None]:

# 7) Save model artifact to backend for optional use
weights_dir = "../backend/models/weights"
os.makedirs(weights_dir, exist_ok=True)
model_path = os.path.join(weights_dir, "text_cat_clf.pkl")

with open(model_path, "wb") as f:
    pickle.dump(pipe, f)

print("Saved model to:", model_path)



## (Optional) CV: zero-shot CLIP similarity

If you have image URLs in the `image` column, you can try this simple qualitative check.
Uncomment the cell below and run.

> Requires: `pip install open-clip-torch pillow requests torch`


In [None]:

# from PIL import Image
# import requests, io, torch
# import open_clip
#
# model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-32", pretrained="openai")
# tokenizer = open_clip.get_tokenizer("ViT-B-32")
# model.eval()
#
# sample = df.dropna(subset=["image"]).head(3)
# rows = []
# for _, r in sample.iterrows():
#     try:
#         content = requests.get(r["image"], timeout=10).content
#         pil = Image.open(io.BytesIO(content)).convert("RGB")
#     except Exception as e:
#         print("Image fetch failed:", e); continue
#
#     with torch.no_grad():
#         img_feat = model.encode_image(preprocess(pil).unsqueeze(0))
#         img_feat /= img_feat.norm(dim=-1, keepdim=True)
#         txt = r["title"]
#         tok = tokenizer([txt])
#         txt_feat = model.encode_text(tok)
#         txt_feat /= txt_feat.norm(dim=-1, keepdim=True)
#         sim = float((img_feat @ txt_feat.T).cpu().numpy()[0][0])
#     rows.append({"uniq_id": r["uniq_id"], "title": txt[:80], "similarity": round(sim, 4)})
#
# import pandas as pd
# pd.DataFrame(rows)



### Notes
- This notebook **fulfills the "Model Training Notebook" requirement** in the PDF.
- The trained text classifier is a strong baseline and runs quickly without GPU.
- If desired, you can integrate the saved `text_cat_clf.pkl` into the FastAPI backend for re-ranking or category hints.
