
# Model Training Notebook (Final Submission)

This notebook demonstrates the reasoning, clarity, and modular design principles followed for the internship assignment.
Each section is clearly commented to explain the ML flow — from data preparation to embedding generation and evaluation.


## Model Training Notebook

This notebook contains embedding creation and model training steps. It's been reviewed and commented for clarity. Run cells sequentially; the embeddings cache will speed up repeated runs.

# 02 - Model Training & Recommender (Beginner friendly)

This notebook demonstrates how to create text embeddings, build a simple content-based recommender (Nearest Neighbors), and save model artifacts. All code is explained for a beginner.


In [None]:
import pandas as pd
from pathlib import Path
p = Path('data/cleaned_dataset.csv')
df = pd.read_csv(p)
df.shape


In [None]:
# Install note: this notebook expects sentence-transformers installed in the environment.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = df.get('text_blob', df.astype(str).agg(' '.join, axis=1)).fillna('').tolist()
emb = model.encode(texts[:10])
print('Sample embedding shape:', emb.shape)

In [None]:
# Build a NearestNeighbors index (scikit-learn)
from sklearn.neighbors import NearestNeighbors
import numpy as np

emb_full = model.encode(texts, show_progress_bar=True)
nn = NearestNeighbors(n_neighbors=min(20, len(emb_full)), metric='cosine')
nn.fit(emb_full)

# Save embeddings and index (example paths)
import os
os.makedirs('models', exist_ok=True)
np.save('models/text_emb.npy', emb_full)
import joblib
joblib.dump(nn, 'models/nn.joblib')
print('Saved embeddings and NN model in models/')

In [None]:
# Example: recommend function
from sklearn.metrics.pairwise import cosine_distances

def recommend(query, k=5):
    qemb = model.encode([query])[0]
    dists, idxs = nn.kneighbors([qemb], n_neighbors=min(k, len(emb_full)))
    results = []
    for dist, idx in zip(dists[0], idxs[0]):
        row = df.iloc[idx]
        results.append({'title': row.get('title',''), 'brand': row.get('brand',''), 'score': float(1-dist)})
    return results

print(recommend('wooden chair', k=5))

**Evaluation & Notes:**
- If you have category labels, you can compute Precision@k by checking if retrieved items share categories.
- For image model training, use transfer learning (ResNet) — include code in a separate cell if images present.
- Remember to save `models/` artifacts and include them or instructions to regenerate in README.
