# üéâ Smart Recommender ‚Äì Sentence‚ÄëTransformer (SBERT)

In the previous notebook we built a TF‚ÄëIDF baseline.  
Now we‚Äôll replace the **bag‚Äëof‚Äëwords** representation with **dense semantic embeddings** from a pretrained sentence‚Äëtransformer.  
The workflow is exactly the same as before, only the vectorisation step changes:
1Ô∏è‚É£ Install `sentence‚Äëtransformers`.  
2Ô∏è‚É£ Load a lightweight model (`all‚ÄëMiniLM‚ÄëL6‚Äëv2`).  
3Ô∏è‚É£ Encode every movie‚Äôs combined text into a 384‚Äëdim vector.  
4Ô∏è‚É£ At query time we encode the user‚Äôs sentence and compute cosine similarity against all movie vectors.  
5Ô∏è‚É£ Return the top‚ÄëN most similar movies.

Because the model already knows **semantic relationships** (e.g., *dream* ‚âà *sleep*, *heist* ‚âà *robbery*), the recommendations become **much more relevant** than the TF‚ÄëIDF baseline.

In [None]:
# 1Ô∏è‚É£ Install the library (run once)
!pip install -q sentence-transformers

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

print("‚úÖ sentence‚Äëtransformers installed and imports ready!")

## üìÇ Load the **cleaned** TMDb data

We reuse the `movies_clean` DataFrame you created in the TF‚ÄëIDF notebook.  
If you are running this notebook **after** the previous one in the same session, the variable already exists.  
Otherwise, just re‚Äërun the loading cells from `01_tmdb_exploration.ipynb` (they are tiny).

In [None]:
# Load the CSV we saved at the end of the TF‚ÄëIDF notebook
movies_clean = pd.read_csv('tmdb_clean.csv')

# Re‚Äëcreate the combined text column (just in case we started a fresh kernel)
def create_combined_text(row):
    parts = [str(row['overview'])]
    if row['genres_list']:
        parts.append(' '.join(row['genres_list']))
    if row['keywords_list']:
        parts.append(' '.join(row['keywords_list']))
    return ' '.join(parts)

movies_clean['combined_text'] = movies_clean.apply(create_combined_text, axis=1)

print(f"‚úÖ Loaded {len(movies_clean)} movies with combined text ready.")

## ü§ñ 2Ô∏è‚É£ Load a pretrained sentence‚Äëtransformer

We pick **`all‚ÄëMiniLM‚ÄëL6‚Äëv2`** ‚Äì 384‚Äëdim, ~30‚ÄØMB, works well on CPUs and GPUs.  
If you have a GPU you‚Äôll see a speed boost, but the model runs fine on a regular Colab CPU runtime.

In [None]:
# Load the model (download ~30‚ÄØMB on first run)
model = SentenceTransformer('all-MiniLM-L6-v2')

print("‚úÖ Model loaded! Vocabulary size:", len(model.get_vocab()))

## üì¶ 3Ô∏è‚É£ Encode **all** movies once (this is the heavy step)

We turn every `combined_text` into a 384‚Äëdim dense vector and store it in a NumPy array.  
The resulting matrix has shape `(num_movies, 384)` ‚Äì tiny compared to the original TF‚ÄëIDF matrix.

In [None]:
# Encode all movies ‚Äì this may take ~30‚Äë45‚ÄØseconds for ~5k rows
movie_embeddings = model.encode(
    movies_clean['combined_text'].tolist(),
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f"‚úÖ Encoded {movie_embeddings.shape[0]} movies ‚Üí {movie_embeddings.shape[1]}‚Äëdim vectors")

## üîé 4Ô∏è‚É£ Recommendation function (query ‚Üí top‚ÄëN movies)

The function does three things:
1Ô∏è‚É£ Encode the user query with the same model.
2Ô∏è‚É£ Compute **cosine similarity** against the pre‚Äëcomputed movie embeddings.
3Ô∏è‚É£ Return the `top_n` titles with their similarity scores (0‚Äë1).

Because the embeddings are dense, cosine similarity is a **single matrix multiplication** ‚Äì extremely fast.

In [None]:
def recommend_by_embedding(query: str, top_n: int = 5):
    """Return the most similar movies for a free‚Äëtext query.
    
    Parameters
    ----------
    query : str
        User‚Äôs natural‚Äëlanguage request (e.g., "mind‚Äëbending sci‚Äëfi with dreams").
    top_n : int, optional
        Number of movies to return (default 5).
    """
    # 1Ô∏è‚É£ Encode the query
    q_vec = model.encode([query], convert_to_numpy=True)

    # 2Ô∏è‚É£ Cosine similarity against all movie vectors
    sims = cosine_similarity(q_vec, movie_embeddings).flatten()

    # 3Ô∏è‚É£ Get top‚ÄëN indices (largest similarity first)
    top_idx = sims.argsort()[::-1][:top_n]

    # 4Ô∏è‚É£ Build a friendly output list
    results = []
    for i in top_idx:
        title = movies_clean.iloc[i]['title']
        score = round(sims[i], 4)
        results.append((title, score))
    return results

# Quick sanity check
print("üîé Test query ‚Üí results:")
for t, s in recommend_by_embedding('mind‚Äëbending sci‚Äëfi with dreams'):
    print(f"  ‚Ä¢ {t} (score={s})")

## üìä 5Ô∏è‚É£ Quick evaluation ‚Äì try a few different queries

Run the cell below and replace the strings with any natural‚Äëlanguage request you like.  
You should see **semantically relevant** movies (e.g., *Inception*, *Paprika*, *The Matrix*, *Interstellar*, etc.).

In [None]:
queries = [
    "mind‚Äëbending sci‚Äëfi with dreams",
    "family friendly animated adventure",
    "dark thriller about a detective",
    "movie about time travel and paradoxes",
    "light romantic comedy set in Paris"
]

for q in queries:
    print(f"\nüîé Query: '{q}'")
    for title, score in recommend_by_embedding(q, top_n=5):
        print(f"  ‚Ä¢ {title} (score={score})")

## üíæ 6Ô∏è‚É£ Persist the embeddings (optional but handy for production)

If you plan to serve the recommender from a web service, you don‚Äôt want to re‚Äëencode all movies on every start‚Äëup.  
Save the dense matrix and the title list to disk ‚Äì they can be loaded instantly later.

```python
# Save
np.save('movie_embeddings.npy', movie_embeddings)
movies_clean[['title']].to_csv('movie_titles.csv', index=False)

# Load (in a new session)
movie_embeddings = np.load('movie_embeddings.npy')
titles = pd.read_csv('movie_titles.csv')['title'].tolist()
```

## üöÄ 7Ô∏è‚É£ Next step ‚Äì expose the recommender as a tiny API (FastAPI)

Below is a **minimal FastAPI** server you can copy into `services/recommender-service/main.py`.  
It loads the saved embeddings and answers `POST /recommend` with JSON.

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

app = FastAPI()

class RecRequest(BaseModel):
    query: str
    top_n: int = 5

# Load model & data once at startup
model = SentenceTransformer('all-MiniLM-L6-v2')
movie_embeddings = np.load('movie_embeddings.npy')
titles = pd.read_csv('movie_titles.csv')['title'].tolist()

@app.post('/recommend')
def recommend(req: RecRequest):
    q_vec = model.encode([req.query], convert_to_numpy=True)
    sims = cosine_similarity(q_vec, movie_embeddings).flatten()
    top_idx = sims.argsort()[::-1][:req.top_n]
    results = [{"title": titles[i], "score": round(float(sims[i]), 4)} for i in top_idx]
    return {"query": req.query, "results": results}
```

Run the service with Docker (or `uvicorn main:app --host 0.0.0.0 --port 8000`).  
Your frontend can now call `POST /recommend` and get instant, semantically‚Äëaware suggestions.

---
### üéâ You‚Äôre done!

You now have:
1. **A TF‚ÄëIDF baseline** (already built).
2. **A smart SBERT‚Äëbased recommender** that understands meaning.
3. **Persistence** of embeddings for fast start‚Äëup.
4. **A minimal FastAPI wrapper** ready to be containerised.

Feel free to experiment:
- Try a larger model (`all‚Äëdistilroberta‚Äëv1`) for even richer semantics.
- Fine‚Äëtune the sentence‚Äëtransformer on your own movie‚Äëspecific corpus (optional).
- Add genre/keyword weighting in the final similarity (e.g., `0.6*embed + 0.4*tfidf`).

Enjoy building the **Movie Discovery Assistant**! üöÄ