<a href="https://colab.research.google.com/github/dennismathewjose/Movie-Recommender/blob/main/MovieRecommeder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Creation

Some comments that may be helpful as you continue your project:

This is a great application of what we have learned so far. One of the hardest parts of iteratively improving existing systems (like recommendations) is that if our new model recommends movies using entirely new data, it is difficult to compare to other models, which is why I am happy to see you are comparing this to other traditional methods such as TF-IDF and collaborative filtering .Your discussion section should walkthrough how using user ratings agnostic of movie summaries is inherently devoid of full predictive power without paying attention to movie content.
You feature engineering describes that you will be genre encoding and even release year normalizing, but then there is no mention of how these will be included into the model. If you intend to concatenate metadata (genre, director, etc.) with text information, your final similarity model will not understand how to properly weight each of these categories and avoid multicollinearity (if “action” is mentioned constantly in the summary and the genre is “action”, those two columns would be correlated and overweight the final model, for example). I would recommend performing some type of dimensionality reduction (such as PCA) if you intend to concatenate the embeddings.
For your evaluation metrics, everything you mentioned is sufficient, but please conduct some qualitative analysis and spot-checking of the results to ensure they make sense.
The bert model you intend to use was trained on very short paragraphs, so if the context cutoff length of the model tokenizer is shorter than your typical paragraph length, you may need to augment or exchange your embedding model.

In [2]:
import requests
import pandas as pd
from tqdm import tqdm
import time

# Replace with your TMDb API key
TMDB_API_KEY = 'c7f2c198fca36cb98d097819f7da8df2'

# TMDb endpoints
base_url = "https://api.themoviedb.org/3"


In [3]:
def get_movies_from_tmdb(page):
    url = f"{base_url}/movie/popular?api_key={TMDB_API_KEY}&language=en-US&page={page}"
    response = requests.get(url)
    if response.status_code == 200:
        return response.json()['results']
    else:
        print(f"Failed to fetch page {page}: {response.status_code}")
        return []

def get_movie_details(movie_id):
    url = f"{base_url}/movie/{movie_id}?api_key={TMDB_API_KEY}&language=en-US"
    response = requests.get(url)
    if response.status_code == 200:
        return response.json()
    else:
        return {}


In [4]:
movies_data = []
seen_ids = set()

print("Fetching movies...")

for page in tqdm(range(1, 60)):  # ~20 movies/page × 60 pages = 1200
    movies = get_movies_from_tmdb(page)
    for movie in movies:
        movie_id = movie.get('id')
        if movie_id not in seen_ids:
            details = get_movie_details(movie_id)
            if details.get('overview'):  # Keep only movies with plot summary
                movies_data.append({
                    'id': movie_id,
                    'title': details.get('title'),
                    'overview': details.get('overview'),
                    'genres': [genre['name'] for genre in details.get('genres', [])],
                    'release_date': details.get('release_date'),
                    'vote_average': details.get('vote_average'),
                    'runtime': details.get('runtime'),
                    'director': None,  # we'll add this in the next step
                    'cast': None       # we'll add this in the next step
                })
                seen_ids.add(movie_id)
    time.sleep(0.3)  # Be nice to the API

print(f"Total movies collected: {len(movies_data)}")

df_movies = pd.DataFrame(movies_data)
df_movies.head()


Fetching movies...


100%|██████████| 59/59 [01:47<00:00,  1.83s/it]

Total movies collected: 1100





Unnamed: 0,id,title,overview,genres,release_date,vote_average,runtime,director,cast
0,950387,A Minecraft Movie,Four misfits find themselves struggling with o...,"[Family, Comedy, Adventure, Fantasy]",2025-03-31,6.066,101,,
1,324544,In the Lost Lands,A queen sends the powerful and feared sorceres...,"[Fantasy, Adventure, Action]",2025-02-27,5.799,102,,
2,1195506,Novocaine,"When the girl of his dreams is kidnapped, ever...","[Action, Comedy, Thriller]",2025-03-12,6.9,110,,
3,1195430,Deva,"Dev Ambre, a ruthless cop, loses his memory in...","[Action, Thriller, Mystery, Crime]",2025-01-31,5.432,155,,
4,1045938,G20,After the G20 Summit is overtaken by terrorist...,"[Action, Mystery, Drama]",2025-04-09,6.333,108,,


In [7]:
def get_credits(movie_id):
    url = f"{base_url}/movie/{movie_id}/credits?api_key={TMDB_API_KEY}&language=en-US"
    response = requests.get(url)
    if response.status_code == 200:
        return response.json()
    else:
        return {}

# Fill in director and cast info
print("Fetching credits for all movies...")

for movie in tqdm(movies_data):
    credits = get_credits(movie['id'])

    # Extract director
    crew = credits.get('crew', [])
    director = next((member['name'] for member in crew if member['job'] == 'Director'), None)

    # Extract top 5 cast members
    cast = credits.get('cast', [])[:5]
    cast_names = [member['name'] for member in cast]

    movie['director'] = director
    movie['cast'] = cast_names

    time.sleep(0.25)  # respect the rate limit


Fetching credits for all movies...


100%|██████████| 1100/1100 [06:18<00:00,  2.90it/s]


In [8]:
# Convert updated movie data to DataFrame
df_movies = pd.DataFrame(movies_data)

# Save to CSV
df_movies.to_csv("tmdb_1000_movies.csv", index=False)
print("Saved as tmdb_1000_movies.csv")

# Preview the result
df_movies.head()


Saved as tmdb_1000_movies.csv


Unnamed: 0,id,title,overview,genres,release_date,vote_average,runtime,director,cast
0,950387,A Minecraft Movie,Four misfits find themselves struggling with o...,"[Family, Comedy, Adventure, Fantasy]",2025-03-31,6.066,101,Jared Hess,"[Jason Momoa, Jack Black, Sebastian Eugene Han..."
1,324544,In the Lost Lands,A queen sends the powerful and feared sorceres...,"[Fantasy, Adventure, Action]",2025-02-27,5.799,102,Paul W. S. Anderson,"[Milla Jovovich, Dave Bautista, Arly Jover, Am..."
2,1195506,Novocaine,"When the girl of his dreams is kidnapped, ever...","[Action, Comedy, Thriller]",2025-03-12,6.9,110,Dan Berk,"[Jack Quaid, Amber Midthunder, Ray Nicholson, ..."
3,1195430,Deva,"Dev Ambre, a ruthless cop, loses his memory in...","[Action, Thriller, Mystery, Crime]",2025-01-31,5.432,155,Rosshan Andrrews,"[Shahid Kapoor, Pooja Hegde, Pavail Gulati, Pr..."
4,1045938,G20,After the G20 Summit is overtaken by terrorist...,"[Action, Mystery, Drama]",2025-04-09,6.333,108,Patricia Riggen,"[Viola Davis, Anthony Anderson, Ramón Rodrígue..."


# Data Preprocessing & Cleaning

### This step includes:

- Cleaning plot summaries (overviews)

- Normalizing text (lowercase, removing stopwords, lemmatizing)

- Handling missing values

- Tokenizing text for embedding

In [17]:

#Import the necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
import string
import re

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [19]:

from nltk.stem import WordNetLemmatizer


lemmatizer = WordNetLemmatizer()

In [88]:
def clean_text(text):
    if pd.isnull(text):
        return ""
    text = re.sub(r'<[^>]+>', '', text)  # Remove HTML tags
    text = re.sub(r'\n|\r', ' ', text)   # Remove line breaks
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = text.lower()  # Lowercase
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and word not in string.punctuation]
    return " ".join(tokens)


In [89]:
# Clean plot summaries (overview column)
tqdm.pandas()
df_movies['cleaned_overview'] = df_movies['overview'].progress_apply(clean_text)

# Drop rows with empty cleaned summaries
df_movies = df_movies[df_movies['cleaned_overview'].str.strip() != ""]

# Reset index after filtering
df_movies.reset_index(drop=True, inplace=True)

df_movies[['title', 'cleaned_overview']].head()


100%|██████████| 1100/1100 [00:00<00:00, 3004.64it/s]


Unnamed: 0,title,cleaned_overview
0,A Minecraft Movie,four misfit find struggling ordinary problem s...
1,In the Lost Lands,queen sends powerful feared sorceress gray aly...
2,Novocaine,girl dream kidnapped everyman nate turn inabil...
3,Deva,dev ambre ruthless cop loses memory accident f...
4,G20,g summit overtaken terrorist president daniell...


# Data Exploring

In [124]:
import plotly.express as px
from collections import Counter

# Flatten and count all genres
genre_list = [genre for genres in df_movies['genres'] for genre in genres]
genre_counts = Counter(genre_list)

# Convert to DataFrame for plotly
genre_df = pd.DataFrame({
    'Genre': list(genre_counts.keys()),
    'Count': list(genre_counts.values())
})

# Plot using plotly
fig = px.bar(
    genre_df.sort_values(by='Count', ascending=False),
    x='Genre',
    y='Count',
    title='Number of Movies per Genre',
    text='Count'
)

fig.update_traces(textposition='outside')
fig.update_layout(xaxis_tickangle=-45)
fig.show()


In [125]:
import plotly.express as px

# Ensure 'release_year' is extracted
df_movies['release_year'] = pd.to_datetime(df_movies['release_date'], errors='coerce').dt.year

# Count number of movies per year
year_counts = df_movies['release_year'].value_counts().sort_index()
year_df = pd.DataFrame({'Year': year_counts.index, 'Count': year_counts.values})

# Plot using plotly
fig = px.line(
    year_df,
    x='Year',
    y='Count',
    markers=True,
    title='Movies Released per Year'
)

fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Number of Movies',
    hovermode='x unified'
)

fig.show()


In [126]:
import plotly.express as px

# Calculate average rating per genre
genre_ratings = {}

for genre in genre_counts.keys():
    mask = df_movies['genres'].apply(lambda g: genre in g)
    avg_rating = df_movies[mask]['vote_average'].mean()
    genre_ratings[genre] = avg_rating

# Convert to DataFrame
genre_rating_df = pd.DataFrame({
    'Genre': list(genre_ratings.keys()),
    'Average Rating': list(genre_ratings.values())
})

# Plot
fig = px.bar(
    genre_rating_df.sort_values(by='Average Rating', ascending=False),
    x='Genre',
    y='Average Rating',
    title='Average Vote Rating per Genre',
    text='Average Rating'
)

fig.update_traces(textposition='outside')
fig.update_layout(
    xaxis_title='Genre',
    yaxis_title='Average Rating',
    xaxis_tickangle=-45,
    yaxis_range=[0, 10]
)

fig.show()


# Generating embedding

In [90]:
from sentence_transformers import SentenceTransformer

# Load pre-trained SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [91]:
# Encode cleaned plot summaries into embeddings
print("Generating embeddings...")

embeddings = model.encode(
    df_movies['cleaned_overview'].tolist(),
    show_progress_bar=True,
    convert_to_numpy=True,
    batch_size=64  # tweak for memory/speed tradeoff
)

Generating embeddings...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

In [92]:
# Store embeddings as a new column (optional preview)
df_movies['embedding'] = list(embeddings)

# Show shape of embedding for verification
print(f"Embedding dimension: {embeddings.shape[1]}")
df_movies[['title', 'embedding']].head(2)

Embedding dimension: 384


Unnamed: 0,title,embedding
0,A Minecraft Movie,"[0.048558515, -0.03745185, 0.030841416, 0.0250..."
1,In the Lost Lands,"[-0.001374634, 0.010979088, -0.04676828, 0.028..."


In [93]:
pip install faiss-cpu



In [94]:
import faiss
import numpy as np

# Recreate embedding matrix from DataFrame
plot_embeddings = np.vstack(df_movies['embedding'].values).astype('float32')

# Normalize for cosine similarity
faiss.normalize_L2(plot_embeddings)

# Store back to DataFrame (optional verification)
df_movies['normalized_embedding'] = list(plot_embeddings)

print(f"Plot embeddings matrix shape: {plot_embeddings.shape}")


Plot embeddings matrix shape: (1100, 384)


In [95]:
# Create FAISS index for cosine similarity
index = faiss.IndexFlatIP(plot_embeddings.shape[1])  # IP = inner product ≈ cosine after normalization
index.add(plot_embeddings)

print(f"FAISS index created with {index.ntotal} movie vectors.")

FAISS index created with 1100 movie vectors.


# Embedding Visualization

In [118]:
import plotly.express as px
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
import numpy as np



In [119]:
# Sample 500 movies for clarity
sample_df = df_movies.sample(n=500, random_state=42)
sample_embeddings = np.vstack(sample_df['embedding'].values)

# Reduce dimensions using t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
tsne_proj = tsne.fit_transform(sample_embeddings)

# Add TSNE results to sample_df
sample_df['tsne_x'] = tsne_proj[:, 0]
sample_df['tsne_y'] = tsne_proj[:, 1]

In [120]:
k = 10
kmeans = KMeans(n_clusters=k, random_state=42)
sample_df['cluster'] = kmeans.fit_predict(sample_embeddings)

In [121]:
fig = px.scatter(
    sample_df,
    x='tsne_x',
    y='tsne_y',
    color=sample_df['cluster'].astype(str),  # convert to string for legend
    hover_data=['title', 'genres'],
    title='t-SNE Projection of SBERT Plot Embeddings (Clustered)',
    labels={'color': 'Cluster'}
)

fig.update_layout(legend_title_text='Cluster ID')
fig.show()


In [122]:
for i in range(k):
    print(f"\nCluster {i} sample movies:")
    print(sample_df[sample_df['cluster'] == i]['title'].head(5).tolist())


Cluster 0 sample movies:
['The Autopsy of Jane Doe', 'Back in Action', 'The Matrix Resurrections', 'Taken 2', 'Venom: Let There Be Carnage']

Cluster 1 sample movies:
['The Dad Quest', 'Elevation', 'The Curse of the Necklace', 'Final Destination Bloodlines', 'Little Man']

Cluster 2 sample movies:
['The Tomorrow War', 'Arrival', 'The Meg', 'Snowpiercer', 'Devara : Part 1']

Cluster 3 sample movies:
['Princess Mononoke', 'Anastasia', 'How to Train Your Dragon', 'Ben-Hur', 'Beauty and the Beast']

Cluster 4 sample movies:
['Whiplash', '10 Lives', 'Delinquent in Drag', 'All Things Fair', 'In Time']

Cluster 5 sample movies:
['xXx', 'Star Wars: The Force Awakens', 'Land of Bad', 'Train to Busan', 'The Gorge']

Cluster 6 sample movies:
['Tarot', 'Eye for an Eye 2', 'Talk to Me', 'Wallace & Gromit: Vengeance Most Fowl', 'Deadpool 2']

Cluster 7 sample movies:
['Come Play with Me', 'A Woman and Seven Husbands', 'Push! Push!', 'The House', 'Palm Springs']

Cluster 8 sample movies:
['Zootopia'

 ### These clusters are formed based on semantic plot similarity, not explicit genres

# Baseline Recommendation Using TF-IDF + Cosine Similarity

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


In [40]:
# Create TF-IDF matrix from cleaned plot summaries
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df_movies['cleaned_overview'])

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")

TF-IDF matrix shape: (1100, 5000)


In [41]:
def recommend_movies_tfidf(query, top_k=5):
    # Clean the query text
    cleaned_query = clean_text(query)

    # Transform to TF-IDF vector
    query_vec = tfidf_vectorizer.transform([cleaned_query])

    # Compute cosine similarity
    similarity_scores = cosine_similarity(query_vec, tfidf_matrix).flatten()

    # Get top-k indices
    top_indices = similarity_scores.argsort()[::-1][:top_k]

    # Prepare results
    results = []
    for i in top_indices:
        results.append({
            'title': df_movies.iloc[i]['title'],
            'similarity_score': round(float(similarity_scores[i]), 3),
            'overview': df_movies.iloc[i]['overview']
        })

    return pd.DataFrame(results)


In [42]:
recommend_movies_tfidf("A thief who enters people's dreams to steal ideas")


Unnamed: 0,title,similarity_score,overview
0,Night Nurse,0.18,A dentist invites a elderly patient to stay at...
1,Inception,0.165,"Cobb, a skilled thief who commits corporate es..."
2,Sidelined: The QB and Me,0.157,"Dallas, a burdened but headstrong dancer, is d..."
3,Wonka,0.147,Willy Wonka – chock-full of ideas and determin...
4,Final Twist,0.146,Monica has a podcast and Alex has a dark secre...


In [None]:
# Example test cases (movie title → relevant movie titles)
evaluation_set = [
    {
        "query": "The batman",
        "relevant_titles": ["The dark Knight", "The Batman", "The dark knight rises"]
    },
    {
        "query": "A thief who enters people's dreams to steal ideas",
        "relevant_titles": ["Inception", "The prestige", "Memento"]
    },
    {
        "query":"An elderly man reads to a woman with dementia the story of two young lovers whose romance is threatened by the difference in their respective social classes",
        "relevant_titles" : ["The notebook","Pride & Prejudice","La La Land"]
    }
]


In [None]:
def precision_at_k(recommended, relevant, k):
    recommended_titles = recommended['title'].head(k).str.lower().tolist()
    relevant_set = set(title.lower() for title in relevant)
    hits = [title for title in recommended_titles if title in relevant_set]
    return len(hits) / k

def recall_at_k(recommended, relevant, k):
    recommended_titles = recommended['title'].head(k).str.lower().tolist()
    relevant_set = set(title.lower() for title in relevant)
    hits = [title for title in recommended_titles if title in relevant_set]
    return len(hits) / len(relevant_set) if relevant_set else 0

def mrr(recommended, relevant):
    recommended_titles = recommended['title'].str.lower().tolist()
    for i, title in enumerate(recommended_titles):
        if title in (t.lower() for t in relevant):
            return 1 / (i + 1)
    return 0

In [72]:
results_tfidf = []

for entry in evaluation_set:
    query = entry["query"]
    relevant = entry["relevant_titles"]

    recommendations = recommend_movies_tfidf(query, top_k=5)

    precision = precision_at_k(recommendations, relevant, k=5)
    recall = recall_at_k(recommendations, relevant, k=5)
    reciprocal_rank = mrr(recommendations, relevant)

    results_tfidf.append({
        "query": query,
        "precision@5": round(precision, 3),
        "recall@5": round(recall, 3),
        "MRR": round(reciprocal_rank, 3)
    })

pd.DataFrame(results_tfidf)


Unnamed: 0,query,precision@5,recall@5,MRR
0,The batman,0.4,0.667,0.333
1,A thief who enters people's dreams to steal ideas,0.2,0.333,0.5
2,An elderly man reads to a woman with dementia ...,0.2,0.333,1.0


# Recommendation System using SBERT + FAISS

In [96]:
def recommend_movies_by_plot(plot_text, top_k=5):
    # Clean and encode query plot
    cleaned_plot = clean_text(plot_text)
    query_vec = model.encode([cleaned_plot]).astype('float32')
    faiss.normalize_L2(query_vec)

    # Search for top-k similar movies
    D, I = index.search(query_vec, top_k)

    results = []
    for i, score in zip(I[0], D[0]):
        results.append({
            'title': df_movies.iloc[i]['title'],
            'similarity_score': round(float(score), 3),
            'overview': df_movies.iloc[i]['overview']
        })

    return pd.DataFrame(results)


In [97]:
test_plot = "A thief who enters people's dreams to steal ideas"
recommend_movies_by_plot(test_plot, top_k=5)


Unnamed: 0,title,similarity_score,overview
0,Inception,0.605,"Cobb, a skilled thief who commits corporate es..."
1,Heat,0.42,Obsessive master thief Neil McCauley leads a t...
2,Insidious,0.398,A family discovers that dark spirits have inva...
3,Novocaine,0.398,"When the girl of his dreams is kidnapped, ever..."
4,Abigail,0.391,A group of criminals kidnap a teenage ballet d...


In [98]:
def unified_smart_recommend(user_input, top_k=5):
    user_input_cleaned = user_input.strip().lower()

    # 1. Title Match
    match = df_movies[df_movies['title'].str.lower() == user_input_cleaned]
    if not match.empty:
        print(f"Exact title match found: '{match.iloc[0]['title']}'")
        plot = match.iloc[0]['overview']
        print(f"over view of {user_input} : {plot}")
        print("Finding similar movies: ")
        return recommend_movies_by_plot(plot, top_k)

    # 2. Genre Match
    all_genres = set(g.lower() for sublist in df_movies['genres'] for g in sublist)
    if user_input_cleaned in all_genres:
        print(f"Interpreted as Genre: '{user_input}'")
        genre_mask = df_movies['genres'].apply(lambda g: user_input_cleaned in [x.lower() for x in g])
        return df_movies[genre_mask][['title', 'overview', 'genres']].head(top_k)

    # 3. Director Match
    if df_movies['director'].dropna().str.lower().str.contains(user_input_cleaned).any():
        print(f"Interpreted as Director: '{user_input}'")
        director_mask = df_movies['director'].fillna('').str.lower().str.contains(user_input_cleaned)
        return df_movies[director_mask][['title', 'overview', 'director']].head(top_k)

    # 4. Fallback to plot-based search
    print("Interpreted as Plot Description.")
    return recommend_movies_by_plot(user_input, top_k)


In [123]:
unified_smart_recommend("Pride & Prejudice")

Exact title match found: 'Pride & Prejudice'
over view of Pride & Prejudice : A story of love and life among the landed English gentry during the Georgian era. Mr. Bennet is a gentleman living in Hertfordshire with his overbearing wife and five daughters, but if he dies their house will be inherited by a distant cousin whom they have never met, so the family's future happiness and security is dependent on the daughters making good marriages.
Finding similar movies: 


Unnamed: 0,title,similarity_score,overview
0,Pride & Prejudice,1.0,A story of love and life among the landed Engl...
1,The Notebook,0.574,An epic love story centered around an older ma...
2,Claudia,0.55,Eager to escape her high-society life among th...
3,Come Play with Me,0.539,Two alluring young ladies live with their beau...
4,Corpse Bride,0.496,"In a 19th-century European village, a young ma..."


In [100]:
unified_smart_recommend("An elderly man reads to a woman with dementia the story of two young lovers whose romance is threatened by the difference in their respective social classes")

Interpreted as Plot Description.


Unnamed: 0,title,similarity_score,overview
0,The Notebook,0.708,An epic love story centered around an older ma...
1,Tatsulok,0.571,Story about a happily married couple visited b...
2,Pride & Prejudice,0.477,A story of love and life among the landed Engl...
3,Apartment Wife: Scent of a Woman,0.45,A female high school teacher suffers from amne...
4,Below Her Mouth,0.448,An unexpected affair quickly escalates into a ...


# Evaluation of SBERT + FAISS

In [101]:
# Example test cases (movie title → relevant movie titles)
evaluation_set = [
    {
        "query": "The batman",
        "relevant_titles": ["The dark Knight", "The Batman", "The dark knight rises"]
    },
    {
        "query": "A thief who enters people's dreams to steal ideas",
        "relevant_titles": ["Inception", "The prestige", "Memento"]
    },
    {
        "query":"An elderly man reads to a woman with dementia the story of two young lovers whose romance is threatened by the difference in their respective social classes",
        "relevant_titles" : ["The notebook","Pride & Prejudice","La La Land"]
    }
]


In [102]:
def precision_at_k(recommended, relevant, k):
    recommended_titles = recommended['title'].head(k).str.lower().tolist()
    relevant_set = set(title.lower() for title in relevant)
    hits = [title for title in recommended_titles if title in relevant_set]
    return len(hits) / k

def recall_at_k(recommended, relevant, k):
    recommended_titles = recommended['title'].head(k).str.lower().tolist()
    relevant_set = set(title.lower() for title in relevant)
    hits = [title for title in recommended_titles if title in relevant_set]
    return len(hits) / len(relevant_set) if relevant_set else 0

def mrr(recommended, relevant):
    recommended_titles = recommended['title'].str.lower().tolist()
    for i, title in enumerate(recommended_titles):
        if title in (t.lower() for t in relevant):
            return 1 / (i + 1)
    return 0

In [103]:
results = []

for entry in evaluation_set:
    query = entry["query"]
    relevant = entry["relevant_titles"]

    recommendations = unified_smart_recommend(query, top_k=10)

    precision = precision_at_k(recommendations, relevant, k=5)
    recall = recall_at_k(recommendations, relevant, k=5)
    reciprocal_rank = mrr(recommendations, relevant)

    results.append({
        "query": query,
        "precision@5": round(precision, 3),
        "recall@5": round(recall, 3),
        "MRR": round(reciprocal_rank, 3)
    })

pd.DataFrame(results)


Exact title match found: 'The Batman'
over view of The batman : In his second year of fighting crime, Batman uncovers corruption in Gotham City that connects to his own family while facing a serial killer known as the Riddler.
Finding similar movies: 
Interpreted as Plot Description.
Interpreted as Plot Description.


Unnamed: 0,query,precision@5,recall@5,MRR
0,The batman,0.6,1.0,1.0
1,A thief who enters people's dreams to steal ideas,0.2,0.333,1.0
2,An elderly man reads to a woman with dementia ...,0.4,0.667,1.0
