<a href="https://colab.research.google.com/github/fjiang316/lumaa-spring-2025-ai-ml_Fiona_Jiang/blob/main/Lumaa_coding_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommender Coding Challenge
Author: Fiona Jiang

In [5]:
import pandas as pd
import numpy as np

## Dataset Loading
In this section we are going to download from kaggle the dataset we are going to be using. The dataset I chose is the movie dataset by rounakbanik from kaggle, specifically we are going to use movies_metadata.csv. There are two options:

1. Directly go to https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv and download the csv file, then upload it to the same directory as this notebook.

2. Follow the code cells below to download it from kaggle directly (this works better if you are using google colab). Please note this would require a kaggle api, please generate one through your kaggle account.

In [1]:
from google.colab import files
files.upload()  # Upload kaggle.json

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"ffkj003","key":"a8ed1ecb27a8d05fd4c82a37bc54de10"}'}

In [2]:
mv kaggle.json ~/.kaggle/

mv: cannot move 'kaggle.json' to '/root/.kaggle/': Not a directory


In [6]:
!kaggle datasets download -d rounakbanik/the-movies-dataset --unzip --file movies_metadata.csv

Dataset URL: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
License(s): CC0-1.0
movies_metadata.csv.zip: Skipping, found more recently modified local copy (use --force to force download)


In [7]:
import zipfile

with zipfile.ZipFile("movies_metadata.csv.zip", 'r') as zip_ref:
    zip_ref.extractall(".")

Now that we finished downloading, the csv file should be in the same directory as this notebook. We will load the data and preserve about only 500 rows (through random sampling) and keep only the movie name column and overview column for the purpose of this coding challenge.

In [23]:
sample_size = 500
movie_df = pd.read_csv("movies_metadata.csv")
movie_df = movie_df.dropna()
movie_df = movie_df.sample(n=sample_size, random_state=42).reset_index()

  movie_df = pd.read_csv("movies_metadata.csv")


In [24]:
movie_df = movie_df[['original_title', 'overview']]
movie_df.head()

Unnamed: 0,original_title,overview
0,Death Race 2,"In the world's most dangerous prison, a new ga..."
1,Jack Reacher: Never Go Back,Jack Reacher must uncover the truth behind a m...
2,Twilight,When Bella Swan moves to a small town in the P...
3,Saw 3D,As a deadly battle rages over Jigsaw's brutal ...
4,Despicable Me,Villainous Gru lives up to his reputation as a...


Just for consistency of reproduction. This version of the dataset will be stored in a csv file that is uploaded to the github repo. But you should be able to reproduce this file by following the above steps because everything is seeded.

In [26]:
movie_df.to_csv('movies_500.csv', index=False)

## Content Similarity Based Approach (with TF-IDF vectorization and cosine similarity)
This approach is the standard similarity based approach based on cosine similarity of the tf-idf vectorization of each movie's overview to that of user's prompt.

To do this, we will need to vectorize overview and prompt, and also need to calculate the cosine similarity score. We will be using the corresponding functions from sklearn package. But the general idea is that tfidf uses bag-of-words concept to get a vector representation for each line showing relevancy of words in it. Then cosine similarity is used to compare the word usage of two phrases. If the two have similar words in similar proportions, the cosine similarity of the two is close to 1.

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [28]:
# vectorize all rows at once to save computation and reduce latency.
vectorizer = TfidfVectorizer()
tfidf_vectors = vectorizer.fit_transform(movie_df["overview"])
movie_df["tfidf_vector"] = list(tfidf_vectors.toarray())

In [41]:
def compute_similarity(df, vectorizer, user_query, top_n=5):
    user_vec = vectorizer.transform([user_query])
    # get similarity score
    sim_scores = np.array([cosine_similarity(user_vec, [movie])[0][0] for movie in df['tfidf_vector']])
    top_indices = sim_scores.argsort()[-top_n:][::-1]  # Get top N matches
    top_movie = df.iloc[top_indices][['original_title', 'overview']]
    top_sim = sim_scores[top_indices]
    for i in range(top_n):
        print(f"The top {i+1} match is '{top_movie.iloc[i]['original_title']}' with similarity score {top_sim[i]}. \
        Overview: {top_movie.iloc[i]['overview']}.")

    return df.iloc[top_indices]['original_title'].tolist(), sim_scores[top_indices]

## Testing

In [42]:
# Testing with prompt
user = "I love thrilling action movies set in space, with a comedic twist."
compute_similarity(movie_df, vectorizer, user)

The top 1 match is '[REC]³ Génesis' with similarity score 0.16465183003201372.       Overview: The action now takes place miles away from the original location and partly in broad daylight, giving the film an entirely fresh yet disturbing new reality. The infection has left the building. In a clever twist that draws together the plots of the first two movies, this third part of the saga also works as a decoder to uncover information hidden in the first two films and leaves the door open for the final installment, the future '[REC] 4 Apocalypse.'.
The top 2 match is 'Iron Sky' with similarity score 0.14181209339408896.       Overview: In the last moments of World War II, a secret Nazi space program evaded destruction by fleeing to the Dark Side of the Moon. During 70 years of utter secrecy, the Nazis construct a gigantic space fortress with a massive armada of flying saucers..
The top 3 match is 'You Only Live Twice' with similarity score 0.13413114995790165.       Overview: A mysteriou

(['[REC]³ Génesis',
  'Iron Sky',
  'You Only Live Twice',
  'Moonraker',
  'Camp Rock'],
 array([0.16465183, 0.14181209, 0.13413115, 0.11349949, 0.10965886]))

In [52]:
user = "I love cartoon comedies with magical elements."
recommendation, scores = compute_similarity(movie_df, vectorizer, user)

The top 1 match is '2046' with similarity score 0.13921836364561335.       Overview: 2046 is the sequel to Wong Kar-Wais’ successful box-office hit In The Mood For Love. A film about affairs, ending relationships, and a shared love for Kung-Fu novels as the main character, Chow, writes his own novel and reflects back on his favorite love Su..
The top 2 match is '劇場版ポケットモンスター 幻のポケモン ルギア爆誕' with similarity score 0.11984922615066444.       Overview: Ash Ketchum must put his skill to the test when he attempts to save the world from destruction. The Greedy Pokemon collector Lawrence III throws the universe into chaos after disrupting the balance of nature by capturing one of the Pokemon birds that rule the elements of fire, lightning and ice. Will Ash have what it takes to save the world?.
The top 3 match is 'Tangled' with similarity score 0.09243273131577245.       Overview: When the kingdom's most wanted-and most charming-bandit Flynn Rider hides out in a mysterious tower, he's taken host

In [57]:
import json
output_sample = {'user_prompt': user, 'recommended_movies': recommendation, "similarity_scores": scores.tolist()}
with open("sample_output.json", "w") as file:
  json.dump(output_sample, file, indent=4)

## Miscellaneous: Specification on TD-IDF vectorization and cosine similarity

In the previous section, we imported packages from sklearn to do those two tasks, but here we will implement from scratch to illustrate the principle of those.

Formula:
$$TF-IDF(t,d)=TF(t,d) \times IDF(t)$$, where

$$TF(t,d)= \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}$$, and

$$IDF(t)=log(\frac{N}{DF(t)})$$.

And for cosine similarity:
$$\text{cosine similarity}=\frac{A⋅B}{∥A∥∥B∥}$$ where ∥A∥ is norm of A.



In [46]:
# Example of Vactorization
from collections import Counter
import math
# getting TF part score (equation 2)
def get_tfidf_vector(corpus):
    tf_list = []
    for overview in corpus:
        word_count = Counter(overview.split()) # keep track of unique words and their occurance in this overview
        total_words = len(overview.split())
        tf = {word: count / total_words for word, count in word_count.items()}
        tf_list.append(tf)

    # getting IDF part score (equation 3)
    total_documents = movie_df.shape[0]
    idf = {}
    all_words = set(word for movie in corpus for word in movie.split())

    for word in all_words:
        num_overviews_with_word = sum(1 for movie in corpus if word in movie.split())
        idf[word] = math.log(total_documents / (num_overviews_with_word + 1)) + 1  # Smooth log by adding 1 to avoid 0

    # TF-IDF vector for each row (equation 1)
    tfidf_list = []
    for tf in tf_list:
        tfidf = {word: tf[word] * idf[word] for word in tf}
        tfidf_list.append(tfidf)
    return tfidf_list

In [47]:
# Now experiment with cosine similarity
def compute_cosine_similarity(vec1, vec2):
    # Compute dot product of two vectors
    dot_product = sum(vec1.get(word, 0) * vec2.get(word, 0) for word in set(vec1) | set(vec2))

    # Compute norm of the vectors
    norm1 = math.sqrt(sum(val ** 2 for val in vec1.values()))
    norm2 = math.sqrt(sum(val ** 2 for val in vec2.values()))

    # special cases when magnitude of either vector is 0, this means similarity between the two has to be 0.
    if norm1 == 0 or norm2 == 0:
        return 0
    return dot_product / (norm1 * norm2)

In [48]:
# Example test similarity
user = "I like action movies with superheros."

# getting vectors
tfidf_list = get_tfidf_vector(movie_df['overview'])
query_tfidf = get_tfidf_vector([user])

# compute similarity with each
sim_scores = [compute_cosine_similarity(query_tfidf[0], movie) for movie in tfidf_list]

# get top 5 movies
top_indices = np.argsort(sim_scores)[-5:][::-1]  # Get top N indices (sorted)
top_movie = [movie_df['original_title'][i] for i in top_indices]
top_sim = [sim_scores[i] for i in top_indices]

# Step 3.5: Display the results
for i in range(5):
    print(f"The top {i+1} match is '{top_movie[i]}' with similarity score {top_sim[i]}.")

The top 1 match is '[REC]²' with similarity score 0.15567289851709365.
The top 2 match is 'Mission: Impossible III' with similarity score 0.09027384094029696.
The top 3 match is 'Vampires Suck' with similarity score 0.08249387381117913.
The top 4 match is '西遊記之大鬧天宮' with similarity score 0.08100494733114658.
The top 5 match is 'Mad Max: Fury Road' with similarity score 0.0808005526636628.


Overall the result should be pretty close. The packages uses the same vectorizer for user input, which is different from our mannual approach, which would cause a little differences and would be less accurate.

In [51]:
# Compare with using packages as in the previous section
compute_similarity(movie_df, vectorizer, user)

The top 1 match is '[REC]³ Génesis' with similarity score 0.14823087321568218.       Overview: The action now takes place miles away from the original location and partly in broad daylight, giving the film an entirely fresh yet disturbing new reality. The infection has left the building. In a clever twist that draws together the plots of the first two movies, this third part of the saga also works as a decoder to uncover information hidden in the first two films and leaves the door open for the final installment, the future '[REC] 4 Apocalypse.'.
The top 2 match is '[REC]²' with similarity score 0.13926419164696807.       Overview: The action continues from [REC], with the medical officer and a SWAT team outfitted with video cameras are sent into the sealed off apartment to control the situation..
The top 3 match is 'Think Like a Man' with similarity score 0.12693958981726322.       Overview: The balance of power in four couples’ relationships is upset when the women start using the ad

(['[REC]³ Génesis',
  '[REC]²',
  'Think Like a Man',
  '西遊記之大鬧天宮',
  'Agent Cody Banks 2: Destination London'],
 array([0.14823087, 0.13926419, 0.12693959, 0.11374865, 0.10298365]))