# 1 Prepare Environment

```pip install pandas numpy scikit-learn sentence_transformers torch```

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer, util

  from .autonotebook import tqdm as notebook_tqdm


# 2 Load Dataset

## 2.1 Introduce Dataset

The movies dataset is a comprehensive collection of information about 4,803 movies. It provides a wide range of details about each movie, including budget, genres, production companies, release date, revenue, runtime, language, popularity, and more.

https://www.kaggle.com/datasets/utkarshx27/movies-dataset?resource=download

In [2]:
# Load dataset
df = pd.read_csv("movie_dataset.csv")
df = df[['original_title', 'genres', 'keywords', 'overview']]
df.head(10)

Unnamed: 0,original_title,genres,keywords,overview
0,Avatar,Action Adventure Fantasy Science Fiction,culture clash future space war space colony so...,"In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,Adventure Fantasy Action,ocean drug abuse exotic island east india trad...,"Captain Barbossa, long believed to be dead, ha..."
2,Spectre,Action Adventure Crime,spy based on novel secret agent sequel mi6,A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,Action Crime Drama Thriller,dc comics crime fighter terrorist secret ident...,Following the death of District Attorney Harve...
4,John Carter,Action Adventure Science Fiction,based on novel mars medallion space travel pri...,"John Carter is a war-weary, former military ca..."
5,Spider-Man 3,Fantasy Action Adventure,dual identity amnesia sandstorm love of one's ...,The seemingly invincible Spider-Man goes up ag...
6,Tangled,Animation Family,hostage magic horse fairy tale musical,When the kingdom's most wanted-and most charmi...
7,Avengers: Age of Ultron,Action Adventure Science Fiction,marvel comic sequel superhero based on comic b...,When Tony Stark tries to jumpstart a dormant p...
8,Harry Potter and the Half-Blood Prince,Adventure Fantasy Family,witch magic broom school of witchcraft wizardry,"As Harry begins his sixth year at Hogwarts, he..."
9,Batman v Superman: Dawn of Justice,Action Adventure Fantasy,dc comics vigilante superhero based on comic b...,Fearing the actions of a god-like Super Hero l...


## 2.2 Preprocess Data

Remove all rows with NaN values

In [3]:
df.isna().sum()

original_title      0
genres             28
keywords          412
overview            3
dtype: int64

In [4]:
df.dropna(inplace=True)

# 3 TF-IDF Approach

## 3.1 Build Vectors

In [5]:
# Initialize vectorizers
vectorizer_keywords = TfidfVectorizer(stop_words='english')
vectorizer_overview = TfidfVectorizer(stop_words='english')
vectorizer_genres = TfidfVectorizer(stop_words='english')

# Fit and transform text data
tfidf_keywords = vectorizer_keywords.fit_transform(df['keywords'])
tfidf_overview = vectorizer_overview.fit_transform(df['overview'])
tfidf_genres = vectorizer_genres.fit_transform(df['genres'])

user_input = "I love thrilling action movies set in space, with a comedic twist."

# Transform user query for both vectors
user_vec_keywords = vectorizer_keywords.transform([user_input])
user_vec_overview = vectorizer_overview.transform([user_input])
user_vec_genres = vectorizer_genres.transform([user_input])

## 3.2 Calculate Similarity

In [None]:
WEIGHT_OVERVIEW = 0.3
WEIGHT_KEYWORDS = 0.6
WEIGHT_GENRES = 0.1

# Compute cosine similarity
similarity_keywords = cosine_similarity(user_vec_keywords, tfidf_keywords).flatten()
similarity_overview = cosine_similarity(user_vec_overview, tfidf_overview).flatten()
similarity_genres = cosine_similarity(user_vec_genres, tfidf_genres).flatten()

# Compute final weighted similarity
overall_similarity = (WEIGHT_KEYWORDS * similarity_keywords) + (WEIGHT_OVERVIEW * similarity_overview) + (WEIGHT_GENRES * similarity_genres)

## 3.3 Return Top Matches

In [7]:
TOP_N = 5

top_indices = overall_similarity.argsort()[-TOP_N:][::-1]

recommendations = df.iloc[top_indices].assign(score=overall_similarity[top_indices])

recommendations

Unnamed: 0,original_title,genres,keywords,overview,score
761,Righteous Kill,Action Crime Drama Thriller,revenge murder plot twist dirty cop,Two veteran New York City detectives work to i...,0.204876
1937,King's Ransom,Comedy Crime,caper action,Hoping to foil his own gold-digging wife's pla...,0.200791
239,Gravity,Science Fiction Thriller Drama,space mission loss space astronaut trapped in ...,"Dr. Ryan Stone, a brilliant medical engineer o...",0.17882
658,Death Race,Action Thriller Science Fiction,car race dystopia matter of life and death pri...,"Terminal Island, New York: 2020. Overcrowding ...",0.175149
1951,Белка и Стрелка. Звёздные собаки,Family Animation,russia space mission space outer space dog,"Belka, the amazing flying dog is unexpectedly ...",0.168931


In [8]:
for i in range(len(recommendations)):
    for j in range(4):
        print(["Title", "Genres", "Keywords", "Overview"][j], "-", recommendations.iloc[i, j])
    print()

Title - Righteous Kill
Genres - Action Crime Drama Thriller
Keywords - revenge murder plot twist dirty cop
Overview - Two veteran New York City detectives work to identify the possible connection between a recent murder and a case they believe they solved years ago; is there a serial killer on the loose, and did they perhaps put the wrong person behind bars?

Title - King's Ransom
Genres - Comedy Crime
Keywords - caper action
Overview - Hoping to foil his own gold-digging wife's plan, a loathsome businessman arranges his own kidnapping, only to realize that there are plenty of other people interested in his wealth as well.

Title - Gravity
Genres - Science Fiction Thriller Drama
Keywords - space mission loss space astronaut trapped in space
Overview - Dr. Ryan Stone, a brilliant medical engineer on her first Shuttle mission, with veteran astronaut Matt Kowalsky in command of his last flight before retiring. But on a seemingly routine spacewalk, disaster strikes. The Shuttle is destroye

# 4 SBERT Approach

## 4.2 Build Vectors

In [None]:
# Load pre-trained SBERT model
sbert_model = SentenceTransformer('all-MiniLM-L6-v2', device='cpu')

# Compute embeddings for dataset
embeddings_keywords = sbert_model.encode(df['keywords'].tolist(), convert_to_tensor=True).cpu()
embeddings_overview = sbert_model.encode(df['overview'].tolist(), convert_to_tensor=True).cpu()
embeddings_genres = sbert_model.encode(df['genres'].tolist(), convert_to_tensor=True).cpu()

user_input = "I love thrilling action movies set in space, with a comedic twist."

# Compute user input embeddings
user_embedding_keywords = sbert_model.encode(user_input, convert_to_tensor=True).cpu()
user_embedding_overview = sbert_model.encode(user_input, convert_to_tensor=True).cpu()
user_embedding_genres = sbert_model.encode(user_input, convert_to_tensor=True).cpu()

## 4.2 Calculate Similarity

In [10]:
# Define weight parameters
WEIGHT_OVERVIEW = 0.3
WEIGHT_KEYWORDS = 0.6
WEIGHT_GENRES = 0.1

# Compute cosine similarity
similarity_keywords = util.cos_sim(user_embedding_keywords, embeddings_keywords).squeeze().numpy()
similarity_overview = util.cos_sim(user_embedding_overview, embeddings_overview).squeeze().numpy()
similarity_genres = util.cos_sim(user_embedding_genres, embeddings_genres).squeeze().numpy()

# Compute final weighted similarity
overall_similarity = (WEIGHT_KEYWORDS * similarity_keywords) + (WEIGHT_OVERVIEW * similarity_overview) + (WEIGHT_GENRES * similarity_genres)

## 4.3 Return Top Matches

In [11]:
TOP_N = 5

# Get top matches
top_indices = overall_similarity.argsort()[-TOP_N:][::-1]

recommendations =  df.iloc[top_indices].assign(score=overall_similarity[top_indices])

recommendations

Unnamed: 0,original_title,genres,keywords,overview,score
1053,Galaxy Quest,Comedy Family Science Fiction,space battle spaceship spoof fictional tv show,The stars of a 1970s sci-fi show - now scrapin...,0.487216
1086,Aliens in the Attic,Adventure Comedy Family Fantasy Science Fiction,alien comedy duringcreditsstinger beforecredit...,"It's summer vacation, but the Pearson family k...",0.479263
1650,Wing Commander,Action Science Fiction,fight pilot outer space based on video game sp...,The Hollywood version of the popular video gam...,0.458202
3184,The Ice Pirates,Action Science Fiction Comedy,rebel space war water sci-fi comedy,"The time is the distant future, where by far t...",0.439505
3730,Cargo,Thriller Mystery Science Fiction,space colony space travel simulated reality s...,The story of CARGO takes place on rusty space-...,0.430614


In [12]:
for i in range(len(recommendations)):
    for j in range(4):
        print(["Title", "Genres", "Keywords", "Overview"][j], "-", recommendations.iloc[i, j])
    print()

Title - Galaxy Quest
Genres - Comedy Family Science Fiction
Keywords - space battle spaceship spoof fictional tv show
Overview - The stars of a 1970s sci-fi show - now scraping a living through re-runs and sci-fi conventions - are beamed aboard an alien spacecraft. Believing the cast's heroic on-screen dramas are historical documents of real-life adventures, the band of aliens turn to the ailing celebrities for help in their quest to overcome the oppressive regime in their solar system.

Title - Aliens in the Attic
Genres - Adventure Comedy Family Fantasy Science Fiction
Keywords - alien comedy duringcreditsstinger beforecreditsstinger live action and animation
Overview - It's summer vacation, but the Pearson family kids are stuck at a boring lake house with their nerdy parents. That is until feisty, little, green aliens crash-land on the roof, with plans to conquer the house AND Earth! Using only their wits, courage and video game-playing skills, the youngsters must band together to d

# 5 Summary

- I compared the usage of TF-IDF and SBERT as embedding method to create vectors for subsequent similarity calculation.
- When calculating the cosine similarities, I used a composite rule where the overview (30%), keywords (60%) and genres (10%) of each movie are combined for the overall score.
- According to the top-5 matches returned, the SBERT embedding method achieved a better result.

# 6 Salary Expectation

- $30 per hour
- 20 hours per week
- $2400 per month