---
title: "Movie-Recommendation Model"
format:
    html: 
        code-fold: true
        code-tools: true
jupyter: python3
---

# Code

Code for this webpage can be found [here.](https://github.com/dcorc7/Movie-Recommendation-Model/recommender.ipynb)

In [48]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from tabulate import tabulate
from IPython.display import display, HTML
import ipywidgets as widgets
import panel as pn

## Load the Dataset

In [49]:
# Load the cleaned movies dataframe
movies_df = pd.read_csv("../data/processed-data/movies_cleaned.csv")

pd.set_option("display.max_columns", None)
movies_df.head(3)


Unnamed: 0,IMDB_ID,Title,Year,Release_Date,Release_Month,Age_Rating,Overview,Keywords,Genre,Director,Actors,Runtime,Metascore_Rating,IMDB_Rating,Rotten_Tomatoes_Rating,TMDB_Rating,Average_Rating,Won_Award,Oscar_Wins,Oscar_Nominations,Budget,Budget_Normalized,Revenue,Revenue_Normalized,Return_On_Investment,Popularity
0,tt0097499,henry v,1989,1989-10-05,October,pg-13,gritty adaption william shakespeares play engl...,['france kingdom theater play based on true st...,war,kenneth branagh,['kenneth branagh derek jacobi simon shepherd'],137,8.3,7.5,9.8,7.2,8.2,True,1,0,9000000,-0.873465,10200000,-0.801446,1.133333,18.771
1,tt1320253,the expendables,2010,2010-08-03,August,r,barney ross leads band highly skilled mercenar...,['rescue sniper island martial arts tattoo esc...,thriller,sylvester stallone,['sylvester stallone jason statham jet li'],103,4.5,6.4,4.2,6.2,5.325,False,0,0,80000000,0.317499,274470394,0.18825,3.43088,74.573
2,tt1025100,gemini man,2019,2019-10-02,October,pg-13,henry brogan elite 51 year assassin whos ready...,['hitman clone'],thriller,ang lee,['will smith mary elizabeth winstead clive owen'],117,3.8,5.7,2.7,6.3,4.625,False,0,0,140000000,1.323948,173469516,-0.189999,1.239068,27.266


## Clean the Dataset

In [50]:
# Remove brakcets and apostrophes from the Actors column
movies_df["Actors"] = movies_df["Actors"].str.replace("[", "", regex = False).str.replace("]", "", regex = False).str.replace("'", "", regex = False)
movies_df["Keywords"] = movies_df["Keywords"].str.replace("[", "", regex = False).str.replace("]", "", regex = False).str.replace("'", "", regex = False)


# Drop columns that won't be included in the cosine similarity calculation
columns_to_drop = ["IMDB_ID", "Keywords", "Won_Award", "Release_Date", "Release_Month", "Age_Rating", "Budget", "Revenue"]
filtered_movies_df = movies_df.drop(columns = columns_to_drop)

# PReview the new dataframe
filtered_movies_df.head(3)

Unnamed: 0,Title,Year,Overview,Genre,Director,Actors,Runtime,Metascore_Rating,IMDB_Rating,Rotten_Tomatoes_Rating,TMDB_Rating,Average_Rating,Oscar_Wins,Oscar_Nominations,Budget_Normalized,Revenue_Normalized,Return_On_Investment,Popularity
0,henry v,1989,gritty adaption william shakespeares play engl...,war,kenneth branagh,kenneth branagh derek jacobi simon shepherd,137,8.3,7.5,9.8,7.2,8.2,1,0,-0.873465,-0.801446,1.133333,18.771
1,the expendables,2010,barney ross leads band highly skilled mercenar...,thriller,sylvester stallone,sylvester stallone jason statham jet li,103,4.5,6.4,4.2,6.2,5.325,0,0,0.317499,0.18825,3.43088,74.573
2,gemini man,2019,henry brogan elite 51 year assassin whos ready...,thriller,ang lee,will smith mary elizabeth winstead clive owen,117,3.8,5.7,2.7,6.3,4.625,0,0,1.323948,-0.189999,1.239068,27.266


## Compute TD-IDF and Cosine Similarity Scores for Text Data

In [51]:
# Combine all text features of each movie into one value of a new column
filtered_movies_df["combined_text_features"] = filtered_movies_df["Overview"] + " " + filtered_movies_df["Genre"] + " " + filtered_movies_df["Director"] + " " + filtered_movies_df["Actors"]

# Create a TF-IDF matrix to vectorize words for each movie's text features
vectorizer = TfidfVectorizer(max_features = 5000)
tfidf_matrix = vectorizer.fit_transform(filtered_movies_df["combined_text_features"])

# Calculate textual cosine similarity scores for each movie
text_cos_similarity = cosine_similarity(tfidf_matrix)

## Compute Cosine Similarity Scores for Numerical Data

In [52]:
# Filter the df to only include numerical columns
numerical_features = ["Runtime", "Metascore_Rating", "IMDB_Rating", "Rotten_Tomatoes_Rating", "TMDB_Rating", "Average_Rating", 
                      "Oscar_Wins", "Return_On_Investment", "Budget_Normalized", "Revenue_Normalized", "Popularity"]

# Scale the values so that one column does not have an extreme bias towards the cosine similarity scores
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(filtered_movies_df[numerical_features])

# Calculate numerical cosine similarity scores for each movie
numerical_cos_similarity = cosine_similarity(scaled_features)

## Determine Cosine Similarity Score Weights for Each Datatype

In [None]:
# Set weights for each cosine similarity scores to determine whether text or numerical data has more say in the recommendations
text_weight = 0.25
numerical_weight = 0.75

# Create a combined cosine similarity score that uses both text and numerical features
combined_similarity = text_weight * text_cos_similarity + numerical_weight * numerical_cos_similarity

# Function to take in a movie and genreate 10 movies that are most similar to it
def recommend_movies(movie_title, top_n = 10):    
    # Obtain the index of the given movie
    selected_movie_index = filtered_movies_df[filtered_movies_df["Title"] == movie_title].index[0]

    # Obtain the similarity scores for the selected movie and place them in a list, along with each movie's index
    sim_scores = list(enumerate(combined_similarity[selected_movie_index]))

    # Sort movies based on similarity scores
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)

    # Filter the list down to n movies with the highest similiarty scores (excluding the first index/selected movie)
    sim_scores = sim_scores[1:top_n + 1]

    # Get indices of the top-n similar movies
    movie_indices = [i[0] for i in sim_scores]
    movie_scores = [i[1].round(4) for i in sim_scores]
    
    # Create a new recommended movie df with selected features of the top movies by mathcing the indeces of the recommended movies
    columns_to_keep = ["IMDB_ID", "Title", "Year", "Age_Rating", "Genre", "Keywords", "Director", "Actors", "Average_Rating", "Revenue", "Budget", "Oscar_Wins"]

    recommendations_df = movies_df[columns_to_keep]
    recommendations_df = recommendations_df.iloc[movie_indices]
    recommendations_df["Similarity_Score"] = movie_scores

    # Return the top-n similar movies
    return recommendations_df

## Example: Interstellar

In [58]:
selected_movie = "interstellar"
recommendations = recommend_movies(selected_movie)

interstellar_df = movies_df[movies_df["Title"] == selected_movie].reset_index(drop = True)
interstellar_df = interstellar_df[["Title", "Year", "Age_Rating", "Genre", "Runtime", "Director", "Actors", "Average_Rating", "Oscar_Wins"]].reset_index(drop = True)


recommendations = recommendations[["Title", "Year", "Age_Rating", "Genre", "Director", "Average_Rating", "Similarity_Score"]].reset_index(drop = True)
recommendations.index = recommendations.index + 1

recommendations.to_csv("../website/recommendation_tables/interstellar_recommendations.csv", index = True)
interstellar_df.to_csv("../website/recommendation_tables/interstellar_metadata.csv", index = True)

recommendations.head(10)

Unnamed: 0,Title,Year,Age_Rating,Genre,Director,Average_Rating,Similarity_Score
1,contact,1997,pg,drama,robert zemeckis,6.975,0.7687
2,the dark knight rises,2012,pg-13,action,christopher nolan,8.175,0.7652
3,tenet,2020,pg-13,science fiction,christopher nolan,7.1,0.7639
4,the dark knight,2008,pg-13,crime,christopher nolan,8.825,0.7571
5,inception,2010,pg-13,science fiction,christopher nolan,8.325,0.757
6,the huntsman winters war,2016,pg-13,drama,cedric nicolas-troyan,4.475,0.757
7,the martian,2015,pg-13,science fiction,ridley scott,8.2,0.7567
8,prometheus,2012,r,science fiction,ridley scott,6.825,0.7565
9,mechanic resurrection,2016,r,thriller,dennis gansel,4.65,0.7554
10,the matrix,1999,r,science fiction,"lana wachowski, lilly wachowski",8.125,0.7548


## Example: Troy

In [59]:
selected_movie = "troy"
recommendations = recommend_movies(selected_movie)

troy_df = movies_df[movies_df["Title"] == selected_movie].reset_index(drop = True)
troy_df = troy_df[["Title", "Year", "Age_Rating", "Genre", "Runtime", "Director", "Actors", "Average_Rating", "Oscar_Wins"]].reset_index(drop = True)

recommendations = recommendations[["Title", "Year", "Age_Rating", "Genre", "Director", "Average_Rating", "Similarity_Score"]].reset_index(drop = True)
recommendations.index = recommendations.index + 1

recommendations.to_csv("../website/recommendation_tables/troy_recommendations.csv", index = True)
troy_df.to_csv("../website/recommendation_tables/troy_metadata.csv", index = True)


recommendations.head(10)

Unnamed: 0,Title,Year,Age_Rating,Genre,Director,Average_Rating,Similarity_Score
1,kingdom of heaven,2005,r,action,ridley scott,6.125,0.7688
2,pirates of the caribbean the curse of the blac...,2003,pg-13,action,gore verbinski,7.525,0.7644
3,snow white and the huntsman,2012,pg-13,drama,rupert sanders,5.65,0.7592
4,oceans twelve,2004,pg-13,crime,steven soderbergh,6.1,0.7576
5,the day after tomorrow,2004,pg-13,action,roland emmerich,5.55,0.7555
6,waterworld,1995,pg-13,action,kevin reynolds,5.675,0.7549
7,fantastic beasts the secrets of dumbledore,2022,pg-13,fantasy,david yates,5.55,0.7548
8,poseidon,2006,pg-13,drama,wolfgang petersen,4.95,0.7544
9,power rangers,2017,pg-13,science fiction,dean israelite,5.425,0.7528
10,robocop,2014,pg-13,science fiction,josé padilha,5.55,0.7528


## Example: Treasure Planet

In [60]:
selected_movie = "treasure planet"
recommendations = recommend_movies(selected_movie)

treasure_planet_df = movies_df[movies_df["Title"] == selected_movie].reset_index(drop = True)
treasure_planet_df = treasure_planet_df[["Title", "Year", "Age_Rating", "Genre", "Runtime", "Director", "Actors", "Average_Rating", "Oscar_Wins"]].reset_index(drop = True)

recommendations = recommendations[["Title", "Year", "Age_Rating", "Genre", "Director", "Average_Rating", "Similarity_Score"]].reset_index(drop = True)
recommendations.index = recommendations.index + 1

recommendations.to_csv("../website/recommendation_tables/treasure_planet_recommendations.csv", index = True)
treasure_planet_df.to_csv("../website/recommendation_tables/treasure_planet_metadata.csv", index = True)


recommendations.head(10)

Unnamed: 0,Title,Year,Age_Rating,Genre,Director,Average_Rating,Similarity_Score
1,the princess and the frog,2009,g,fantasy,"ron clements, john musker",7.55,0.786
2,moana,2016,pg,comedy,"ron clements, john musker, don hall",8.2,0.7779
3,lightyear,2022,pg,science fiction,angus maclane,6.625,0.7764
4,muppet treasure island,1996,g,music,brian henson,6.75,0.7658
5,home,2015,pg,fantasy,tim johnson,6.025,0.7652
6,atlantis the lost empire,2001,pg,science fiction,"gary trousdale, kirk wise",5.975,0.7614
7,inception,2010,pg-13,science fiction,christopher nolan,8.325,0.7605
8,spirit stallion of the cimarron,2002,g,romance,"kelly asbury, lorna cook",6.75,0.7603
9,timecop,1994,r,crime,peter hyams,5.225,0.7601
10,wall·e,2008,g,science fiction,andrew stanton,8.875,0.76
