---
title: "Movie-Recommendation Model"
format:
    html: 
        code-fold: true
---

# Code

Code for this webpage can be found [here.](https://github.com/dcorc7/Movie-Recommendation-Model/recommender.ipynb)

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from tabulate import tabulate
from IPython.display import display, HTML
import ipywidgets as widgets

## Load the Dataset

In [2]:
# Load the cleaned movies dataframe
movies_df = pd.read_csv("./data/processed-data/movies_cleaned.csv")

pd.set_option("display.max_columns", None)
movies_df.head(5)


Unnamed: 0,IMDB_ID,Title,Year,Release_Date,Release_Month,Age_Rating,Overview,Keywords,Genre,Director,Actors,Runtime,Metascore_Rating,IMDB_Rating,Rotten_Tomatoes_Rating,TMDB_Rating,Average_Rating,Won_Award,Oscar_Wins,Oscar_Nominations,Budget,Budget_Normalized,Revenue,Revenue_Normalized,Return_On_Investment,Popularity
0,tt0097499,henry v,1989,1989-10-05,October,pg-13,gritty adaption william shakespeares play engl...,['france kingdom theater play based on true st...,war,kenneth branagh,['kenneth branagh derek jacobi simon shepherd'],137,8.3,7.5,9.8,7.2,8.2,True,1,0,9000000,-0.873465,10200000,-0.801446,1.133333,18.771
1,tt1320253,the expendables,2010,2010-08-03,August,r,barney ross leads band highly skilled mercenar...,['rescue sniper island martial arts tattoo esc...,thriller,sylvester stallone,['sylvester stallone jason statham jet li'],103,4.5,6.4,4.2,6.2,5.325,False,0,0,80000000,0.317499,274470394,0.18825,3.43088,74.573
2,tt1025100,gemini man,2019,2019-10-02,October,pg-13,henry brogan elite 51 year assassin whos ready...,['hitman clone'],thriller,ang lee,['will smith mary elizabeth winstead clive owen'],117,3.8,5.7,2.7,6.3,4.625,False,0,0,140000000,1.323948,173469516,-0.189999,1.239068,27.266
3,tt0473075,prince of persia the sands of time,2010,2010-05-19,May,pg-13,rogue prince reluctantly joins forces mysterio...,['persia sandstorm brother against brother arm...,action,mike newell,['jake gyllenhaal gemma arterton ben kingsley'],116,5.0,6.5,3.7,6.3,5.375,False,0,0,200000000,2.330396,336365676,0.420048,1.681828,33.199
4,tt1981115,thor the dark world,2013,2013-10-30,October,pg-13,thor fights restore order cosmos… ancient race...,['superhero based on comic hostile takeover no...,action,alan taylor,['chris hemsworth natalie portman tom hiddlest...,112,5.4,6.7,6.7,6.5,6.325,False,0,0,170000000,1.827172,644783140,1.575075,3.792842,50.246


## Clean the Dataset

In [3]:
# Remove brakcets and apostrophes from the Actors column
movies_df["Actors"] = movies_df["Actors"].str.replace("[", "", regex = False).str.replace("]", "", regex = False).str.replace("'", "", regex = False)
movies_df["Keywords"] = movies_df["Keywords"].str.replace("[", "", regex = False).str.replace("]", "", regex = False).str.replace("'", "", regex = False)


# Drop columns that won't be included in the cosine similarity calculation
columns_to_drop = ["IMDB_ID", "Keywords", "Won_Award", "Release_Date", "Release_Month", "Age_Rating", "Budget", "Revenue"]
filtered_movies_df = movies_df.drop(columns = columns_to_drop)

# PReview the new dataframe
filtered_movies_df.head(5)

Unnamed: 0,Title,Year,Overview,Genre,Director,Actors,Runtime,Metascore_Rating,IMDB_Rating,Rotten_Tomatoes_Rating,TMDB_Rating,Average_Rating,Oscar_Wins,Oscar_Nominations,Budget_Normalized,Revenue_Normalized,Return_On_Investment,Popularity
0,henry v,1989,gritty adaption william shakespeares play engl...,war,kenneth branagh,kenneth branagh derek jacobi simon shepherd,137,8.3,7.5,9.8,7.2,8.2,1,0,-0.873465,-0.801446,1.133333,18.771
1,the expendables,2010,barney ross leads band highly skilled mercenar...,thriller,sylvester stallone,sylvester stallone jason statham jet li,103,4.5,6.4,4.2,6.2,5.325,0,0,0.317499,0.18825,3.43088,74.573
2,gemini man,2019,henry brogan elite 51 year assassin whos ready...,thriller,ang lee,will smith mary elizabeth winstead clive owen,117,3.8,5.7,2.7,6.3,4.625,0,0,1.323948,-0.189999,1.239068,27.266
3,prince of persia the sands of time,2010,rogue prince reluctantly joins forces mysterio...,action,mike newell,jake gyllenhaal gemma arterton ben kingsley,116,5.0,6.5,3.7,6.3,5.375,0,0,2.330396,0.420048,1.681828,33.199
4,thor the dark world,2013,thor fights restore order cosmos… ancient race...,action,alan taylor,chris hemsworth natalie portman tom hiddleston,112,5.4,6.7,6.7,6.5,6.325,0,0,1.827172,1.575075,3.792842,50.246


## Compute TD-IDF and Cosine Similarity Scores for Text Data

In [4]:
# Combine all text features of each movie into one value of a new column
filtered_movies_df["combined_text_features"] = filtered_movies_df["Overview"] + " " + filtered_movies_df["Genre"] + " " + filtered_movies_df["Director"] + " " + filtered_movies_df["Actors"]

# Create a TF-IDF matrix to vectorize words for each movie's text features
vectorizer = TfidfVectorizer(max_features = 5000)
tfidf_matrix = vectorizer.fit_transform(filtered_movies_df["combined_text_features"])

# Calculate textual cosine similarity scores for each movie
text_cos_similarity = cosine_similarity(tfidf_matrix)

## Compute Cosine Similarity Scores for Numerical Data

In [5]:
# Filter the df to only include numerical columns
numerical_features = ["Runtime", "Metascore_Rating", "IMDB_Rating", "Rotten_Tomatoes_Rating", "TMDB_Rating", "Average_Rating", 
                      "Oscar_Wins", "Return_On_Investment", "Budget_Normalized", "Revenue_Normalized", "Popularity"]

# Scale the values so that one column does not have an extreme bias towards the cosine similarity scores
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(filtered_movies_df[numerical_features])

# Calculate numerical cosine similarity scores for each movie
numerical_cos_similarity = cosine_similarity(scaled_features)

## Determine Cosine Similarity Score Weights for Each Datatype

In [6]:
# Set weights for each cosine similarity scores to determine whether text or numerical data has more say in the recommendations
text_weight = 0.25
numerical_weight = 0.75

# Create a combined cosine similarity score that uses both text and numerical features
combined_similarity = text_weight * text_cos_similarity + numerical_weight * numerical_cos_similarity

##  Recommend 10 Movies Based On A Selected Movie

In [7]:
# Function to take in a movie and genreate 10 movies that are most similar to it
def recommend_movies(movie_title, top_n = 10):    
    # Obtain the index of the given movie
    selected_movie_index = filtered_movies_df[filtered_movies_df["Title"] == movie_title].index[0]

    # Obtain the similarity scores for the selected movie and place them in a list, along with each movie's index
    sim_scores = list(enumerate(combined_similarity[selected_movie_index]))

    # Sort movies based on similarity scores
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)

    # Filter the list down to n movies with the highest similiarty scores (excluding the first index/selected movie)
    sim_scores = sim_scores[1:11]

    # Get indices of the top-n similar movies
    movie_indices = [i[0] for i in sim_scores]
    movie_scores = [i[1].round(4) for i in sim_scores]
    
    # Create a new recommended movie df with selected features of the top movies by mathcing the indeces of the recommended movies
    columns_to_keep = ["IMDB_ID", "Title", "Year", "Age_Rating", "Keywords", "Director", "Actors", "Average_Rating", "Revenue", "Budget", "Oscar_Wins"]

    recommendations_df = movies_df[columns_to_keep]
    recommendations_df = recommendations_df.iloc[movie_indices]
    recommendations_df["Similarity_Score"] = movie_scores

    # Return the top-n similar movies
    return recommendations_df


#selected_movie = "django unchained"
#recommendations = recommend_movies(selected_movie)

In [8]:
#selected_movie = "django unchained"
#display(HTML(f"<h1 style='color: black;'>Movie Recommendations For: {selected_movie}</h1>"))
#display(recommendations)

Unnamed: 0,IMDB_ID,Title,Year,Age_Rating,Keywords,Director,Actors,Average_Rating,Revenue,Budget,Oscar_Wins,Similarity_Score
410,tt1663202,the revenant,2015,r,rape based on novel or book parent child relat...,alejandro g. iñárritu,leonardo dicaprio tom hardy will poulter,7.725,532950503,135000000,3,0.793
953,tt3460252,the hateful eight,2015,r,bounty hunter wyoming usa narration mountain h...,quentin tarantino,samuel l jackson kurt russell jennifer jason l...,7.45,155760117,44000000,1,0.7919
134,tt7131622,once upon a time in hollywood,2019,r,movie business male friendship cult based on t...,quentin tarantino,leonardo dicaprio brad pitt margot robbie,7.975,374300000,95000000,2,0.7844
335,tt0993846,the wolf of wall street,2013,r,corruption based on novel or book drug addicti...,martin scorsese,leonardo dicaprio jonah hill margot robbie,7.9,392000000,100000000,0,0.7744
696,tt3622592,paper towns,2015,pg-13,high school friendship based on novel or book ...,jake schreier,nat wolff cara delevingne austin abrams,5.95,85500000,12000000,0,0.7736
585,tt0110912,pulp fiction,1994,r,drug dealer boxer massage stolen money briefca...,quentin tarantino,john travolta uma thurman samuel l jackson,9.025,213928762,8500000,1,0.771
767,tt1067583,water for elephants,2011,pg-13,based on novel or book elephant clown great de...,francis lawrence,robert pattinson reese witherspoon christoph w...,6.25,117094902,38000000,0,0.7693
167,tt0407887,the departed,2006,r,police undercover boston massachusetts gangste...,martin scorsese,leonardo dicaprio matt damon jack nicholson,8.575,291465000,90000000,4,0.768
581,tt0378194,kill bill vol 2,2004,r,daughter martial arts kung fu showdown right a...,quentin tarantino,uma thurman david carradine michael madsen,8.15,152159461,30000000,0,0.7662
1541,tt1375666,inception,2010,pg-13,rescue mission dream airplane paris france vir...,christopher nolan,leonardo dicaprio joseph gordon levitt elliot ...,8.325,825532764,160000000,4,0.7652


## Find Recommended Movies

In [None]:
# Define the interactive elements
movie_input = widgets.Text(
    value = "django unchained",
    placeholder = "Enter movie title",
    description = "Movie:",
    disabled = False
)
output = widgets.Output()

def on_button_click(b):
    with output:
        output.clear_output()  # Clear previous results
        movie_title = movie_input.value
        try:
            recommendations = recommend_movies(movie_title)
            display(recommendations)
        except Exception as e:
            print(f"Error: {e}")

button = widgets.Button(description = "Recommend Movies")
button.on_click(on_button_click)

# Display the widgets
display(movie_input, button, output)