# Making a content-based recommendation system

This notebook attempts to make non-personalized recommendations based on the Netflix prize data.

And since the original data only have the movie titles, we include an external data with more detailed movie information which can get from [a Kaggle discussion](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data/discussion/36670#527063). Thanks a lot for this piece of information.

This is the second part of making recommendation systems. Full index of the series includes:
1. [Non-personalized recommendations](https://www.kaggle.com/code/dungdore1312/non-personalized-recommendations)
2. Content-based recommendations

# Reading DataFrames

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_rows', 100)



In [2]:
import glob

rating_files = glob.glob('/kaggle/input/netflix-prize-data/combined_data_*.txt')
df_ratings = pd.concat([pd.read_csv(filename,
                                    header=None,
                                    names=['customer_id', 'rating', 'date'],
                                    parse_dates=['date']) for filename in rating_files])
df_ratings['movie_id'] = np.where(df_ratings['rating'].isna(), df_ratings['customer_id'], np.nan)
df_ratings['movie_id'] = df_ratings['movie_id'].str.split(':').str[0]
df_ratings['movie_id'] = df_ratings['movie_id'].fillna(method='ffill')
df_ratings.dropna(subset=['rating', 'date'], inplace=True)
df_ratings = df_ratings.astype({
    'customer_id': 'int',
    'movie_id': 'int'
})
df_ratings.drop_duplicates(inplace=True)

In [3]:
df_movies = pd.read_csv('/kaggle/input/modified-movie-titles/movies.csv',
                        parse_dates=['year_of_release'],
                        encoding='latin-1')

In [4]:
df_ratings.head()

Unnamed: 0,customer_id,rating,date,movie_id
1,1277134,1.0,2003-12-02,9211
2,2435457,2.0,2005-06-01,9211
3,2338545,3.0,2001-02-17,9211
4,2218269,1.0,2002-12-27,9211
5,441153,4.0,2002-10-11,9211


In [5]:
df_movies.head()

Unnamed: 0,movie_id,year_of_release,title
0,1,2003-01-01,Dinosaur Planet
1,2,2004-01-01,Isle of Man TT 2004 Review
2,3,1997-01-01,Character
3,4,1994-01-01,Paula Abdul's Get Up & Dance
4,5,2004-01-01,The Rise and Fall of ECW


In [6]:
df_movies_info = pd.read_csv('/kaggle/input/movies-information/cooked_movies.csv')
df_movies_info.head()

Unnamed: 0,id,year,title,Runtime,Rating,Directors,Writers,Production companies,Genres
0,1,2003.0,Dinosaur Planet,50.0,7.7,Pierre de Lespinois,Mike Carrol-Mike Carroll-Georgann Kane,,Documentary-Animation-Family
1,2,2004.0,Isle of Man TT 2004 Review,,,,,,
2,3,1997.0,Character,122.0,7.8,Mike van Diem,Ferdinand Bordewijk-Laurens Geels-Mike van Diem,First Floor Features-Almerica Film,Crime-Drama-Mystery
3,4,1994.0,Paula Abdul's Get Up & Dance,54.0,8.8,Steve Purcell,,,Family
4,5,2004.0,The Rise and Fall of ECW,360.0,8.6,Kevin Dunn,Paul Heyman,WWE Home Video,Documentary-Sport


In [7]:
df_movies_info.shape

(17769, 9)

In [8]:
# Drop movies with all null information
df_movies_info = df_movies_info.dropna(subset=["Runtime", "Rating", "Directors", "Writers", "Production companies", "Genres"])
df_movies_info.shape

(10248, 9)

In [9]:
df_movies = df_movies_info[["title", "Runtime", "Directors", "Writers", "Production companies", "Genres"]].merge(
    df_movies,
    on="title",
)
df_movies.head()

Unnamed: 0,title,Runtime,Directors,Writers,Production companies,Genres,movie_id,year_of_release
0,Character,122.0,Mike van Diem,Ferdinand Bordewijk-Laurens Geels-Mike van Diem,First Floor Features-Almerica Film,Crime-Drama-Mystery,3,1997-01-01
1,The Rise and Fall of ECW,360.0,Kevin Dunn,Paul Heyman,WWE Home Video,Documentary-Sport,5,2004-01-01
2,8 Man,83.0,Yasuhiro Horiuchi,Kazumasa Hirai-Jirô Kuwata,Rim Publishing,Action-Sci-Fi,7,1992-01-01
3,What the #$*! Do We Know!?,109.0,William Arntz-Betsy Chasse-Mark Vicente,William Arntz-Betsy Chasse-Matthew Hoffman,Captured Light-Lord of the Wind,Documentary-Comedy-Drama-Fantasy-Mystery-Sci-Fi,8,2004-01-01
4,Class of Nuke 'Em High 2,90.0,Eric Louzil,Lloyd Kaufman-Carl Morano-Matt Unger,Troma Entertainment,Comedy-Horror-Sci-Fi,9,1991-01-01


In [10]:
df_movies['movie_id'] = df_movies['movie_id'].astype(int)
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10884 entries, 0 to 10883
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   title                 10884 non-null  object        
 1   Runtime               10884 non-null  float64       
 2   Directors             10884 non-null  object        
 3   Writers               10884 non-null  object        
 4   Production companies  10884 non-null  object        
 5   Genres                10884 non-null  object        
 6   movie_id              10884 non-null  int64         
 7   year_of_release       10884 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 765.3+ KB


# Feature Extraction

## Naive approach
We will try to use one-hot encoding where the features are list of directors, writers, production companies, and genres.

In [11]:
# Modify some keywords containing '-' (e.g., Sci-fi) to avoid being splitted later
df_movies["Genres"] = df_movies["Genres"].str.replace("Sci-Fi", "SciFi")
df_movies["Genres"].head()

0                               Crime-Drama-Mystery
1                                 Documentary-Sport
2                                      Action-SciFi
3    Documentary-Comedy-Drama-Fantasy-Mystery-SciFi
4                               Comedy-Horror-SciFi
Name: Genres, dtype: object

In [12]:
cat_cols = ["Directors", "Writers", "Production companies", "Genres"]
df_movies_encoded = df_movies.copy(deep=True)

for col in cat_cols:
    df_movies_encoded[f"{col}_list"] = df_movies_encoded[col].str.split('-')
#     df_movies_encoded = df_movies_encoded.explode(f"{col}_list")
#     df_movies_encoded.reset_index(inplace=True, drop=True)
    
import itertools
df_movies_encoded["tags"] =  df_movies_encoded[[f"{col}_list" for col in cat_cols]].apply(lambda x: list(itertools.chain(*x)), axis=1)
df_movies_encoded = df_movies_encoded[["title", "Runtime", "tags"]]
df_movies_encoded.head()

Unnamed: 0,title,Runtime,tags
0,Character,122.0,"[Mike van Diem, Ferdinand Bordewijk, Laurens G..."
1,The Rise and Fall of ECW,360.0,"[Kevin Dunn, Paul Heyman, WWE Home Video, Docu..."
2,8 Man,83.0,"[Yasuhiro Horiuchi, Kazumasa Hirai, Jirô Kuwat..."
3,What the #$*! Do We Know!?,109.0,"[William Arntz, Betsy Chasse, Mark Vicente, Wi..."
4,Class of Nuke 'Em High 2,90.0,"[Eric Louzil, Lloyd Kaufman, Carl Morano, Matt..."


In [13]:
df_movies_encoded = df_movies_encoded.explode("tags")
df_movies_encoded

Unnamed: 0,title,Runtime,tags
0,Character,122.0,Mike van Diem
0,Character,122.0,Ferdinand Bordewijk
0,Character,122.0,Laurens Geels
0,Character,122.0,Mike van Diem
0,Character,122.0,First Floor Features
...,...,...,...
10883,Alien Hunter,92.0,Sandstorm Films
10883,Alien Hunter,92.0,Action
10883,Alien Hunter,92.0,Adventure
10883,Alien Hunter,92.0,SciFi


In [14]:
# Unique tags
df_movies_encoded["tags"].nunique()

24010

Due to the relatively high number of entities, we keep only 1k most common tags.

In [15]:
kept_tags = df_movies_encoded["tags"].value_counts().iloc[:1000].index
df_movies_encoded_filtered = df_movies_encoded[df_movies_encoded["tags"].isin(kept_tags)]
df_movies_encoded_filtered

Unnamed: 0,title,Runtime,tags
0,Character,122.0,Crime
0,Character,122.0,Drama
0,Character,122.0,Mystery
1,The Rise and Fall of ECW,360.0,Documentary
1,The Rise and Fall of ECW,360.0,Sport
...,...,...,...
10883,Alien Hunter,92.0,Millennium Films
10883,Alien Hunter,92.0,Action
10883,Alien Hunter,92.0,Adventure
10883,Alien Hunter,92.0,SciFi


In [16]:
df_movies_crosstab = pd.crosstab(df_movies_encoded_filtered['title'], df_movies_encoded_filtered['tags'])
df_movies_crosstab.head()

tags,20th Century Fox Television,3 Arts Entertainment,40 Acres & A Mule Filmworks,A Band Apart,A&E Television Networks,A&M Films,Abel Ferrara,Action,Aditya Chopra,Adventure,...,Working Title Films,World International Network,Yash Chopra,Yash Raj Films,Yasujirô Ozu,Zentropa Entertainments,Zoetrope Studios,Zweites Deutsches Fernsehen,liang Tsai,Éric Rohmer
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'N Sync: Unauthorized Biography,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
'Round Midnight,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
... And God Spoke,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...And Justice for All,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...And Then Came Summer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
df_movies_crosstab.sum(axis=1).sort_values()

title
The Mouse That Roared            1
The Best of Benny Hill           1
The Best of Riverdance           1
Diary of Jack the Ripper         1
The Sex Monster                  1
                              ... 
Hamlet                         104
Peter Pan                      108
Pinocchio                      108
Anna Karenina                  125
The Hunchback of Notre Dame    138
Length: 9819, dtype: int64

There is no movie with 0 tag, so we may good to go.

In [18]:
# CALCULATE THE JACCARD SIMILARITY
from scipy.spatial.distance import pdist, squareform

# Calculate all pairwise distances
jaccard_distances = pdist(df_movies_crosstab.values, metric='jaccard')

# Convert the distances to a square matrix
jaccard_similarity_array = 1 - squareform(jaccard_distances)

# Wrap the array in a pandas DataFrame
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, columns=df_movies_crosstab.index, index=df_movies_crosstab.index)

# Print the top 5 rows of the DataFrame
display(jaccard_similarity_df.head())

title,'N Sync: Unauthorized Biography,'Round Midnight,... And God Spoke,...And Justice for All,...And Then Came Summer,.Com for Murder,10,10 Attitudes,10 Things I Hate About You,10 to Midnight,...,Zigzag,Zion Canyon: Treasure of the Gods: IMAX,Zombie,Zombie 3,Zombie Holocaust,Zoolander,Zoot Suit,Zorba the Greek,Zubeidaa,Zulu
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'N Sync: Unauthorized Biography,1.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight,0.25,1.0,0.0,0.125,0.333333,0.0,0.125,0.25,0.142857,0.142857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.142857,0.2
... And God Spoke,0.0,0.0,1.0,0.0,0.0,0.0,0.166667,0.5,0.2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.333333,0.0,0.0
...And Justice for All,0.0,0.125,0.0,1.0,0.166667,0.125,0.090909,0.142857,0.1,0.375,...,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.125,0.1,0.125
...And Then Came Summer,0.0,0.333333,0.0,0.166667,1.0,0.0,0.166667,0.5,0.2,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,0.2,0.333333


In [19]:
movie = "Zombie 3"

# Find the values for the movie
jaccard_similarity_series = jaccard_similarity_df.loc[movie]

# Sort these values from highest to lowest
ordered_similarities = jaccard_similarity_series.sort_values(ascending=False)

# Print the results
print("Top 20 movies similar to {}:".format(movie))
display(ordered_similarities.iloc[1:21].to_frame())

Top 20 movies similar to Zombie 3:


Unnamed: 0_level_0,Zombie 3
title,Unnamed: 1_level_1
Reign in Darkness,1.0
Dead Meat,0.75
Godzilla vs. Megaguirus,0.75
Death Machine,0.75
Raptor,0.75
Starship Troopers 2: Hero of the Federation,0.75
Predator Island,0.75
The Crazies,0.75
Reptilian,0.75
Avia Vampire Hunter,0.666667
