# Netflix Recommendation System

### Loading Dataset

In [1]:
import pandas as pd
import altair as alt
import numpy as np
from fuzzywuzzy import process



The data used for this project is taken from the Netflix catalog in May 2022. (Source: https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies). The goal of this project is to use data preprocessing methods and machine learning methods to create a model that recommends users a Netflix title based off similarities in the title's attributes such as run time, genre, age certification ratings, IMDb ratings, release year, etc.

In [2]:
titles = pd.read_csv('titles.csv')

In the output below we list the data type of each column.

In [3]:
# Checking data type of each column
titles.dtypes

id                       object
title                    object
type                     object
description              object
release_year              int64
age_certification        object
runtime                   int64
genres                   object
production_countries     object
seasons                 float64
imdb_id                  object
imdb_score              float64
imdb_votes              float64
tmdb_popularity         float64
tmdb_score              float64
dtype: object

In [4]:
# Preview of titles data
titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,tt0075314,8.3,795222.0,27.612,8.2
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,tt0071853,8.2,530877.0,18.216,7.8
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,tt0079470,8.0,392419.0,17.505,7.8
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,tt0070047,8.1,391942.0,95.337,7.7


In the visualization below we can see the percent of missing values in each column. Notice the seasons column has a lot missing, but this is easily explained away since movies are listed missing values for this column. The age certification column also has a lot of missing values later we will encode these values as not rated for the model. For the rest of the columns these missing values are likely due to the title not being on IMDb or tmdb (The Movie Database) and will be filled with zeros for the model.

In [5]:
# looking what columns contain a lot of null values
Nullpercent = titles.isnull().sum().sort_values(ascending = False) / len(titles) * 100
null_df = pd.DataFrame({'Features': Nullpercent.index, 'Null_Value_Percentage': Nullpercent.values}).style.background_gradient(cmap='Blues')
null_df

Unnamed: 0,Features,Null_Value_Percentage
0,seasons,64.743369
1,age_certification,44.953496
2,imdb_votes,9.2835
3,imdb_score,9.007923
4,imdb_id,7.647261
5,tmdb_score,5.477093
6,tmdb_popularity,1.619015
7,description,0.310024
8,title,0.017224
9,id,0.0


### Data Cleaning For Model Making

In this section we will use feature engineering and cleaning methods to prepare a dataset from the raw data that can be used with K Nearest Neighbors in order to make predictions. K Nearest Neighbors is a good model for this project since its main assumption is that data points that are close to each other are alike. Which should hold as many of the attributes such as genre, release year, IMDb score, and whether or not it is a movie or tv show should ensure that similar titles are close to one another and allow recommendations to be made by selecting titles closest to that specific title.


In [6]:
# Finding all the unique genres
array = titles.genres.to_list()

array1 = [s.strip("[]") for s in array]

lst = []
for i in range(len(array1)):
    lst.append(array1[i].split(','))

result = list({x for l in lst for x in l})

result1 = [s.strip("' ") for s in result]

# function to get unique values
def unique(list1):
      
    # insert the list to the set
    list_set = set(list1)
    # convert the set to the list
    unique_list = (list(list_set))
    for x in unique_list:
        print(x),

results = unique(result1)


action
thriller
crime
animation
reality
music
comedy
drama
sport
horror
fantasy
romance
western
scifi
documentation
european
family
history
war


In the block above we clean the genres column and output all of the unique genres that show up in the genres column. With this we engineer new columns that contain 1 (True) and 0 (False) if that specific genre is listed in the genres column.

In [7]:
# Creating indictor columns for each genre
titles['horror'] = titles['genres'].str.contains('horror').astype('int')
titles['history'] = titles['genres'].str.contains('history').astype('int')
titles['european'] = titles['genres'].str.contains('european').astype('int')
titles['sport'] = titles['genres'].str.contains('sport').astype('int')
titles['family'] = titles['genres'].str.contains('family').astype('int')
titles['reality'] = titles['genres'].str.contains('reality').astype('int')
titles['drama'] = titles['genres'].str.contains('drama').astype('int')
titles['animation'] = titles['genres'].str.contains('animation').astype('int')
titles['comedy'] = titles['genres'].str.contains('comedy').astype('int')
titles['scifi'] = titles['genres'].str.contains('scifi').astype('int')
titles['music'] = titles['genres'].str.contains('music').astype('int')
titles['fantasy'] = titles['genres'].str.contains('fantasy').astype('int')
titles['western'] = titles['genres'].str.contains('western').astype('int')
titles['thriller'] = titles['genres'].str.contains('thriller').astype('int')
titles['documentation'] = titles['genres'].str.contains('documentation').astype('int')
titles['crime'] = titles['genres'].str.contains('crime').astype('int')
titles['action'] = titles['genres'].str.contains('action').astype('int')
titles['war'] = titles['genres'].str.contains('war').astype('int')
titles['romance'] =titles['genres'].str.contains('romance').astype('int')
titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,...,scifi,music,fantasy,western,thriller,documentation,crime,action,war,romance
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,...,0,0,0,0,0,1,0,0,0,0
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,...,0,0,0,0,0,0,1,0,0,0
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,...,0,0,1,0,0,0,0,0,0,0
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,...,0,0,0,0,0,0,0,0,0,0
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,...,0,0,0,0,0,0,0,0,0,0


In the block below non necessary columns are dropped. Also show and movie columns with 1 and 0 denoting True and False are created from the type column.

In [8]:
# dropping unneeded columns
titles_mod1 = titles.drop(columns = ['title', 'description', 'genres', 'imdb_id', 'production_countries'])
titles_mod1

# creating indicator columns for show or movie
titles_mod1['movie'] = titles_mod1['type'].str.contains('MOVIE').astype('int')
titles_mod1['show'] = titles_mod1['type'].str.contains('SHOW').astype('int')
titles_mod1 = titles_mod1.drop(columns = 'type')

In the block below more indicator columns are made for each unique age certification level. Since there were many missing values in the age certification column missing values are denoted with 1 in the not rated indicator column.

In [9]:
# creating indicator columns for age certification
unique(titles_mod1['age_certification'])

titles_mod1['not_rated'] = titles_mod1['age_certification'].isna().astype('int')
titles_mod1['TV-Y'] = titles_mod1['age_certification'].str.contains('TV-Y', na = False).astype('int')
titles_mod1['TV-PG'] = titles_mod1['age_certification'].str.contains('TV-PG', na = False).astype('int')
titles_mod1['PG'] = titles_mod1['age_certification'].str.contains('PG', na = False).astype('int')
titles_mod1['R'] = titles_mod1['age_certification'].str.contains('R', na = False).astype('int')
titles_mod1['TV-14'] = titles_mod1['age_certification'].str.contains('TV-14', na = False).astype('int')
titles_mod1['PG-13'] = titles_mod1['age_certification'].str.contains('PG-13', na = False).astype('int')
titles_mod1['TV-Y7'] = titles_mod1['age_certification'].str.contains('TV-Y7', na = False).astype('int')
titles_mod1['NC-17'] = titles_mod1['age_certification'].str.contains('NC-17', na = False).astype('int')
titles_mod1['G'] = titles_mod1['age_certification'].str.contains('G', na = False).astype('int')
titles_mod1['TV-G'] = titles_mod1['age_certification'].str.contains('TV-G', na = False).astype('int')
titles_mod1['TV-MA'] = titles_mod1['age_certification'].str.contains('TV-MA', na = False).astype('int')
titles_mod1 = titles_mod1.drop(columns = ['age_certification'])
titles_mod1.head()

nan
TV-PG
R
TV-14
TV-MA
TV-G
PG-13
NC-17
PG
G
TV-Y7
TV-Y


Unnamed: 0,id,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,horror,history,...,TV-PG,PG,R,TV-14,PG-13,TV-Y7,NC-17,G,TV-G,TV-MA
0,ts300399,1945,48,1.0,,,0.6,,0,0,...,0,0,0,0,0,0,0,0,0,1
1,tm84618,1976,113,,8.3,795222.0,27.612,8.2,0,0,...,0,0,1,0,0,0,0,0,0,0
2,tm127384,1975,91,,8.2,530877.0,18.216,7.8,0,0,...,0,1,0,0,0,0,0,1,0,0
3,tm70993,1979,94,,8.0,392419.0,17.505,7.8,0,0,...,0,0,1,0,0,0,0,0,0,0
4,tm190788,1973,133,,8.1,391942.0,95.337,7.7,1,0,...,0,0,1,0,0,0,0,0,0,0


In the following block missing values are filled with 0 and all the columns are normalized from 0 to 1. This normalization is done to ensure that none of the columns have a greater weight when passed into the K Nearest Neighbor model that will be used for making recommendation`s.

In [10]:
# Changing NA's to zeros
titles_mod1 = titles_mod1.fillna(0).drop(columns='id')
titles_mod1.head()

# Normalizing data
normalized_data = (titles_mod1 - titles_mod1.min())/(titles_mod1.max() - titles_mod1.min())
normalized_data.head()

Unnamed: 0,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,horror,history,european,...,TV-PG,PG,R,TV-14,PG-13,TV-Y7,NC-17,G,TV-G,TV-MA
0,0.0,0.191235,0.02381,0.0,0.0,0.000329,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.402597,0.450199,0.0,0.864583,0.350582,0.015143,0.82,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.38961,0.36255,0.0,0.854167,0.234043,0.00999,0.78,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.441558,0.374502,0.0,0.833333,0.173002,0.0096,0.78,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.363636,0.52988,0.0,0.84375,0.172792,0.052286,0.77,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Making Model

We fit the K Nearest Neighbors model below and use cosine similarity as the metric to determine how close titles are from each other.

In [11]:
from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
knn.fit(normalized_data)

In the block below we create a function that allows users to type in a movie title and if it is an exact match to one in the dataset then ten recommendations with the closest cosine similarity to the model are outputted. If the title is not found a function is used to give close matches to what the user inputted and allows for the user to input in a title again.

In [12]:
# creating df of title names
title_names = titles.loc[ : , ['title']]
title_names['title'] = title_names['title'].str.lower()
title_names.head()

Unnamed: 0,title
0,five came back: the reference films
1,taxi driver
2,monty python and the holy grail
3,life of brian
4,the exorcist


In [13]:
#get the ten nearest neighbors
def recomendations(n):
    neighbor_index = knn.kneighbors([normalized_data.loc[n]], return_distance=False, n_neighbors=11)
    neighbor_index = list(neighbor_index[0])
    neighbor_index = neighbor_index[1:11]
    print("Recomendations for: " + title_names.loc[n, 'title'])
    print(title_names.loc[neighbor_index])

In [14]:
netflix_titles = list(title_names.loc[:, 'title'])

def movie():
    title = input('Title you want recomendations for: ').lower()
    result = process.extract(title, netflix_titles, limit = 5)
    i = 1
    for mov in result:
        if (i == 1 and mov[1] <= 80):
            print('No close matches found. Try again.')
            break
        if mov[1] == 100:
            index = netflix_titles.index(title)
            recomendations(index)
            return
        if (mov[1] >= 80 and mov[1] != 100):
            if i == 1:
                print('Exact match not found for {}. Did you mean:'.format(title))
            print(str(i) + ') ' + mov[0])
            i = i+1
    movie()

In the following blocks some examples are shown using the function to show recommendations.

In [15]:
movie()

Recomendations for: breaking bad
                 title
1260            narcos
1824             ozark
1835        mindhunter
1826        the sinner
1840             elite
951          wentworth
2198     the mechanism
3362        undercover
4935  the chestnut man
1888       alias grace


In [16]:
movie()

Exact match not found for conjuring. Did you mean:
1) the conjuring
2) the conjuring 2
Recomendations for: the conjuring
                                title
1234                       it follows
1915              annabelle: creation
1290                             hush
1847                       the ritual
948                             creep
350                   friday the 13th
2042           may the devil take you
4582                              122
4911                       till death
151   i know what you did last summer


In [18]:
movie()

Recomendations for: stranger things
                               title
253                     supernatural
1897  chilling adventures of sabrina
3204                     locke & key
3868                            jinn
272              the vampire diaries
946                    the originals
1834                        legacies
2235                          erased
1337                       frequency
5305                           elves


In [19]:
movie()

Recomendations for: too hot to handle
                                      title
5796                            the big day
5380              too hot to handle: brazil
5426                  love is blind: brazil
3222                          love is blind
4233  the fabulous lives of bollywood wives
5408                               insiders
4856        the ultimatum: marry or move on
5131                     my unorthodox life
5479                     cooking with paris
3997             million dollar beach house


In [21]:
movie()

Recomendations for: criminal minds
                             title
918                  the blacklist
931    how to get away with murder
1321                      quantico
1833                     riverdale
783   pablo escobar: the drug lord
1870                       manhunt
2377               falsa identidad
2149                lawless lawyer
1340                      aquarius
3955                   stolen away


In [28]:
movie()

Recomendations for: inception
                     title
255      war of the worlds
155               godzilla
1336              spectral
339                the one
282   the bourne ultimatum
292      quantum of solace
721              contagion
734           total recall
722             battleship
1348             self/less


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=0ff0db95-be66-48c0-9257-0e8c513eba64' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>