## Netflix Recommendation System using Python

The recommendation system of Netflix shows us movies and TV shows according to our interests. Netflix has a lot of data because of its user base. Its recommendation system predicts a personalised catalogue for us based on factors like:

- your viewing history
- the viewing history of other users with similar tastes and preferences as ours
- genres, category, description, and more information about the content that we watched in the past

The dataset I am using to build a Netflix recommendation system using Python is downloaded from [Kaggle](https://www.kaggle.com/datasets/satpreetmakhija/netflix-movies-and-tv-shows-2021). The dataset contains information about all the movies and TV shows on Netflix as of 2021. 

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import cosine_similarity

data = pd.read_csv('data/netflixData.csv')
data.head()

Unnamed: 0,Show Id,Title,Description,Director,Genres,Cast,Production Country,Release Date,Rating,Duration,Imdb Score,Content Type,Date Added
0,cc1b6ed9-cf9e-4057-8303-34577fb54477,(Un)Well,This docuseries takes a deep dive into the luc...,,Reality TV,,United States,2020.0,TV-MA,1 Season,6.6/10,TV Show,
1,e2ef4e91-fb25-42ab-b485-be8e3b23dedb,#Alive,"As a grisly virus rampages a city, a lone man ...",Cho Il,"Horror Movies, International Movies, Thrillers","Yoo Ah-in, Park Shin-hye",South Korea,2020.0,TV-MA,99 min,6.2/10,Movie,"September 8, 2020"
2,b01b73b7-81f6-47a7-86d8-acb63080d525,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...","Sabina Fedeli, Anna Migotto","Documentaries, International Movies","Helen Mirren, Gengher Gatti",Italy,2019.0,TV-14,95 min,6.4/10,Movie,"July 1, 2020"
3,b6611af0-f53c-4a08-9ffa-9716dc57eb9c,#blackAF,Kenya Barris and his family navigate relations...,,TV Comedies,"Kenya Barris, Rashida Jones, Iman Benson, Genn...",United States,2020.0,TV-MA,1 Season,6.6/10,TV Show,
4,7f2d4170-bab8-4d75-adc2-197f7124c070,#cats_the_mewvie,This pawesome documentary explores how our fel...,Michael Margolis,"Documentaries, International Movies",,Canada,2020.0,TV-14,90 min,5.1/10,Movie,"February 5, 2020"


In the first impressions on the dataset, we can see that the Title column needs preparation as it contains # before the name of the movies or tv shows. We will get back to it. For now, let’s have a look at whether the data contains null values or not:

In [2]:
print(data.isnull().sum())

Show Id                  0
Title                    0
Description              0
Director              2064
Genres                   0
Cast                   530
Production Country     559
Release Date             3
Rating                   4
Duration                 3
Imdb Score             608
Content Type             0
Date Added            1335
dtype: int64


The dataset contains null values, but before removing the null values, let’s select the columns that we can use to build a Netflix recommendation system:

In [3]:
data = data[["Title", "Description", "Content Type", "Genres"]]
data.head()

Unnamed: 0,Title,Description,Content Type,Genres
0,(Un)Well,This docuseries takes a deep dive into the luc...,TV Show,Reality TV
1,#Alive,"As a grisly virus rampages a city, a lone man ...",Movie,"Horror Movies, International Movies, Thrillers"
2,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...",Movie,"Documentaries, International Movies"
3,#blackAF,Kenya Barris and his family navigate relations...,TV Show,TV Comedies
4,#cats_the_mewvie,This pawesome documentary explores how our fel...,Movie,"Documentaries, International Movies"


As the name suggests:

1. The title column contains the titles of movies and TV shows on Netflix
2. Description column describes the plot of the TV shows and movies
3. The Content Type column tells us if it’s a movie or a TV show
4. The Genre column contains all the genres of the TV show or the movie

Now let’s drop the rows containing null values and move further:

In [4]:
data = data.dropna()

Now we will clean the Title column as it contains some data preparation:

In [5]:
import nltk
import re

nltk.download('stopwords')
stemmer = nltk.SnowballStemmer('english')

from nltk.corpus import stopwords
import string
stopword=set(stopwords.words('english'))

def clean(text):
    text = str(text).lower()
    text = re.sub('#', '', text)
    return text

data['Title'] = data['Title'].apply(clean)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\azhuk\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now let’s have a look at some samples of the Titles before moving forward:

In [6]:
data.Title.sample(10)

5240                                         the terminal
3567                                oththa seruppu size 7
5148                                    the queen of flow
2849                                            lowriders
496                                                  baby
1343                  dolly kitty aur woh chamakte sitare
2065                                           hinterland
377                                          animal world
1440                         edoardo ferrario: temi caldi
3994    rolling thunder revue: a bob dylan story by ma...
Name: Title, dtype: object

Now let's use the Genres column as the feature to recommend similar content to the user. We will use the concept of [cosine similarity](https://thecleverprogrammer.com/2021/02/27/cosine-similarity-in-machine-learning/) here (used to find similarities in two documents):

In [7]:
feature = data["Genres"].tolist()
tfidf = text.TfidfVectorizer(input=feature, stop_words='english')
tfidf_matrix = tfidf.fit_transform(feature)
similarity = cosine_similarity(tfidf_matrix)

Now I will set the Title column as an index so that we can find similar content by giving the title of the movie or TV show as an input:

In [8]:
indices = pd.Series(data.index, 
                    index=data['Title']).drop_duplicates()

Now here’s how to write a function to recommend Movies and TV shows on Netflix:

In [9]:
def netFlix_recommendation(title, similarity = similarity):
    index = indices[title]
    similarity_scores = list(enumerate(similarity[index]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[0:10]
    movieindices = [i[0] for i in similarity_scores]
    return data['Title'].iloc[movieindices]

netFlix_recommendation('enola holmes')

1503                                 enola holmes
3218                 mowgli: legend of the jungle
4938                       the karate kid part ii
5202                                the sleepover
603                                         benji
655                                   big miracle
882                                        canvas
1062                                   coin heist
1704    free rein: the twelve neighs of christmas
1705                   free rein: valentine's day
Name: Title, dtype: object

So this is how we can build a Netflix Recommendation System with Python.

#### Summary
The recommendation system of Netflix predicts a personalised catalogue for the user based on factors like viewing history, the viewing history of other users with similar tastes and preferences, and the genres, category, descriptions, and more information of the content watched.