<a href="https://colab.research.google.com/github/Yogananth-r/Netflix-Recommendation-System/blob/main/Netflix_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Mounted at /gdrive
/gdrive


Mounting Google drive is essential as we are going to import the datasets from the drive. The file stays permanently.

In [None]:
%cd /gdrive/My Drive/Colab Notebooks/Netflix Recommendation System

/gdrive/My Drive/Colab Notebooks/Netflix Recommendation System


The current directory is set to the netflix recommendation system folder where the notebook and the datasets are present

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import cosine_similarity

data=pd.read_csv("netflixData.csv")
print(data.head())

                                Show Id                          Title  \
0  cc1b6ed9-cf9e-4057-8303-34577fb54477                       (Un)Well   
1  e2ef4e91-fb25-42ab-b485-be8e3b23dedb                         #Alive   
2  b01b73b7-81f6-47a7-86d8-acb63080d525  #AnneFrank - Parallel Stories   
3  b6611af0-f53c-4a08-9ffa-9716dc57eb9c                       #blackAF   
4  7f2d4170-bab8-4d75-adc2-197f7124c070               #cats_the_mewvie   

                                         Description  \
0  This docuseries takes a deep dive into the luc...   
1  As a grisly virus rampages a city, a lone man ...   
2  Through her diary, Anne Frank's story is retol...   
3  Kenya Barris and his family navigate relations...   
4  This pawesome documentary explores how our fel...   

                      Director  \
0                          NaN   
1                       Cho Il   
2  Sabina Fedeli, Anna Migotto   
3                          NaN   
4             Michael Margolis   

             

modules and data have been imported for the system above.

Since the netflix recommendation learns the genre and type of the watch history so we need only some of the attributes. Let us select the attributes needed and not all of it.

In [None]:
data=data[["Title","Description","Content Type","Genres"]]
print(data.head())

                           Title  \
0                       (Un)Well   
1                         #Alive   
2  #AnneFrank - Parallel Stories   
3                       #blackAF   
4               #cats_the_mewvie   

                                         Description Content Type  \
0  This docuseries takes a deep dive into the luc...      TV Show   
1  As a grisly virus rampages a city, a lone man ...        Movie   
2  Through her diary, Anne Frank's story is retol...        Movie   
3  Kenya Barris and his family navigate relations...      TV Show   
4  This pawesome documentary explores how our fel...        Movie   

                                           Genres  
0                                      Reality TV  
1  Horror Movies, International Movies, Thrillers  
2             Documentaries, International Movies  
3                                     TV Comedies  
4             Documentaries, International Movies  


Let us drop the rows with null values.

In [None]:
data= data.dropna()
print(data.head())

                       Title  \
0                      unwel   
1                       aliv   
2  annefrank  parallel stori   
3                    blackaf   
4               catsthemewvi   

                                         Description Content Type  \
0  This docuseries takes a deep dive into the luc...      TV Show   
1  As a grisly virus rampages a city, a lone man ...        Movie   
2  Through her diary, Anne Frank's story is retol...        Movie   
3  Kenya Barris and his family navigate relations...      TV Show   
4  This pawesome documentary explores how our fel...        Movie   

                                           Genres  
0                                      Reality TV  
1  Horror Movies, International Movies, Thrillers  
2             Documentaries, International Movies  
3                                     TV Comedies  
4             Documentaries, International Movies  


The title needs to be cleaned ,it has characters such as "#". This is an essential part of the data cleaning in data preprocessing.

We need to process the input using stopwords. The titles are normalized and then can be proceeded for the recommendation purpose.

In [None]:
import nltk
import re
import string

nltk.download('stopwords')
stemmer=nltk.SnowballStemmer("english")
from nltk.corpus import stopwords
stopword=set(stopwords.words('english'))

def clean(text):
  text=str(text).lower()
  text=re.sub('\[.*?\]','',text)
  text=re.sub('https?://\S+|www\.\S+', '', text)
  text=re.sub('<.*?>', '', text)
  text=re.sub('[%s]' % re.escape(string.punctuation),'',text)
  text=re.sub('\n','',text)
  text=re.sub('\w*\d\w*','',text)
  text=[word for word in text.split(' ') if word not in stopword]
  text=" ".join(text)
  text=[stemmer.stem(word) for word in text.split(' ')]
  text=" ".join(text)
  return text
data["Title"]=data["Title"].apply(clean)




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
print(data.Title.sample(30))
print(data.Title[3655])

4922              investig british crime stori
4077                              sandi wexler
4120                    scott pilgrim vs world
5175                               rugrat movi
985                                  chopstick
2289                                 iron ladi
1061                                      coff
596                              bella bulldog
5246                                      tour
1105                                 coupl day
397                                   apaharan
3352                     narcoworld dope stori
3119                                   misaeng
1422                                     eteam
5470                                   traitor
4357                                space forc
2920                                    malang
1944                                     hamid
5837                                  wild oat
4596                          winter wind blow
600       ben platt live radio citi music hall
5726         

Genres attribute will be used to recommend similar content and cosine similarity will be used (to find similarities in two documents)

In [None]:
feature=data["Genres"].tolist()
tfidf = text.TfidfVectorizer(input=feature,stop_words="english")
tfidf_matrix=tfidf.fit_transform(feature)
similarity= cosine_similarity(tfidf_matrix)

let us set the title attribute as index, so we can find similar content by giving title as input.

In [None]:
indices= pd.Series(data.index, index=data['Title']).drop_duplicates()

Function to recommend movies and tv shows on netflix-

Since the titles are normalized, we need to give the normalized title input to the function, so it can work and recommend. Giving the actual movie name and tv show name might work sometimes, but if it's normalized it may not work.
REMEMBER netflix recommendation system works in the backend and it will carry its operation on its own.

In [None]:
def NetFlix_Recommendation(title,similarity=similarity):
  index=indices[title]
  similarity_scores=list(enumerate(similarity[index]))
  similarity_scores=sorted(similarity_scores,key=lambda x:x[1],reverse=True)
  similarity_scores=similarity_scores[0:10]
  movie_indices=[i[0] for i in similarity_scores]
  return data['Title'].iloc[movie_indices]

print(NetFlix_Recommendation("arrow"))

418                   arrow
2985       marvel daredevil
308            alter carbon
1880                 gotham
2987       marvel iron fist
2988    marvel jessica jone
2989       marvel luke cage
2990          marvel defend
3350                  narco
3351           narco mexico
Name: Title, dtype: object
