### Problem Statement  : Build a content based movie recommender system with natural language processing.The function should take movie name as input and should return top 3 recommended movies.(Use Netflix dataset) *

### For acheiving the goal following steps were taken.

#### Data Gathering (Unstructured Data)
1. Select  Text data from Netflix data set.
2. Columns like director,cast,type of the content(movie or TV show),listed in ,description are getting used for finding better relevance .

#### Data PreProcessing 
1. Convert to lowercase.
2. Remove trailing spaces,stopwords,endline,punctuations.
3. For now,url has been removed but it can be used for further mining.
4. Remove digits,numbers.
5. Tokenization and lemmatisation of the text for BOW.

####  Feature Extraction and BOW
1. Create Tf-id vectors with 1-gram and 2-gram with TFIDF vectorisation

#### Cosine Similarity Matrix

1. Create cosine similarity matrix using Tfidf vectors.

#### Recommend content as per input

1. Give the input as [movie/tv shows]
2. Find the index of the given content from the dataset ,if not present ask again for input.
3. Get the index of rows most similar to i/p using cosine similarity matrix created above.
4. Select index of top 3  from the list of cosine scores .
5. Return content title of these indices.

 


In [1]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# Reading and preview of the files
data = pd.read_csv('NETFLIX TITLES.csv')
data.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81088285,Movie,The Mayo Clinic,"Ken Burns, Christopher Loren Ewers, Erik Ewers",Peter Coyote,United States,"April 19, 2019",2018,TV-14,116 min,Documentaries,A look at how a world-renowned medical institu...
1,81077597,Movie,I Am,Onir,"Juhi Chawla, Rahul Bose, Nandita Das, Sanjay S...","India, Japan","March 4, 2019",2010,TV-MA,106 min,"Dramas, Independent Movies, International Movies",Four individuals in modern India grapple with ...


In [3]:
df = data[['type','title','director','cast','listed_in','description']]
df.head()

Unnamed: 0,type,title,director,cast,listed_in,description
0,Movie,The Mayo Clinic,"Ken Burns, Christopher Loren Ewers, Erik Ewers",Peter Coyote,Documentaries,A look at how a world-renowned medical institu...
1,Movie,I Am,Onir,"Juhi Chawla, Rahul Bose, Nandita Das, Sanjay S...","Dramas, Independent Movies, International Movies",Four individuals in modern India grapple with ...
2,Movie,Love Jones,Theodore Witcher,"Larenz Tate, Nia Long, Isaiah Washington, Lisa...","Comedies, Dramas, Independent Movies","In this urban romantic comedy set in Chicago, ..."
3,Movie,Ghayal,Rajkumar Santoshi,"Sunny Deol, Meenakshi Sheshadri, Amrish Puri, ...","Action & Adventure, Dramas, International Movies","Framed for his older brother's murder, a boxer..."
4,Movie,Marriage Story,Noah Baumbach,"Scarlett Johansson, Adam Driver, Laura Dern, A...",Dramas,Academy Award-nominated filmmaker Noah Baumbac...


In [4]:
from nltk.tokenize import word_tokenize 
def preprocessing_data(row):
    if row:
        return word_tokenize(str(row).lower().translate({ord(c):'' for c in "[!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]"}))
    

In [5]:
import warnings
warnings.simplefilter('ignore')
for col in df.drop('title',1).columns:
    df[col] = df[col].apply(preprocessing_data)
df.head()

Unnamed: 0,type,title,director,cast,listed_in,description
0,[movie],The Mayo Clinic,"[ken, burns, christopher, loren, ewers, erik, ...","[peter, coyote]",[documentaries],"[a, look, at, how, a, worldrenowned, medical, ..."
1,[movie],I Am,[onir],"[juhi, chawla, rahul, bose, nandita, das, sanj...","[dramas, independent, movies, international, m...","[four, individuals, in, modern, india, grapple..."
2,[movie],Love Jones,"[theodore, witcher]","[larenz, tate, nia, long, isaiah, washington, ...","[comedies, dramas, independent, movies]","[in, this, urban, romantic, comedy, set, in, c..."
3,[movie],Ghayal,"[rajkumar, santoshi]","[sunny, deol, meenakshi, sheshadri, amrish, pu...","[action, adventure, dramas, international, mov...","[framed, for, his, older, brother, 's, murder,..."
4,[movie],Marriage Story,"[noah, baumbach]","[scarlett, johansson, adam, driver, laura, der...",[dramas],"[academy, awardnominated, filmmaker, noah, bau..."


In [7]:
lemmatizer = WordNetLemmatizer()
def lemmatize_root(row):
    new = []
    new = [lemmatizer.lemmatize(x) for x in row ] 
    return ' '.join(new)
for col in df.drop('title',1).columns:
    df[col] = df[col].apply(lemmatize_root)
df.head()

Unnamed: 0,type,title,director,cast,listed_in,description
0,movie,The Mayo Clinic,ken burn christopher loren ewer erik ewer,peter coyote,documentary,a look at how a worldrenowned medical institut...
1,movie,I Am,onir,juhi chawla rahul bose nandita da sanjay suri ...,drama independent movie international movie,four individual in modern india grapple with t...
2,movie,Love Jones,theodore witcher,larenz tate nia long isaiah washington lisa ni...,comedy drama independent movie,in this urban romantic comedy set in chicago t...
3,movie,Ghayal,rajkumar santoshi,sunny deol meenakshi sheshadri amrish puri mou...,action adventure drama international movie,framed for his older brother 's murder a boxer...
4,movie,Marriage Story,noah baumbach,scarlett johansson adam driver laura dern alan...,drama,academy awardnominated filmmaker noah baumbach...


In [8]:
stop_words = set(stopwords.words('english'))
def remove_stopwords(row):
    row = row.split()
    row_new = " ".join([i for i in row if i not in stop_words])
    return row_new
for col in df.drop('title',1).columns:
    df[col] = df[col].apply(remove_stopwords)
df.head()

Unnamed: 0,type,title,director,cast,listed_in,description
0,movie,The Mayo Clinic,ken burn christopher loren ewer erik ewer,peter coyote,documentary,look worldrenowned medical institution priorit...
1,movie,I Am,onir,juhi chawla rahul bose nandita da sanjay suri ...,drama independent movie international movie,four individual modern india grapple identity ...
2,movie,Love Jones,theodore witcher,larenz tate nia long isaiah washington lisa ni...,comedy drama independent movie,urban romantic comedy set chicago ups courtshi...
3,movie,Ghayal,rajkumar santoshi,sunny deol meenakshi sheshadri amrish puri mou...,action adventure drama international movie,framed older brother 's murder boxer seek viol...
4,movie,Marriage Story,noah baumbach,scarlett johansson adam driver laura dern alan...,drama,academy awardnominated filmmaker noah baumbach...


In [10]:
columns = ['type', 'director', 'cast','listed_in','description']

df['summary'] = df['type']+' '+df['description']+' '+df['director']+' '+df['cast']+' '+df['listed_in']
df = df.drop(columns,1)
df.head()

Unnamed: 0,title,summary
0,The Mayo Clinic,movie look worldrenowned medical institution p...
1,I Am,movie four individual modern india grapple ide...
2,Love Jones,movie urban romantic comedy set chicago ups co...
3,Ghayal,movie framed older brother 's murder boxer see...
4,Marriage Story,movie academy awardnominated filmmaker noah ba...


In [11]:
df['title'] = df['title'].apply(lambda x : str(x).lower())

In [12]:
from sklearn.metrics.pairwise import cosine_similarity
TfidfVec = TfidfVectorizer(ngram_range=(1,2))
#Convert the text to a matrix of TF-IDF features
tfidf = TfidfVec.fit_transform(df['summary'])
feature_names = TfidfVec.get_feature_names()
feature_names
# corpus_index = [n for n in df['summary']]
# data_matrix = pd.DataFrame(tfidf.T.todense(), index=feature_names, columns=corpus_index)
# data_matrix.head()

['007',
 '007 find',
 '007 installment',
 '009',
 '009 meet',
 '10',
 '10 country',
 '10 elite',
 '10 men',
 '10 million',
 '10 prepare',
 '10 song',
 '10 woman',
 '10 year',
 '100',
 '100 brother',
 '100 million',
 '100 space',
 '100 year',
 '1000',
 '1000 year',
 '10000',
 '10000 afghan',
 '10000 counting',
 '10000 prize',
 '100000',
 '100000 nan',
 '102yearold',
 '102yearold goblin',
 '10city',
 '10city tour',
 '10seat',
 '10seat 300aplate',
 '10story',
 '10story collection',
 '10thyear',
 '10thyear senior',
 '10week',
 '10week competition',
 '10year',
 '10year journey',
 '10yearold',
 '10yearold ash',
 '10yearold daughter',
 '10yearold dreamer',
 '10yearold go',
 '10yearold luke',
 '10yearold natalie',
 '11',
 '11 different',
 '11 los',
 '11 year',
 '1100',
 '1100 jew',
 '112mile',
 '112mile bike',
 '11day',
 '11day disappearance',
 '11yearold',
 '11yearold afghan',
 '11yearold boy',
 '11yearold daughter',
 '11yearold girl',
 '11yearold jimmy',
 '11yearold johnny',
 '11yearold stud

In [13]:
cosine_sim = cosine_similarity(tfidf, tfidf)
cosine_sim

array([[1.00000000e+00, 1.53947654e-03, 1.06211397e-03, ...,
        0.00000000e+00, 1.07242768e-03, 6.18241739e-04],
       [1.53947654e-03, 1.00000000e+00, 1.25308977e-02, ...,
        0.00000000e+00, 3.74525662e-03, 1.58104446e-03],
       [1.06211397e-03, 1.25308977e-02, 1.00000000e+00, ...,
        0.00000000e+00, 8.51602642e-03, 3.90051457e-03],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.07242768e-03, 3.74525662e-03, 8.51602642e-03, ...,
        0.00000000e+00, 1.00000000e+00, 1.10138466e-03],
       [6.18241739e-04, 1.58104446e-03, 3.90051457e-03, ...,
        0.00000000e+00, 1.10138466e-03, 1.00000000e+00]])

In [14]:
def response(title):
    recommended_content_titles =[]
    content_titles = pd.Series(df['title'])
    try:
        
        content_index = content_titles[content_titles == title].index[0]
        similarity_scores_list = pd.Series(cosine_sim[content_index]).sort_values(ascending = False)
        similar_content_indices = list(similarity_scores_list.iloc[1:4].index)
    except IndexError as e:
        return(f'No such content with title {title} on Netflix.Please try Again')
    print(f'You have watched {title},You may like:')
    for content in similar_content_indices:
         recommended_content_titles.append(str(df['title'][content]).upper())
    return(','.join(recommended_content_titles))


In [15]:
#Keyword Matching
import random
#Greeting Inputs
GREETING_INPUTS = ["hi", "hello", "hola", "greetings", "wassup", "hey"]

#Greeting responses back to the user
GREETING_RESPONSES=["howdy", "hi", "hey", "what's good", "hello", "hey there"]

#Function to return a random greeting response to a users greeting
def greeting(sentence):
  #if the user's input is a greeting, then return a randomly chosen greeting response
  for word in sentence.split():
    if word.lower() in GREETING_INPUTS:
        return random.choice(GREETING_RESPONSES)

#### Chatbot flow for recommended movies/tv-shows

In [16]:
flag = True
print("recommendMe:I will recommend you better content you ever watched !Please enter the movie/season you have watched")
while(flag == True):
  user_response = input()
  user_response = user_response.strip().lower()
  if(user_response != 'bye'):
    if(user_response == 'thanks' or user_response =='thank you'):
      flag=False
      print("recommendMe: You are welcome !")
    else:
      if(greeting(user_response) != None):
        print("recommendMe: "+greeting(user_response))
      else:
        print("recommendMe: "+response(user_response))       
  else:
    flag = False
    print("recommendMe: Chat with you later !")

recommendMe:I will recommend you better content you ever watched !Please enter the movie/season you have watched
Hi
recommendMe: howdy
Ghayal
You have watched ghayal,You may like:
recommendMe: PUKAR,MANDI,BARSAAT
Grey's Anatomy
You have watched grey's anatomy,You may like:
recommendMe: MELODIES OF LIFE - BORN THIS WAY,13 REASONS WHY,THE OATH
The godfather
recommendMe: No such content with title the godfather on Netflix.Please try Again
XXX: State of the Union
You have watched xxx: state of the union,You may like:
recommendMe: XXX,WHAT HAPPENED TO MONDAY,THE MATRIX RELOADED
Thanks
recommendMe: You are welcome !
