# Preprocessing Netflix Movies and TV Shows Dataset

Data source: https://www.kaggle.com/datasets/shivamb/netflix-shows

In preparation for the movie recommender app we will later build, we need to clean and explore the dataset to understand the data structure. This will help us determine the optimal algorithm to be used as the core of the recommendation system. <br>

<br>
Before we explore the data, let's go over the types of recommendation systems and their data requirements:<br>

**1. Content-based Recommendation:** <br>

- Items are recommended based on their descriptions, metadata and content similarities <br>

- In other words, the user will be recommended a book similar to the ones they liked in the past <br>

 **Data Requirements:** <br>

   - Require item information, like descriptions, keywords, metadata, content summary, etc.<br>

   - Does not require user information

<br>


**2. Collaborative Filtering Recommendation:** <br>

 *2.1. Item-based Collaborative Filtering* <br>
 
  - Items are recommended based on similarities calculated by user interactions i.e. ratings, reviews, etc.<br>

  - Example: 2 items, A & B, are similar - if an user rates item A highly, they are likely to also rate item B highly<br>

 *2.2. User-based Collaborative Filtering* <br>

  - Items are recommended based on similarities between users' characteristics or behaviors <br>

 **Data Requirements:** <br>

   - Require user information like userID, user ratings<br>

   - Item-based Approach: Each item needs to have at least one ratings<br>

   - User-based Approach: Each user needs to rate at least one items<br>

## Data Exploration

In [237]:
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
nltk.download('stopwords')
nltk.download('punkt')
from nltk import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\I589795\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\I589795\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [238]:
netflix_data = pd.read_csv('netflix_titles.csv', index_col=0)
netflix_data.head()

Unnamed: 0_level_0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [239]:
netflix_data.columns

Index(['type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [240]:
netflix_data.shape

(8807, 11)

In [241]:
# Print description of the first movie
netflix_data.description[0]

'As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.'

### Note

- We notice that the dataset has no information about user ratings or user IDs<br>

- Thus, we will use content-based approach to build the recommendation system

## Content-based Recommendation Systems

We will build 2 content-based recommendation systems (specific data cleaning steps and algorithms used listed below)

**1. Recommendation System based on movie description** <br> 

- *Data Cleaning:* <br>
  + Clean description column to retain only relevant keywords using user-defined functions and NLTK library <br>
<br>

- *Algorithms:* <br>
  + Utilize ```TF-IDF``` model to represent text as vector arrays. <br>
    -> ```TF-IDF``` measures the relative importance of a word by comparing its frequency in a specific document to its occurence across the entire corpus. <br>
    -> ```TF-IDF``` will reduce weights of common words, and increase weights of rare words that don't occur in many documents<br>
    
  + Utilize ```Cosine Similarity``` to calculare similarities between vectorized words<br>
    -> ```Cosine Similarity``` measures similarity between pairs of vectors<br>


**2. Recommendation System based on metadata (director, cast, genre, type)** <br>

- *Data Cleaning:* <br>
  + Cast column: Keep top 3 actors, lowercase all names and strip spaces between names <br>
  + Director column: Lowercase and strip spaces between names <br>
  + Create metadata soup (combining cast, director, genre, type) to feed vectorizer <br>
<br>

- *Algorithms:* <br>
  + Utilize ```TF-IDF``` model to represent text as vector arrays. <br>
    -> ```TF-IDF``` measures the relative importance of a word by comparing its frequency in a specific document to its occurence across the entire corpus. <br>
    -> ```TF-IDF``` will reduce weights of common words, and increase weights of rare words that don't occur in many documents<br>

  + Utilize ```Cosine Similarity``` to calculare similarities between vectorized words<br>
    -> ```Cosine Similarity``` measures similarity between pairs of vectors<br>  

In [242]:
netflix_data.columns

Index(['type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [243]:
# Create new DataFrame containing metadata of movies and shows 
netflix_metadata = netflix_data.copy()
netflix_metadata = netflix_metadata[['title', 'type', 'director', 'cast', 'listed_in', 'description']]

# Rename listed_in column as genre 
netflix_metadata.rename(columns={'listed_in': 'genre'}, inplace=True)
netflix_metadata.head()

Unnamed: 0_level_0,title,type,director,cast,genre,description
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
s1,Dick Johnson Is Dead,Movie,Kirsten Johnson,,Documentaries,"As her father nears the end of his life, filmm..."
s2,Blood & Water,TV Show,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...","International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
s3,Ganglands,TV Show,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...","Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
s4,Jailbirds New Orleans,TV Show,,,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
s5,Kota Factory,TV Show,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...","International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [244]:
netflix_metadata.info()

## Some titles do not have information of director and cast
## All titles have description -> no need to drop any rows

<class 'pandas.core.frame.DataFrame'>
Index: 8807 entries, s1 to s8807
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        8807 non-null   object
 1   type         8807 non-null   object
 2   director     6173 non-null   object
 3   cast         7982 non-null   object
 4   genre        8807 non-null   object
 5   description  8807 non-null   object
dtypes: object(6)
memory usage: 481.6+ KB


### Description-based Recommender

In [245]:
# Clean description

# Define function to clean description
def cleaning(s):
    s = str(s) 
    s = s.lower() # lowercase all words
    s = re.sub('[^a-zA-Z]', ' ', s)  # replace characters other than "a" to "z" & "A" to "Z" with spaces
    return s

# Apply function on description to create new column 'clean_desc'
netflix_metadata['clean_desc'] = netflix_metadata['description'].apply(cleaning)

netflix_metadata.head()

Unnamed: 0_level_0,title,type,director,cast,genre,description,clean_desc
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
s1,Dick Johnson Is Dead,Movie,Kirsten Johnson,,Documentaries,"As her father nears the end of his life, filmm...",as her father nears the end of his life filmm...
s2,Blood & Water,TV Show,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...","International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",after crossing paths at a party a cape town t...
s3,Ganglands,TV Show,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...","Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,to protect his family from a powerful drug lor...
s4,Jailbirds New Orleans,TV Show,,,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",feuds flirtations and toilet talk go down amo...
s5,Kota Factory,TV Show,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...","International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,in a city of coaching centers known to train i...


In [246]:
# Tokenize and remove English stopwords in clean_desc to retain only keywords

netflix_metadata['clean_desc'] = netflix_metadata['clean_desc'].apply(word_tokenize)

netflix_metadata['clean_desc'] = netflix_metadata['clean_desc'].apply(
    lambda x:[word for word in x if word not in set(stopwords.words('english'))]
    )

netflix_metadata['clean_desc'] = netflix_metadata['clean_desc'].apply(lambda x: ' '.join(x)) #convert tokenized words into single string

netflix_metadata.head()

Unnamed: 0_level_0,title,type,director,cast,genre,description,clean_desc
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
s1,Dick Johnson Is Dead,Movie,Kirsten Johnson,,Documentaries,"As her father nears the end of his life, filmm...",father nears end life filmmaker kirsten johnso...
s2,Blood & Water,TV Show,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...","International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",crossing paths party cape town teen sets prove...
s3,Ganglands,TV Show,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...","Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,protect family powerful drug lord skilled thie...
s4,Jailbirds New Orleans,TV Show,,,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",feuds flirtations toilet talk go among incarce...
s5,Kota Factory,TV Show,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...","International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,city coaching centers known train india finest...


In [248]:
netflix_metadata.clean_desc[0]

'father nears end life filmmaker kirsten johnson stages death inventive comical ways help face inevitable'

In [249]:
# Reset index of netflix_metadata to integers
netflix_metadata = netflix_metadata.reset_index(drop=True)

In [250]:
print(netflix_metadata.index)

RangeIndex(start=0, stop=8807, step=1)


In [251]:
# Generate TF-IDF matrix
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(netflix_metadata['clean_desc'])

In [252]:
tfidf_matrix.shape

## Note: We have a matrix of 8807 titles and 18751-word vectors representing text of each title 
## To compute cosine similarity, we need to generate a matrix where rows and columns represent the 8807 titles
## Each title will be compared to every other title along both axes 

(8807, 18751)

In [253]:
# Calculate Cosine Similarity matrix using linear kernel

from sklearn.metrics.pairwise import linear_kernel

cosine_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)

cosine_matrix.shape

(8807, 8807)

In [254]:
# Verify diagonal elements are 1s since each movie will be compared to itself 
# Sum of diagonal elements should equal 8807

diag = 0
for i in range(len(cosine_matrix)):
    diag+= cosine_matrix[i][i]

print(diag)

8807.0


In [None]:
# Define function to fetch n similar titles to the one user provided

## Workflow: 
# 1. User provides a title - model to locate corresponding index of the title in the original DataFrame
# 2. Use this index to retrieve array of cosine similarity scores of specified title with all other titles
# 3. Sort the similarity score in descending order to showcase most similar titles

In [255]:
def get_recommendations(title, n):
    # get index from dataframe:
    index = netflix_metadata[netflix_metadata['title'] == title].index[0]
    
    # sort top n similar movies
    similar_titles = sorted(list(enumerate(cosine_matrix[index])), reverse=True, key=lambda x: x[1])
    
    # extract names from dataframe and return movie names
    recomm = []
    for i in  similar_titles[1:n+1]:
        recomm.append(netflix_metadata.iloc[i[0]].title)
        
    return recomm

In [256]:
get_recommendations("Dick Johnson Is Dead", 5)

['End Game',
 'The Soul',
 'Moon',
 'The Death and Life of Marsha P. Johnson',
 'The Cloverfield Paradox']

In [257]:
get_recommendations("Kota Factory", 2)

['Drishyam', 'The Creative Indians']

### Metadata-based Recommender

In [258]:
netflix_metadata.head()

Unnamed: 0,title,type,director,cast,genre,description,clean_desc
0,Dick Johnson Is Dead,Movie,Kirsten Johnson,,Documentaries,"As her father nears the end of his life, filmm...",father nears end life filmmaker kirsten johnso...
1,Blood & Water,TV Show,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...","International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",crossing paths party cape town teen sets prove...
2,Ganglands,TV Show,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...","Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,protect family powerful drug lord skilled thie...
3,Jailbirds New Orleans,TV Show,,,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",feuds flirtations toilet talk go among incarce...
4,Kota Factory,TV Show,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...","International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,city coaching centers known train india finest...


In [259]:
# Change data type of cast, genre to list; director to string
import ast

netflix_metadata['cast'] = netflix_metadata['cast'].apply(lambda x: x.split(", ") if pd.notna(x) else [])
netflix_metadata['genre'] = netflix_metadata['genre'].apply(lambda x: x.split(", ") if pd.notna(x) else [])
netflix_metadata['director'] = netflix_metadata['director'].astype(str)

In [260]:
netflix_metadata.head()

Unnamed: 0,title,type,director,cast,genre,description,clean_desc
0,Dick Johnson Is Dead,Movie,Kirsten Johnson,[],[Documentaries],"As her father nears the end of his life, filmm...",father nears end life filmmaker kirsten johnso...
1,Blood & Water,TV Show,,"[Ama Qamata, Khosi Ngema, Gail Mabalane, Thaba...","[International TV Shows, TV Dramas, TV Mysteries]","After crossing paths at a party, a Cape Town t...",crossing paths party cape town teen sets prove...
2,Ganglands,TV Show,Julien Leclercq,"[Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nab...","[Crime TV Shows, International TV Shows, TV Ac...",To protect his family from a powerful drug lor...,protect family powerful drug lord skilled thie...
3,Jailbirds New Orleans,TV Show,,[],"[Docuseries, Reality TV]","Feuds, flirtations and toilet talk go down amo...",feuds flirtations toilet talk go among incarce...
4,Kota Factory,TV Show,,"[Mayur More, Jitendra Kumar, Ranjan Raj, Alam ...","[International TV Shows, Romantic TV Shows, TV...",In a city of coaching centers known to train I...,city coaching centers known train india finest...


In [261]:
## Get top 3 actors and top 3 genres for each title

# Define function to return a list of top 3 elements or entire list (whichever is more)
def get_list(x):
    if isinstance(x, list):  #Check if x is a list 
        if len(x) > 3:      #Check if the list contain >3 items. If yes, return only first 3. If not, return entire list   
            return x[:3]
        return x               
    return [] #Return empty list if input object 'x' is not a list

In [262]:
# Apply get_list on cast and genre to get top 3
features = ['cast','genre']
for feature in features:
    netflix_metadata[feature] = netflix_metadata[feature].apply(get_list)

In [263]:
netflix_metadata.head()

Unnamed: 0,title,type,director,cast,genre,description,clean_desc
0,Dick Johnson Is Dead,Movie,Kirsten Johnson,[],[Documentaries],"As her father nears the end of his life, filmm...",father nears end life filmmaker kirsten johnso...
1,Blood & Water,TV Show,,"[Ama Qamata, Khosi Ngema, Gail Mabalane]","[International TV Shows, TV Dramas, TV Mysteries]","After crossing paths at a party, a Cape Town t...",crossing paths party cape town teen sets prove...
2,Ganglands,TV Show,Julien Leclercq,"[Sami Bouajila, Tracy Gotoas, Samuel Jouy]","[Crime TV Shows, International TV Shows, TV Ac...",To protect his family from a powerful drug lor...,protect family powerful drug lord skilled thie...
3,Jailbirds New Orleans,TV Show,,[],"[Docuseries, Reality TV]","Feuds, flirtations and toilet talk go down amo...",feuds flirtations toilet talk go among incarce...
4,Kota Factory,TV Show,,"[Mayur More, Jitendra Kumar, Ranjan Raj]","[International TV Shows, Romantic TV Shows, TV...",In a city of coaching centers known to train I...,city coaching centers known train india finest...


In [264]:
# Change data type of type to str
netflix_metadata['type'] = netflix_metadata['type'].astype(str)

In [265]:
## Convert names of actors, directors, type, and genres to lowercase and strip names of spaces
   # (prevent vectorizer from counting 'Johnny Depp' & 'Johnny Galecki' as the same)

In [266]:
# Define function to strip spaces within word
def clean_data(x):
    if isinstance(x, list):
        return [str(i.lower().replace(" ","")) for i in x]  # strip spaces within word
    elif isinstance(x, str):
        return x.lower().replace(" ","")  # Check if director exists. If not, return empty string
    else:
        return x

In [267]:
# Apply clean_data on type, director, cast, genre columns
metadata = ['type', 'director', 'cast', 'genre']

for m in metadata:
    new_column_name = f'clean_{m}'
    netflix_metadata[new_column_name] = netflix_metadata[m].apply(clean_data)

In [268]:
# Replace nan values in clean_director with empty strings
netflix_metadata['clean_director'] = netflix_metadata['clean_director'].replace('nan', '')

In [269]:
netflix_metadata.head()

Unnamed: 0,title,type,director,cast,genre,description,clean_desc,clean_type,clean_director,clean_cast,clean_genre
0,Dick Johnson Is Dead,Movie,Kirsten Johnson,[],[Documentaries],"As her father nears the end of his life, filmm...",father nears end life filmmaker kirsten johnso...,movie,kirstenjohnson,[],[documentaries]
1,Blood & Water,TV Show,,"[Ama Qamata, Khosi Ngema, Gail Mabalane]","[International TV Shows, TV Dramas, TV Mysteries]","After crossing paths at a party, a Cape Town t...",crossing paths party cape town teen sets prove...,tvshow,,"[amaqamata, khosingema, gailmabalane]","[internationaltvshows, tvdramas, tvmysteries]"
2,Ganglands,TV Show,Julien Leclercq,"[Sami Bouajila, Tracy Gotoas, Samuel Jouy]","[Crime TV Shows, International TV Shows, TV Ac...",To protect his family from a powerful drug lor...,protect family powerful drug lord skilled thie...,tvshow,julienleclercq,"[samibouajila, tracygotoas, samueljouy]","[crimetvshows, internationaltvshows, tvaction&..."
3,Jailbirds New Orleans,TV Show,,[],"[Docuseries, Reality TV]","Feuds, flirtations and toilet talk go down amo...",feuds flirtations toilet talk go among incarce...,tvshow,,[],"[docuseries, realitytv]"
4,Kota Factory,TV Show,,"[Mayur More, Jitendra Kumar, Ranjan Raj]","[International TV Shows, Romantic TV Shows, TV...",In a city of coaching centers known to train I...,city coaching centers known train india finest...,tvshow,,"[mayurmore, jitendrakumar, ranjanraj]","[internationaltvshows, romantictvshows, tvcome..."


In [271]:
# Generate metadata soup by merging metadata columns into single string
    
# Define create_soup() function 
def create_soup(x):
    clean_type = x['clean_type']
    clean_cast = ' '.join(x['clean_cast'])
    clean_genre = ' '.join(x['clean_genre'])
    
    # Handle 'clean_director' if it's a list of strings
    if isinstance(x['clean_director'], list):
        clean_director = ' '.join(x['clean_director'])
    else:
        clean_director = x['clean_director']
    
    return f"{clean_type} {clean_director} {clean_cast} {clean_genre}"

# Add 'soup' column to the DataFrame 
netflix_metadata['soup'] = netflix_metadata.apply(create_soup, axis=1)

In [272]:
print(netflix_metadata[['director','soup']])

             director                                               soup
0     Kirsten Johnson                movie kirstenjohnson  documentaries
1                 nan  tvshow  amaqamata khosingema gailmabalane inte...
2     Julien Leclercq  tvshow julienleclercq samibouajila tracygotoas...
3                 nan                      tvshow   docuseries realitytv
4                 nan  tvshow  mayurmore jitendrakumar ranjanraj inte...
...               ...                                                ...
8802    David Fincher  movie davidfincher markruffalo jakegyllenhaal ...
8803              nan          tvshow   kids'tv koreantvshows tvcomedies
8804  Ruben Fleischer  movie rubenfleischer jesseeisenberg woodyharre...
8805     Peter Hewitt  movie peterhewitt timallen courteneycox chevyc...
8806      Mozez Singh  movie mozezsingh vickykaushal sarah-janedias r...

[8807 rows x 2 columns]


In [273]:
# Remove extra spaces & tvshow/movie words in soup column
def clean_soup(soup):
    cleaned_soup = ' '.join(soup.split())
    return cleaned_soup

netflix_metadata['soup'] = netflix_metadata['soup'].apply(clean_soup)
netflix_metadata['soup'] = netflix_metadata['soup'].str.replace('tvshow', '').str.replace('movie', '').str.strip()

In [274]:
print(netflix_metadata.soup)

0                            kirstenjohnson documentaries
1       amaqamata khosingema gailmabalane internationa...
2       julienleclercq samibouajila tracygotoas samuel...
3                                    docuseries realitytv
4       mayurmore jitendrakumar ranjanraj internationa...
                              ...                        
8802    davidfincher markruffalo jakegyllenhaal robert...
8803                           kids'tv koreans tvcomedies
8804    rubenfleischer jesseeisenberg woodyharrelson e...
8805    peterhewitt timallen courteneycox chevychase c...
8806    mozezsingh vickykaushal sarah-janedias raaghav...
Name: soup, Length: 8807, dtype: object


In [275]:
# Generate TfIdfVectorizer matrix 
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_meta = TfidfVectorizer()
tfidf_meta_matrix = tfidf_meta.fit_transform(netflix_metadata['soup'])

In [276]:
tfidf_meta_matrix.shape

(8807, 18844)

In [278]:
# Define function to input keywords (genre, director name and/or actor name) - output similar movies

def get_keywords_recommendations(keywords, n):
    
    keywords = keywords.lower()
    if "," in keywords:
        keywords = keywords.split(",")
        keywords = " ".join(["".join(k.strip().split()) for k in keywords])
    else: 
        keywords = keywords.replace(" ", "")
   
    # If there are no spaces in keywords, use regex to find exact matches
    regex_pattern = r'\b' + re.escape(keywords) + r'\b'
    matches = netflix_metadata[netflix_metadata['soup'].str.contains(regex_pattern, case=False, regex=True)]
    
    # If there are exact matches, return them
    if not matches.empty:
        recomm = matches['title'][:n].tolist()
    else:
        # If no exact matches, calculate cosine similarity
        keywords_vector = tfidf_meta.transform([keywords]) # vectorize keywords
        result = cosine_similarity(keywords_vector, tfidf_meta_matrix) # compute cosine similarity
        similar_key_movies = sorted(list(enumerate(result[0])), reverse=True, key=lambda x: x[1]) # sort top n similar movies
        recomm = [netflix_metadata.iloc[i[0]].title for i in similar_key_movies[1:n+1]]
    
    return recomm
    
    # extract names from dataframe and return movie names
    recomm = [netflix_metadata.iloc[i[0]].title for i in similar_key_movies[1:n+1]]
    return recomm
    

In [280]:
get_keywords_recommendations("Christopher Nolan", 3)

['Inception']