<a href="https://colab.research.google.com/github/dileepb0503/dileepb0503/blob/main/movieRecommendation_ContentBasedFiltering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Content-based filtering

##Content based filtering is more specific to a user as it takes into consideration the viewer's previous activity & ratings given by user to different movies
##The goal is to look at movies the user liked and recommend similar ones based on genre,cast,crew(director,music diector etc.),title,tagline etc.

###![](https://i1.wp.com/astig.ph/wp-content/uploads/2016/01/netflix-philippines-catalog.jpg)

##Importing libraries and loading dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
movies=pd.read_csv("/content/drive/MyDrive/movies_data/tmdb_5000_movies.csv/tmdb_5000_movies.csv")
credits=pd.read_csv("/content/drive/MyDrive/movies_data/tmdb_5000_credits.csv/tmdb_5000_credits.csv")
print(movies.shape,credits.shape)

(4803, 20) (4803, 4)


##Data preprocessing & visualization

In [None]:
#rename movie-id to id so as to merge both on the id
credits.rename(columns={"movie_id":"id"},inplace=True)
movies= movies.merge(credits,on='id')
print(movies.shape)

(4803, 23)


In [None]:
relevant_cols=["id","title_x","genres","popularity","vote_average","vote_count","cast","crew","keywords"]
movies=movies.loc[:,relevant_cols]
movies.shape

(4803, 9)

In [None]:
movies.rename(columns={"title_x":"title","vote_average":"rating"},inplace=True)
print(movies.columns)
movies.shape

Index(['id', 'title', 'genres', 'popularity', 'rating', 'vote_count', 'cast',
       'crew', 'keywords'],
      dtype='object')


(4803, 9)

Lets make a note of features which are critical to examining similarity of movies:



1)keywords

2)title

3)genre

4)cast

5)crew

Firstly,let's convert the keywords,title to a bag of words with tokenisation,stop words

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import random
row_id=int((100*random.random())//1)                          #sample the keywords of this movie
sample=movies.loc[row_id,'keywords']
#split into words
tokens = word_tokenize(sample)
#tokens simply mean that words are broken into their prefixes eg.happiness,happy,happily would all tokenize to happi
# remove all tokens that are not alphabetic(eg.punctuation,numbers etc.)
words = [word for word in tokens if word.isalpha()]
words=list(set(words))
if('id' in words):
  words.remove('id')
if('name' in words):
  words.remove('name')
#Now,lets remove stopwords i.e. common words like I,am,he,it,on etc. which don't contribute much to the meaning of the sentence
stop_words=set(stopwords.words("english"))
filtered_words=[]
for w in words:
  if w not in stop_words:
    filtered_words.append(w)
print("The keywords of the movie:",movies.loc[row_id,'title'],"are",str(filtered_words))

The keywords of the movie: Batman v Superman: Dawn of Justice are ['comic', 'super', 'revenge', 'bruce', 'wayne', 'based', 'powers', 'dc', 'comics', 'book', 'extended', 'clark', 'universe', 'kent', 'vigilante', 'superhero']


In [None]:
#Now,lets create a function for genreating a list of key_words from the keywords string
def get_keywords(keywords):
  tokens = word_tokenize(keywords.lower())
  #tokens simply mean that words are broken into their prefixes eg.happiness,happy,happily would all tokenize to happi
  #remove all tokens that are not alphabetic(eg.punctuation,numbers etc.)
  words = [word for word in tokens if word.isalpha()]
  words=list(set(words))
  if('id' in words):
    words.remove('id')
  if('name' in words):
    words.remove('name')
  #Now,lets remove stopwords i.e. common words like I,am,he,it,on etc. which don't contribute much to the meaning of the sentence
  stop_words=set(stopwords.words("english"))
  filtered_words=[]
  for w in words:
    if w not in stop_words:
      filtered_words.append(w)
  return filtered_words

def get_title(title):
  tokens = word_tokenize(title.lower())
  #tokens simply mean that words are broken into their prefixes eg.happiness,happy,happily would all tokenize to happi
  #remove all tokens that are not alphabetic(eg.punctuation,numbers etc.)
  words = [word for word in tokens if word.isalpha()]
  words=list(set(words))
  #Now,lets remove stopwords i.e. common words like I,am,he,it,on etc. which don't contribute much to the meaning of the sentence
  stop_words=set(stopwords.words("english"))
  filtered_words=[]
  for w in words:
    if w not in stop_words:
      filtered_words.append(w)
  return filtered_words

#lets test these functions
row_id=int((100*random.random())//1)
print("Title:",movies.loc[row_id,"title"],'| Title(bag of words)',str(get_title(movies.loc[row_id,"title"])),'| keywords(bag of words)',str(get_keywords(movies.loc[row_id,"keywords"])))

Title: Independence Day: Resurgence | Title(bag of words) ['independence', 'day', 'resurgence'] | keywords(bag of words) ['invasion', 'history', 'alternate', 'alien']


In [None]:
#The genre and cast are string type first lets make functions to convert them to dictionaries
def process_genre(s):
  id_str=s.split(", ")[0]
  genre_str=s.split(", ")[-1]
  genre={}
  genre["id"]=int(id_str.split(':')[-1])
  genre["name"]=(genre_str.split(':')[-1])[2:-1]
  return genre

def get_genres(s):
  if(len(s)<5):
    return {}
  genres=s.split("}, {")
  genres[0]=genres[0][2:]
  genres[-1]=genres[-1][:-2]
  genre_dicts=[]
  for genre in genres:
    genre_dicts.append(process_genre(genre))
  return genre_dicts

def process_actor(s):
  id_str=s.split(", ")[0]
  actor_str=s.split(", ")[-2]
  actor={}
  actor["id"]=int(id_str.split(':')[-1])
  actor["name"]=(actor_str.split(':')[-1])[2:-1]
  return actor

def get_cast(s):
  if(len(s)<5):
    return {}
  cast=s.split("}, {")
  cast[0]=cast[0][2:]
  cast[-1]=cast[-1][:-2]
  cast_dicts=[]
  for actor in cast:
    cast_dicts.append(process_actor(actor))
  return cast_dicts

def process_crew(s):
  id_str=s.split(", ")[-3]
  job_str=s.split(", ")[-2]
  name_str=s.split(", ")[-1]
  crew={}
  crew["id"]=id_str.split(':')[-1]
  crew["job"]=(job_str.split(':')[-1])[2:-1]
  crew["name"]=(name_str.split(':')[-1])[2:-1]
  return crew

def get_crew(s):
  #only director seems relevant we can safely ignore others like cameraman,producer,dubbing artist etc.
  if(len(s)<5):
    return {}
  crew=s.split("}, {")
  crew[0]=crew[0][2:]
  crew[-1]=crew[-1][:-2]
  crew_dicts=[]
  for crew_member in crew:
    crew_dicts.append(process_crew(crew_member))
  return crew_dicts

In [None]:
#now,lets make cast,genre and crew list

def get_genre_list(s):
  genre_dicts=get_genres(s)
  genre_list=[genre["name"] for genre in genre_dicts]
  return genre_list

def get_cast_list(s):
  cast_dicts=get_cast(s)
  cast_list=[actor["name"] for actor in cast_dicts]
  return cast_list

def get_crew_list(s):
  crew_dicts=get_crew(s)
  crew_list=[crew_member["name"] for crew_member in crew_dicts if crew_member["job"]=="Director"]
  return crew_list

get_crew_list(movies.loc[3,"crew"])

['Christopher Nolan']

Now lets test all these lists for a movie

In [None]:
row_id=int((100*random.random())//1)
print("TITLE:",movies.loc[row_id,"title"])
print("KEYWORDS:",str(get_keywords(movies.loc[row_id,"keywords"])))
print("GENRES:",str(get_genre_list(movies.loc[row_id,"genres"])))
print("CAST:",str(get_cast_list(movies.loc[row_id,"cast"])))
print("DIRECTOR:",str(get_crew_list(movies.loc[row_id,"crew"])))

TITLE: 47 Ronin
KEYWORDS: ['based', 'true', 'breed', 'japan', 'story', 'half', 'sword', 'shogun', 'samurai', 'suicide', 'ronin']
GENRES: ['Drama', 'Action', 'Adventure', 'Fantasy']
CAST: ['Keanu Reeves', 'Hiroyuki Sanada', 'Kou Shibasaki', 'Tadanobu Asano', 'Min Tanaka', 'Rinko Kikuchi', 'Jin Akanishi', 'Masayoshi Haneda', 'Hiroshi Sogabe', 'Takato Yonemoto', 'Sh\\u00fb Nakajima', 'Hiroshi Yamada', 'Cary-Hiroyuki Tagawa', 'Tanroh Ishida', 'Yorick van Wageningen', 'Ron Bottitta', 'Natsuki Kunimoto', 'Togo Igawa', 'Akira Koieyama', 'Haruka Abe', 'Clyde Kusatsu', 'Junichi Kajioka', 'Masashi Fujimoto', 'Neil Fingleton']
DIRECTOR: ['Carl Rinsch']


Now,let's develop a similarity measure between any 2 arbitary movies

In [None]:
def get_sim(list1,list2):
  ctr=0
  for ele in list1:
    if ele in list2:
      ctr+=1
  return ctr/len(list1)

def get_similarity(i,j):
  #let's the get the features of movies in row i
  past_title=get_title(movies.loc[i,"title"])
  past_keywords=get_keywords(movies.loc[i,"keywords"])
  past_genres=get_genre_list(movies.loc[i,"genres"])
  past_cast=get_cast_list(movies.loc[i,"cast"])
  past_directors=get_crew_list(movies.loc[i,"crew"])
  #Now,lets get the features of movie at row j
  present_title=get_title(movies.loc[j,"title"])
  present_keywords=get_keywords(movies.loc[j,"keywords"])
  present_genres=get_genre_list(movies.loc[j,"genres"])
  present_cast=get_cast_list(movies.loc[j,"cast"])
  present_directors=get_crew_list(movies.loc[j,"crew"])
  similarity=0
  weights=[5,5,3,10,2]
  similarity+=weights[0]*get_sim(past_title,present_title)
  similarity+=weights[1]*get_sim(past_keywords,present_keywords)
  similarity+=weights[2]*get_sim(past_genres,present_genres)
  similarity+=weights[3]*get_sim(past_cast,present_cast)
  similarity+=weights[4]*get_sim(past_directors,present_directors)
  return similarity/sum(weights)

In [None]:
def get_rowId(movie):
  for row_id in range(movies.shape[0]):
    if(movie==movies.loc[row_id,"title"]):
      return row_id
  return -1

def get_recommendation(watch_history,rating_history=-1):
  #Here,watch_history is list of previous movies & rating_history is the corresponding rating given by the user(if no rating were given feed -1)
  similarity_scores=np.zeros((movies.shape[0],len(watch_history)))
  rowIds=[]
  for past_movie in watch_history:
    row_id=get_rowId(past_movie)
    rowIds.append(row_id)
  for idx,past_movie_id in enumerate(rowIds):
    for row_id in range(movies.shape[0]):
      if row_id not in rowIds:
        similarity_scores[row_id,idx]=get_similarity(past_movie_id,row_id)
  cummulative_similarity_score=np.dot(similarity_scores,np.array(rating_history))
  #Recommend movies with Top 10 cummulative similarity scores
  recommended_rowIds=np.flip(cummulative_similarity_score.argsort()[-20:])
  top_movies=movies.loc[recommended_rowIds,:]
  recommendations=list(top_movies.nlargest(10,'popularity').loc[:,'title'])
  print("****TOP PERSONALIZED RECOMMENDATIONS****")
  for idx,movie in enumerate(recommendations):
    print(idx+1,')',movie)

In [None]:
get_recommendation(["Mission: Impossible","Insidious"],[5,1])

****TOP PERSONALIZED RECOMMENDATIONS****
1 ) Mission: Impossible - Rogue Nation
2 ) Mission: Impossible - Ghost Protocol
3 ) Scarface
4 ) Mission: Impossible III
5 ) Mission: Impossible II
6 ) Dr. No
7 ) Tomorrow Never Dies
8 ) From Russia with Love
9 ) Ronin
10 ) Live and Let Die


In [None]:
get_rowId("")

-1

In [None]:
print(movies.loc[:100,"title"])

0                                        Avatar
1      Pirates of the Caribbean: At World's End
2                                       Spectre
3                         The Dark Knight Rises
4                                   John Carter
                         ...                   
96                                    Inception
97                                Shin Godzilla
98            The Hobbit: An Unexpected Journey
99                     The Fast and the Furious
100         The Curious Case of Benjamin Button
Name: title, Length: 101, dtype: object
