<a href="https://colab.research.google.com/github/bryaanabraham/Movie-Recommender-System/blob/main/Movie_Recommender_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [246]:
import numpy as np
import pandas as pd

Dataset can be found at: https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata/?select=tmdb_5000_movies.csv

In [247]:
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

In [248]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [249]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [250]:
movies = movies.merge(credits, on='title')

In [251]:
movies = movies[['id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]

In [252]:
movies.head(1)

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


## Preprocessing data

Checking for missing and duplicated data



In [253]:
movies.isnull().sum()

id          0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [254]:
movies.dropna(inplace=True)

In [255]:
movies.duplicated().sum()

0

In [256]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

We need to simplify the data to hold relevant data <br>
Eg: Genres should be reduced to ["Action", "Adventure", "Fantasy", "Science Fiction"]

In [257]:
import ast
#ast.literal_eval converts the data to a list
# if the data is not a list it will throw an error saying string indices must be integers

In [258]:
def convert(obj):
  L = []
  for data in ast.literal_eval(obj):
    L.append(data['name'])
  return L

In [259]:
movies['genres'] = movies['genres'].apply(convert)

In [260]:
movies['keywords'] = movies['keywords'].apply(convert)

We will only use the first 6 cast members for building the model

In [261]:
def convert6(obj):
  L = []
  counter = 0
  for data in ast.literal_eval(obj):
    if counter != 6:
      L.append(data['name'])
      counter+=1
    else:
      break
  return L

In [262]:
movies['cast'] = movies['cast'].apply(convert6)

In [263]:
movies.head()

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman, A...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


'Director' is not consistent with the dataset in terms of position so we cannot use convert function

In [264]:
def director(obj):
  L =[]
  for data in ast.literal_eval(obj):
    if data['job'] == 'Director':
      L.append(data['name'])
      break
  return L

In [265]:
movies['crew'] = movies['crew'].apply(director)

We will split the 'overview' column datas into individual words(lists) to check for similarities

In [266]:
movies['overview'] = movies['overview'].apply(lambda x: x.split())

In [267]:
movies.head()

Unnamed: 0,id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman, A...",[Christopher Nolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...",[Andrew Stanton]


It is necesarry to remove spaces between names and other datas to prevent mismatch of two data points with the same first or last name

In [268]:
movies['genres'] = movies['genres'].apply(lambda x: [i.replace(" ","") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [i.replace(" ","") for i in x])
movies['cast'] = movies['cast'].apply(lambda x: [i.replace(" ","") for i in x])
movies['crew'] = movies['crew'].apply(lambda x: [i.replace(" ","") for i in x])

In [269]:
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [270]:
movies.head()

Unnamed: 0,id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman, Anne...",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton, Wi...",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."


In [271]:
tagged_movies = movies[['id', 'title', 'tags']]

In [272]:
tagged_movies.head()

Unnamed: 0,id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


In [273]:
tagged_movies['tags'] = tagged_movies['tags'].apply(lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tagged_movies['tags'] = tagged_movies['tags'].apply(lambda x: " ".join(x))


In [274]:
tagged_movies['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver StephenLang MichelleRodriguez GiovanniRibisi JamesCameron'

#### Converting to lower case
If letter are in upper case the words may be considered different due to capitilization mismatch

In [275]:
tagged_movies['tags'] = tagged_movies['tags'].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tagged_movies['tags'] = tagged_movies['tags'].apply(lambda x: x.lower())


In [276]:
tagged_movies['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver stephenlang michellerodriguez giovanniribisi jamescameron'

###Using bag of words technique we will convert the text of 'tags' into vectors

Process: <br>
We combine all the rows of the tags attribute into one large string and calculate the most frequent words<br>
We fill a table for each movie based on how many times each word has occured<br>
These rows now have become vectors

In [277]:
from sklearn.feature_extraction.text import CountVectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [278]:
cv = CountVectorizer(max_features=5000,stop_words='english')

In [279]:
cv.fit_transform(tagged_movies['tags']).toarray().shape

(4806, 5000)

In [280]:
vectors = cv.fit_transform(tagged_movies['tags']).toarray()

In [281]:
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [282]:
 cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      dtype=object)

We can see that there are many similar words in the above output and we need to follow the below procedure to prune them

* To club similar word we use stemming<br>
Eg: [ ['actor', 'actors'] , ['dance', 'danced'] ]

* Stemming will reduce the words to their root word<br>
Eg: dance,dancing,danced -> danc

In [283]:
from nltk.stem.porter import PorterStemmer

In [284]:
ps = PorterStemmer()

In [285]:
def stem(text):
  y = []
  for i in text.split():
    y.append(ps.stem(i))
  return " ".join(y)

In [286]:
tagged_movies['tags'] = tagged_movies['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tagged_movies['tags'] = tagged_movies['tags'].apply(stem)


In [287]:
vectors = cv.fit_transform(tagged_movies['tags']).toarray()

In [288]:
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      dtype=object)

Since euclindian distance is not useful in higher dimension (here we are dealing with 4806x5000 dimensions vectors), we use cosine difference to calculate proportional dissimilarity between vectors

In [289]:
from sklearn.metrics.pairwise import cosine_similarity

In [292]:
similarity = cosine_similarity(vectors)

In [293]:
cosine_similarity(vectors).shape

(4806, 4806)

We cant sort the array to find the similarities as the index of the array determines the movie in referrence<br>
We thus use enumerate and convert it to a list, this makes it a list of tuples with the index value(+1) of the matrix

In [295]:
#similarity with 'Avatar'
list(enumerate(similarity[0]))

[(0, 1.0000000000000002),
 (1, 0.07897471897389846),
 (2, 0.08492077756084468),
 (3, 0.0711680683578165),
 (4, 0.18300360434867097),
 (5, 0.10604746049596296),
 (6, 0.03942082639927217),
 (7, 0.1401425507601155),
 (8, 0.055749467333806056),
 (9, 0.09342836717341034),
 (10, 0.09767725810110087),
 (11, 0.09245003270420486),
 (12, 0.08770580193070292),
 (13, 0.04358136336404089),
 (14, 0.1235415527768502),
 (15, 0.06201736729460422),
 (16, 0.07692307692307693),
 (17, 0.13592873654160803),
 (18, 0.09370791409872918),
 (19, 0.08338763984501665),
 (20, 0.05617667256463243),
 (21, 0.10390486669322622),
 (22, 0.06827887419989188),
 (23, 0.08362420100070908),
 (24, 0.05195243334661311),
 (25, 0.033757978902788886),
 (26, 0.14824986333222023),
 (27, 0.18452078282356418),
 (28, 0.10675210253672475),
 (29, 0.06419407387663695),
 (30, 0.06537204504606134),
 (31, 0.14867525836251314),
 (32, 0.08200923681047297),
 (33, 0.09078412990032037),
 (34, 0.0),
 (35, 0.09078412990032037),
 (36, 0.167093470609

In [296]:
sorted(list(enumerate(similarity[0])), reverse = True, key = lambda x:x[1])

[(0, 1.0000000000000002),
 (1216, 0.2850557338444825),
 (507, 0.25231028011870793),
 (3730, 0.2506402059138015),
 (582, 0.24194822861802104),
 (539, 0.24019223070763068),
 (2409, 0.2360960823249428),
 (61, 0.23372319715296228),
 (1194, 0.22417941532712204),
 (778, 0.22205779584216376),
 (4048, 0.2204155075111935),
 (2786, 0.21982600255746473),
 (1920, 0.21977383072747692),
 (1204, 0.2172620473133704),
 (1444, 0.20672455764868075),
 (172, 0.20483662259967567),
 (2333, 0.20380986614602725),
 (2971, 0.203362958695521),
 (322, 0.20254787341673336),
 (74, 0.20180183819889372),
 (3608, 0.20174251088960077),
 (972, 0.20131905799006777),
 (1089, 0.20073126386549828),
 (151, 0.20033416898825337),
 (4192, 0.20033416898825335),
 (577, 0.19814848097530424),
 (973, 0.19512313566832118),
 (260, 0.19223226273338137),
 (3999, 0.1921537845661046),
 (3675, 0.19096396641051544),
 (47, 0.19003715589632164),
 (3327, 0.1887128390240993),
 (4405, 0.18490006540840973),
 (27, 0.18452078282356418),
 (305, 0.184

Here we have calculated the cosine distances between all vectors and thus recieved a symnmetric matrix dimensions of 4806x4806 with all diagonals being 1

In [301]:
def recommend(movie):
  movie_index = tagged_movies[tagged_movies['title'] == movie].index[0]
  distances = similarity[movie_index]
  movies_list = sorted(list(enumerate(distances)), reverse = True, key = lambda x:x[1])[1:6]

  for data in movies_list:
    print (tagged_movies.iloc[data[0]].title)

In [305]:
#example
recommend('Avatar')

Aliens vs Predator: Requiem
Independence Day
Falcon Rising
Battle: Los Angeles
Titan A.E.
