In this notebook, we will create a movie recommendation system. If we provide a movie name, the system would recommend another movie based on the similarity of the tags associated with the movies. The system uses the cosine similarity to recommend similar movies.

In [42]:
import numpy as np
import pandas as pd

movies = pd.read_csv('tmdb_5000_movies.csv')
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [43]:
credits = pd.read_csv('tmdb_5000_credits.csv')
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [44]:
#Merge both data sets
movies = movies.merge(credits, on = 'title')
movies.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'],
      dtype='object')

In [45]:
#Feature Selection
movies_df = movies[['id', 'title', 'overview', 'genres', 'cast', 'crew', 'keywords']]
movies_df.head(1)

Unnamed: 0,id,title,overview,genres,cast,crew,keywords
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":..."


In [87]:
#Remove null values
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4806 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        4806 non-null   int64 
 1   title     4806 non-null   object
 2   overview  4806 non-null   object
 3   genres    4806 non-null   object
 4   cast      4806 non-null   object
 5   crew      4806 non-null   object
 6   keywords  4806 non-null   object
dtypes: int64(1), object(6)
memory usage: 429.4+ KB


In [8]:
movies_df.isna().any()

id          False
title       False
overview     True
genres      False
cast        False
crew        False
keywords    False
dtype: bool

In [46]:
#Drop 3 rows with NULL values
movies_df.dropna(inplace = True)
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4806 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        4806 non-null   int64 
 1   title     4806 non-null   object
 2   overview  4806 non-null   object
 3   genres    4806 non-null   object
 4   cast      4806 non-null   object
 5   crew      4806 non-null   object
 6   keywords  4806 non-null   object
dtypes: int64(1), object(6)
memory usage: 300.4+ KB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_df.dropna(inplace = True)


In [55]:
movies_df.head(1)

Unnamed: 0,id,title,overview,genres,cast,crew,keywords
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":..."


In [10]:
#Data pre-processing
#Here genres & keywords columns are a dictionary.
movies_df.genres.iloc[0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [11]:
movies_df.keywords.iloc[0]

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

In [47]:
#Convert this dictionary to an array of the name key

#Create a function to convert the list of dictionary to a list
def convertdict_list(obj):
    import ast #Needed to make the program to understand that it is a dictionary, see below for example
    outlst = []
    for i in ast.literal_eval(obj):
        outlst.append(i['name'])
    return outlst

In [59]:
#Python will parse var dictionary as string, but we need it to parse it as a dictionary
dictionary = "{'id': 28, 'name': Action}"
print(type(dictionary))

<class 'str'>


In [60]:
#Making python to parse it as a dictionary
import ast
dictionary = ast.literal_eval("{'a': 1, 'b': 2}")
print(type(dictionary))

<class 'dict'>


In [48]:
#Convert genres to a list instead of a dictionary
movies_df['genres'] = movies_df.genres.apply(convertdict_list)
movies_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_df['genres'] = movies_df.genres.apply(convertdict_list)


Unnamed: 0,id,title,overview,genres,cast,crew,keywords
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":..."


In [49]:
#Convert keywords into a list instead of a dictionary
movies_df['keywords'] = movies_df.keywords.apply(convertdict_list)
movies_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_df['keywords'] = movies_df.keywords.apply(convertdict_list)


Unnamed: 0,id,title,overview,genres,cast,crew,keywords
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[culture clash, future, space war, space colon..."


In [50]:
movies_df.genres

0       [Action, Adventure, Fantasy, Science Fiction]
1                        [Adventure, Fantasy, Action]
2                          [Action, Adventure, Crime]
3                    [Action, Crime, Drama, Thriller]
4                [Action, Adventure, Science Fiction]
                            ...                      
4804                        [Action, Crime, Thriller]
4805                                [Comedy, Romance]
4806               [Comedy, Drama, Romance, TV Movie]
4807                                               []
4808                                    [Documentary]
Name: genres, Length: 4806, dtype: object

In [51]:
movies_df.keywords

0       [culture clash, future, space war, space colon...
1       [ocean, drug abuse, exotic island, east india ...
2       [spy, based on novel, secret agent, sequel, mi...
3       [dc comics, crime fighter, terrorist, secret i...
4       [based on novel, mars, medallion, space travel...
                              ...                        
4804    [united states–mexico barrier, legs, arms, pap...
4805                                                   []
4806    [date, love at first sight, narration, investi...
4807                                                   []
4808            [obsession, camcorder, crush, dream girl]
Name: keywords, Length: 4806, dtype: object

In [52]:
#Get the first 4 names from the cast list
movies_df.cast.iloc[0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [53]:
#Function to get the first 4 names in the cast
import ast
def extract_top4(obj):
    castlist = []
    counter = 0
    for i in ast.literal_eval(obj):
        castlist.append(i['name'])
        if counter == 3:
            break;
        counter += 1
    return castlist

In [54]:
movies_df['cast'] = movies_df.cast.apply(extract_top4)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_df['cast'] = movies_df.cast.apply(extract_top4)


In [55]:
movies_df.cast.iloc[0]

['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver', 'Stephen Lang']

In [56]:
#Extract only director from the crew list
movies_df.crew.iloc[0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [64]:
def extract_director(obj):
    directorList = []
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            directorList.append(i['name'])
    return directorList           

In [65]:
movies_df['crew'] = movies_df.crew.apply(extract_director)
movies_df.crew

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_df['crew'] = movies_df.crew.apply(extract_director)


0                                [James Cameron]
1                               [Gore Verbinski]
2                                   [Sam Mendes]
3                            [Christopher Nolan]
4                               [Andrew Stanton]
                          ...                   
4804                          [Robert Rodriguez]
4805                              [Edward Burns]
4806                               [Scott Smith]
4807                               [Daniel Hsia]
4808    [Brian Herzlinger, Jon Gunn, Brett Winn]
Name: crew, Length: 4806, dtype: object

In [70]:
#Remove all stop words like 'a', 'an', 'the' to create tags in the overview column
movies_df.overview.iloc[1]

'Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the Earth with Will Turner and Elizabeth Swann. But nothing is quite as it seems.'

In [71]:
#We will use lambda functions. Lambda functions need not be defined, can be used where it needs to be called. For e.g.:
x = lambda a : a + 10
print (x(5))

15


In [74]:
#Split the overview column
movies_df['overview'] = movies_df.overview.apply(lambda x:x.split())
movies_df.overview.iloc[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_df['overview'] = movies_df.overview.apply(lambda x:x.split())


['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.']

In [84]:
#Replace whitespaces with underscores
movies_df.head(5)

Unnamed: 0,id,title,overview,genres,cast,crew,keywords
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron],"[culture clash, future, space war, space colon..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski],"[ocean, drug abuse, exotic island, east india ..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",[Sam Mendes],"[spy, based on novel, secret agent, sequel, mi..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[Christian Bale, Michael Caine, Gary Oldman, A...",[Christopher Nolan],"[dc comics, crime fighter, terrorist, secret i..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, Science Fiction]","[Taylor Kitsch, Lynn Collins, Samantha Morton,...",[Andrew Stanton],"[based on novel, mars, medallion, space travel..."


In [85]:
movies_df.genres[0]

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

In [86]:
movies_df.genres = movies_df.genres.apply(lambda x :[i.replace(' ', '_') for i in x])
movies_df.genres[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


['Action', 'Adventure', 'Fantasy', 'Science_Fiction']

In [88]:
#Do the same for keywords, cast, crew
movies_df.keywords = movies_df.keywords.apply(lambda x :[i.replace(' ', '_') for i in x])
movies_df.cast = movies_df.cast.apply(lambda x :[i.replace(' ', '_') for i in x])
movies_df.crew = movies_df.crew.apply(lambda x :[i.replace(' ', '_') for i in x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_df['tags'] = movies_df['genres'] + movies_df['keywords'] + movies_df['cast'] + movies_df['crew']


['Action',
 'Adventure',
 'Fantasy',
 'Science_Fiction',
 'culture_clash',
 'future',
 'space_war',
 'space_colony',
 'society',
 'space_travel',
 'futuristic',
 'romance',
 'space',
 'alien',
 'tribe',
 'alien_planet',
 'cgi',
 'marine',
 'soldier',
 'battle',
 'love_affair',
 'anti_war',
 'power_relations',
 'mind_and_soul',
 '3d',
 'Sam_Worthington',
 'Zoe_Saldana',
 'Sigourney_Weaver',
 'Stephen_Lang',
 'James_Cameron']

In [90]:
#Create a tag column with the concatenation of genres, keywords, cast, crew
movies_df['tags'] = movies_df['overview'] + movies_df['genres'] + movies_df['keywords'] + movies_df['cast'] + movies_df['crew']
movies_df.tags.iloc[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_df['tags'] = movies_df['overview'] + movies_df['genres'] + movies_df['keywords'] + movies_df['cast'] + movies_df['crew']


['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.',
 'Action',
 'Adventure',
 'Fantasy',
 'Science_Fiction',
 'culture_clash',
 'future',
 'space_war',
 'space_colony',
 'society',
 'space_travel',
 'futuristic',
 'romance',
 'space',
 'alien',
 'tribe',
 'alien_planet',
 'cgi',
 'marine',
 'soldier',
 'battle',
 'love_affair',
 'anti_war',
 'power_relations',
 'mind_and_soul',
 '3d',
 'Sam_Worthington',
 'Zoe_Saldana',
 'Sigourney_Weaver',
 'Stephen_Lang',
 'James_Cameron']

In [92]:
#In tags column, join everything
movies_df.tags = movies_df.tags.apply(lambda x: ' '.join(x))
movies_df.tags.iloc[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy Science_Fiction culture_clash future space_war space_colony society space_travel futuristic romance space alien tribe alien_planet cgi marine soldier battle love_affair anti_war power_relations mind_and_soul 3d Sam_Worthington Zoe_Saldana Sigourney_Weaver Stephen_Lang James_Cameron'

In [94]:
#Perform Stemming to reduce words into its root-words
new_df = movies_df[['id', 'title', 'tags']]
new_df.head(1)

Unnamed: 0,id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."


In [96]:
#Before stemming convert words into lowercase
new_df.title = new_df.title.apply(lambda x: x.lower())
new_df.tags = new_df.tags.apply(lambda x: x.lower())
new_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,id,title,tags
0,19995,avatar,"in the 22nd century, a paraplegic marine is di..."


In [97]:
#For loop in single line
#Approach 1
for i in range(10): print(i)

0
1
2
3
4
5
6
7
8
9


In [98]:
#Approach 2: If you want put this in a list
print([i for i in range(10)])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [99]:
#Use PorterStemmer from nltk.stem.porter for stemming the words
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def stem(text):
    #First split the text, then stem the words and join it back
    v = ' '.join([ps.stem(i) for i in text.split()])
    return v

In [101]:
new_df.tags = new_df['tags'].apply(stem)
new_df.tags.iloc[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi science_fict culture_clash futur space_war space_coloni societi space_travel futurist romanc space alien tribe alien_planet cgi marin soldier battl love_affair anti_war power_rel mind_and_soul 3d sam_worthington zoe_saldana sigourney_weav stephen_lang james_cameron'