# **Movies recommender system**

**Necessary imports**

In [5]:
import numpy as np
import pandas as pd

**Extracting data**

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
movies = pd.read_csv('/content/drive/MyDrive/dataset/tmdb_5000_movies.csv')
credits = pd.read_csv('/content/drive/MyDrive/dataset/tmdb_5000_credits.csv')

**Data analyzing and preprocessing**

for this project we used two tables:


1.   **Table movies:**data about the movie itself like:
`movie_id,title,overview,genres,keywords,release_date`

2.   **Table credits:**data about who made the movie:
`title,cast,crew`



In [8]:
movies.head(4)
movies.shape

(4803, 20)

In [9]:
credits.head(4)
credits.shape

(4803, 4)

Combines both datasets into one, matching tables where `movies.title == credits.title`

In [10]:
movies = movies.merge(credits,on='title')

In [11]:
# Keeping important columns for recommendation
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [12]:
movies.isnull().sum() #checking for missing values

Unnamed: 0,0
movie_id,0
title,0
overview,3
genres,0
keywords,0
cast,0
crew,0


In [13]:
movies.dropna(inplace=True) #remove missing values from the dataset

In [14]:
movies.duplicated().sum()  #Count the number of duplicate rows in the DataFrame.

np.int64(0)

In [15]:
import ast #to string to list

def convert(text):
    L = []
    for i in ast.literal_eval(text):  #paste string into python dictionary
        L.append(i['name'])
    return L

In [16]:
movies['genres'] = movies['genres'].apply(convert)   #To have a list of genders

In [17]:
movies['keywords'] = movies['keywords'].apply(convert)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [18]:
# keeping top 5 cast
def convert_cast(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 5:
            L.append(i['name'])
        counter+=1
    return L

In [19]:
movies['cast'] = movies['cast'].apply(convert_cast)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman, A...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [20]:
def fetch_director(text):  #Extract the director’s name from the crew .


    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
            break
    return L

In [21]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [22]:
# handle overview (converting to list)

movies.iloc[0]['overview']

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [23]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())
movies.sample(4)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
1768,28355,Case 39,"[In, her, many, years, as, a, social, worker,,...","[Horror, Mystery, Thriller]","[child abuse, detective, social worker, supern...","[Renée Zellweger, Jodelle Ferland, Ian McShane...",[Christian Alvart]
2848,14191,Aquamarine,"[Two, teenage, girls, discover, that, mermaids...","[Fantasy, Romance, Family, Comedy]","[female friendship, mermaid, teenager, woman d...","[Emma Roberts, Joanna 'JoJo' Levesque, Sara Pa...",[Elizabeth Allen Rosenbaum]
960,14199,The Adventures of Sharkboy and Lavagirl,"[Everyone, always, knew, that, Max, had, a, wi...","[Adventure, Family, Science Fiction]","[imaginary friend, outcast]","[Taylor Lautner, Taylor Dooley, Cayden Boyd, D...",[Robert Rodriguez]
3084,329440,The Forest,"[Set, in, the, Aokigahara, Forest,, a, real-li...","[Horror, Thriller]","[japan, forest]","[Natalie Dormer, Taylor Kinney, Yukiyoshi Ozaw...",[Jason Zada]


In [24]:
movies.iloc[0]['overview']

['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.']

In [25]:
#Removing space like that
def remove_space(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ",""))
    return L1

**Some data preprocessing to improves vectorization**

In [26]:
movies['cast'] = movies['cast'].apply(remove_space)
movies['crew'] = movies['crew'].apply(remove_space)
movies['genres'] = movies['genres'].apply(remove_space)
movies['keywords'] = movies['keywords'].apply(remove_space)

In [27]:
# Concatenation into one list per movie
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [28]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman, Anne...",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton, Wi...",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."


In [29]:
# removing extra columns
new_df = movies[['movie_id','title','tags']]

In [30]:
# Converting list to str
new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x))
new_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x))


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [31]:
# Converting to lower case
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())


In [32]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


In [33]:
new_df.iloc[0]['tags']

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver stephenlang michellerodriguez jamescameron'

Now we used **stemming** which is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form to simplify the matching

In [34]:
import nltk
from nltk.stem import PorterStemmer

In [35]:
ps = PorterStemmer()

In [36]:
def stems(text):
    T = []

    for i in text.split():
        T.append(ps.stem(i)) #applying steming to each way

    return " ".join(T)

In [37]:
new_df['tags'] = new_df['tags'].apply(stems) #tags now contains stemmed tokens

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stems) #tags now contains stemmed tokens


In [38]:
new_df.iloc[0]['tags']

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav stephenlang michellerodriguez jamescameron'

In this step, we will transform the text data (e.g., genres, cast, and overview) into numeric form using a tool called `CountVectorizer`.

This provides  the list of most important words regarding each movie. Subsequently, we compute the **similarity** between all movies with each other using `cosine_similarity` and measure how similar they are in terms of those words.

The outcome is a table demonstrating how similar each pair of films is. This table is then used to suggest films that are similar to the one the user chooses.


In [39]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,stop_words='english')


In [40]:
vector = cv.fit_transform(new_df['tags']).toarray()

In [42]:
vector.shape

(4806, 5000)

In [44]:
from sklearn.metrics.pairwise import cosine_similarity

In [45]:
similarity = cosine_similarity(vector)

In [46]:
similarity.shape

(4806, 4806)

In [47]:
new_df[new_df['title'] == 'The Lego Movie'].index[0]

np.int64(744)

**Recommendation function**

In [50]:
def recommend(movie):
    index = new_df[new_df['title'].str.lower() == movie.lower()].index   #lower case and find the matching index
    if index.empty:
        print(f"Movie '{movie}' not found in dataset.") #check if the movie exists
        return

    index = index[0] # Extract the actual index
    distances = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x: x[1])  #sort by similarity from the closest one
    for i in distances[1:6]:
        print(new_df.iloc[i[0]].title)


In [53]:
recommend('avatar')

Aliens vs Predator: Requiem
Independence Day
Falcon Rising
Battle: Los Angeles
Titan A.E.


Loading the pickled files is much faster than recomputing everything (data cleaning, vectorizing, similarity ....) every time we restart.

In [None]:
import pickle

In [None]:
pickle.dump(new_df,open('dataset/savedmodels/movie_list.pkl','wb'))
pickle.dump(similarity,open('dataset/savedmodels/similarity.pkl','wb'))