# **Movies recommender system**

**Necessary libraries**

In [1]:
import numpy as np
import pandas as pd

import nltk
from nltk.stem import PorterStemmer

import ast

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics.pairwise import cosine_similarity

import pickle

**Extracting data**

Access the dataset from jupyter notebook

In [2]:
data_path = "datasett"

movies = pd.read_csv(f"{data_path}/movies.csv")
credits = pd.read_csv(f"{data_path}/credits.csv")

From google colab

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
movies = pd.read_csv('/content/drive/MyDrive/dataset/tmdb_5000_movies.csv')
credits = pd.read_csv('/content/drive/MyDrive/dataset/tmdb_5000_credits.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/dataset/tmdb_5000_movies.csv'

**Data analyzing and preprocessing**

for this project we used two tables:


1.   **Table movies:** data about the movie itself like:
`movie_id,title,overview,genres,keywords,release_date`

2.   **Table credits:** data about who made the movie:
`title,cast,crew`



In [4]:
movies.head(4)
movies.shape

(4803, 20)

In [5]:
credits.head(4)
credits.shape

(4803, 4)

Combines both datasets into one, matching tables where `movies.title == credits.title`

In [6]:
movies = pd.merge(movies, credits, on='title')

In [7]:
#keep important columns for recommendation
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [8]:
movies.isnull().sum() #checking for missing values

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [9]:
movies.dropna(inplace=True) #remove missing values from the dataset

In [10]:
movies.duplicated().sum()  #Count the number of duplicate rows in the df

np.int64(0)

In [11]:
#from string to list
def convert(text):
    return [i['name'] for i in ast.literal_eval(text)]

In [12]:
movies['genres'] = movies['genres'].apply(convert)   #To have a list of genders

In [13]:
movies['keywords'] = movies['keywords'].apply(convert)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [14]:
def convert_cast(text):
    return [i['name'] for i in ast.literal_eval(text)[:5]]


In [15]:
movies['cast'] = movies['cast'].apply(convert_cast)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman, A...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [16]:
def get_director(text):  #extract director’s name from the crew
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            return [i['name']]
    return []

In [17]:
movies['crew'] = movies['crew'].apply(get_director)

Convert overview to a list (important for tokenizing)

In [18]:
movies.iloc[1]['overview']

'Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the Earth with Will Turner and Elizabeth Swann. But nothing is quite as it seems.'

In [19]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())
movies.sample(4)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
2481,30890,Radio Days,"[The, Narrator, (Woody, Allen), tells, us, how...","[Comedy, Drama]","[beach, taxi driver, world war ii, radio, cone...","[Woody Allen, Mia Farrow, Jeff Daniels, Larry ...",[Woody Allen]
3466,10622,Mr. Nice Guy,"[A, Chinese, chef, accidentally, gets, involve...","[Crime, Action, Comedy]","[journalist, martial arts, cook, drug dealer]","[Jackie Chan, Richard Norton, Miki Lee, Gabrie...",[Sammo Hung]
541,9425,Soldier,"[Sergeant, Todd, is, a, veteran, soldier, for,...","[Action, War, Science Fiction]","[space marine, dystopia, alien planet, genetic...","[Kurt Russell, Jason Scott Lee, Jason Isaacs, ...",[Paul W.S. Anderson]
1624,9452,Bulworth,"[A, suicidally, disillusioned, liberal, politi...","[Comedy, Drama]","[mission of murder, politics, election campaig...","[Warren Beatty, Halle Berry, Sean Astin, Chris...",[Warren Beatty]


In [20]:
movies.iloc[1]['overview']

['Captain',
 'Barbossa,',
 'long',
 'believed',
 'to',
 'be',
 'dead,',
 'has',
 'come',
 'back',
 'to',
 'life',
 'and',
 'is',
 'headed',
 'to',
 'the',
 'edge',
 'of',
 'the',
 'Earth',
 'with',
 'Will',
 'Turner',
 'and',
 'Elizabeth',
 'Swann.',
 'But',
 'nothing',
 'is',
 'quite',
 'as',
 'it',
 'seems.']

We will remove the spaces here because when tokenizing, the model can take the name as a token and the lastname as another which could mislead us to wrong recommendations

In [27]:
def remove_space(L):
    return [i.replace(" ", "") for i in L]

**Some data preprocessing to improves vectorization**

In [30]:
movies['cast'] = movies['cast'].apply(remove_space)
movies['crew'] = movies['crew'].apply(remove_space)
movies['genres'] = movies['genres'].apply(remove_space)
movies['keywords'] = movies['keywords'].apply(remove_space)

In [33]:
# Concatenation into one list per movie to get tags
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [35]:
movies.iloc[1]['tags']

['Captain',
 'Barbossa,',
 'long',
 'believed',
 'to',
 'be',
 'dead,',
 'has',
 'come',
 'back',
 'to',
 'life',
 'and',
 'is',
 'headed',
 'to',
 'the',
 'edge',
 'of',
 'the',
 'Earth',
 'with',
 'Will',
 'Turner',
 'and',
 'Elizabeth',
 'Swann.',
 'But',
 'nothing',
 'is',
 'quite',
 'as',
 'it',
 'seems.',
 'Adventure',
 'Fantasy',
 'Action',
 'ocean',
 'drugabuse',
 'exoticisland',
 'eastindiatradingcompany',
 "loveofone'slife",
 'traitor',
 'shipwreck',
 'strongwoman',
 'ship',
 'alliance',
 'calypso',
 'afterlife',
 'fighter',
 'pirate',
 'swashbuckler',
 'aftercreditsstinger',
 'JohnnyDepp',
 'OrlandoBloom',
 'KeiraKnightley',
 'StellanSkarsgård',
 'ChowYun-fat',
 'JohnnyDepp',
 'OrlandoBloom',
 'KeiraKnightley',
 'StellanSkarsgård',
 'ChowYun-fat']

In [34]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...","[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...","[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...","[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman, Anne...","[ChristianBale, MichaelCaine, GaryOldman, Anne...","[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton, Wi...","[TaylorKitsch, LynnCollins, SamanthaMorton, Wi...","[John, Carter, is, a, war-weary,, former, mili..."


In [36]:
# creating new dataframe and removing extra columns
new_DF = movies[['movie_id','title','tags']]

In [37]:
# Converting list to str
new_DF['tags'] = new_DF['tags'].apply(lambda x: " ".join(x))
new_DF.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_DF['tags'] = new_DF['tags'].apply(lambda x: " ".join(x))


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [38]:
#lower case - text cleaning
new_DF['tags'] = new_DF['tags'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_DF['tags'] = new_DF['tags'].apply(lambda x:x.lower())


In [39]:
new_DF.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


In [40]:
new_DF.iloc[1]['tags']

"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger johnnydepp orlandobloom keiraknightley stellanskarsgård chowyun-fat johnnydepp orlandobloom keiraknightley stellanskarsgård chowyun-fat"

Now we used **stemming** which is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form to simplify the matching

In [35]:
ps = PorterStemmer()

In [36]:
def stems(text):
    T = []

    for i in text.split():
        T.append(ps.stem(i)) #applying steming to each way

    return " ".join(T)

In [None]:
new_DF['tags'] = new_DF['tags'].apply(stems) #tags now contains stemmed tokens

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stems) #tags now contains stemmed tokens


In [None]:
new_DF.iloc[0]['tags']

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav stephenlang michellerodriguez jamescameron'

In this step, we will transform the text data (e.g., genres, cast, and overview) into numeric form using a tool called `CountVectorizer`.

This provides  the list of most important words regarding each movie. Subsequently, we compute the **similarity** between all movies with each other using `cosine_similarity` and measure how similar they are in terms of those words.

The outcome is a table demonstrating how similar each pair of films is. This table is then used to suggest films that are similar to the one the user chooses.


In [None]:
cv = CountVectorizer(max_features=5000,stop_words='english')


In [None]:
vector = cv.fit_transform(new_DF['tags']).toarray()

In [42]:
vector.shape

(4806, 5000)

In [45]:
similarity = cosine_similarity(vector)

In [46]:
similarity.shape

(4806, 4806)

In [None]:
new_DF[new_DF['title'] == 'The Lego Movie'].index[0]

np.int64(744)

**Recommendation function**

In [None]:
def recommend(movie):
    index = new_DF[new_DF['title'].str.lower() == movie.lower()].index   #lower case and find the matching index
    if index.empty:
        print(f"Movie '{movie}' not found in dataset.") #check if the movie exists
        return

    index = index[0] # Extract the actual index
    distances = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x: x[1])  #sort by similarity from the closest one
    for i in distances[1:6]:
        print(new_DF.iloc[i[0]].title)


In [53]:
recommend('avatar')

Aliens vs Predator: Requiem
Independence Day
Falcon Rising
Battle: Los Angeles
Titan A.E.


Loading the pickled files is much faster than recomputing everything (data cleaning, vectorizing, similarity ....) every time we restart.

In [None]:
pickle.dump(new_DF,open('dataset/savedmodels/movie_list.pkl','wb'))
pickle.dump(similarity,open('dataset/savedmodels/similarity.pkl','wb'))