## MOVIE RECOMMENDATION SYSTEM

In today's digital landscape, finding the perfect movie to watch can be a daunting task due to the sheer volume of available options. To simplify this process, we introduce a movie recommendation system designed to help users discover films that match their preferences. Using two extensive datasets—movies and credits—our system implements two distinct recommendation models: a popularity-based model and a content-based model.

The popularity-based model utilizes overall ratings aoteview counts to recommend highly acclaimed and trending movies, ensuring users are exposed to widely appreciated titles. On the other hand, the content-based model analyzes specific movie attributes, such as genre, director, and cast, to provide personalized recommendations by identifying films similar to those a user has previously enjoyed. By integrating these models, our system delivers a well-rounded approach that accommodates both general trends and individual tastes, enhancing the movie discovery experience for users.

#### DATA GATHERING

Importing various libraries.

1. The NumPy and pandas libraries are essential for data manipulation and analysis in Python.
2. NumPy is used for numerical computing in Python. It provides support for arrays, matrices, and many mathematical functions.
3. Pandas is a powerful data manipulation library built on top of NumPy. It provides data structures like DataFrame and Series for efficient ata manipulation.

In [82]:
import numpy as np
import pandas as pd

The movies dataset includes detailed information such as genres, budget, revenue, and ratings, while the credits dataset provides insights into the cast and crew involved in each movie. These datasets together form the foundation for building our recommendation models.

In [83]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

In [84]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [85]:
credits.head(2)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


We will first use the info method of pandas which provides a concise summary of a DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.

In [86]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [87]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


Renaming 'id' column of movies dataset to 'movie_id'.

In [88]:
movies.rename(columns = {'id':'movie_id'}, inplace = True)

#### POPULARITY BASED MODEL

Under the popularity based model, we will be filtering movies with a minimum vote count of 700 and a minimum rating of 7 to classify them as popular.  Then, we will ask the user to specify two genres and retrieves the top 5 most popular movies within both selected genres.

Keeping only those movies whose minimum vote_count is 700 and minimum rating('vote_average') is 7 to classify them as popular movies.

In [89]:
popular_movies = movies[(movies['vote_count']>=700) & (movies['vote_average']>=7)]

Keeping only specefic columns for popular_movies dataframe. 

In [90]:
popular_movies = popular_movies[['movie_id','title','genres','popularity','vote_average','overview']]

In [91]:
popular_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 487 entries, 0 to 4773
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   movie_id      487 non-null    int64  
 1   title         487 non-null    object 
 2   genres        487 non-null    object 
 3   popularity    487 non-null    float64
 4   vote_average  487 non-null    float64
 5   overview      487 non-null    object 
dtypes: float64(2), int64(1), object(3)
memory usage: 26.6+ KB


Renaming column 'vote_average' as 'rating'.

In [92]:
popular_movies.rename(columns = {'vote_average':'rating'}, inplace = True)

Now we transformed the 'genres' column to retain only the genre names as a list for each movie.

In [93]:
import json
def convert_genre(genre):
    genre = json.loads(genre)
    l = []
    for i in genre:
        l.append(i['name'])
    return l

In [94]:
popular_movies['genres'] = popular_movies['genres'].apply(convert_genre)

Resetting index of our popular_movies dataframe.

In [95]:
popular_movies = popular_movies.reset_index().drop(columns = 'index')

Now, creating a list of different type of genres present in our dataset.

In [96]:
l = []
for i in range(popular_movies.shape[0]):
    for j in popular_movies['genres'].iloc[i]:
        l.append(j)
s = set(l)
l = list(s)
list_of_genres = sorted(l)

In [97]:
print(list_of_genres)

['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'Thriller', 'War', 'Western']


Now, creating a function that prompts the user to specify two genres and then retrieves the top 5 most popular movies that belong to both of those genres.

In [98]:
l = []
def recommend(genre1, genre2):
    for i in range(popular_movies.shape[0]):
        if (genre1 in popular_movies['genres'].iloc[i]) and (genre2 in popular_movies['genres'].iloc[i]):
            l.append(i)
    df = popular_movies.iloc[l]
    df = df.sort_values('popularity', ascending = False)
    df = df.reset_index()
    for i in range(5):
        print(df['title'].iloc[i])

In [99]:
recommend('Comedy','Animation')

Big Hero 6
Despicable Me 2
Inside Out
Monsters, Inc.
How to Train Your Dragon 2


#### CONTENT BASED MODEL

Under content-based filtering, we will merge datasets based on 'movie_id', create a consolidated 'tags' feature from genres, keywords, overviews, cast, and director information. We'll use CountVectorizer to transform 'tags' into a numeric representation, apply cosine similarity to measure movie similarities, and recommend the top 5 movies most closely related to the user's input.

Merging both the datasets on the basis of common column present i.e. 'movie_id' column.

In [100]:
new_movies = movies.merge(credits, on = 'movie_id')

In [101]:
new_movies.head(2)

Unnamed: 0,budget,genres,homepage,movie_id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title_x,vote_average,vote_count,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [102]:
new_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   movie_id              4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

Keeping only specefic columns of new_movies dataset that can help us filter movies on the basis of content.

In [103]:
new_movies = new_movies[['movie_id', 'title_x','genres','keywords','overview','cast', 'crew']]

In [104]:
new_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title_x   4803 non-null   object
 2   genres    4803 non-null   object
 3   keywords  4803 non-null   object
 4   overview  4800 non-null   object
 5   cast      4803 non-null   object
 6   crew      4803 non-null   object
dtypes: int64(1), object(6)
memory usage: 262.8+ KB


Handling null values present in 'overview' column and then dropping them.

In [105]:
new_movies[new_movies['overview'].isnull()]

Unnamed: 0,movie_id,title_x,genres,keywords,overview,cast,crew
2656,370980,Chiamatemi Francesco - Il Papa della gente,"[{""id"": 18, ""name"": ""Drama""}]","[{""id"": 717, ""name"": ""pope""}, {""id"": 5565, ""na...",,"[{""cast_id"": 5, ""character"": ""Jorge Mario Berg...","[{""credit_id"": ""5660019ac3a36875f100252b"", ""de..."
4140,459488,"To Be Frank, Sinatra at 100","[{""id"": 99, ""name"": ""Documentary""}]","[{""id"": 6027, ""name"": ""music""}, {""id"": 225822,...",,"[{""cast_id"": 0, ""character"": ""Narrator"", ""cred...","[{""credit_id"": ""592b25e4c3a368783e065a2f"", ""de..."
4431,292539,Food Chains,"[{""id"": 99, ""name"": ""Documentary""}]",[],,[],"[{""credit_id"": ""5470c3b1c3a368085e000abd"", ""de..."


In [106]:
new_movies = new_movies[~new_movies['overview'].isnull()]

Renaming 'title_x' column as 'title'.

In [107]:
new_movies.rename(columns = {'title_x':'title'}, inplace = True)

Resetting index.

In [108]:
new_movies = new_movies.reset_index()

In [109]:
new_movies.drop(columns = 'index', inplace = True)

Checking if there any duplicates present.

In [110]:
new_movies.duplicated().sum()

0

Now we transformed the 'genres' column to retain only the genre names as a list for each movie.

In [111]:
new_movies['genres'][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [112]:
import json
def convert_genre(genre):
    genre = json.loads(genre)
    l = []
    for i in genre:
        l.append(i['name'])
    for i in range(len(l)):
        l[i] = l[i].lower().replace(' ','')
    return l

In [113]:
new_movies['genres'] = new_movies['genres'].apply(convert_genre)

Now we transformed the 'keywords' column to retain only the keywords as a list for each movie.

In [114]:
new_movies['keywords']

0       [{"id": 1463, "name": "culture clash"}, {"id":...
1       [{"id": 270, "name": "ocean"}, {"id": 726, "na...
2       [{"id": 470, "name": "spy"}, {"id": 818, "name...
3       [{"id": 849, "name": "dc comics"}, {"id": 853,...
4       [{"id": 818, "name": "based on novel"}, {"id":...
                              ...                        
4795    [{"id": 5616, "name": "united states\u2013mexi...
4796                                                   []
4797    [{"id": 248, "name": "date"}, {"id": 699, "nam...
4798                                                   []
4799    [{"id": 1523, "name": "obsession"}, {"id": 224...
Name: keywords, Length: 4800, dtype: object

In [115]:
import json
def convert_keyword(keyword):
    keyword = json.loads(keyword)
    l = []
    for i in keyword:
        l.append(i['name'])
    for i in range(len(l)):
        l[i] = l[i].lower().replace(' ','')
    return l

In [116]:
new_movies['keywords'] = new_movies['keywords'].apply(convert_keyword)

Now we transformed the 'cast' column to retain only the top 3 cast names as a list for each movie.

In [117]:
new_movies['cast']

0       [{"cast_id": 242, "character": "Jake Sully", "...
1       [{"cast_id": 4, "character": "Captain Jack Spa...
2       [{"cast_id": 1, "character": "James Bond", "cr...
3       [{"cast_id": 2, "character": "Bruce Wayne / Ba...
4       [{"cast_id": 5, "character": "John Carter", "c...
                              ...                        
4795    [{"cast_id": 1, "character": "El Mariachi", "c...
4796    [{"cast_id": 1, "character": "Buzzy", "credit_...
4797    [{"cast_id": 8, "character": "Oliver O\u2019To...
4798    [{"cast_id": 3, "character": "Sam", "credit_id...
4799    [{"cast_id": 3, "character": "Herself", "credi...
Name: cast, Length: 4800, dtype: object

In [118]:
def convert_cast(cast):
    cast = json.loads(cast)
    l = []
    for i in cast:
        l.append(i['name'])
        if len(l) == 3:
            break
    for i in range(len(l)):
        l[i] = l[i].lower().replace(' ','')
    return l

In [119]:
new_movies['cast'] = new_movies['cast'].apply(convert_cast)

Now we transformed the 'crew' column to retain only the director name as a list for each movie.

In [120]:
new_movies['crew']

0       [{"credit_id": "52fe48009251416c750aca23", "de...
1       [{"credit_id": "52fe4232c3a36847f800b579", "de...
2       [{"credit_id": "54805967c3a36829b5002c41", "de...
3       [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4       [{"credit_id": "52fe479ac3a36847f813eaa3", "de...
                              ...                        
4795    [{"credit_id": "52fe44eec3a36847f80b280b", "de...
4796    [{"credit_id": "52fe487dc3a368484e0fb013", "de...
4797    [{"credit_id": "52fe4df3c3a36847f8275ecf", "de...
4798    [{"credit_id": "52fe4ad9c3a368484e16a36b", "de...
4799    [{"credit_id": "58ce021b9251415a390165d9", "de...
Name: crew, Length: 4800, dtype: object

In [121]:
def convert_crew(crew):
    crew = json.loads(crew)
    l = []
    for i in crew:
        if i['job'] == 'Director':
            l.append(i['name'])
    for i in range(len(l)):
        l[i] = l[i].lower().replace(' ','')
    return l

In [122]:
new_movies['crew'] = new_movies['crew'].apply(convert_crew)

Renaming 'crew' column as 'director' since we have retained only director name from the crew of each movie.

In [123]:
new_movies.rename(columns = {'crew':'director'}, inplace = True)

Converting overview of each movie into a list of individual words.

In [124]:
new_movies['overview'] = new_movies['overview'].str.lower().str.split(' ')

We created a new column 'tags' by combining genres, keywords, overviews, cast, and director information for each movie to facilitate content-based recommendations.

In [125]:
new_movies['tags'] = new_movies['genres'] + new_movies['keywords'] + new_movies['overview'] + new_movies['cast'] + new_movies['director']

Converting list of tags for each movie into a string.

In [126]:
def convert_tag(tag):
    tag = ' '.join(tag)
    return tag

In [127]:
new_movies['tags'] = new_movies['tags'].apply(convert_tag)

Keeping only columns 'movie_id', 'title', and 'tags' for content based filtering.

In [128]:
new_movies = new_movies[['movie_id', 'title', 'tags']]

In [129]:
new_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4800 entries, 0 to 4799
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4800 non-null   int64 
 1   title     4800 non-null   object
 2   tags      4800 non-null   object
dtypes: int64(1), object(2)
memory usage: 112.6+ KB


We will now imports the CountVectorizer class from scikit-learn, which is used for converting a collection of text documents into a matrix of token counts.

In [130]:
from sklearn.feature_extraction.text import CountVectorizer

1. 'max_features=5000' specifies that only the top 5000 most frequent words (features) from the 'tags' data will be used.
2. stop_words='english' removes common English words (like 'a', 'an', 'the', etc.) from the 'tags' data. These words are often not useful for modeling since they appear frequently and don't carry much specific meaning.

In [131]:
cv = CountVectorizer(max_features=5000,stop_words='english')

1. Now we will fit the CountVectorizer to the 'tags' data in movies which transforms it into a sparse matrix representation where rows correspond to movies and columns correspond to words.
2. .toarray() converts the sparse matrix representation returned by fit_transform into a dense NumPy array (vector), where each element represents the count of a specific word for each movie.

In [132]:
vector = cv.fit_transform(new_movies['tags']).toarray()

shape[0] represents number of movies and shape[1] represents the 5000 words which have maximum frequency in the 'tags' data.

In [133]:
vector

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [134]:
vector.shape

(4800, 5000)

Now we will calculate cosine similarity which is a measure that calculates the cosine of the angle between two vectors, often used to determine similarity between items in a recommendation system. In our case, it computes the cosine similarity between the rows of vector array, where each row corresponds to a movie.

In [135]:
from sklearn.metrics.pairwise import cosine_similarity
similarity_scores = cosine_similarity(vector)

similarity_scores is a matrix where each element represents the similarity score between two movies. This similarity is based on cosine_similarity. Higher scores indicate greater similarity. So, in our case similarity_scores is a matrix which gives similarity of one movie with other 4799 movies and with itself(which is 1). Therefore the shape of similiarity_scores matrix is (4800,4800).

In [137]:
similarity_scores.shape

(4800, 4800)

Now creating a function that identifies the top 5 movies most similar to the one entered by the user.

In [138]:
def recommend(movie_name):
    movies_list = list(new_movies['title'])
    ind = movies_list.index(movie_name)
    similarity_scores_list = list(similarity_scores[ind])
    d = {}
    l = []
    for i in similarity_scores_list:
        d[i] = similarity_scores_list.index(i)
    d = dict(sorted(d.items(),reverse=True))
    for i in d.values():
        l.append(i)
    l = l[1:6]
    for i in l:
        print(new_movies['title'].iloc[i])

In [139]:
recommend('Spider-Man 3')

Spider-Man 2
Spider-Man
The Amazing Spider-Man 2
The Amazing Spider-Man
Arachnophobia
