# Developing a Content-Based Recommendation System using Metadata of Movies

## 1. Business Problem

A newly established online movie viewing platform wants to make movie recommendations to its users.Since the login rate of users is very low, it cannot collect user habits. For this reason, it cannot develop product recommendations with the collaborative filtering method. but it knows which movies the users are watching from their tracks in the browser. It is requested to make movie recommendations based on this information.

## 2. Dataset Story
The dataset contains basic information about 45000 movies. Within the scope of the application, it was worked with the 'overview' variable containing movie descriptions.

## 3. Importing the libraries

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.expand_frame_repr', False)

## 4. Reading the dataset

In [2]:
df = pd.read_csv('/kaggle/input/movie-metadatacsv/movies_metadata.csv', low_memory=False, encoding='utf-8') # to close the DtypeWarning
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
df.shape

(45466, 24)

## 5. Selecting the variable 'overview'

In [4]:
df['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

## 6. Creating the TF-IDF Matrix

Goal: Developing a Content-Based Recommendation System

### 6.1. Use of TF-IDF Method

In [5]:
tfidf = TfidfVectorizer(stop_words='english')

We want to exclude commonly used expressions from the dataset that have no measurement value and generate too many empty observations.

### 6.2. Replace missings in the dataset with spaces

In [6]:
df['overview'] = df['overview'].fillna('')

### 6.3. Converting the variable 'overview' by calling the TF-IDF method

In [7]:
tfidf_matrix = tfidf.fit_transform(df['overview'])
tfidf_matrix.shape

(45466, 75827)

### 6.4. Getting the names of the variables

In [8]:
tfidf.get_feature_names()



['00',
 '000',
 '000km',
 '000th',
 '001',
 '006',
 '007',
 '008',
 '009',
 '0093',
 '01',
 '0123',
 '02',
 '03',
 '04',
 '042',
 '05',
 '05pm',
 '06',
 '07',
 '077',
 '07am',
 '08',
 '088',
 '09',
 '10',
 '100',
 '1000',
 '10000',
 '1000s',
 '1000th',
 '1001',
 '100th',
 '101',
 '101st',
 '103',
 '103rd',
 '104',
 '105',
 '1066',
 '108',
 '1080s',
 '108th',
 '109',
 '10b',
 '10crores',
 '10mn',
 '10th',
 '10x',
 '11',
 '110',
 '1100',
 '111',
 '112',
 '1138',
 '114',
 '115',
 '117',
 '117a',
 '118',
 '1183',
 '119',
 '11s',
 '11th',
 '12',
 '120',
 '1200',
 '1200s',
 '1206',
 '1215',
 '1218',
 '1227',
 '125',
 '1250',
 '125th',
 '1263',
 '129',
 '12th',
 '13',
 '130',
 '1300',
 '1300s',
 '1302',
 '1303',
 '133',
 '134',
 '1344',
 '1348',
 '1349',
 '138',
 '13anos',
 '13b',
 '13s',
 '13th',
 '14',
 '140',
 '1400',
 '1408',
 '1413',
 '142',
 '1429',
 '143',
 '144',
 '145',
 '1458',
 '146',
 '1463',
 '1466',
 '1472',
 '1475',
 '148',
 '1482',
 '1483',
 '1492',
 '14pm',
 '14th',
 '15',
 '

Thus, there are 45466 thousand comments and a new variable was formed from 75827 word names.
At the intersection of observations and variables, there are TF-IDF scores.

### 6.5. Getting the TF-IDF scores

In [9]:
tfidf_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## 7. Cosine Similarity Calculation
Goal: Creating the cosine similarity matrix

In [10]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

(45466, 45466)

Each movie in this matrix has similarities with each movie.

### 7.1. Reaching the similarity scores of the movie in the first index with all the other names

In [11]:
# cosine_sim[1]

## 8. Recommendation Based on the Similarities of Movies

### 8.1. Adding movie names to similarity scores

In [12]:
indices = pd.Series(df.index, index=df['title'])
indices.head()

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64

### 8.2. Getting the frequneces of the movies

In [13]:
indices.index.value_counts()

Cinderella              11
Hamlet                   9
Alice in Wonderland      9
Beauty and the Beast     8
Les Misérables           8
                        ..
Cluny Brown              1
Babies                   1
The Green Room           1
Captain Conan            1
Queerama                 1
Name: title, Length: 42277, dtype: int64

As you can see, the films have a multiplexing problem. In this case, let's select the movies that are the most recent in terms of timeliness. We can use 'duplicated' method to do this

In [14]:
indices = indices[~indices.index.duplicated(keep='last')]

In [15]:
indices.index.value_counts()

Toy Story                   1
Russell Madness             1
Attack of the Sabretooth    1
The Millennials             1
X/Y                         1
                           ..
Wife! Be Like a Rose!       1
Adelheid                    1
PEEPLI [Live]               1
The Moth                    1
Queerama                    1
Name: title, Length: 42277, dtype: int64

Thus, the multiplexing problem is solved.

In [16]:
indices['Cinderella']

45406

In [17]:
movie_index = indices['Sherlock Holmes']
movie_index

35116

In [18]:
cosine_sim[movie_index]

array([0.        , 0.00392837, 0.00476764, ..., 0.        , 0.0067919 ,
       0.        ])

As you see that there is still a readability problem. We can also solve this with the following coding.

### 8.3 Getting the similarity scores

In [19]:
similarity_scores = pd.DataFrame(cosine_sim[movie_index], columns=['score'])
similarity_scores

Unnamed: 0,score
0,0.000000
1,0.003928
2,0.004768
3,0.000000
4,0.000000
...,...
45461,0.000000
45462,0.000000
45463,0.000000
45464,0.006792


Thus, the similarity scores of the Sherlock Holmes movie and all existing movies are obtained.

### 8.4. Getting the most similar movies and their names to Sherlock Holmes

The 10 most similar movies to Sherlock Holmes

In [20]:
movie_indices = similarity_scores.sort_values('score', ascending=False)[1: 11].index
movie_indices

Int64Index([34737, 14821, 34750, 9743, 4434, 29706, 18258, 24665, 6432, 29154], dtype='int64')

In [21]:
df.iloc[movie_indices]['title']

34737    Приключения Шерлока Холмса и доктора Ватсона: ...
14821                                    The Royal Scandal
34750    The Adventures of Sherlock Holmes and Doctor W...
9743                           The Seven-Per-Cent Solution
4434                                        Without a Clue
29706                       How Sherlock Changed the World
18258                   Sherlock Holmes: A Game of Shadows
24665     The Sign of Four: Sherlock Holmes' Greatest Case
6432                   The Private Life of Sherlock Holmes
29154                          Sherlock Holmes in New York
Name: title, dtype: object

Thus, many films similar to Sherlock Holmes are obtained.

## 9. Functionalization of the study

### 9.1. Importing the libraries

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.expand_frame_repr', False)

### 9.2 Reading the dataset

In [2]:
df = pd.read_csv('/kaggle/input/movie-metadatacsv/movies_metadata.csv', low_memory=False, encoding='utf-8') # to close the DtypeWarning
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


### 9.3 Cosine Similarity Calculation

In [3]:
def calculate_cosine_sim(dataframe):
    # Use of TF-IDF Method
    tfidf = TfidfVectorizer(stop_words='english')
    
    # Replace missings in the dataset with spaces
    dataframe['overview'] = dataframe['overview'].fillna('')
    
    # Creating the TF-IDF matrix
    tfidf_matrix = tfidf.fit_transform(dataframe['overview'])
    
    # Calculation of cosine similarity
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
    
    return cosine_sim

In [4]:
cosine_sim = calculate_cosine_sim(df)

### 9.4 Defining the content based recommender function

In [5]:
def content_based_recommender(title, cosine_sim, dataframe):
    # Creating indices
    indices = pd.Series(dataframe.index, index = dataframe['title'])
    indices = indices[~indices.index.duplicated(keep='last')]
    
    # Grabing the title index
    movie_index = indices[title]
    
    # Calculation of the similarity scores to title
    similarity_scores = pd.DataFrame(cosine_sim[movie_index], columns=['score'])
    
    # Get top 10 movies excluding the movie itself
    movie_indices = similarity_scores.sort_values('score', ascending=False)[1: 11].index
    
    return dataframe.iloc[movie_indices]['title']

In [6]:
content_based_recommender('Sherlock Holmes', cosine_sim, df)

34737    Приключения Шерлока Холмса и доктора Ватсона: ...
14821                                    The Royal Scandal
34750    The Adventures of Sherlock Holmes and Doctor W...
9743                           The Seven-Per-Cent Solution
4434                                        Without a Clue
29706                       How Sherlock Changed the World
18258                   Sherlock Holmes: A Game of Shadows
24665     The Sign of Four: Sherlock Holmes' Greatest Case
6432                   The Private Life of Sherlock Holmes
29154                          Sherlock Holmes in New York
Name: title, dtype: object

In [7]:
content_based_recommender('The Matrix', cosine_sim, df)

44161                        A Detective Story
44167                              Kid's Story
44163                             World Record
33854                                Algorithm
167                                    Hackers
20707    Underground: The Julian Assange Story
6515                                  Commando
24202                                 Who Am I
22085                           Berlin Express
9159                                  Takedown
Name: title, dtype: object

In [8]:
content_based_recommender('The Godfather', cosine_sim, df)

1178               The Godfather: Part II
44030    The Godfather Trilogy: 1972-1990
1914              The Godfather: Part III
23126                          Blood Ties
11297                    Household Saints
34717                   Start Liquidation
10821                            Election
38030            A Mother Should Be Loved
17729                   Short Sharp Shock
26293                  Beck 28 - Familjen
Name: title, dtype: object

In [9]:
content_based_recommender('The Dark Knight Rises', cosine_sim, df)

12481                                      The Dark Knight
150                                         Batman Forever
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
585                                                 Batman
21194    Batman Unmasked: The Psychology of the Dark Kn...
9230                    Batman Beyond: Return of the Joker
18035                                     Batman: Year One
19792              Batman: The Dark Knight Returns, Part 1
3095                          Batman: Mask of the Phantasm
Name: title, dtype: object

As a result, movie recommendations based on the content of each movie were made.