## **Plot Based Recommender System**

Imagine you are a person who really likes the plot of a movie 'A' and you want to be recommended a new movie whose plot is similar to the plot of movie 'A'. I will build a recommender system here that recommends you a movie based on the plot of the movie you like.

Plot is basically a summary of a movie which is in a text format. Here's an example of the plot of a movie 'Life Begins for Andy Hardy'

After high school graduation, Andy Hardy (Mickey Rooney) reconsiders going on to college to follow in the footsteps of his father, Judge James Hardy (Lewis Stone). Persuading his parents to allow him to spend the summer in New York City, Andy drives there with wealthy friend Betsy Booth (Judy Garland). Unable to find a job immediately, Andy lets his small allowance be taken by a gold digger, but he writes home that all is well. After Betsy finds Andy sleeping in a park, she reports to the judge.

Let us now build a recommender system that recommends a movie similar to the plot of a movie we like.

The dataset we will be using is the MovieLens 100k dataset on Kaggle :

https://www.kaggle.com/prajitdatta/movielens-100k-dataset

Lets import the necessary libraries and start our analysis.

In [0]:
#importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel


In [0]:
from google.colab import files
uploaded = files.upload()

Saving movies_metadata.csv to movies_metadata.csv


In [0]:
#putting movies data on 'movies' dataframe
movies = pd.read_csv('movies_metadata.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.9469,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.0155,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.85949,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.38752,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


Since we are building a plot based recommender system, let us only select the columns we will be using for our model. We will be using movie 'id', movie 'title' and 'overview' which are basically the columns detailing the plot of each movie.

In [0]:
movies = movies[['id', 'title', 'overview']]

In [0]:
movies.head()

Unnamed: 0,id,title,overview
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...


Let us see how a movie plot looks like in the dataset.

In [0]:
movies['overview'][0]

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

In [0]:
sum(movies['overview'].isnull())

954

In [0]:
movies.shape

(45466, 3)

We have a dataset of around **45466 movies** which is good enough to build a model that will recommend us movies based on the plots. It is gonna be very interesting.

As a first step, we will use **TfidfVectorizer** which will basically convert our '**overview**' (plot) column which is a text column into numerical.

All the data science models run on numerical values since computers can only understand 0s and 1s.

TfIdf basically is **Term Frequency-Inverse Document frequency**. The number of features it creates is equal to the total number of distinct words used in the overview column and the values basically are directly proportional to the number of times a particular word is used and inversely proportional to the number of documents (movies here) in which the word is used. It will penalize a word even though a word has a huge number for 1 movie but is common to many movies. The words which occur multiple times but are common to many movies are anyways not so helpful in differentiating different movies.

In [0]:
tfidf = TfidfVectorizer(stop_words='english')

movies['overview'] = movies['overview'].fillna('')

#Construct the required TF-IDF matrix by applying the fit_transform method on the overview feature
overview_matrix = tfidf.fit_transform(movies['overview'])

#Output the shape of tfidf_matrix
overview_matrix.shape

(45466, 75827)

Now, we have a tfidf feature matrix for all the movies. Every movie has **75927** number of **features** ( words )

Now, in order to find the similarity between 2 movies, we will use the **cosine_similarity** . Here, the linear_kernel function will do that for us.

Cosine_Similarity is basically a measure of the similarity between 2 vectors. This measure is the cosine of the angle between them. Here, we have 75927 feature vector for each movie.

In [0]:
similarity_matrix = linear_kernel(overview_matrix,overview_matrix)

In [0]:
similarity_matrix 

array([[1.        , 0.01504121, 0.        , ..., 0.        , 0.00595453,
        0.        ],
       [0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
        0.00929411],
       [0.        , 0.04681953, 1.        , ..., 0.        , 0.01402548,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.00595453, 0.02198641, 0.01402548, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.00929411, 0.        , ..., 0.        , 0.        ,
        1.        ]])

Now, let us create a series that maps the index of the matrix to movie names to make it easy for us to just feed in movie names and get the recommendation.

In [0]:
#movies index mapping
mapping = pd.Series(movies.index,index = movies['title'])

In [0]:
mapping

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45461
Century of Birthing            45462
Betrayal                       45463
Satan Triumphant               45464
Queerama                       45465
Length: 45466, dtype: int64

Now, we will make a **recommender function** that recommends movies using cosine_similarity. Our function will take a movie name as input and then find the top 15 movies using the cosine similarity matrix we found above.

In [0]:
def recommend_movies_based_on_plot(movie_input):
  
  movie_index = mapping[movie_input]
  #get similarity values with other movies
  #similarity_score is the list of index and similarity matrix
  similarity_score = list(enumerate(similarity_matrix[movie_index]))
  #sort in descending order the similarity score of movie inputted with all the other movies
  similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
  
  # Get the scores of the 15 most similar movies. Ignore the first movie.
  similarity_score = similarity_score[1:15]

  #return movie names using the mapping series 
  movie_indices = [i[0] for i in similarity_score]
  return (movies['title'].iloc[movie_indices])




In [0]:
recommend_movies_based_on_plot('Life Begins for Andy Hardy')

23530                      Andy Hardy Meets Debutante
21422                                 A Family Affair
26304                          You're Only Young Once
10301                          The 40 Year Old Virgin
29369                  Andy Hardy's Private Secretary
23843                     Andy Hardy's Blonde Trouble
15348                                     Toy Story 3
43427                Andy Kaufman Plays Carnegie Hall
38476    Superstar: The Life and Times of Andy Warhol
42721    Andy Peters: Exclamation Mark Question Point
8327                                        The Champ
28128                       The Mayor of Casterbridge
21359                        Andy Hardy's Double Life
32086                                Brother's Keeper
Name: title, dtype: object

We can finally see that when we **input a movie 'Life Begins for Andy Hardy'** in this case, we get 15 recommendations of other movies whose plot are similar to this movie. Its magical. Isn't it?

In the next project, I will build a ***recommender system based on metadata*** ( directors, actors, genres and keywords) and a ***recommender system based on collaborative filtering*** ( the most common modern recommender system )

Thanks for reading. 

