In [1]:
#Broadly, recommender systems can be classified into 3 types:

#1.Simple recommenders:offer generalized recommendations to every user,based on movie popularity and/or
#genre.The basic idea behind this system is that movies that are more popular and critically acclaimed 
#will have a higher probability of being liked by the average audience. An example could be IMDB Top 250.

#2.Content-based recommenders: suggest similar items based on a particular item. This system uses item 
#metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. 
#The general idea behind these recommender systems is that if a person likes a particular item, he or 
#she will also like an item that is similar to it. And to recommend that, it will make use of the user's
#past item metadata. A good example could be YouTube, where based on your history, it suggests you new 
#videos that you could potentially watch.

#3.Collaborative filtering engines: these systems are widely used, and they try to predict the rating or 
#preference that a user would give an item-based on past ratings and preferences of other users.
#Collaborative filters do not require item metadata like its content-based counterparts.

In [2]:
#1.Simple Recommenders:-

#Simple recommenders are basic systems that recommend the top items based on a certain metric or score. 
#In this section, you will build a simplified clone of IMDB Top 250 Movies using metadata collected from
#IMDB.

#The following are the steps involved:

#1.Decide on the metric or score to rate movies on.

#2.Calculate the score for every movie.

#3.Sort the movies based on the score and output the top results.

#Load your movies metadata dataset into a pandas DataFrame:

# Import Pandas
import pandas as pd

# Load Movies Metadata
metadata = pd.read_csv('E:/datafiles/ml-latest-full/movies_metadata.csv', low_memory=False)

# Print the first three rows
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [3]:
#One of the most basic metrics you can think of is the ranking to decide which top 250 movies are based
#on their respective ratings.

#However, using a rating as a metric has a few caveats:

#For one, it does not take into consideration the popularity of a movie. Therefore, a movie with a rating
#of 9 from 10 voters will be considered 'better' than a movie with a rating of 8.9 from 10,000 voters.

#What if,we have to consider multiple movies and users at the same time t0 choose the ratings then,taking 
#these shortcomings into consideration, you must come up with a weighted rating that takes into account 
#the average rating and the number of votes it has accumulated.

#Since you are trying to build a clone of IMDB's Top 250,let's use its weighted rating formula as a 
#metric/score. WeightedRating(WR) = (v/(v+m)*R)+(m/(v+m)*C)

#v is the number of votes for the movie;

#m is the minimum votes required to be listed in the chart;

#R is the average rating of the movie;

#C is the mean vote across the whole report.

#You already have the values to v (vote_count) and R (vote_average) for each movie in the dataset. 
#It is also possible to directly calculate C from this data.

#Since there is no right value for 'm' we will use cutoff m as the 90th percentile in other words
#remove the movies which have a number of votes less than a certain threshold 'm' value i.e 90th
#percentile

#Let's calculate the value of C, the mean rating across all movies using the pandas .mean() function:

# Calculate mean of vote average column
C = metadata['vote_average'].mean()
print(C)

#Note:- We observe that the average rating of a movie on IMDB is around 5.6 on a scale of 10.

5.618207215134185


In [4]:
#let's calculate the number of votes, m, received by a movie in the 90th percentile by using using the 
#pandas .quantile() function:

# Calculate the minimum number of votes required to be in the chart, m
m = metadata['vote_count'].quantile(0.90)
print(m)

160.0


In [5]:
#Now you have the m you can simply use a greater than equal to condition to filter out movies having 
#greater than equal to 160 vote counts:

#You can use the .copy() method to ensure that the new q_movies DataFrame created is independent of 
#your original metadata DataFrame.which does not effect the original dataframe.

# Filter out all qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

(4555, 24)

In [6]:
metadata.shape

#Note:- From above o/p It is clear that there are around 10% movies with vote count more than 160 and 
#qualify to be on this list.

(45466, 24)

In [7]:
#Next and the most important step is to calculate the weighted rating for each qualified movie:
#To do this, we will:

#1.Define a function, weighted_rating();

#2.Since you already have calculated m and C you will simply pass them as an argument to the function;

#3.Then you will select the vote_count(v) and vote_average(R) column from the q_movies data frame;

#4.Finally, you will compute the weighted average and return the result.

#We will define a new feature score, of which you'll calculate the value by applying this function to 
#your DataFrame of qualified movies:

# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [8]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [9]:
#Let's sort the DataFrame in descending order based on the score feature column and output the title, 
#vote count, vote average, and weighted rating (score) of the top 20 movies.

#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


In [10]:
#2.Content-Based Recommender

#Plot Description Based Recommender:

#We will learn how to build a system that recommends movies that are similar to a particular movie. 
#To achieve this, you will compute pairwise cosine similarity scores for all movies based on their 
#plot descriptions and recommend movies based on that similarity score threshold.

#The plot description is available to you as the overview feature in your metadata dataset.Let's inspect 
#the plots of a few movies:

#Print plot overviews of the first 5 movies.
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [11]:
#The problem at hand is a Natural Language Processing problem,and it is not possible to compute the 
#similarity between any two overviews in their raw forms. To do this, you need to compute the word 
#vectors of each overview or document

#word vectors are vectorized representation of words in a document.The vectors carry a semantic meaning
#with it. For example, man & king will have vector representations close to each other while man & woman
#would have representation far from each other.

#We will compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document by this we
#will get a matrix where each column represents a word in the overview vocabulary (all the words that 
#appear in at least one document), and each column represents a movie, as before.

#The TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of 
#documents in which it occurs. This is done to reduce the importance of words that frequently occur 
#in plot overviews and, therefore, their significance in computing the final similarity score.

#Fortunately, scikit-learn gives you a built-in TfIdfVectorizer class that produces the TF-IDF matrix 
#in a couple of lines.

#1.Import the Tfidf module using scikit-learn;

#2.Remove stop words like 'the', 'an', etc. since they do not give any useful information about the topic;

#3.Replace not-a-number values with a blank string;

#4.Finally, construct the TF-IDF matrix on the data.

#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(45466, 75827)

In [12]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names()[5000:5010]

['avails',
 'avaks',
 'avalanche',
 'avalanches',
 'avallone',
 'avalon',
 'avant',
 'avanthika',
 'avanti',
 'avaracious']

In [13]:
#From the above output, you observe that 75,827 different vocabularies or words in your dataset have 
#45,000 movies.

#With this matrix in hand, you can now compute a similarity score.There are several similarity metrics
#that you can use for this,such as the manhattan,euclidean,the Pearson,and the cosine similarity scores. 
#Again, there is no right answer to which score is the best. Different scores work well in different 
#scenarios, and it is often a good idea to experiment with different metrics and observe the results.

#Cosine Similarity = x * y/|x|*|y|

#Since you have used the TF-IDF vectorizer,calculating the dot product between each vector will directly
#give you the cosine similarity score.Therefore, you will use sklearn's linear_kernel() instead of 
#cosine_similarities() since it is faster.

#This would return a matrix of shape 45466x45466, which means each movie overview cosine similarity score 
#with every other movie overview.Hence, each movie will be a 1x45466 column vector where each column will 
#be a similarity score with each movie.

# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [14]:
cosine_sim.shape

(45466, 45466)

In [15]:
cosine_sim[1]

array([0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
       0.00929411])

In [16]:
#You're going to define a function that takes in a movie title as an input and outputs a list of the 10 
#most similar movies.

#Firstly, for this, you need a reverse mapping of movie titles and DataFrame indices.

#In other words,you need a mechanism to identify the index of a movie in your metadata DataFrame,given 
#its title.

#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [17]:
indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

In [18]:
#Now we are in good shape to define our recommendation function.these are the following steps we follow:- 

#1.Get the index of the movie given its title.

#2.Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.

#3.Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

#4.Get the top 10 elements of this list.Ignore the first element as it refers to self (the movie most 
#similar to a particular movie is the movie itself).

#5.Return the titles corresponding to the indices of the top elements.


# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

In [19]:
get_recommendations('The Dark Knight Rises')

12481                                      The Dark Knight
150                                         Batman Forever
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
585                                                 Batman
21194    Batman Unmasked: The Psychology of the Dark Kn...
9230                    Batman Beyond: Return of the Joker
18035                                     Batman: Year One
19792              Batman: The Dark Knight Returns, Part 1
3095                          Batman: Mask of the Phantasm
Name: title, dtype: object

In [20]:
get_recommendations('The Godfather')

1178               The Godfather: Part II
44030    The Godfather Trilogy: 1972-1990
1914              The Godfather: Part III
23126                          Blood Ties
11297                    Household Saints
34717                   Start Liquidation
10821                            Election
38030            A Mother Should Be Loved
17729                   Short Sharp Shock
26293                  Beck 28 - Familjen
Name: title, dtype: object

In [21]:
#Note:- You see that, while your system has done a decent job of finding movies with similar plot 
#descriptions, the quality of recommendations is not that great. "The Dark Knight Rises" returns all
#Batman movies while it is more likely that the people who liked that movie are more inclined to enjoy
#other Christopher Nolan movies. his is something that cannot be captured by your present system.

In [25]:
#Credits, Genres, and Keywords Based Recommender

#You will build a recommender system based on the following metadata:the 3 top actors,the director,related 
#genres, and the movie plot keywords.

#The keywords,cast, and crew data are not available in your current dataset,so the first step would be to 
#load and merge them into your main DataFrame metadata.

credits = pd.read_csv('E:/datafiles/ml-latest-full/credits.csv')
keywords = pd.read_csv('E:/datafiles/ml-latest-full/keywords.csv')

# Remove rows with bad IDs.
metadata = metadata.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on = 'id')
metadata = metadata.merge(keywords, on = 'id')

KeyError: '[19730 29503 35587] not found in axis'