The dataset files contain metadata for all 4803 movies listed in the Dataset. 
This dataset captures feature points like budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count.

I will load our movies dataset into a pandas DataFrame:

In [1]:
# Firstly, I will Import Pandas
import pandas as pd

# Loading of  Movies Metadata
movies = pd.read_csv('tmdb_5000_movies.csv', low_memory=False)

# I will print the first row in order to have a view of the headings
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",12/10/2009,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


One of the most basic metrics I can think of is the ranking to decide which top 250 movies are based on their respective ratings.

However, using a rating as a metric has a few caveats:

For one, it does not take into consideration the popularity of a movie. Therefore, a movie with a rating of 9 from 10 voters will be considered 'better' than a movie with a rating of 8.9 from 10,000 voters.

Since I am trying to build a clone of TMDB's Top 250, I will use its weighted rating formula as a metric/score. Mathematically, it is represented as follows:

WeightedRating(WR)=(vv+m⋅R)+(mv+m⋅C)
In the above equation,

v is the number of votes for the movie;

m is the minimum votes required to be listed in the chart;

R is the average rating of the movie;

C is the mean vote across the whole report.

I already have the values to v (vote_count) and R (vote_average) for each movie in the dataset. It is also possible to directly calculate C from this data.

In [2]:
# Calculate mean of vote average column
C = movies['vote_average'].mean()
print(C)

6.092171559442011


From the above output, I can observe that the average rating of a movie on TMDB is around 6.0 on a scale of 10.

Next, I will calculate the number of votes, m, received by a movie in the 90th percentile. The pandas library makes this task extremely trivial using the .quantile() method of pandas

In [3]:
# Calculate the minimum number of votes required to be in the chart, m
m = movies['vote_count'].quantile(0.90)
print(m)

1838.4000000000015


Since now I have the m I can simply use a greater than equal to condition to filter out movies having greater than equal to 1838.40 vote counts:

I can use the .copy() method to ensure that the new q_movies DataFrame created is independent of my original metadata DataFrame. 
In other words, any changes made to the q_movies DataFrame will not affect the original metadata data frame.

In [4]:
# Filter out all qualified movies into a new DataFrame
q_movies = movies.copy().loc[movies['vote_count'] >= m]
q_movies.shape

(481, 20)

In [5]:
#Checking to know the amount of rows and columns, I have
movies.shape

(4803, 20)

From the above output, it is clear that there are around 10% movies with vote count more than 1838.40 and qualify to be on this list.

Next and the most important step is to calculate the weighted rating for each qualified movie. To do this, I will:

Define a function, weighted_rating();
Since I already have calculated m and C I will simply pass them as an argument to the function;
Then I will select the vote_count(v) and vote_average(R) column from the q_movies data frame;
Finally, I will compute the weighted average and return the result.
I will define a new feature score, of which I will calculate the value by applying this function to my DataFrame of qualified movies:

In [6]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the TMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [7]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

Finally, I will sort the DataFrame in descending order based on the score feature column and output the title, vote count, vote average, and weighted rating (score) of the top 20 movies.

In [8]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 20 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)

Unnamed: 0,title,vote_count,vote_average,score
1881,The Shawshank Redemption,8205,8.5,8.059258
662,Fight Club,9413,8.3,7.939256
65,The Dark Knight,12002,8.2,7.92002
3232,Pulp Fiction,8428,8.3,7.904645
96,Inception,13752,8.1,7.863239
3337,The Godfather,5893,8.4,7.851236
95,Interstellar,10867,8.1,7.809479
809,Forrest Gump,7927,8.2,7.803188
329,The Lord of the Rings: The Return of the King,8064,8.1,7.727243
1990,The Empire Strikes Back,5879,8.2,7.697884


Well, from the above output, I can see that the simple recommender did a great job!

Since the chart has a lot of movies in common with the TMDB Top 250 chart: for example, my top two movies, "The Shawshank Redemption" and "Fight Club", are the same as TMDB and we all know they are indeed amazing movies, in fact, all top 20 movies do deserve to be in that list.

In [9]:
#Print plot overviews of the first 5 movies.
movies['overview'].head()

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

#The problem at hand is a Natural Language Processing problem. Hence I need to extract some kind of features from the above text data before I can compute the similarity and/or dissimilarity between them. To put it simply, it is not possible to compute the similarity between any two overviews in their raw forms. To do this, I need to compute the word vectors of each overview or document, as it will be called from now on.

Fortunately, scikit-learn gives me a built-in TfIdfVectorizer class that produces the TF-IDF matrix in a couple of lines.

In [10]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
movies['overview'] = movies['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(4803, 20978)

In [11]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names()[5000:5010]

['define',
 'defined',
 'defines',
 'defining',
 'definite',
 'definitely',
 'definition',
 'definitions',
 'deflower',
 'deformed']

From the above output, I observe that 20,978 different vocabularies or words in my dataset have 4,803 movies.
With this matrix in hand, I can now compute a similarity score.

I will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. I use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate (especially when used in conjunction with TF-IDF scores.

Since I have used the TF-IDF vectorizer, calculating the dot product between each vector will directly give me the cosine similarity score. 
Therefore, I will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

In [12]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [13]:
cosine_sim.shape

(4803, 4803)

In [14]:
cosine_sim[1]

array([0.        , 1.        , 0.        , ..., 0.02160533, 0.        ,
       0.        ])

I am going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, I need a reverse mapping of movie titles and DataFrame indices. In other words, I need a mechanism to identify the index of a movie in my metadata DataFrame, given its title.

In [15]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

In [16]:
indices[:10]

title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
Spider-Man 3                                5
Tangled                                     6
Avengers: Age of Ultron                     7
Harry Potter and the Half-Blood Prince      8
Batman v Superman: Dawn of Justice          9
dtype: int64

I am now in good shape to define my recommendation function. These are the following steps I will follow:

Get the index of the movie given its title.

Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.

Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).

Return the titles corresponding to the indices of the top elements.

In [17]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movies['title'].iloc[movie_indices]

In [18]:
get_recommendations('The Godfather')

2731     The Godfather: Part II
1873                 Blood Ties
867     The Godfather: Part III
3727                 Easy Money
3623                       Made
3125                     Eulogy
3896                   Sinister
4506            The Maid's Room
3783                        Joe
2244      The Cold Light of Day
Name: title, dtype: object

In [19]:
get_recommendations('Fight Club')

3619                      UHF
2828                Project X
2585          The Hurt Locker
2344              Raging Bull
2023               The Animal
1414      Blast from the Past
4044               Go for It!
3515             Freaky Deaky
4045    Dancer, Texas Pop. 81
4760    This Is Martin Bonner
Name: title, dtype: object

In [20]:
get_recommendations('The Dark Knight')

3                         The Dark Knight Rises
428                              Batman Returns
3854    Batman: The Dark Knight Returns, Part 2
299                              Batman Forever
1359                                     Batman
119                               Batman Begins
1181                                        JFK
9            Batman v Superman: Dawn of Justice
2507                                  Slow Burn
210                              Batman & Robin
Name: title, dtype: object

In [21]:
get_recommendations('Pulp Fiction')

3526            The Sting
3194       All or Nothing
3466        Sliding Doors
4624            Locker 13
2917          The Fighter
4036            Antibirth
3491         The Wackness
2849             Nebraska
3504                11:14
3346    Jumping the Broom
Name: title, dtype: object

In [22]:
get_recommendations('Inception')

2897                                Cypher
134     Mission: Impossible - Rogue Nation
1930                            Stone Cold
914                   Central Intelligence
1683                       Pitch Perfect 2
1248                        At First Sight
1512                 A History of Violence
2389                           Renaissance
1803                        Blood and Wine
1267                                Duplex
Name: title, dtype: object