# 1. Simple Recommender

In [1]:
# Import commands

import pandas as pd
import numpy as np

In [2]:
#Reading the CSV file, displaying the shape and columns in the dataset

md = pd.read_csv("movies_metadata.csv", low_memory = False)
md.shape
md.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [3]:
# Defining the average vote in the whole dataset

C = md.vote_average.mean()
print("Average vote is {0}" .format(C, int))

# Defining a condition M to determine the movies that have votes more than 90% of the movies in the list.

M = md.vote_count.quantile(0.9)
print("Minimum vote count is {0}" .format(M))

Average vote is 5.618207215134185
Minimum vote count is 160.0


In [4]:
# This piece of code copies the dataset but with only the movies where the vote counts are at least 90% more than other movies or 160 vote counts

q_90 = md.copy().loc[md.vote_count >= M]
q_90.shape

(4555, 24)

In [5]:
# Defining the score function 

def score(X, C = C, M = M):
    V = X.vote_count
    R = X.vote_average
    
    return ((V*R) + (M*C)) / (V + M)

In [6]:
# Adding the score to the new dataset

q_90['score'] = q_90.apply(score, axis = 1)

In [7]:
# Sorting movies as per score count and then showing the top 20 movies and their title, vote count, average and score

q_90 = q_90.sort_values('score', ascending = False)
q_90[['title', 'vote_count', 'vote_average', 'score']].head(20).reset_index(drop = True)


Unnamed: 0,title,vote_count,vote_average,score
0,The Shawshank Redemption,8358.0,8.5,8.445869
1,The Godfather,6024.0,8.5,8.425439
2,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
3,The Dark Knight,12269.0,8.3,8.265477
4,Fight Club,9678.0,8.3,8.256385
5,Pulp Fiction,8670.0,8.3,8.251406
6,Schindler's List,4436.0,8.3,8.206639
7,Whiplash,4376.0,8.3,8.205404
8,Spirited Away,3968.0,8.3,8.196055
9,Life Is Beautiful,3643.0,8.3,8.187171


##### Conclusion: This is a very simple recommender based of the voting metrics namely vote count and average. However, it in thoroughly inaccurate since we have not taken into consideration any of the other features which may be able to provide a higher information gain compared to vote counts and vote average. To that note, we need to build a better recommender as shown below.

# 2. NLP Recommender using Cosine Similarity & TF-IDF

1. https://www.machinelearningplus.com/nlp/cosine-similarity/
2. https://medium.com/analytics-vidhya/understanding-tf-idf-in-nlp-4a28eebdee6a

### 2.1 Plot Description Based

In [8]:
# Showcasing the plot descriptions, a feature we'll focus on in this method.

md.overview.head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [9]:
# Importing TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# This removes the common words ('a', 'an', 'the', etc.)
tfidf = TfidfVectorizer(stop_words = 'english')

# Filling missing plot descriptions with blank string
md['overview'] = md['overview'].fillna('')

# This creates a vector of around 45000 movies with around 75000 words 
tfidf_matrix = tfidf.fit_transform(md['overview'])
tfidf_matrix.shape

(45466, 75827)

In [10]:
# Some random list of words to showcase

tfidf.get_feature_names_out()[75820:75827]

array(['海難1890', '見鬼10', '주식회사', '찾기', '첫사랑', 'ﬁrst', 'ﬁve'], dtype=object)

Observation #1:

In logic, the 2 words "first" and "five" did not belong at this index. It could be a character encoding problem since the "fi" look a little weird to me.

In [11]:
# Calculating cosine-sim using dot product

from sklearn.metrics.pairwise import linear_kernel

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Since you have used the TF-IDF vectorizer, calculating the dot product between each vector will directly give you the cosine similarity score. Therefore, you will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

This would return a matrix of shape 45466x45466, which means each movie overview cosine similarity score with every other movie overview. Hence, each movie will be a 1x45466 column vector where each column will be a similarity score with each movie.

In [12]:
cosine_sim.shape

(45466, 45466)

In [13]:
# Showing the similarity scores with the first movie

cosine_sim[0:1, :14]

array([[1.        , 0.01504121, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ]])

Observation #2:

Since our first movie was Toy Story, this array basically shows the similarity score between Toy Story and 13 other movies. Also, that is why the index [1,1] is 1 since the similarity of a movie with itself is always 1.


Observation #3:

In the cell below, we need to make sure the index value are movie titles, since we'll be searching with that. Also, we need to drop duplicate movie names. [Note: Toy story and Toy Story 2 are still 2 different movie names]

In [14]:
# Making movie title the index

indices = pd.Series(md.index, index = md['title']).drop_duplicates()

In [15]:
# Showing first 18 movies in the list.
indices[:14]

title
Toy Story                       0
Jumanji                         1
Grumpier Old Men                2
Waiting to Exhale               3
Father of the Bride Part II     4
Heat                            5
Sabrina                         6
Tom and Huck                    7
Sudden Death                    8
GoldenEye                       9
The American President         10
Dracula: Dead and Loving It    11
Balto                          12
Nixon                          13
dtype: int64

Observation #4: Compare this list with the cosine-sim array in the previous code cell. Does it make sense why most of them are not similar to Toy Story?

In [16]:
# Now all we do is setup a logic of finding the top 10 movies by similarity by using the movie index from the list above
# That index will be used in our cosine-sim matrix to get the top 10 movies after we sort them. Savvy?

def get_recommendation(title, cosine_sim = cosine_sim):
    idx = indices[title]
    #print(idx) # Try this out
    
    sim_scores = list(enumerate(cosine_sim[idx]))
    #print(sim_scores) # Try this out
    
    # Sort by top scores
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
    #print(sim_scores) # Try this out
    
    # Take top 10, We need to exclude index 0 since it is the movie itself
    sim_scores= sim_scores[1:11]
    
    # Indices for top 10 to be used to retrieve the title from the main matrix 
    movie_indices = [i[0] for i in sim_scores]
    
    
    return md['title'].iloc[movie_indices]

In [17]:
# Driver Code. Enter the movie name here

get_recommendation('Toy Story')

15348                                     Toy Story 3
2997                                      Toy Story 2
10301                          The 40 Year Old Virgin
24523                                       Small Fry
23843                     Andy Hardy's Blonde Trouble
29202                                      Hot Splash
43427                Andy Kaufman Plays Carnegie Hall
38476    Superstar: The Life and Times of Andy Warhol
42721    Andy Peters: Exclamation Mark Question Point
8327                                        The Champ
Name: title, dtype: object

### 2.2: The 3 top actors, the director, related genres, and the movie plot keywords Based

###### Coming soon...