# Recommendation Engine Using Similarity Matrix
#### Overview
This is a recommendation engine using Similarity Matrix. The dataset I am using is the movie dataset.
The goal is to use the different text columns available in the dataset to build a similarity matrix using NLP.
The recommendation system like this one can be used for streaming services. Other real world use case can be food delivery app etc. 
#### Importing required packages

In [16]:
import pandas as pd
from IPython.display import display
from sklearn.feature_extraction.text import TfidfVectorizer  
from  sklearn.metrics.pairwise import linear_kernel
import ast

#### Reading the dataset

In [17]:
movies = pd.read_csv('data\movies.csv')

#### EDA
First step would be to do some Exploratory Data Analysis

In [18]:
#checking table structure
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

From the dataset I can see that except, homepage and tagline most of the data is complete (runtime, release date do have 1 and 2 missing data, but is negligible for dataset of this size)

In [19]:
display(movies.head())

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


Let's look at the columns that may be useful in describing the movie.
Following columns maybe have characteristic that would define the movie:
1. genres
2. keywords
3. original_title
4. overview
5. production_companies
6. production_countries
7. spoken_languages
8. tagline

Following were the columns from above list which were in json format. These columns have 'name' key, thus I  used that key to extract that information.

In [20]:
cols_with_json = ['genres', 'keywords','production_companies','production_countries',
       'spoken_languages']

#### Data Preparation and Feature Engineering
Following code will extract the names from the above column and store it in new columns respectively

In [21]:
cols_with_description = []
try:
    for col in cols_with_json:
        new_column_name = col + '_name'
        movies[col] = movies[col].apply(ast.literal_eval)
        movies[new_column_name] = movies[col].apply(lambda x:','.join([item['name'] for item in x]))
        cols_with_description.append(new_column_name)
except Exception as e:
    print(e)

In [22]:
display(movies[cols_with_description])

Unnamed: 0,genres_name,keywords_name,production_companies_name,production_countries_name,spoken_languages_name
0,"Action,Adventure,Fantasy,Science Fiction","culture clash,future,space war,space colony,so...","Ingenious Film Partners,Twentieth Century Fox ...","United States of America,United Kingdom","English,Español"
1,"Adventure,Fantasy,Action","ocean,drug abuse,exotic island,east india trad...","Walt Disney Pictures,Jerry Bruckheimer Films,S...",United States of America,English
2,"Action,Adventure,Crime","spy,based on novel,secret agent,sequel,mi6,bri...","Columbia Pictures,Danjaq,B24","United Kingdom,United States of America","Français,English,Español,Italiano,Deutsch"
3,"Action,Crime,Drama,Thriller","dc comics,crime fighter,terrorist,secret ident...","Legendary Pictures,Warner Bros.,DC Entertainme...",United States of America,English
4,"Action,Adventure,Science Fiction","based on novel,mars,medallion,space travel,pri...",Walt Disney Pictures,United States of America,English
...,...,...,...,...,...
4798,"Action,Crime,Thriller","united states–mexico barrier,legs,arms,paper k...",Columbia Pictures,"Mexico,United States of America",Español
4799,"Comedy,Romance",,,,
4800,"Comedy,Drama,Romance,TV Movie","date,love at first sight,narration,investigati...","Front Street Pictures,Muse Entertainment Enter...",United States of America,English
4801,,,,"United States of America,China",English


In the next step, I combined all the prepared text columns into one column, that's the column to build the similarity matrix.

In [23]:
#adding other non json columns to the description column
movies['description'] = movies['genres_name'] + movies['keywords_name']+movies['production_companies_name']+movies['production_countries_name']+movies['spoken_languages_name']+ movies['overview']+movies['original_title']+movies['tagline']

In [24]:
#Since I have most data, however to be on safe side will go ahead with next step
movies['description'] = movies['description'].fillna('')

#### Model Building
Sklearn's TfidfVectorizer  and Linear_Kernal were used for the Similarity Matrix.
 TfidfVectorizer, will convert a collection of raw documents to a matrix of TF-IDF (Sklearn). This basically count-vectorizer followed by TfidfTransformer
 The outcome of TfidfVectorizer matrix, which would give how a word weighs compared to other occurrence of the same word in other rows. Rarer words have more weight.

In [25]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english') #removing the words like a,the etc.
tfidf_matrix = tfidf_vectorizer.fit_transform(movies['description'])

Linear Kernel provided the clustering of similar movies.

In [26]:
similarity_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)

In [27]:
#similarity_matrix

Next part of the code is to transform the similarity matrix, so that the data can be easily stored in a database. The matrix will be converted into a panda dataframe and would be converted to long form.

In [28]:
pd_similarity = pd.DataFrame(similarity_matrix, columns=movies['title']).reset_index(drop=True)
pd_similarity['title'] = movies['title']
pd_similarity_melt = pd.melt(pd_similarity, id_vars=['title'], var_name='Similar Movies', value_name='Similarity Coefficient')

#### Recommendation at work.
The below function uses the Similarity Matrix to recommend movies to the user 

In [29]:
def movie_recomendation(movie_title, number_of_movies=5): 
    """
    This function gives the recommendation based on current selection.
    :param movie_title: title of the movie for which it will recommend similar movies
    :param number_of_movies: how many similar movies to recommend
    :return: movies to recommend or if the title is not present a message stating the same
    """
    #print(movie_title)
    recommendation = pd_similarity_melt[pd_similarity_melt['title']== movie_title].sort_values('Similarity Coefficient', ascending=False)
    final_list = recommendation.index.tolist()
    if len(list(recommendation['Similar Movies'])) == 0:
        return f"No recommendation for {movie_title}" 
    else :
        return   recommendation['Similar Movies'][1:number_of_movies+1].reset_index(drop=True)
        

Example of end user using the system

In [30]:
try:
    movie_title = input('Enter movie title: ').strip()
    how_many = int(input('How many movies to recommend: ').strip())

    print(movie_recomendation(movie_title, how_many))
except Exception as e:
    #print(e)
    print("Please enter a number in how many movies to recommend")

0                                    Cars 2
1                           Bride of Chucky
2                            McFarland, USA
3                         The Good Dinosaur
4     The Fast and the Furious: Tokyo Drift
5                     The Final Destination
6                                 Silverado
7                  The Fast and the Furious
8                                The Wraith
9                     Gone in Sixty Seconds
10                      Monsters University
11                          Death Race 2000
Name: Similar Movies, dtype: object
