In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from IPython.display import display

# Movie Recommender System
### (Based on Collaborative Filtering)
### Author: Tihomir Todorov

### Table of Contents

1. [Overview](#overview)
    1. [What is a Recommender System?](#what_is_rec_system)
    2. [Benefits](#benefits)
    3. [Types of Recommender Systems](#rec_system_types)
2. [Collaborative Filtering](#collab_filtering)
3. [Python Implementation](#python_implementation)
4. [Conclusion](#conclusion)
5. [Resources](#resources)

 <a name="overview"></a>
 ## 1. Overview

<a name="what_is_rec_system"></a>
### A. What is a Recommender System?


A recommender system is a subclass of [Information filtering Systems](https://en.wikipedia.org/wiki/Information_filtering_system) that seeks to predict the rating or the preference a user might give to an item. In simple words, it is an algorithm that suggests relevant items to users. Eg: to suggest for you which books to buy (**Amazon**), which movies to watch (**Netflix**), which song to listen to (**Spotify**), where to go to holidays (**Booking**), and many, many others. </span>

<a href="https://imgur.com/sPJt7Lb"><img src="https://i.imgur.com/sPJt7Lb.png" title="source: imgur.com" width="700"/></a>

<a name="benefits"></a>
### B. Benefits

In this section I will share with you my opinion on what are some of most important benefits of a recommender system:

<br></br>
    1. **Drive Traffic**: Through personalized email messages and targeted ads, a recommendation engine can encourage elevated amounts of traffic to your site, thus increasing the opportunity to scoop up more data to further enrich a customer profile.
<br></br>
    2. **Deliver Relevant Content**: By analyzing the customer’s current site usage and previous browsing history, a recommendation engine can deliver relevant product recommendations as he or she shops based on said profile. The data is collected in real time so the software can react as shopping habits change constantly.
<br></br>   
    3. **Engage Shoppers**: Shoppers become more engaged when personalized product recommendations are made to them across the customer journey. Through individualized product recs, customers are able to delve more deeply into your product line without having to dive into (and very likely get lost in) an e-commerce rabbit hole.
<br></br>    
    4. **Convert Shoppers to Customers**: Converting shoppers into customers takes a special touch. Personalized interactions from a recommendation engine show your customer that he or she is valued as an individual, in turn, engendering long-term loyalty.
<br></br>    
    5. **Increase Average Order Value**: Average order values typically rise when an engine is leveraged to display personalized options as shoppers are more willing to spend generously on items they thoroughly covet.
<br></br>    
    6. **Control Merchandising and Inventory Rules**: A recommendation engine can add your marketing and inventory control directives to a customer’s profile to feature products that are on clearance or overstocked so as to avoid unnecessary shopping friction.
<br></br>    
    7. **Reduce Workload and Overhead**: The volume of data required to create a personal shopping experience for each customer is usually far too large to be managed manually. Using a recommender system automates this process, easing the workload for the IT staff.
<br></br>
    8. **A Recommendation Engine Provides Reports**: Detailed reports are an integral part of a personalization system. Accurate and up-to-the-minute reporting will allow you to make informed decisions about the direction of a campaign or the structure of a product page.

Etc...

<a name="rec_system_types"></a>
### C. Types of Recommender Systems

There are three main types of recommendation engines: **collaborative filtering**, **content-based filtering** – and a hybrid of the two.

<a href="https://imgur.com/GghIS53"><img src="https://i.imgur.com/GghIS53.png" title="source: imgur.com" /></a>
<br></br>
    1. *Collaborative filtering*: focuses on collecting and analyzing data on user behavior, activities, and preferences, to predict what a person will like, based on their similarity to other users.
    To plot and calculate these similarities, collaborative filtering uses a matrix style formula. An advantage of collaborative filtering is that it doesn’t need to analyze or understand the content (products, films, books). It simply picks items to recommend based on what they know about the user.
<br></br>    
    2. *Content-based filtering*: Content-based filtering works on the principle that if you like a particular item, you will also like this other item. To make recommendations, algorithms use a profile of the customer’s preferences and a description of an item (genre, product type, color, word length) to work out the similarity of items using [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) and [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance).  
    The downside of content-based filtering is that the system is limited to recommending products or content similar to what the person is already buying or using. It can’t go beyond this to recommend other types of products or content. For example, it couldn’t recommend products beyond homeware if the customer had only brought homeware.
<br></br>    
    3. A *hybrid recommendation engine*: looks at both the meta (collaborative) data and the transactional (content-based) data. Because of this, it outperforms both.
    In a hybrid recommendation engine, natural language processing tags can be generated for each product or item (movie, song), and vector equations used to calculate the similarity of products. A collaborative filtering matrix can then be used to recommend items to users depending on their behaviors, activities, and preferences. Netflix is the perfect example of a hybrid recommendation engine. It takes into account both the interests of the user (collaborative) and the descriptions or features of the movie or show (content-based).  

<a name="collab_filtering"></a>
## 2. Collaborative Filtering

Collaborative filtering is rooted on the fact that there are relations between the products and the interests of the clients. It has two main focuses: **user-based** and **item-based**. The user-based collaborative filtering works with the similarities of the target-users and other users, and the item-based collaborative filtering works with the similarity between the items that target-users rate or interact with and other items.

- In user-based collaborative filtering we have an active user to whom the recommendation is being directed. The recommender system searches first for the users that share similar evaluation patterns with the target-user. Those similarities are based on the history, the preferences and the choices those users make when they buy something. Later, the system uses the evaluations of those users in order to predict the possible evaluations of the target-user of a product that he or she haven't been exposed to previously. For example, if two users like similar movies, we can recommend to the target-user the movies that the other user has seen.

<a href="https://imgur.com/I1HLweD"><img src="https://i.imgur.com/I1HLweD.png" title="source: imgur.com" width="300px"/></a>
    
### How does it work?

Let's suppose that we have a simple matrix 'user-item', that shows the evaluations of four users for five different movies. We can also think that our target-user has seen and evaluated three out of five from of those movies. First we are going to check which are the movies that our target-user still haven't seen.

<a href="https://imgur.com/0wrhiye"><img src="https://i.imgur.com/0wrhiye.jpg" title="source: imgur.com" width="600px"/></a>

The next step is to find the similarities between the target-user and the other users through one of the different methods from statistics and linear algebra (i.e. Euclidean distance, Pearson correlation. cosine similarity, etc.). To calculate the level of similarity between two of the users, we will use three of the movies that both have evaluated in the past.

<a href="https://imgur.com/nblx3Kx"><img src="https://i.imgur.com/nblx3Kx.jpg" title="source: imgur.com" width="600px"/></a>

The numbers 0.7, 0.9, 0.4 represent the **weigth of similarity** or the **proximity** of the target-user to the other users from this dataset. Using those weigths, we can calculate the possible opinion of the target-user on the two movies that are not rated by him/her. We can achieve this by multiplying the weights of similarity with the ratings of the other users.

<a href="https://imgur.com/dHczOLu"><img src="https://i.imgur.com/dHczOLu.jpg" title="source: imgur.com" width="600px"/></a>

Now, we can generate the recommendation matrix by adding all the weighted ratings. However, since three users rated the first potential movie and two users rated the second movie, we need to normalize the weighted rating values. To do this, we divide the sum of the weighted ratings by the sum of the similarity index of the users.

<a href="https://imgur.com/1aDRhEH"><img src="https://i.imgur.com/1aDRhEH.jpg" title="source: imgur.com" width="600px"/></a>

The result is the potential rating that our target-user will give these movies based on their similarity to other users. We can use it to classify the movies and offer recommendations to our active user.

- Now, let's examine quickly the difference between user-based and item-based collaborative filtering.

<a href="https://imgur.com/DBSDaQB"><img src="https://i.imgur.com/DBSDaQB.jpg" title="source: imgur.com" width="700px"/></a>

In the **user-based** approach, the recommendation is based on users in the same cluster with whom you share common preferences. For example, since user 1 and user 3 like item 3 and item 4, we consider them similar, or neighboring users, and we recommend item 1 to user 3, which has been positively rated by user 1.

In the **item-based** approach, similar items build clusters based on user behavior (not content!). For example, item 1 and item 3 are considered neighbors because they have been positively rated by user 1 and user 2. Therefore, item 1 can be recommended to user 3 because they have already shown interest in item 3. Therefore, the recommendations here are based on items in the cluster that a user might prefer.

<a name="collab_filtering"></a>
## 3. Python Implementation

To build a recommendation system we are going to need a dataset where users have previously evaluated, for them we are going to use the following files:

- ratings.csv
- movies.csv

In [3]:
'''First we import the libraries and read the csv to have our dataframes'''

movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
print(f"For this project we have {movies_df.shape[0]} titles available in our 'movies' dataset.")
print(f"The number of ratings available in the 'ratings' dataset is {ratings_df.shape[0]}.")

For this project we have 9742 titles available in our 'movies' dataset.
The number of ratings available in the 'ratings' dataset is 100836.


In [4]:
'''Now let's have a preview of our datasets'''
display(movies_df.head())
display(ratings_df.head())

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
'''For this project we're not going to need the year and the genre, so they can be removed'''

movies_df['title'] = movies_df.title.str.replace(r' (\(\d\d\d\d\))', '', regex=True)
movies_df = movies_df.drop('genres', axis=1)
movies_df.head()

Unnamed: 0,movieId,title
0,1,Toy Story
1,2,Jumanji
2,3,Grumpier Old Men
3,4,Waiting to Exhale
4,5,Father of the Bride Part II


In [6]:
'''
This is the whole list of movies, feel free to go through it 
and use some of the titles for the next step
'''

pd.set_option("display.max_rows", None, "display.max_columns", None)
# print(movies_df)

### The process 

From this point, the steps we will go through are:

1. Create a user with the movies he has seen
2. Based on the movie ratings, find the 10 best neighbors
3. Get the record of movies watched by the user for each neighbor.
4. Calculate a similarity score
5. Recommend movies with the highest score

In [7]:
'''Let's start by creating a user to recommend movies to'''

userInput = [
{'title':'Heat', 'rating':5},
{'title':'Star Wars: Episode II - Attack of the Clones', 'rating':4.5},
{'title':'Secret in Their Eyes, The (El secreto de sus ojos)', 'rating':4.5},
{'title':'Terminator 2: Judgment Day', 'rating':4.5},
{'title':'Four Rooms', 'rating':3.5}
]
inputMovies = pd.DataFrame(userInput)

In [8]:
'''Next step is to group the rows 
by userId. We're also going to sort 
these groups so that users who share 
most movies in common with the input 
get higher priority. This provides a 
richer recommendation since we are 
not going to go through all users.
'''

inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
inputMovies = pd.merge(inputId, inputMovies)
display(inputMovies)

userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
display(userSubset.head())
userSubsetGroup = userSubset.groupby(['userId'])
userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)
userSubsetGroup = userSubsetGroup[0:100]

Unnamed: 0,movieId,title,rating
0,6,Heat,5.0
1,18,Four Rooms,3.5
2,589,Terminator 2: Judgment Day,4.5
3,5378,Star Wars: Episode II - Attack of the Clones,4.5
4,71033,"Secret in Their Eyes, The (El secreto de sus o...",4.5


Unnamed: 0,userId,movieId,rating,timestamp
2,1,6,4.0,964982224
552,5,589,3.0,847435258
564,6,6,4.0,845553757
806,6,589,3.0,845553317
886,7,589,2.5,1106635940


We are going to find out how similar each user is to the input through Pearson's correlation coefficient. When applied to a sample is commonly represented by $r_{xy}$ and may be referred to as the **sample correlation coefficient** or the **sample Pearson correlation coefficient**. We can obtain the formula for $r_{xy}$ by substituting estimates of the covariances and variances based on a sample into the formula above. Given paired data ${\displaystyle \left\{(x_{1},y_{1}),\ldots ,(x_{n},y_{n})\right\}}$ consisting of ${\displaystyle n}$ pairs, ${\displaystyle r_{xy}}$ is defined as:

$${\displaystyle r_{xy}={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}}}$$

where:

- $n$ is sample size
- $x_i,y_i$ are the individual sample points indexed with i
- ${\textstyle {\bar {x}}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}}$ (the sample mean); and analogously for ${\displaystyle {\bar {y}}}$

Rearranging gives us this formula for ${\displaystyle r_{xy}}r_{xy}$:

$${\displaystyle r_{xy}={\frac {n\sum x_{i}y_{i}-\sum x_{i}\sum y_{i}}{{\sqrt {n\sum x_{i}^{2}-\left(\sum x_{i}\right)^{2}}}~{\sqrt {n\sum y_{i}^{2}-\left(\sum y_{i}\right)^{2}}}}}.}$$

where:

- ${\displaystyle n,x_{i},y_{i}}$ are defined as above.

This formula suggests a convenient single-pass algorithm for calculating sample correlations, though depending on the numbers involved, it can sometimes be numerically unstable.

Rearranging again gives us this formula for ${\displaystyle r_{xy}}r_{xy}$:

$${\displaystyle r_{xy}={\frac {\sum _{i}x_{i}y_{i}-n{\bar {x}}{\bar {y}}}{{\sqrt {\sum _{i}x_{i}^{2}-n{\bar {x}}^{2}}}~{\sqrt {\sum _{i}y_{i}^{2}-n{\bar {y}}^{2}}}}}.}$$

where:

- ${\displaystyle n,x_{i},y_{i},{\bar {x}},{\bar {y}}}$ are defined as above.

An equivalent expression gives the formula for ${\displaystyle r_{xy}}r_{xy}$ as the mean of the products of the standard scores as follows:

$${\displaystyle r_{xy}={\frac {\sum x_{i}y_{i}-n{\bar {x}}{\bar {y}}}{(n-1)s_{x}s_{y}}}}$$

where:

- ${\displaystyle n,x_{i},y_{i},{\bar {x}},{\bar {y}}}$ are defined as above, and ${\displaystyle s_{x},s_{y}}$ are defined below

- ${\textstyle \left({\frac {x_{i}-{\bar {x}}}{s_{x}}}\right)}$ is the standard score (and analogously for the standard score of ${\displaystyle y}$)

Alternative formulae for ${\displaystyle r_{xy}}$ are also available. For example. one can use the following formula for ${\displaystyle r_{xy}}$:

$${\displaystyle r_{xy}={\frac {\sum x_{i}y_{i}-n{\bar {x}}{\bar {y}}}{(n-1)s_{x}s_{y}}}}$$

where:

- ${\displaystyle n,x_{i},y_{i},{\bar {x}},{\bar {y}}}$ are defined as above and:

- ${\textstyle s_{x}={\sqrt {{\frac {1}{n-1}}\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}}$ (the sample standard deviation); and analogously for ${\displaystyle s_{y}}$

In [9]:
pearsonCorrelationDict = {}

for name, group in userSubsetGroup:
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    
    tempRatingList = temp_df['rating'].tolist()
    
    tempGroupList = group['rating'].tolist()
    data_correlation = {'tempGroupList': tempGroupList,
            'tempRatingList': tempRatingList}
    pd_correlation = pd.DataFrame(data_correlation)
    r = pd_correlation.corr(method="pearson")["tempRatingList"]["tempGroupList"]
    
    if math.isnan(r) == True:
        r = 0
    pearsonCorrelationDict[name] = r
    
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.969458,68
1,0.0,105
2,-0.36074,182
3,0.345857,380
4,0.207514,414


In [10]:
'''
Now let's look at the top 50 users who are most similar to the target-user
'''

topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
# display(topUsers)

In [11]:
'''
We're going to do this by taking the weighted average of the movie ratings
using the Pearson Correlation as the weight. But to do this, we first need 
to get the movies watched by users in our pearsonDF from the ratings dataframe, 
and then store their correlation in a new column called similarityIndex. 
This is then achieved by merging these two tables.
'''

topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,timestamp
0,1.0,91,1,4.0,1112713037
1,1.0,91,2,3.0,1112713392
2,1.0,91,3,3.0,1112712323
3,1.0,91,6,5.0,1112712032
4,1.0,91,10,3.5,1112713269


In [12]:
'''
Returns the idea of all users similar to the candidate movies for the input user:
'''

topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,timestamp,weightedRating
0,1.0,91,1,4.0,1112713037,4.0
1,1.0,91,2,3.0,1112713392,3.0
2,1.0,91,3,3.0,1112712323,3.0
3,1.0,91,6,5.0,1112712032,5.0
4,1.0,91,10,3.5,1112713269,3.5


In [13]:
'''
Applies a sum to the top users after grouping them by userId
'''

tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

'''Create a new dataframe for the recommendations and takes the weighted average'''
recommendation_df = pd.DataFrame()

recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.775806,1
2,3.298585,2
3,3.228402,3
4,3.0,4
5,3.180369,5


In [16]:
'''Finally, let's rank it and see the top 10 movies recommended by the algorithm.'''

recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(30)
display(movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())])

Unnamed: 0,movieId,title
164,194,Smoke
700,918,Meet Me in St. Louis
744,971,Cat on a Hot Tin Roof
918,1217,Ran
5704,27803,"Sea Inside, The (Mar adentro)"
6999,67618,Strictly Sexual
7041,69069,Fired Up
7087,70183,"Ugly Truth, The"
7284,74946,She's Out of My League
8896,134796,Bitter Lake


And this is it! We've successfully implemented Pearson's Correlation to a collaborative filtering, user-based recommender system!

<a name="conclusion"></a>
## 4. Conclusion

This algorithm works using the ratings of the users. It is possible to develop a similar recommendation system based on previous buying records, search records, or watch records. We saw that it is relatively "easy" to create a recommender system in Python. As many times in Data-Science, one of the central parts for the model to work is focused on having the correct data and a high volume. The value that will be used as "rating" is also crucial - being a real evaluation of each user or an artificial value that we believe is appropriate. Then it will be a matter of evaluating between the user-based and item-based engine options and selecting the one with the least error. 

<a name="resources"></a>
## 5. Resources
    
- [Wikipedia](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)
- [Real Python](https://realpython.com/build-recommendation-engine-collaborative-filtering/)
- [Notebook Community](https://notebook.community/AtmaMani/pyChakras/udemy_ml_bootcamp/Machine%20Learning%20Sections/Recommender-Systems/Advanced%20Recommender%20Systems%20with%20Python)
- [Regenerative Today](https://regenerativetoday.com/a-complete-recommender-system-from-scratch-in-python-using-linear-regression/)