<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-Recommender-System" data-toc-modified-id="Introduction-to-Recommender-System-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to Recommender System</a></span></li><li><span><a href="#Collaborative-Filtering" data-toc-modified-id="Collaborative-Filtering-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Collaborative Filtering</a></span><ul class="toc-item"><li><span><a href="#Nearest-Neighborhood" data-toc-modified-id="Nearest-Neighborhood-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Nearest Neighborhood</a></span><ul class="toc-item"><li><span><a href="#K-Nearest-Neighbors" data-toc-modified-id="K-Nearest-Neighbors-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>K-Nearest Neighbors</a></span><ul class="toc-item"><li><span><a href="#Import-data" data-toc-modified-id="Import-data-2.1.1.1"><span class="toc-item-num">2.1.1.1&nbsp;&nbsp;</span>Import data</a></span></li><li><span><a href="#Pivot-dataframe:-Row-as-Users-and-Column-as-Movies" data-toc-modified-id="Pivot-dataframe:-Row-as-Users-and-Column-as-Movies-2.1.1.2"><span class="toc-item-num">2.1.1.2&nbsp;&nbsp;</span>Pivot dataframe: Row as Users and Column as Movies</a></span></li><li><span><a href="#Convert-to-Sparse-Matrix" data-toc-modified-id="Convert-to-Sparse-Matrix-2.1.1.3"><span class="toc-item-num">2.1.1.3&nbsp;&nbsp;</span>Convert to Sparse Matrix</a></span></li><li><span><a href="#Nearest-Neighbors-Fit" data-toc-modified-id="Nearest-Neighbors-Fit-2.1.1.4"><span class="toc-item-num">2.1.1.4&nbsp;&nbsp;</span>Nearest Neighbors Fit</a></span></li></ul></li></ul></li><li><span><a href="#Make-Recommendatations" data-toc-modified-id="Make-Recommendatations-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Make Recommendatations</a></span></li></ul></li></ul></div>

# Introduction to Recommender System

Reference: https://towardsdatascience.com/intro-to-recommender-system-collaborative-filtering-64a238194a26

Approaches of Collaborative Filtering: Nearest Neighborhood and Matrix Factorization

“We are leaving the age of information and entering the age of recommendation.”

Like many machine learning techniques, a recommender system makes prediction based on users’ historical behaviors. Specifically, it’s to predict user preference for a set of items based on past experience. To build a recommender system, the most two popular approaches are Content-based and Collaborative Filtering.

<b>Content-based approach</b> requires a good amount of information of items’ own features, rather than using users’ interactions and feedbacks. For example, it can be movie attributes such as genre, year, director, actor etc., or textual content of articles that can extracted by applying Natural Language Processing. 

<b>Collaborative Filtering</b>, on the other hand, doesn’t need anything else except users’ historical preference on a set of items. Because it’s based on historical data, the core assumption here is that the users who have agreed in the past tend to also agree in the future. In terms of user preference, it usually expressed by two categories. Explicit Rating, is a rate given by a user to an item on a sliding scale, like 5 stars for Titanic. This is the most direct feedback from users to show how much they like an item. Implicit Rating, suggests users preference indirectly, such as page views, clicks, purchase records, whether or not listen to a music track, and so on. 

In this article, I will take a close look at collaborative filtering that is a traditional and powerful tool for recommender systems.

# Collaborative Filtering

## Nearest Neighborhood
The standard method of Collaborative Filtering is known as Nearest Neighborhood algorithm. There are <b>user-based CF</b> and <b>item-based CF</b>. 

Let’s first look at User-based CF. We have an n × m matrix of ratings, with user uᵢ, i = 1, ...n and item pⱼ, j=1, …m. Now we want to predict the rating rᵢⱼ if target user i did not watch/rate an item j. The process is to calculate the similarities between target user i and all other users, select the top X similar users, and take the weighted average of ratings from these X users with similarities as weights.

![image.png](https://miro.medium.com/max/875/1*mM089Lta5X6zkUkULcO9aA.png)
![image.png](https://miro.medium.com/max/875/1*mTRUakSIWmo9OX6D2HakWQ.png)

While different people may have different baselines when giving ratings, some people tend to give high scores generally, some are pretty strict even though they are satisfied with items. To avoid this bias, we can subtract each user’s average rating of all items when computing weighted average, and add it back for target user, shown as below.

![image.png](https://miro.medium.com/max/875/1*gLbwJts3g_v2TbPRhFoNfA.png)

Two ways to calculate similarity are Pearson Correlation and Cosine Similarity.

![image.png](https://miro.medium.com/max/875/1*Xvf2o6kE4VCuueMPikxZ_A.png)
![image.png](https://miro.medium.com/max/875/1*6HISTi8SjbD2VHicoZwKpA.png)


Basically, the idea is to find the most similar users to your target user (nearest neighbors) and weight their ratings of an item as the prediction of the rating of this item for target user.

Without knowing anything about items and users themselves, we think two users are similar when they give the same item similar ratings. Analogously, for Item-based CF, we say two items are similar when they received similar ratings from a same user. Then, we will make prediction for a target user on an item by calculating weighted average of ratings on most X similar items from this user. One key advantage of Item-based CF is the stability which is that the ratings on a given item will not change significantly overtime, unlike the tastes of human beings.

![image.png](https://miro.medium.com/max/875/1*dPzd5-dScFplypBGeSwgUw.png)

There are quite a few limitations of this method. It doesn’t handle sparsity well when no one in the neighborhood rated an item that is what you are trying to predict for target user. Also, it’s not computational efficient as the growth of the number of users and products.


### K-Nearest Neighbors
The k-nearest neighbors (KNN) algorithm doesn’t make any assumptions on the underlying data distribution, but it relies on item feature similarity. When a KNN makes a prediction about a movie, it will calculate the “distance” (distance metrics will be discussed later) between the target movie and every other movie in its database. It then ranks its distances and returns the top k nearest neighbor movies as the most similar movie recommendations.

#### Import data

In [1]:
from recommender_system import utils

In [2]:
import os
import pandas as pd

data_path = 'data/'
movies_filename = 'movie.csv'
ratings_filename = 'rating.csv'

df_movies = pd.read_csv(data_path + movies_filename)
df_movies = df_movies.drop('genres', axis=1)

df_ratings = pd.read_csv(data_path + ratings_filename)
df_ratings = df_ratings.drop('timestamp', axis=1)

In [3]:
df_movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [4]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


In [5]:
# now we need to take only movies that have been rated at least 50 times to get some idea of the reactions of users towards it

movies_keep = utils.get_list_movies_to_keep(df_ratings)
df_ratings_drop_movies = df_ratings[df_ratings.movieId.isin(movies_keep)]

print('shape of original ratings data: ', df_ratings.shape)
print('shape of ratings data after dropping unpopular movies: ',
      df_ratings_drop_movies.shape)

# filter data to come to an approximation of user likings.

active_users = utils.get_list_users_to_keep(df_ratings)
df_ratings_drop_users = df_ratings_drop_movies[
    df_ratings_drop_movies.userId.isin(active_users)]
print('shape of original ratings data: ', df_ratings.shape)
print('shape of ratings data after dropping both unpopular movies and inactive users: ', df_ratings_drop_users.shape)

shape of original ratings data:  (20000263, 3)
shape of ratings data after dropping unpopular movies:  (19845397, 3)
shape of original ratings data:  (20000263, 3)
shape of ratings data after dropping both unpopular movies and inactive users:  (18122385, 3)


In [6]:
df_ratings_clean = df_ratings_drop_users.reset_index(drop=True)
df_ratings_clean

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5
...,...,...,...
18122380,138493,68954,4.5
18122381,138493,69526,4.5
18122382,138493,69644,3.0
18122383,138493,70286,5.0


In [7]:
df_movies_clean = df_movies[df_movies.movieId.isin(df_ratings_clean.movieId)].reset_index(drop=True)
df_movies_clean

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)
...,...,...
10468,119141,The Interview (2014)
10469,119145,Kingsman: The Secret Service (2015)
10470,119155,Night at the Museum: Secret of the Tomb (2014)
10471,120635,Taken 3 (2015)


#### Pivot dataframe: Row as Users and Column as Movies
To have a better interpretation of the data, we pivot the dataframe to have userId as rows and movieId as columns, filling the null values with 0.0

Reference: https://machinelearningmastery.com/sparse-matrices-for-machine-learning/

In [8]:
# pivot ratings into movie features
df_movie_features_clean = utils.pivot_dataframe(df_ratings_clean)

#### Convert to Sparse Matrix
Matrices that contain mostly zero values are called sparse, distinct from matrices where most of the values are non-zero, called dense.

The sparsity of a matrix can be quantified with a score, which is the number of zero values in the matrix divided by the total number of elements in the matrix.

In [9]:
from scipy.sparse import csr_matrix

mat_movie_features_clean = csr_matrix(df_movie_features_clean.values)

#### Nearest Neighbors Fit

In [10]:
from sklearn.neighbors import NearestNeighbors

# make an object for the NearestNeighbors Class.
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)

## Make Recommendatations
We’ve already fit the pre-processed dataset in our KNN model. Now we just need to take a movie or a movieId as input and recommend movies based on the inference derived from the KNN.

In [11]:
fav_movie= "Forrest Gump"
data = mat_movie_features_clean
n_recommendations = 10

utils.make_recommendation(model_knn, data, fav_movie, n_recommendations, df_movies_clean)

You have input movie: Forrest Gump
Recommendation system start to make inference
......

Forrest Gump (1994) with distance : 4.2099657093785936e-13
Jurassic Park (1993) with distance : 0.26321274756062996
Pulp Fiction (1994) with distance : 0.2954752834085578
Silence of the Lambs, The (1991) with distance : 0.2970385252573423
Shawshank Redemption, The (1994) with distance : 0.2980727216529797
Braveheart (1995) with distance : 0.30358517087419
Apollo 13 (1995) with distance : 0.32690052924147606
Terminator 2: Judgment Day (1991) with distance : 0.3273906366370589
Fugitive, The (1993) with distance : 0.3323692354837502
Speed (1994) with distance : 0.3379339683783722
Schindler's List (1993) with distance : 0.35064623006748785
