# Collaborative Filtering (CF)



In month 1, we learn about some commom techniques to recommend items to a user.  


[The 1st notebook](https://github.com/caiomiyashiro/RecommenderSystemsNotebooks/blob/master/Month%201%20Part%20I%20-%20Non%20Personalised%20and%20Stereotyped%20Recommendation.ipynb) presented non-personalised and stereotyped recommendations, which only took averages from the population's avaliations (ratings) in order to predict and present the most popular items.


[The 2nd notebook](https://github.com/caiomiyashiro/RecommenderSystemsNotebooks/blob/master/Month%201%20Part%20III%20-%20Content%20Based%20Recommendation.ipynb) introduced a little of personalisation, where we created a user's taste vector and used it to 'match' the user array with other documents.
    
This notebook introduce the concept of **collaborative filtering**, a recommendation strategy to find and match similar entities. I say entities because we have two different variants on collaborative filtering: 


* User User CF: First CF technique created, the User User CF only takes into consideration only the user's past behaviour, *i.e.*, its ratings, and nothing about the items's characteristics. The ideia is pretty simple: If two users $U_{1}$ and $U_{2}$ have liked items $I_{a}$ and $I_{b}$, but user $U_{2}$ liked an item $I_{c}$ that $U_{1}$ hasn't seen yet. We infer that item $I_{c}$ would be a good recommendation for $U_{1}$. The following picture gives a good representation about it.

<img src="images/notebook4_image1.png" width="600">

* Item Item CF: The User User CF has some drawbacks, which we are going to talk about later. Because of these drawbacks, a more efficient approach was created, the Item Item CF. This technique doesn't take into consideration the users' similarities but only on item similarities. With this, new item predictions for a user $U$ can be easily calculated taking into account the ratings the user gave for similar items. This approach is going to be presented in the next notebook.

# Example Dataset

For the next explanations in Nearest Neighboors for CF we're going to use the [dataset](https://drive.google.com/file/d/0BxANCLmMqAyIQ0ZWSy1KNUI4RWc/view?usp=sharing) provided from the Coursera Specialisation in Recommender Systems, specifically the data from the assignment on User User CF in [course 2](https://www.coursera.org/learn/collaborative-filtering) from the specialisation: 

The dataset is a matrix with size 100 movies x 25 users and each cell $c_{m,u}$ contains the rating user $u$ gave to movie $m$. If user $u$ didn't rate movie $m$, the cell is empty.



In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('data/User-User Collaborative Filtering - movie-row.csv', index_col=0)
print('Dataset shape: ' + str(df.shape))
df.head()

Dataset shape: (100, 25)


Unnamed: 0,1648,5136,918,2824,3867,860,3712,2968,3525,4323,...,3556,5261,2492,5062,2486,4942,2267,4809,3853,2288
11: Star Wars: Episode IV - A New Hope (1977),,45,5.0,45.0,4.0,4.0,,5,4,5,...,4,,45,4,35,,,,,
12: Finding Nemo (2003),,5,5.0,,4.0,4.0,45.0,45,4,5,...,4,,35,4,2,35.0,,,,35.0
13: Forrest Gump (1994),,5,45.0,5.0,45.0,45.0,,5,45,5,...,4,5.0,35,45,45,4.0,35.0,45.0,35.0,35.0
14: American Beauty (1999),,4,,,,,45.0,2,35,5,...,4,,35,45,35,4.0,,35.0,,
22: Pirates of the Caribbean: The Curse of the Black Pearl (2003),4.0,5,3.0,45.0,4.0,25.0,,5,3,4,...,3,15.0,4,4,25,35.0,,5.0,,35.0


# Nearest Neighboors for CF

The approach for doing CF with nearest neighboors is to compare what you want to be matched with other similiar entities. With this, we have to define two things: 
  
* One, in order to bring the most similar items or other customers with similar tastes, we must limit the amount of entities we compare it with.
* Second, when doing predictions for an unseen data, we must match it with neighboors who have already rated the data we want.
  
With these two constraints, we see we have a trade off when deciding the amount of neighboors. If the number of neighboors is set to a too low value, the chances is that we end up with a lot of entities not having reviewed the same thing, and we end up not being able to provide confident predictions for our objective. If we set the bar too high, we will include too many different neighboors in our comparison, with different tastes than the user we want predict recommendations to.

(**reference**) made a feel experiments with different configurations for User User CF and discovered that, for most commercial applications used nowadays, an optimal number of neighboors to consider is between 20 and 30. 

## Similarity Function

The next step to define what are going to be the neighboors of a specific user is to define the similarity metric. In the User User CF context, the input data is a matrix where the rows are the users, columns are the items, and each cell $C_{u,i}$ is the rating that user $u$ gave to item $i$. So, if we want to compare the similarity in terms of ratings between two users $u_{1}$ and $u_{2}$, we have as input to the similarty function, two arrays, containing all reviews that each user made to each item, and blank values when the user didn't rate that specific item.  

(**reference**) made a few experiments with similarity metric and pointed that, in the context of User User CF, the pearson correlation performed well in terms of finding good user neighboors to get data for predictions.

The person correlation 

# Notes on the Pearson Correlation Coefficient

The pearson correlation coefficient comes from the covariance factor between two variables normalised to have a bounded value between 0 and 1 and answers the following question: **How much linear correlated** the variables $x$ and $y$ are? As the values are normalised, its possible to have some guidelines for the coefficient value, such as:

* Exactly 1. A perfect uphill (positive) linear relationship
* 0.70. A strong uphill (positive) linear relationship
* 0.50. A moderate uphill (positive) linear relationship
* 0.30. A weak uphill (positive) linear relationship
* 0 no **linear** relationship, neither positive nor negative
* The same for negative values

The value comes from dividing the covariance between variables $x$ and $y$ and dividing it by the product of $x$ and $y$ standard deviations:

$$r_{x,y} = \frac{S_{xy}}{S_{x}S_{y}}\hspace{7.0cm}$$

$$= \frac{\frac{\sum_{i=1}^{n}(x_{i} - \bar{x})(y_{i} - \bar{y})}{n-1}}{\frac{\sqrt{\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}}}{n-1}  \frac{\sqrt{\sum_{i=1}^{n}(y_{i} - \bar{y}})^{2}}{n-1}}\hspace{4.5cm}(1)$$

$$= \frac{\sum_{i=1}^{n}(x_{i} - \bar{x})(y_{i} - \bar{y})}  {\sqrt{\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}} \sqrt{\sum_{i=1}^{n}(y_{i} - \bar{y})^{2}}}\hspace{3cm}(2)$$

Lets take a look first at equation (1):

- Nominator: It basically extracts the following info:
    - For a covariance value of $c$. In average, a dislocation in 1 unit from the mean value in $x$ represent a dislocation in $c^{2}$ units in $y$.
    - Differently from the 1 variable standard deviation = square root of the variance, we don't take square roots from the covariance values, as it is still not easily understandable. For example, the covariance output would be in the unit $x^{2}y^{2}$ and taking the square root would still be on the unit $xy$. Instead of finding a a proper valid unit transformation, people tend to go directly to a unitless variable, the correlation.
      
      
      
- Denominator: Normalisation factor to transform the output between -1 and 1


## Similarity between Pearson and Cosine Similarity