In the [last notebook](https://github.com/caiomiyashiro/RecommenderSystemsNotebooks/blob/master/Month%202%20Part%20I%20-%20User%20User%20Collaborative%20Filtering.ipynb), we took a look at the Nearest Neighboor User User CF, a form of recommender system that looked at the similarity between users to define which item should be suggested to a new customer.

We saw the benefits from User User CF when comparing it to non personalised and content based recommendations, but we also saw that it comes with one difficulty.

- It doesn't scale well. Even in a big e-commerce dataset, the amount of intersecting items between 2 users is not as big as it could be when User User CF was created. Because of this, when a user bought an item that intersected now with another different customer, the new similarity could drastically change, causing the system's owners to recalculate the new similarities very often.
  
- If they don't update the matrix often, they can also lose profits over it because it wouldn't map the user's short term interests which, on the internet, can vary quite a lot.

All in all, short term interest and sparse mutual interest space make the User User CF inapt for high scale companies.

# Item Item Collaborative Filtering (CF)

[Item Item CF](https://en.wikipedia.org/wiki/Item-item_collaborative_filtering) was created by ([Sarwar et all, 1998](https://patentimages.storage.googleapis.com/41/80/fb/07d4d9e61e7431/US6266649.pdf)) in partneship with Amazon in order to fix the problems with the User User CF. In Item Item perspective, as the name suggests, changes the perspective from User centered to a Item centered view, *i.e.*, instead of having a User User similarity matrix, they started to use a item item similarity matrix. Then, when a user $u$ bought and liked an item $i_{1}$ and $i_{1}$ was similar to item $i_{2}$, then we predicted that $u$ would also like $i_{2}$. Take a look at the image below:

<img src="images/notebook5_image1.jpeg" width="500">

Why this simple change in perspective helped to solve the inneficiency problems present in the User User CF?

By considering an enviroment of a big e-commerce company, we end with the number of users >> number of items.   
In this case, even if a single user hasn't given many reviews, the chances are that many users have given a review to a specific item.
By having a big number of reviews, an item relationship to other items doesn't change too much by receiving a few more reviews, *i.e.*, item item relationship are more stable. Therefore, by being more stable, the similarity matrix doesn't have to be recalculated often, as in the User User CF.
  
An extra perfomance improvement comes also from the prediction calculation. In the Item Item CF, a new prediction for a user $u$ for a product $p$ is made by retrieving the items similarities and calculating a weighted average. The number of neighboors for this calculation is only the item that $u$ has liked or bought in the past and this number is often small enought. Therefore, we don't need to search the big user user similarity matrix to find the best $k$ neighboors.

# Item Item Steps

As always, we're going to work with one of the datasets from the [Coursera's Specialization on Recommender Systems](https://www.coursera.org/specializations/recommender-systems). This dataset is from the last week in the course of [Nearest Neighboors CF](https://www.coursera.org/learn/collaborative-filtering) for Item Item CF. Dataset is [here](https://d396qusza40orc.cloudfront.net/umntestsite/on-demand_files/A5/Assignment%205.xls) (Coursera's page) and [here](https://github.com/caiomiyashiro/RecommenderSystemsNotebooks/blob/master/data/Item%20Item%20Collaborative%20Filtering%20-%20Ratings.csv) (personal Github account).
  
The steps taken to evaluate and recommend are similar to User User CF, with some different calculations in the prediction step, as we've said.

- Load traditional input - User Item Review dataset
- Create similarity matrix
- Make predictions

Lets go!

## Example Dataset

The dataset is a matrix with size 25 users x 25 movies and each cell $c_{u,m}$ contains the rating user $u$ gave to movie $m$. If user $u$ didn't rate movie $m$, the cell is empty. As the float values were stored with commas and consequently were being casted as strings, I had to process it a little bit to replace the commas for dots and then convert the column to floats

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('data/Item Item Collaborative Filtering - Ratings.csv', index_col=0, nrows=20)

df.drop('Mean', axis=1, inplace=True) # remove mean column that comes at the end

# replace commas for dots and convert previous string column into float
def processCol(col):
    return col.astype(str).apply(lambda val: val.replace(',','.')).astype(float)
df = df.apply(processCol)

print('Dataset shape: ' + str(df.shape))
df.head()

Dataset shape: (20, 20)


Unnamed: 0_level_0,1: Toy Story (1995),1210: Star Wars: Episode VI - Return of the Jedi (1983),356: Forrest Gump (1994),"318: Shawshank Redemption, The (1994)","593: Silence of the Lambs, The (1991)",3578: Gladiator (2000),260: Star Wars: Episode IV - A New Hope (1977),2028: Saving Private Ryan (1998),296: Pulp Fiction (1994),1259: Stand by Me (1986),2396: Shakespeare in Love (1998),2916: Total Recall (1990),780: Independence Day (ID4) (1996),541: Blade Runner (1982),1265: Groundhog Day (1993),"2571: Matrix, The (1999)",527: Schindler's List (1993),"2762: Sixth Sense, The (1999)",1198: Raiders of the Lost Ark (1981),34: Babe (1995)
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
755,2.0,5.0,2.0,,4.0,4.0,1.0,2.0,,3.0,2.0,,5.0,2.0,5.0,4.0,2.0,5.0,,
5277,1.0,,,2.0,4.0,2.0,5.0,,,4.0,3.0,2.0,2.0,,2.0,,5.0,1.0,3.0,
1577,,,,5.0,2.0,,,,,1.0,,1.0,4.0,4.0,1.0,1.0,2.0,3.0,1.0,3.0
4388,2.0,3.0,,,,1.0,,3.0,4.0,,,4.0,,3.0,5.0,,5.0,1.0,1.0,2.0
1202,,3.0,4.0,1.0,4.0,1.0,4.0,4.0,,1.0,5.0,1.0,,4.0,,3.0,5.0,5.0,,


## Create Similarity Matrix


### Similarity Function

As for the User User CF, we have a few possibilities to choose from when deciding how we're going to define if an item is similar to another item. Again, ([Herlocker et all, 2002](https://grouplens.org/site-content/uploads/evaluating-TOIS-20041.pdf)) did an analysis on the performance of these metrics on Item Item CF and realised that, for this case, the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) was the best performant metric. So we're going with them this time. On the **next notebook**, we try to analyse these metrics and see why it performs better in certain cases and others not.


## Calculating User User Similarity with Cosine Similarity:

One important point here is on the calculation of the denominator of the cosine similarity. Even though we make the dot product only with values existing in both arrays, the norm of the individual vectors are considering all values, and not the intersections between array1 and array2.

In [3]:
def cos_similarity(item1, item2):
    item1Values = ~np.isnan(item1)
    item2Values = ~np.isnan(item2)
    allValues = np.logical_and(item1Values,item2Values) # get only existent elements of both vectors
    return np.dot(item1[allValues], item2[allValues])/(np.linalg.norm(item1[item1Values]) * np.linalg.norm(item2[item2Values]))

def pre_cos_similarity(item1, df):
    return df.apply(lambda item2: cos_similarity(item1, item2))

df_corr = df.apply(lambda item1: pre_cos_similarity(item1, df))
df_corr.head()

Unnamed: 0,1: Toy Story (1995),1210: Star Wars: Episode VI - Return of the Jedi (1983),356: Forrest Gump (1994),"318: Shawshank Redemption, The (1994)","593: Silence of the Lambs, The (1991)",3578: Gladiator (2000),260: Star Wars: Episode IV - A New Hope (1977),2028: Saving Private Ryan (1998),296: Pulp Fiction (1994),1259: Stand by Me (1986),2396: Shakespeare in Love (1998),2916: Total Recall (1990),780: Independence Day (ID4) (1996),541: Blade Runner (1982),1265: Groundhog Day (1993),"2571: Matrix, The (1999)",527: Schindler's List (1993),"2762: Sixth Sense, The (1999)",1198: Raiders of the Lost Ark (1981),34: Babe (1995)
1: Toy Story (1995),1.0,0.644995,0.58054,0.667424,0.570229,0.587852,0.747409,0.534579,0.667846,0.492659,0.376659,0.623056,0.690665,0.383067,0.661016,0.50501,0.463817,0.421637,0.466817,0.61807
1210: Star Wars: Episode VI - Return of the Jedi (1983),0.644995,1.0,0.563029,0.456052,0.516566,0.483187,0.589805,0.408752,0.685662,0.534324,0.533429,0.391934,0.605856,0.515397,0.526952,0.535673,0.573529,0.565297,0.252604,0.511576
356: Forrest Gump (1994),0.58054,0.563029,1.0,0.293041,0.381346,0.569209,0.59555,0.463003,0.399114,0.527926,0.647153,0.491498,0.498741,0.487713,0.29829,0.631039,0.320494,0.602943,0.288275,0.456849
"318: Shawshank Redemption, The (1994)",0.667424,0.456052,0.293041,1.0,0.589,0.212846,0.565577,0.598344,0.538219,0.340151,0.329203,0.332674,0.617366,0.531981,0.437319,0.255345,0.497511,0.459446,0.467347,0.542782
"593: Silence of the Lambs, The (1991)",0.570229,0.516566,0.381346,0.589,1.0,0.551612,0.682137,0.64059,0.400471,0.661958,0.484751,0.414499,0.738445,0.585662,0.673091,0.530856,0.75763,0.715565,0.702452,0.309159


## Predictions Calculation

By now we already know which items are more similar to each other. This will help us when predicting a new rating, by giving higher weights for more similar items than other the user has bough. 
  
The way we're going to calculate the new predictions is the same we used for User User CF, *i.e.*, a weighted average:

$$\frac{\sum_{n=1}^{k} r_{n}w_{n}}{\sum_{n=1}^{k} w_{n}}$$
  
The difference is that we don't have the neighboors anymore, so the $n$ in the summation is considering **all** the items user $u$ has rated and $w$ is still the similarities, but now *item similarity*.

In [4]:
def predictRating(userRatings, itemSimilarity):
    userHasRating = ~np.isnan(userRatings)
    return np.dot(userRatings[userHasRating], itemSimilarity[userHasRating])/np.sum(itemSimilarity[userHasRating])

def pre_predictRating(userRatings, df_corr):
    return df_corr.apply(lambda itemSimilarity: predictRating(userRatings, itemSimilarity))

predictions = df.apply(lambda userRatings: pre_predictRating(userRatings, df_corr), axis=1)
predictions.head()    

Unnamed: 0_level_0,1: Toy Story (1995),1210: Star Wars: Episode VI - Return of the Jedi (1983),356: Forrest Gump (1994),"318: Shawshank Redemption, The (1994)","593: Silence of the Lambs, The (1991)",3578: Gladiator (2000),260: Star Wars: Episode IV - A New Hope (1977),2028: Saving Private Ryan (1998),296: Pulp Fiction (1994),1259: Stand by Me (1986),2396: Shakespeare in Love (1998),2916: Total Recall (1990),780: Independence Day (ID4) (1996),541: Blade Runner (1982),1265: Groundhog Day (1993),"2571: Matrix, The (1999)",527: Schindler's List (1993),"2762: Sixth Sense, The (1999)",1198: Raiders of the Lost Ark (1981),34: Babe (1995)
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
755,3.192403,3.292306,3.100111,3.138513,3.272413,3.169078,3.040837,3.088188,3.121772,3.259814,3.154181,3.146678,3.409717,3.202508,3.391725,3.309275,3.173982,3.326625,3.169455,3.131759
5277,2.666898,2.774215,2.70855,2.723963,2.883304,2.809767,2.92224,2.780218,2.802797,2.928801,2.852131,2.701826,2.674951,2.784252,2.69072,2.770176,2.973883,2.716267,2.818155,2.610945
1577,2.37189,2.408644,2.317307,2.735981,2.320003,2.144546,2.283465,2.4279,2.397125,2.151301,2.307571,2.146637,2.379497,2.512756,2.227885,2.264895,2.327398,2.440141,2.280008,2.612319
4388,2.854355,2.926799,2.668524,2.869568,2.832946,2.600812,2.803925,2.827416,2.980781,2.890292,2.755777,2.974583,2.834529,2.904316,2.99951,2.911118,3.065207,2.784864,2.678815,2.626968
1202,3.078704,3.287867,3.304391,3.213417,3.325591,3.081952,3.191459,3.384223,3.17565,3.112196,3.45759,3.038464,3.209556,3.447542,3.204562,3.4657,3.396677,3.488553,3.205717,3.146035


## Mean Normalised Weighted Average

As in the same way of the User User CF, we can calculate predictions using the absolute value of the reviews or from the mean centralised values of it. The advantages are the same: consider the scale variability of reviewers when attributing a final score for a item of interest:

$$\bar{r_{u}} + \frac{\sum_{n=1}^{k} (r_{n} - \bar{r_{n}})w_{n}}{\sum_{n=1}^{k} w_{n}}$$

We took the same function as above, but added two extra parameters:
- $userMeanRating$: mean average ratings for a specific user
- $neighboorsMeanRating$: mean average rating for all the nearest neighboors for a specific user

In [5]:
# mean normalise
def subtractFromMean(col, meanCol):
    result = np.array([np.nan] * col.shape[0])
    isValidValue = ~np.isnan(col)
    result[isValidValue] = col.values[isValidValue] - meanCol.values[isValidValue]
    return result
userMeanRatings = df.apply(np.mean, axis=1)
df_ratings_norm = df.apply(lambda col: subtractFromMean(col, userMeanRatings))

# similarity matrix
df_corr_norm = df_ratings_norm.apply(lambda item1: pre_cos_similarity(item1, df_ratings_norm))


### Remove negative correlations

In this example, we are replacing the negative correlations by 0, as we can interpret as a maximum weight for unwanted items:


In [6]:
def replaceNegative(col):
    col[col < 0] = 0
    return col
df_corr_norm2 = df_corr_norm.apply(replaceNegative)

### Predict!

In [7]:
predictions_norm = df.apply(lambda userRatings: pre_predictRating(userRatings, df_corr_norm2), axis=1)
predictions_norm.head() 


Unnamed: 0_level_0,1: Toy Story (1995),1210: Star Wars: Episode VI - Return of the Jedi (1983),356: Forrest Gump (1994),"318: Shawshank Redemption, The (1994)","593: Silence of the Lambs, The (1991)",3578: Gladiator (2000),260: Star Wars: Episode IV - A New Hope (1977),2028: Saving Private Ryan (1998),296: Pulp Fiction (1994),1259: Stand by Me (1986),2396: Shakespeare in Love (1998),2916: Total Recall (1990),780: Independence Day (ID4) (1996),541: Blade Runner (1982),1265: Groundhog Day (1993),"2571: Matrix, The (1999)",527: Schindler's List (1993),"2762: Sixth Sense, The (1999)",1198: Raiders of the Lost Ark (1981),34: Babe (1995)
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
755,2.1526,4.365913,2.36254,2.597612,3.90012,3.404381,1.45612,2.120407,2.025099,3.439884,2.286692,3.71603,4.516014,2.615461,4.592515,4.308328,2.098549,4.584935,3.06012,2.611896
5277,1.466851,2.789617,2.904472,1.840136,3.22089,2.324026,4.565822,3.095917,2.804427,3.602141,3.297938,1.996194,2.043163,2.641633,1.922817,2.681204,4.562483,2.032737,2.861378,1.822288
1577,3.084638,2.1731,1.60324,4.375861,2.250719,1.586648,1.725537,1.958046,3.382771,1.161933,2.030983,1.279478,3.472474,3.79885,1.422635,1.487751,2.258958,2.541503,1.40118,2.940755
4388,2.432925,2.910112,2.455503,2.84047,2.335598,1.101522,3.214281,2.863773,3.392348,2.290185,3.688964,3.803343,2.320929,3.045505,3.944862,2.873867,4.401957,1.465993,1.423875,2.785818
1202,2.773346,2.847472,3.383074,1.766545,3.969441,2.04009,4.087295,4.01391,3.085791,2.003364,4.603075,1.338091,3.55023,3.620304,2.597993,3.069085,4.421799,4.215749,3.73691,2.225889


* I didn't quite understand why we didn't use the not normalised item's ratings in this calculation. Ideally, we would use the mean centered user ratings as well..

# Comparison Between Approaches

The comparison follows the same guidelines we used when evaluating the User User CF. 


# Final Considerations on User User CF

When we mean centered the user's rating for the User User CF, the objective was clear, we wanted to take into account that users rate in different parts of the scale. But what about mean centering for the Item Item CF? This evaluation I'll leave it to the next notebook, where we evaluate and compare the main metrics used for similarity calculation in a CF system.

-- 

Item Item CF brings efficiencies steps forward from the User User CF schema. With it, we bring personalised recommendations and in a way that is computationally efficient to scale for giant e-commerce companies, such as Amazon or Netflix. But Item Item CF isn't a gold system, where we can implement it and always get good results. It has a few premisses:

- First, it has the premisse that number of users >> number of items. This is a prerequisite to have stable entities, items in this case, and doesn't need to recalculate the similarity matrix often, as in the User User CF.
  
  
- Secondly, and this is an interesting feature, Item Item CF is better when the item ratings are stable, *i.e.*, they have lots of evaluations. This means that the user's items are probably going to have a lot of influence from these popular items and, at the end, receiving popular items recommendations. This is good when you want to be safe about your recommendations, such as expensive services or products or rarely bought, such as houses or cars. However, this lack of '*serendipity*' is missed when we want to enable users to find that particular rare item and amazingly matched with your tastes. As an example, If we take Spotify, we don't want to receive recommendations such as 'Hey, as you listened to Mozart, here is what we think you'd like: Bach'. Spotify greatness works on the premisse of finding the bands and songs that can surprise you, so they wouldn't be effective by working on the Item Item CF schema. Of course, we are going to see more advance techniques in the future where these companies apply modern algorithms to have good recommendations and still be performatic, but the idea now was to show how we can't rely on one algorithm as the best of them all.

<img src="images/notebook5_image2.png" width="500">
  
  
In the [next notebook](https://github.com/caiomiyashiro/RecommenderSystemsNotebooks/blob/master/Month%202%20Part%20III%20-%20Notes%20on%20Similarity%20Metrics%20for%20CF.ipynb), we finalise the discussion over Collaborative Filtering by investigating a little more on how the similarity metrics work and try to find out some of its features such as:  
  
* Why pearson end up being better for User User CF and Cosine Similarity better for Item Item CF?
* What are the strenghts and weakness when thinking on using one of the evaluated metrics?
* Some filosophies on what they represent and how we can think about them geometrically
  
Stay tuned :)