### User-user Collaborative Filtering

This notebook is an exercise for user-user collaborative filtering. The basic idea is find **similar users** therefore we can do recommendations. The core assumptions includes:
1. our past agreement predicts our future agreement.
2. our tastes are either individually stable or more in sync with each other.
3. our system is scoped within a domain of agreement.

To calculate the similarity of user $u$ and $v$, we can use Pearson correlation or cosine similarity. Here we give the route of Pearson correlation:

$$w_{uv} = \frac{\sum_{i\in I}(r_{vi}-\bar{r}_{v})(r_{ui}-\bar{r}_{u})}{\sigma_{v}\sigma_{u}},$$

where $r_{vi}$ is the rating of user $v$ to item $i$, and $\bar{r}_{v}$ is the mean ratings of user $v$. We might need other similarity measures when the number of items that users are commonly rating are too small.

Then we can get the score of user $u$ to item $i$:

$$s(u, i) = \frac{\sum_{v\in U}r_{vi}w_{uv}}{\sum_{v\in U} w_{uv}}.$$

Here $w_{uv}$ means how much this user is similar to or should be contributing to the predictions, $U$ is the neighbors of user $u$. In practice we can add some constraints, for example limit the size of neighbors (25-100), minimum similarities.

One of the issue using this formula is people will have different scales for rating: two people might have same score on one item but give different ratings. Therefore we need to normalize rating,

$$s(u, i) = \frac{\sum_{v\in U}(r_{vi}-\bar{r}_{v})w_{uv}}{\sum_{v\in U} w_{uv}} + \bar{r}_{u}.$$

Formalize this:

Given a set of items $I$ and a set of users $U$, and a sparse rating matrix $R$, we compute the prediction $s(u, i)$ as follows:

1. for all users $u\neq v$, compute $w_{uv}$;
2. select a neighborhood of users $v\subset U$ with highest $w_{uv}$,
3. compute prediction $s(u, i)$.

Implementation issues:
Given $m=|U|$ and $n=|I|$, to calculate the correlation between two users we need $O(n)$; to calculate all correlations for a user we need $O(mn)$; to calculate all pairwise collections we need $O(m^2n)$; make recommendations needs $O(mn)$.

When do we use user-user CF? - when items being recommended don't have good attributes or keywords to describe (CBF case).

In [1]:
import numpy as np
import pandas as pd

In [2]:
movie_user = pd.read_excel('UUCF Assignment Spreadsheet.xls', sheet_name='movie-row')
user_movie = pd.read_excel('UUCF Assignment Spreadsheet.xls', sheet_name='user-row')
print(movie_user.shape, user_movie.shape)

(100, 25) (25, 100)


There are 100 movies and 25 users.

In [3]:
movie_user.head(3)

Unnamed: 0,1648,5136,918,2824,3867,860,3712,2968,3525,4323,...,3556,5261,2492,5062,2486,4942,2267,4809,3853,2288
11: Star Wars: Episode IV - A New Hope (1977),,4.5,5.0,4.5,4.0,4.0,,5.0,4.0,5.0,...,4.0,,4.5,4.0,3.5,,,,,
12: Finding Nemo (2003),,5.0,5.0,,4.0,4.0,4.5,4.5,4.0,5.0,...,4.0,,3.5,4.0,2.0,3.5,,,,3.5
13: Forrest Gump (1994),,5.0,4.5,5.0,4.5,4.5,,5.0,4.5,5.0,...,4.0,5.0,3.5,4.5,4.5,4.0,3.5,4.5,3.5,3.5


In [4]:
user_movie.head(3)

Unnamed: 0,11: Star Wars: Episode IV - A New Hope (1977),12: Finding Nemo (2003),13: Forrest Gump (1994),14: American Beauty (1999),22: Pirates of the Caribbean: The Curse of the Black Pearl (2003),24: Kill Bill: Vol. 1 (2003),38: Eternal Sunshine of the Spotless Mind (2004),63: Twelve Monkeys (a.k.a. 12 Monkeys) (1995),77: Memento (2000),85: Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981),...,8467: Dumb & Dumber (1994),8587: The Lion King (1994),9331: Clear and Present Danger (1994),9741: Unbreakable (2000),9802: The Rock (1996),9806: The Incredibles (2004),10020: Beauty and the Beast (1991),36657: X-Men (2000),36658: X2: X-Men United (2003),36955: True Lies (1994)
1648,,,,,4.0,3.0,,,,,...,,4.0,,,5.0,3.5,3.0,,3.5,
5136,4.5,5.0,5.0,4.0,5.0,5.0,5.0,3.0,,5.0,...,1.0,5.0,,,,5.0,5.0,4.5,4.0,
918,5.0,5.0,4.5,,3.0,,5.0,,5.0,,...,,5.0,,,,3.5,,,,


#### 1. Calculate user-user correlations

In [5]:
user_cor_matrix = movie_user.corr()
user_cor_matrix

Unnamed: 0,1648,5136,918,2824,3867,860,3712,2968,3525,4323,...,3556,5261,2492,5062,2486,4942,2267,4809,3853,2288
1648,1.0,0.40298,-0.142206,0.51762,0.3002,0.480537,-0.312412,0.383348,0.092775,0.098191,...,-0.191988,0.493008,0.360644,0.551089,0.002544,0.116653,-0.429183,0.394371,-0.304422,0.245048
5136,0.40298,1.0,0.118979,0.057916,0.341734,0.241377,0.131398,0.206695,0.360056,0.033642,...,0.488607,0.32812,0.422236,0.226635,0.305803,0.037769,0.240728,0.411676,0.189234,0.390067
918,-0.142206,0.118979,1.0,-0.317063,0.294558,0.468333,0.092037,-0.045854,0.367568,-0.035394,...,0.373226,0.470972,0.069956,-0.054762,0.133812,0.015169,-0.273096,0.082528,0.667168,0.119162
2824,0.51762,0.057916,-0.317063,1.0,-0.060913,-0.008066,0.46291,0.21476,0.169907,0.11935,...,-0.201275,0.228341,0.2387,0.25966,0.247097,0.149247,-0.361466,0.474974,-0.262073,0.166999
3867,0.3002,0.341734,0.294558,-0.060913,1.0,0.282497,0.400275,0.264249,0.125193,-0.333602,...,0.174085,0.297977,0.476683,0.293868,0.438992,-0.162818,-0.295966,0.054518,0.46411,0.379856
860,0.480537,0.241377,0.468333,-0.008066,0.282497,1.0,0.171151,0.072927,0.387133,0.146158,...,0.34747,0.399436,0.207314,0.311363,0.276306,0.079698,0.212991,0.165608,0.162314,0.279677
3712,-0.312412,0.131398,0.092037,0.46291,0.400275,0.171151,1.0,0.065015,0.095623,-0.292501,...,0.016406,-0.240764,-0.115254,0.247693,0.166913,0.146011,0.009685,-0.451625,0.19366,0.113266
2968,0.383348,0.206695,-0.045854,0.21476,0.264249,0.072927,0.065015,1.0,0.028529,-0.073252,...,0.049132,-0.009041,0.203613,0.033301,0.137982,0.070602,0.109452,-0.083562,-0.089317,0.229219
3525,0.092775,0.360056,0.367568,0.169907,0.125193,0.387133,0.095623,0.028529,1.0,0.210879,...,0.475711,0.306957,0.136343,0.30175,0.143414,0.0561,0.179908,0.284648,0.170757,0.193131
4323,0.098191,0.033642,-0.035394,0.11935,-0.333602,0.146158,-0.292501,-0.073252,0.210879,1.0,...,-0.040606,0.155045,-0.204164,0.263654,0.167198,-0.084592,0.315712,0.085673,-0.109892,-0.279385


#### 2. Top neighbors of users

Just select columns of correlation matrix and sort the correlation coefficients.

In [6]:
user_1 = 3867
user_2 = 89
print(f'\tTop 5 neighbors of user {user_1}:')
print(user_cor_matrix[user_1].sort_values(ascending=False)[1:6])
print('\n')
print(f'\tTop 5 neighbors of user {user_2}:')
print(user_cor_matrix[user_2].sort_values(ascending=False)[1:6])

	Top 5 neighbors of user 3867:
2492    0.476683
3853    0.464110
2486    0.438992
3712    0.400275
2288    0.379856
Name: 3867, dtype: float64


	Top 5 neighbors of user 89:
4809    0.668516
5136    0.562449
860     0.539066
5062    0.525990
3525    0.475495
Name: 89, dtype: float64


#### 3. Predict movie score for users

Follow the score calculation formula, here we do it in matrix multiplication way.

In [7]:
user_1_neighbors = user_cor_matrix[user_1].sort_values(ascending=False)[1:6].index.tolist()
user_2_neighbors = user_cor_matrix[user_2].sort_values(ascending=False)[1:6].index.tolist()

user_1_similarities = user_cor_matrix[user_1].sort_values(ascending=False)[1:6]
user_2_similarities = user_cor_matrix[user_2].sort_values(ascending=False)[1:6]

In [8]:
user_1_preds = movie_user.loc[:, user_1_neighbors].fillna(0).values\
                .dot(user_1_similarities.values) / \
               movie_user.loc[:, user_1_neighbors].notnull().values\
                .dot(user_1_similarities.values)
            
user_1_preds = pd.DataFrame({'movie': movie_user.index, 'pred_score': user_1_preds})
user_1_preds.head()

Unnamed: 0,movie,pred_score
0,11: Star Wars: Episode IV - A New Hope (1977),4.020581
1,12: Finding Nemo (2003),3.347734
2,13: Forrest Gump (1994),3.749478
3,14: American Beauty (1999),3.804172
4,22: Pirates of the Caribbean: The Curse of the...,3.345121


In [9]:
user_1_preds.sort_values(by='pred_score', ascending=False).head()

Unnamed: 0,movie,pred_score
77,1891: Star Wars: Episode V - The Empire Strike...,4.760291
21,155: The Dark Knight (2008),4.551454
16,122: The Lord of the Rings: The Return of the ...,4.507637
8,77: Memento (2000),4.472487
15,121: The Lord of the Rings: The Two Towers (2002),4.400194


In [10]:
user_2_preds = movie_user.loc[:, user_2_neighbors].fillna(0).values\
                .dot(user_2_similarities.values) / \
               movie_user.loc[:, user_2_neighbors].notnull().values\
                .dot(user_2_similarities.values)
            
user_2_preds = pd.DataFrame({'movie': movie_user.index, 'pred_score': user_2_preds})
user_2_preds.head()

Unnamed: 0,movie,pred_score
0,11: Star Wars: Episode IV - A New Hope (1977),4.133725
1,12: Finding Nemo (2003),4.267451
2,13: Forrest Gump (1994),4.60147
3,14: American Beauty (1999),3.861582
4,22: Pirates of the Caribbean: The Curse of the...,3.98083


In [11]:
user_2_preds.sort_values(by='pred_score', ascending=False).head()

Unnamed: 0,movie,pred_score
27,238: The Godfather (1972),4.894124
33,278: The Shawshank Redemption (1994),4.882194
64,807: Seven (a.k.a. Se7en) (1995),4.774093
32,275: Fargo (1996),4.770944
38,424: Schindler's List (1993),4.729056


#### 4. Calculate normalized predictions

To normalize the rating, we need to substract user's mean rating for each user's rating data. 

In [12]:
user_avg_rating = movie_user.mean(skipna=True)
user_1_avg_rating = user_avg_rating[user_1]
user_2_avg_rating = user_avg_rating[user_2]

user_avg_rating

1648    3.651515
5136    4.107955
918     4.681818
2824    4.058824
3867    3.661538
860     3.666667
3712    4.500000
2968    3.510101
3525    3.713542
4323    4.041176
3617    4.065217
4360    3.859375
2756    3.759740
89      4.397436
442     3.600000
3556    3.628205
5261    2.964286
2492    3.440000
5062    3.865672
2486    2.890000
4942    4.114865
2267    3.423077
4809    4.279412
3853    3.700000
2288    3.369863
dtype: float64

In [13]:
movie_user_normalized = movie_user - user_avg_rating
movie_user_normalized.head()

Unnamed: 0,1648,5136,918,2824,3867,860,3712,2968,3525,4323,...,3556,5261,2492,5062,2486,4942,2267,4809,3853,2288
11: Star Wars: Episode IV - A New Hope (1977),,0.392045,0.318182,0.441176,0.338462,0.333333,,1.489899,0.286458,0.958824,...,0.371795,,1.06,0.134328,0.61,,,,,
12: Finding Nemo (2003),,0.892045,0.318182,,0.338462,0.333333,0.0,0.989899,0.286458,0.958824,...,0.371795,,0.06,0.134328,-0.89,-0.614865,,,,0.130137
13: Forrest Gump (1994),,0.892045,-0.181818,0.941176,0.838462,0.833333,,1.489899,0.786458,0.958824,...,0.371795,2.035714,0.06,0.634328,1.61,-0.114865,0.076923,0.220588,-0.2,0.130137
14: American Beauty (1999),,-0.107955,,,,,0.0,-1.510101,-0.213542,0.958824,...,0.371795,,0.06,0.634328,0.61,-0.114865,,-0.779412,,
22: Pirates of the Caribbean: The Curse of the Black Pearl (2003),0.348485,0.892045,-1.681818,0.441176,0.338462,-1.166667,,1.489899,-0.713542,-0.041176,...,-0.628205,-1.464286,0.56,0.134328,-0.39,-0.614865,,0.720588,,0.130137


In [14]:
user_1_preds_normalized = movie_user_normalized.loc[:, user_1_neighbors].fillna(0).values\
                            .dot(user_1_similarities.values) / \
                          movie_user_normalized.loc[:, user_1_neighbors].notnull().values\
                            .dot(user_1_similarities.values)
user_1_preds_normalized += user_1_avg_rating    

user_1_preds_normalized = pd.DataFrame({'movie': movie_user.index, 'pred_score': user_1_preds_normalized})
user_1_preds_normalized.head()

Unnamed: 0,movie,pred_score
0,11: Star Wars: Episode IV - A New Hope (1977),4.5058
1,12: Finding Nemo (2003),3.477161
2,13: Forrest Gump (1994),4.054794
3,14: American Beauty (1999),3.886764
4,22: Pirates of the Caribbean: The Curse of the...,3.773592


In [15]:
user_1_preds_normalized.sort_values(by='pred_score', ascending=False).head()

Unnamed: 0,movie,pred_score
77,1891: Star Wars: Episode V - The Empire Strike...,5.245509
21,155: The Dark Knight (2008),4.85677
8,77: Memento (2000),4.777803
32,275: Fargo (1996),4.771538
64,807: Seven (a.k.a. Se7en) (1995),4.655569


In [16]:
user_2_preds_normalized = movie_user_normalized.loc[:, user_2_neighbors].fillna(0).values\
                            .dot(user_2_similarities.values) / \
                          movie_user_normalized.loc[:, user_2_neighbors].notnull().values\
                            .dot(user_2_similarities.values)
            
user_2_preds_normalized += user_2_avg_rating
    
user_2_preds_normalized = pd.DataFrame({'movie': movie_user.index, 'pred_score': user_2_preds_normalized})
user_2_preds_normalized.head()

Unnamed: 0,movie,pred_score
0,11: Star Wars: Episode IV - A New Hope (1977),4.686099
1,12: Finding Nemo (2003),4.819825
2,13: Forrest Gump (1994),5.049074
3,14: American Beauty (1999),4.240812
4,22: Pirates of the Caribbean: The Curse of the...,4.428435


In [17]:
user_2_preds_normalized.sort_values(by='pred_score', ascending=False).head()

Unnamed: 0,movie,pred_score
27,238: The Godfather (1972),5.322015
33,278: The Shawshank Redemption (1994),5.261424
32,275: Fargo (1996),5.241111
64,807: Seven (a.k.a. Se7en) (1995),5.201984
38,424: Schindler's List (1993),5.199223
