# Example usage of fneighcf

This Ipython notebook illustrates the usage of the [fneighcf](https://github.com/david-cortes/fneighcf) package on the [MovieLens 1M dataset](https://grouplens.org/datasets/movielens/1m/).

[fneighcf](https://github.com/david-cortes/fneighcf) is a Python implementation of the collaborative filtering algorithm described in _Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1), 1._, using Cython for fast computations.

** Small note: if the TOC here is not clickable or the math symbols don't show properly, try visualizing this same notebook from nbviewer following [this link](http://nbviewer.jupyter.org/github/david-cortes/cmfrec/blob/master/example/cmfrec_movielens_sideinfo.ipynb). **
** *
## Sections
* [1. Model description](#p1)
* [2. Preparing the data](#p2)
* [3. Fitting the model](#p3)
* [4. Evaluating predictions](#p4)
* [5. Examining some recomendations](#p5)
* [6. References](#p6)
** *
<a id="p1"></a>
## 1. Model description

The model consists of predicting the rating that each user gave to each movie according to parameterized item-item effects, considering both the ratings themselves but also the items that were rated as a form of implicit feedback. Having a parameterized model makes recommendations a lot faster and more scalable than typical user-user or item-item nearest-neighbor search, as well as increasing recomendation quality.

Here, ratings are predicted as follows:

$$
\hat{r_{ui}} = \mu + b_u + b_i + |R_u|^{-\alpha} (\sum_{j \in R_u} (r_{uj} - \mu - b_u' - b_i')W_{ij}  + \sum_{j \in R_u} C_{ij})
$$

Where:
* $r_{ui}$ is the rating from user $u$ to item $i$.
* $\mu$ is the average rating across all data.
* $b_u$ is the bias for user $u$ (part of model parameters).
* $b_i$ is the bias for item $i$ (part of model parameters).
* $R_u$ is the set of items rated by user $u$.
* $\alpha$ is a hyperparameter to control the effect from ratings.
* $b_u'$ and $b_i'$ are fixed user and item biases (not part of model parameters) calculated through a simple heuristic.
* $W$ is a square matrix $(N_{items} \:x \:N_{items})$ that parameterizes the effect of rating deviations from each item on the rating each other item.
* $C$ is a square matrix $(N_{items} \:x \:N_{items})$ that parameterizes the implicit effects of rating an item (regardless of the rating given) on the rating given to every other item.

Regularization is applied to the squared $l_2$ norm of all parameters and to the calculation of the fixed biases.


Note that this model requires storing dense matrices of size $(N_{items} \:x \:N_{items})$, and in a typical setting, has millions of parameters. Other models proposed in the paper above with less space requirement but more computation requirements are not implemented here.

** *
<a id="p2"></a>
## 2. Preparing the data


Loading the [Movielens-100k dataset](https://grouplens.org/datasets/movielens/100k/):

In [1]:
import numpy as np, pandas as pd, time, re
from datetime import datetime

ratings=pd.read_table('~/ml-1m/ratings.dat',sep='::',engine='python',names=['UserId','ItemId','Rating','Timestamp'])
ratings['Timestamp']=ratings.Timestamp.map(lambda x: datetime(*time.localtime(x)[:6])).map(lambda x: pd.to_datetime(x))
ratings=ratings.sort_values(['UserId','ItemId']).reset_index(drop=True)
ratings.head()

Unnamed: 0,UserId,ItemId,Rating,Timestamp
0,1,1,5,2001-01-06 23:37:48
1,1,48,5,2001-01-06 23:39:11
2,1,150,5,2000-12-31 22:29:37
3,1,260,4,2000-12-31 22:12:40
4,1,527,5,2001-01-06 23:36:35


Creating a temporal train-test split:

In [2]:
time_cutoff='2002-01-01'
train=ratings.loc[ratings.Timestamp<=time_cutoff]
test=ratings.loc[ratings.Timestamp>time_cutoff]
users_train=set(list(train.UserId))
items_train=set(list(train.ItemId))
test=test.loc[test.UserId.map(lambda x: x in users_train)]
test=test.loc[test.ItemId.map(lambda x: x in items_train)]
print(train.shape)
print(test.shape)

(972815, 4)
(27102, 4)


Loading movie titles to inspect recomendations later:

In [3]:
movie_titles=pd.read_table('~/ml-1m/movies.dat',sep='::',engine='python',header=None)
movie_titles.columns=['ItemId','title','genres']
movie_titles=movie_titles[['ItemId','title']]

movie_id_to_title={i.ItemId:i.title for i in movie_titles.itertuples()}

# function to print recommended lists more nicely
def print_reclist(reclist):
    list_w_info=[str(m+1)+") - "+movie_id_to_title[reclist[m]]+\
        " - Average Rating: "+str(np.round(avg_movie_rating[reclist[m]],2))+\
        " - Number of ratings: "+str(num_ratings_per_movie[reclist[m]]) for m in range(len(reclist))]
    print("\n".join(list_w_info))
    
# aggregate statistics
avg_movie_rating=train.groupby('ItemId')['Rating'].mean()
num_ratings_per_movie=train.groupby('ItemId')['Rating'].agg(lambda x: len(tuple(x)))

** *
<a id="p3"></a>
## 3. Fitting the model

In [4]:
%%time
from fneighcf import FNeigh
import warnings
warnings.filterwarnings("ignore")

rec = FNeigh(center_ratings=True, alpha=0.5, lambda_bu=10, lambda_bi=25,
             lambda_u=5e-1, lambda_i=5e-2, lambda_W=5e-3, lambda_C=5e-2)

# The better-quality alternative
# With this dataset it takes 9GB of RAM
rec.fit(train, method='lbfgs', opts_lbfgs={'maxiter':300, 'disp':True})

# A faster alternative, uses 0.7GB of RAM and takes 8 minutes
# However, the solution it reaches is not as good
# rec.fit(train, method='sgd', epochs=100, step_size=1e-3, early_stop=False, verbose=True)

# Note that the model has 27 million parameters

CPU times: user 20min 40s, sys: 24.6 s, total: 21min 5s
Wall time: 21min 7s


** *
<a id="p4"></a>
## 4. Evaluating predictions

Making predictions on the test set - predicting like this requires passing `save_data=True` to the fit method:

In [5]:
%%time
test['Predicted']=rec.predict(uids=test.UserId, iids=test.ItemId)

CPU times: user 84 ms, sys: 4 ms, total: 88 ms
Wall time: 86.6 ms


RMSE (root mean squared error):

In [6]:
np.sqrt(np.mean((test.Predicted-test.Rating)**2))

1.0514003487805468

How users would have rated top and bottom predictions, and comparison against a non-personalized recommended list:

In [7]:
avg_ratings=train.groupby('ItemId')['Rating'].mean().to_frame().rename(columns={"Rating":"AvgRating"})
test2=pd.merge(test,avg_ratings,left_on='ItemId',right_index=True,how='left')

print('Averge movie rating:',test2.groupby('UserId')['Rating'].mean().mean())
print('Average rating for top-10 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=False).groupby('UserId')['Rating'].head(10).mean())
print('Average rating for bottom-10 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=True).groupby('UserId')['Rating'].head(10).mean())
print('Average rating for top-10 recommendations of best-rated movies:',test2.sort_values(['UserId','AvgRating'],ascending=False).groupby('UserId')['Rating'].head(10).mean())
print('----------------------')
print('Average rating for top-10 recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=False).groupby('UserId')['Rating'].head(10).mean())
print('Average rating for bottom-10 (non-)recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=True).groupby('UserId')['Rating'].head(10).mean())

Averge movie rating: 3.5696247441040905
Average rating for top-10 rated by each user: 4.353635798632691
Average rating for bottom-10 rated by each user: 2.6254402320281747
Average rating for top-10 recommendations of best-rated movies: 3.9181686347627926
----------------------
Average rating for top-10 recommendations from this model: 3.8479386782680756
Average rating for bottom-10 (non-)recommendations from this model: 3.172570955044541


** *
<a id="p5"></a>
## 5. Examining some recomendations

Now examining top recommended lists for some random users.

Generating a top-N list for a user from the training set:

In [8]:
%%time
rec.topN(uid=1, n=20)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.05 ms


array([1326, 1755, 3936,   71, 2535, 3901, 2938, 1824, 3870, 3736, 2937,
       2904, 2210, 2204, 2203, 1940, 1936, 1935, 1596, 1507])

In [9]:
%%time
rec.topN(items=[1, 48, 150, 260, 527], ratings=[1, 3, 4, 4, 2], n=20)

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 2.94 ms


array([3593,  810, 2383, 2555, 2817, 2799, 1556, 2643,  546, 3268, 3799,
       2382,  181, 1389,   66, 1760, 1381, 1998, 2816, 2153])

Now checking some top-10 recommended lists in details for some users:

In [10]:
reclist_user1 = rec.topN(uid=1, n=10)
reclist_user100 = rec.topN(uid=100, n=10)

print('Number of ratings from User 1: ',train.loc[train.UserId==1].shape[0])
print('')
print('Recomendations from this model')
print_reclist(reclist_user1)

Number of ratings from User 1:  53

Recomendations from this model
1) - Amityville II: The Possession (1982) - Average Rating: 2.13 - Number of ratings: 55
2) - Shooting Fish (1997) - Average Rating: 3.55 - Number of ratings: 29
3) - Phantom of the Opera, The (1943) - Average Rating: 3.73 - Number of ratings: 110
4) - Fair Game (1995) - Average Rating: 2.13 - Number of ratings: 94
5) - Earthquake (1974) - Average Rating: 2.85 - Number of ratings: 117
6) - Duets (2000) - Average Rating: 2.65 - Number of ratings: 95
7) - Man Facing Southeast (Hombre Mirando al Sudeste) (1986) - Average Rating: 3.7 - Number of ratings: 30
8) - Homegrown (1998) - Average Rating: 3.36 - Number of ratings: 102
9) - Our Town (1940) - Average Rating: 3.81 - Number of ratings: 47
10) - Big Carnival, The (1951) - Average Rating: 3.76 - Number of ratings: 37


In [11]:
print('Number of ratings from User 100: ',train.loc[train.UserId==100].shape[0])
print('')
print('Recomendations from this model')
print_reclist(reclist_user100)

Number of ratings from User 100:  76

Recomendations from this model
1) - Five Wives, Three Secretaries and Me (1998) - Average Rating: 4.0 - Number of ratings: 1
2) - Nueba Yol (1995) - Average Rating: 1.0 - Number of ratings: 1
3) - Specials, The (2000) - Average Rating: 4.33 - Number of ratings: 3
4) - Death in the Garden (Mort en ce jardin, La) (1956) - Average Rating: 3.0 - Number of ratings: 3
5) - Hillbillys in a Haunted House (1967) - Average Rating: 1.0 - Number of ratings: 1
6) - It Happened Here (1961) - Average Rating: 3.0 - Number of ratings: 2
7) - Zero Kelvin (Kj�rlighetens kj�tere) (1995) - Average Rating: 4.0 - Number of ratings: 1
8) - Lotto Land (1995) - Average Rating: 1.0 - Number of ratings: 1
9) - I'll Never Forget What's 'is Name (1967) - Average Rating: 3.0 - Number of ratings: 1
10) - Another Man's Poison (1952) - Average Rating: 4.0 - Number of ratings: 1


As seen above, the model seems to make more risky (less common items) recomendations the more ratings a user has, which might be a good thing if increased personalization is desired.
** *
<a id="p6"></a>
## 6. References
* Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1), 1.