# Example usage of fneighcf

This Ipython notebook illustrates the usage of the [fneighcf](https://github.com/david-cortes/fneighcf) package, a Python implementation of the collaborative filtering algorithm described in _Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1), 1._.

Loading the [Movielens-100k dataset](https://grouplens.org/datasets/movielens/100k/):

In [1]:
import pandas as pd, numpy as np, time
from datetime import datetime

ratings=pd.read_table('u.data',sep='\t',engine='python',names=['UserId','ItemId','Rating','Timestamp'])
ratings['Timestamp']=ratings.Timestamp.map(lambda x: datetime(*time.localtime(x)[:6])).map(lambda x: pd.to_datetime(x))
ratings=ratings.sort_values(['UserId','ItemId']).reset_index(drop=True)
ratings.head()

Unnamed: 0,UserId,ItemId,Rating,Timestamp
0,1,1,5,1997-09-23 01:02:38
1,1,2,3,1997-10-15 08:26:11
2,1,3,4,1997-11-03 09:42:40
3,1,4,3,1997-10-15 08:25:19
4,1,5,3,1998-03-13 03:15:12


Splitting the data into a training and testing set - the model will be evaluated only on users and items that were used to estimate the parameters:

In [2]:
time_cutoff='1998-01-01'
train=ratings.loc[ratings.Timestamp<=time_cutoff]
test=ratings.loc[ratings.Timestamp>time_cutoff]
users_train=set(list(train.UserId))
items_train=set(list(train.ItemId))
test=test.loc[test.UserId.map(lambda x: x in users_train)]
test=test.loc[test.ItemId.map(lambda x: x in items_train)]
print(train.shape)
print(test.shape)

(52884, 4)
(5835, 4)


Example usage:

In [3]:
from fneighcf import FNeigh

recc=FNeigh(use_biases=True,norm_nratings=-0.5, reg_param_biases=0.001, reg_param_interactions=0.5, save_ratings=True)
recc.fit(train, maxiter=15, step_size_biases=0.001, step_size_interactions=0.05,
            decrease_step_sqrt=True, use_sgd=False, verbose=True)

Iteration 1
RMSE after biases only: 0.8268826559618411
RMSE before update:  0.8268826559618411

Iteration 2
RMSE after biases only: 0.8193267047229166
RMSE before update:  0.6666345885393363

Iteration 3
RMSE after biases only: 0.8170700866186927
RMSE before update:  0.5886276907025956

Iteration 4
RMSE after biases only: 0.8161189817906047
RMSE before update:  0.5497678554428815

Iteration 5
RMSE after biases only: 0.8157434666605078
RMSE before update:  0.5321702767957052

Iteration 6
RMSE after biases only: 0.8155871304194205
RMSE before update:  0.5247975540332956

Iteration 7
RMSE after biases only: 0.8155253991716207
RMSE before update:  0.5218458802392846

Iteration 8
RMSE after biases only: 0.8155023786279961
RMSE before update:  0.5207388242379916



Can also fit the model using stochastic gradient descent (iterating over the ratings in a random order, updating the parameters immediately after calcualting errors for each rating) - this requires fewer passes over the data, thus might take less time (recommended for larger datasets). Note that the step sizes and biases should be smaller than when using full gradient descent as above:

In [4]:
recc=FNeigh(use_biases=True,norm_nratings=-0.5, reg_param_biases=0.001, reg_param_interactions=0.001, save_ratings=True)
recc.fit(train, maxiter=5, step_size_biases=0.001, step_size_interactions=0.01,
            decrease_step_sqrt=True, use_sgd=True)

Test set RMSE

In [5]:
test['Predicted']=test.apply(lambda x: recc.predict(x['UserId'],x['ItemId']),axis=1)
np.mean((test.Rating-test.Predicted)**2)

1.121074058013816

Can also ignore the item bias (a measure of the overall item popularity) when making recommendations - these are overall lower quality, but more varied and more 'customized' for each user, having more serendipity (i.e. more likely to recommend something not obvious):

In [6]:
test['PredictedNoBias']=test.apply(lambda x: recc.score(
        rating_history=train[['ItemId','Rating']].loc[train.UserId==x['UserId']],
        item=x['ItemId'],
        user=x['UserId'],
        use_item_bias=False), axis=1)



Recommendations for a user without rating - i.e. based on global item popularity:

In [7]:
%%time
print(recc.top_n(rating_history=[],n=20))

[1499, 1536, 1629, 1467, 1599, 1512, 1104, 1500, 119, 851, 1450, 1405, 1189, 1175, 1144, 1233, 1158, 1449, 272, 1193]
Wall time: 17 ms


Now a fictional user with some random ratings:

In [8]:
%%time
ratings_fictional_user=[(i,np.random.randint(low=1,high=6)) for i in range(1,10) if i in items_train] # (ItemId,Rating)
print(recc.top_n(rating_history=ratings_fictional_user,n=20,user=1))

[1499, 1536, 1629, 1467, 1599, 1512, 1104, 1500, 119, 851, 1450, 1405, 1189, 1175, 1233, 1144, 1158, 64, 1449, 272]
Wall time: 21 ms


Adding more ratings for the fictional user:

In [9]:
%%time
ratings_fictional_user+=[(i,np.random.randint(low=1,high=6)) for i in range(1,100) if i in items_train]
print(recc.top_n(rating_history=ratings_fictional_user,n=20,user=1))

[1536, 1499, 1629, 1467, 1599, 1512, 1500, 119, 1104, 851, 1450, 1405, 1189, 1175, 1233, 1144, 408, 169, 1158, 1449]
Wall time: 33 ms


Recommendations from some random users - a bad point about this model is that Top-N recommendations tend to be all too similar for different users:

In [10]:
%%time
print(recc.top_n_saved(user=1,n=20))

[1536, 1499, 1629, 1467, 1599, 1512, 1104, 1500, 851, 1450, 1405, 1189, 1175, 1144, 1233, 1158, 1449, 272, 1193, 733]
Wall time: 24 ms


In [11]:
print(recc.top_n_saved(user=870,n=20))

[1536, 1499, 1629, 1467, 1599, 1512, 1104, 1500, 119, 851, 1450, 1405, 1189, 1175, 1144, 1233, 1158, 1449, 272, 1193]


If we exclude the item popularity from the scoring of items, the recommendations become more varied and tailored to each user:

In [12]:
%%time
print(recc.top_n_saved(user=1,n=20,use_item_bias=False))

[1682, 694, 652, 654, 655, 656, 657, 663, 664, 692, 693, 695, 706, 696, 697, 698, 699, 700, 701, 702]
Wall time: 26 ms


In [13]:
print(recc.top_n_saved(user=870,n=20,use_item_bias=False))

[1682, 694, 652, 656, 664, 695, 706, 696, 698, 700, 701, 702, 703, 629, 617, 615, 611, 686, 687, 688]


Some basic comparisons:

In [14]:
avg_ratings=train.groupby('ItemId')['Rating'].mean().to_frame().rename(columns={"Rating":"AvgRating"})
test=pd.merge(test,avg_ratings,left_on='ItemId',right_index=True,how='left')

print('Averge movie rating:',test.groupby('UserId')['Rating'].mean().mean())
print('Average rating for top-5 rated by each user:',test.sort_values(['UserId','Rating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 rated by each user:',test.sort_values(['UserId','Rating'],ascending=True).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for top-5 recommendations of best-rated movies:',test.sort_values(['UserId','AvgRating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('----------------------')
print('Average rating for top-5 recommendations from this model:',test.sort_values(['UserId','Predicted'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for top-5 recommendations from this model (without item bias):',test.sort_values(['UserId','PredictedNoBias'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 (non-)recommendations from this model:',test.sort_values(['UserId','Predicted'],ascending=True).groupby('UserId')['Rating'].head(5).mean())

Averge movie rating: 3.5602718818211856
Average rating for top-5 rated by each user: 4.5298621745788665
Average rating for bottom-5 rated by each user: 2.246554364471669
Average rating for top-5 recommendations of best-rated movies: 4.029096477794793
----------------------
Average rating for top-5 recommendations from this model: 4.065849923430322
Average rating for top-5 recommendations from this model (without item bias): 3.9494640122511484
Average rating for bottom-5 (non-)recommendations from this model: 2.996937212863706


In [15]:
test.sort_values(['UserId','Predicted'],ascending=False).groupby(['UserId']).head(3)

Unnamed: 0,UserId,ItemId,Rating,Timestamp,Predicted,PredictedNoBias
99919,943,318,3,1998-02-28 06:11:33,4.232707,0.136666
99866,943,98,5,1998-02-28 06:09:40,4.165017,0.168649
99850,943,56,5,1998-02-28 06:14:29,4.103143,0.256862
97905,921,174,5,1998-01-13 08:43:00,3.981879,0.262914
97903,921,172,4,1998-01-13 08:43:43,3.913905,0.238139
97955,921,603,3,1998-01-13 08:44:28,3.735175,0.065376
97716,919,272,5,1998-01-17 19:50:52,4.218175,0.010901
97743,919,313,5,1998-01-17 19:50:00,3.966001,0.036423
97724,919,286,4,1998-01-17 19:50:00,3.785180,0.163585
97069,913,258,4,1998-03-08 06:24:09,3.832721,0.091046
