## This is the Recommendation models - 

https://blog.insightdatascience.com/explicit-matrix-factorization-als-sgd-and-all-that-jazz-b00e4d9b21ea
and
https://www.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/

### First code for basic Collaborative FIltering models

In [1]:
import numpy as np
import pandas as pd

In [2]:
!curl -O http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip ml-100k.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4808k  100 4808k    0     0  4281k      0  0:00:01  0:00:01 --:--:-- 4281k
Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test       

### Read the ratings dataset

In [5]:
!ls ml-100k

README       u.genre      u.user       u2.test      u4.test      ua.test
[31mallbut.pl[m[m    u.info       u1.base      u3.base      u5.base      ub.base
[31mmku.sh[m[m       u.item       u1.test      u3.test      u5.test      ub.test
u.data       u.occupation u2.base      u4.base      ua.base


In [6]:
names = ['user_id', 'item_id', 'rating', 'timestamp']
ratings_df = pd.read_csv('./ml-100k/u.data', sep='\t', names=names)
ratings_df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [14]:
print(ratings_df.describe())
print("")
nusers = ratings_df["user_id"].nunique()
nproducts = ratings_df["item_id"].nunique()
print(str(nusers)+ " number of unique users")
print(str(nproducts)+" number of unique items")

            user_id        item_id         rating     timestamp
count  100000.00000  100000.000000  100000.000000  1.000000e+05
mean      462.48475     425.530130       3.529860  8.835289e+08
std       266.61442     330.798356       1.125674  5.343856e+06
min         1.00000       1.000000       1.000000  8.747247e+08
25%       254.00000     175.000000       3.000000  8.794487e+08
50%       447.00000     322.000000       4.000000  8.828269e+08
75%       682.00000     631.000000       4.000000  8.882600e+08
max       943.00000    1682.000000       5.000000  8.932866e+08

943 number of unique users
1682 number of unique items


### create the user-item rating matrix (it is a explicit reco model)

In [15]:
ratings = np.zeros((nusers , nproducts))

In [19]:
for row in ratings_df.itertuples():    
    ratings[row[1]-1, row[2]-1] = row[3]
ratings

array([[5., 3., 4., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [5., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 5., 0., ..., 0., 0., 0.]])

In [34]:
# Check the sparsity
sparsity = float(len(ratings.nonzero()[0]))
sparsity /= (ratings.shape[0] * ratings.shape[1])
sparsity *= 100
print(f'The sparsity of the ratings matrix is {sparsity}% ')


The sparsity of the ratings matrix is 6.304669364224531% 


In [43]:
def train_test_split(ratings):
    test = np.zeros(ratings.shape)
    train = ratings.copy()
    for user in range(ratings.shape[0]):
        test_ratings = np.random.choice(ratings[user, :].nonzero()[0], 
                                        size=10, 
                                        replace=False)
#         print(test_ratings)
        train[user, test_ratings] = 0.
        test[user, test_ratings] = ratings[user, test_ratings]
        
    # Test and training are truly disjoint
    assert(np.all((train * test) == 0)) 
    return train, test

In [44]:
train, test = train_test_split(ratings)
print(train.shape)
print(test.shape)

[229  83  12  20  41 267  91 129  63 207]
[288 310 126 291 284 254 301 275 289 272]
[299 263 321 341 351 342 353 331 244 270]
[328 360 359 270  10 327 209 358 302  49]
[388 213 430 100 266 161  79 369 440 172]
[476 457 537 486 507 500 470 167 487   0]
[414 274 131 426   7 398 259 498 486 210]
[227 257 209 687 242 509 384 510 226 340]
[  6 526 275  49 486 690 285 339 520 200]
[691 134 699 177 272 196 479 284 136 703]
[300 428  53 740 579 742 721 454 737 749]
[415 237 734 126 752  96 683 479  14 227]
[ 48  63 313 764 462 200 442   1 301 493]
[175 497  21  31 190 587 203 150  41 282]
[458 937 330 284 309 120 935 753  49 332]
[731 239  55 938 497 201 215 654  97 285]
[221 236 470   0 116 150   6 743  12 149]
[ 78 284 190 283 268 606 274  51 495 141]
[209 434 886   7 257 201 691 654 200 312]
[180 377 741 677 930 175 356 242 143 762]
[993 741 833 241 216 300 239 929 677 546]
[ 567  429 1001  383   95  200  522  664  410  357]
[169  89 154 293 108 160  49 641 385 526]
[128  57 126 507 299 356

[128 221 272 120 299 830 404 362 328 293]
[886 268 301 302 689 257 270 321 287 750]
[861 247 314 339 267 146 244  69 380 245]
[270 287 894 342 323 285 301 315 300 312]
[ 388  470    0  280  596  483  844 1059   78  475]
[ 168   94  750  409 1239   90  357  110  234  205]
[173 197 195  63 180 169 227 216  94 205]
[287 988 320 244 322 318 324 686 291 303]
[1290  830  368  251  244  869  924   13   11   99]
[ 342  678  434  809   32   54 1230  228   37    0]
[357 288 688 346 315 244 267 689 271 300]
[1072  346  960  885  312   85  324  511 1465  206]
[254 704 124  19 197 293 267 508  13 277]
[565 230 192 193 221  99 660  21 426  94]
[747 327  30 315 155  27 357   6 128 259]
[312 321 257 306 285 322 293 244 677 180]
[ 466   12  451  473  487  521 1011  942  179  662]
[ 293  285  345    8   99   14  322 1016  750  885]
[392 134 195 464 224 184 504 236 238 179]
[ 267  257  653  879  895 1064  259  215  874  237]
[475 761 273 282 274 283 327 124 150  15]
[127 237  99 731 203 595 285  49 270 3

In [49]:
t_ratings = [229,83,12,20,41,267,91,129,63,207]
test[0, t_ratings]

array([4., 4., 5., 1., 5., 5., 3., 3., 5., 5.])

In [38]:
!surprise -algo SVDpp -params "{'n_epochs': 5, 'verbose': True}" -load-builtin ml-100k -n-folds 3

 processing epoch 0
 processing epoch 1
 processing epoch 2
 processing epoch 3
 processing epoch 4
 processing epoch 0
 processing epoch 1
 processing epoch 2
 processing epoch 3
 processing epoch 4
 processing epoch 0
 processing epoch 1
 processing epoch 2
 processing epoch 3
 processing epoch 4
Evaluating RMSE, MAE of algorithm SVDpp on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9540  0.9416  0.9550  0.9502  0.0061  
MAE (testset)     0.7570  0.7451  0.7572  0.7531  0.0057  
Fit time          25.94   24.94   25.07   25.32   0.44    
Test time         4.06    3.93    4.03    4.01    0.06    
