## This is the Recommendation models - 

https://blog.insightdatascience.com/explicit-matrix-factorization-als-sgd-and-all-that-jazz-b00e4d9b21ea
and
https://www.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/

### First code for basic Collaborative FIltering models

In [1]:
import numpy as np
import pandas as pd

In [2]:
!curl -O http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip ml-100k.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4808k  100 4808k    0     0  4281k      0  0:00:01  0:00:01 --:--:-- 4281k
Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test       

### Read the ratings dataset

In [5]:
!ls ml-100k

README       u.genre      u.user       u2.test      u4.test      ua.test
[31mallbut.pl[m[m    u.info       u1.base      u3.base      u5.base      ub.base
[31mmku.sh[m[m       u.item       u1.test      u3.test      u5.test      ub.test
u.data       u.occupation u2.base      u4.base      ua.base


In [6]:
names = ['user_id', 'item_id', 'rating', 'timestamp']
ratings_df = pd.read_csv('./ml-100k/u.data', sep='\t', names=names)
ratings_df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [14]:
print(ratings_df.describe())
print("")
nusers = ratings_df["user_id"].nunique()
nproducts = ratings_df["item_id"].nunique()
print(str(nusers)+ " number of unique users")
print(str(nproducts)+" number of unique items")

            user_id        item_id         rating     timestamp
count  100000.00000  100000.000000  100000.000000  1.000000e+05
mean      462.48475     425.530130       3.529860  8.835289e+08
std       266.61442     330.798356       1.125674  5.343856e+06
min         1.00000       1.000000       1.000000  8.747247e+08
25%       254.00000     175.000000       3.000000  8.794487e+08
50%       447.00000     322.000000       4.000000  8.828269e+08
75%       682.00000     631.000000       4.000000  8.882600e+08
max       943.00000    1682.000000       5.000000  8.932866e+08

943 number of unique users
1682 number of unique items


### create the user-item rating matrix (it is a explicit reco model)

In [15]:
ratings = np.zeros((nusers , nproducts))

In [19]:
for row in ratings_df.itertuples():    
    ratings[row[1]-1, row[2]-1] = row[3]
ratings

array([[5., 3., 4., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [5., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 5., 0., ..., 0., 0., 0.]])

In [34]:
# Check the sparsity
sparsity = float(len(ratings.nonzero()[0]))
sparsity /= (ratings.shape[0] * ratings.shape[1])
sparsity *= 100
print(f'The sparsity of the ratings matrix is {sparsity}% ')


The sparsity of the ratings matrix is 6.304669364224531% 


### Take 10 random ratings from each user and create them to be the test set, for the collaborative filtering model

In [51]:
def train_test_split(ratings):
    test = np.zeros(ratings.shape)
    train = ratings.copy()
    for user in range(ratings.shape[0]):
        test_ratings = np.random.choice(ratings[user, :].nonzero()[0], 
                                        size=10, 
                                        replace=False)
#         print(test_ratings)
        train[user, test_ratings] = 0.
        test[user, test_ratings] = ratings[user, test_ratings]
        
    # Test and training are truly disjoint
    assert(np.all((train * test) == 0)) 
    return train, test

In [52]:
# it takes 10 random ratings from each user and create them to be the test set, for the collaborative filtering model
train, test = train_test_split(ratings)
print(train.shape)
print(test.shape)

(943, 1682)
(943, 1682)


### Usage of Surprise package for SVD, SVD++, NMF etc 

In [38]:
!surprise -algo SVDpp -params "{'n_epochs': 5, 'verbose': True}" -load-builtin ml-100k -n-folds 3

 processing epoch 0
 processing epoch 1
 processing epoch 2
 processing epoch 3
 processing epoch 4
 processing epoch 0
 processing epoch 1
 processing epoch 2
 processing epoch 3
 processing epoch 4
 processing epoch 0
 processing epoch 1
 processing epoch 2
 processing epoch 3
 processing epoch 4
Evaluating RMSE, MAE of algorithm SVDpp on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9540  0.9416  0.9550  0.9502  0.0061  
MAE (testset)     0.7570  0.7451  0.7572  0.7531  0.0057  
Fit time          25.94   24.94   25.07   25.32   0.44    
Test time         4.06    3.93    4.03    4.01    0.06    
