# Collaborative Filtering
> Author: [Yalim Demirkesen](github.com/demirkeseny) 

> Date: March 2019

In [1]:
# Necessary libraries:
import pandas as pd
import surprise
from surprise import SVD
from surprise import SVDpp
from surprise import SlopeOne
from surprise import NMF
from surprise import NormalPredictor
from surprise import KNNBaseline
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import BaselineOnly
from surprise import CoClustering
from surprise.model_selection import train_test_split
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import GridSearchCV
from surprise.model_selection import cross_validate
from surprise import Dataset
from surprise import Reader

### User-Based with Surprise

In user-based collaborative filtering we are utilizing a module called Surprise. That enaables us to compare many models in a much shorter period of time. 

In [2]:
ratings = pd.read_csv('./data/ratings_updated.csv')

We upload our dataframe to the surprise using the `Reader` function.

In [3]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings[['user_id', 'book_id', 'rating']], reader)

In [4]:
comparison = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), 
                  KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Performing cross validation with 3-folds
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results and create a data frame including the algorithm and its error rate and run time
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    comparison.append(tmp)
    
pd.DataFrame(comparison).set_index('Algorithm').sort_values('test_rmse') 

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,0.836106,355.712914,15.639056
SVD,0.843045,39.706435,2.967792
BaselineOnly,0.845931,2.640687,2.484811
KNNBaseline,0.859549,85.374498,54.249281
KNNWithZScore,0.865394,109.459474,71.63931
KNNWithMeans,0.865835,110.424613,60.090756
CoClustering,0.869896,18.584387,2.81894
NMF,0.889443,44.29435,2.678582
SlopeOne,0.911281,4.925132,12.595289
KNNBasic,0.929855,115.98364,54.815071


After `cv=3`, we realized that the best test score is obtained with SVDpp but I prefer SVD since the test score is not that very different than SVDpp but it is much faster. After picking the algorithm, I start by splitting the train and testing data frames. Then run the SVD model with gridsearch which enables me to check the most optimal hyperparameters in model.

In [6]:
trainset, testset = train_test_split(data, test_size=0.20)

In [7]:
param_grid = {'n_factors': [10, 20, 30, 40], 
              'n_epochs': [30, 35, 40], 
              'lr_all': [0.001, 0.003, 0.005, 0.008],
              'reg_all': [0.08, 0.1, 0.15]}

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

algo = gs.best_estimator['rmse']
print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

0.8299778084397585
{'n_factors': 40, 'n_epochs': 40, 'lr_all': 0.008, 'reg_all': 0.1}
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8245  0.8210  0.8236  0.8231  0.8224  0.8229  0.0012  
MAE (testset)     0.6410  0.6389  0.6400  0.6397  0.6398  0.6399  0.0007  
Fit time          53.85   53.16   53.32   53.16   53.82   53.46   0.31    
Test time         1.98    1.91    2.21    1.86    2.23    2.04    0.15    


{'test_rmse': array([0.82445391, 0.82101992, 0.82363432, 0.82314183, 0.82244423]),
 'test_mae': array([0.6409841 , 0.6388709 , 0.63999116, 0.63968937, 0.6397802 ]),
 'fit_time': (53.851797580718994,
  53.159351110458374,
  53.31809902191162,
  53.15888428688049,
  53.81906247138977),
 'test_time': (1.981522560119629,
  1.9101359844207764,
  2.20574688911438,
  1.8565313816070557,
  2.2260844707489014)}

After running our gridsearch, we realized that the best parameters are given above. As next step we build our final model with these parameters.

In [8]:
algo = SVD(n_factors=40, n_epochs=40, lr_all=0.008, reg_all=0.1)
algo.fit(trainset)
test_pred = algo.test(testset)
print("SVD : Test Set")
accuracy.rmse(test_pred, verbose=True)

SVD : Test Set
RMSE: 0.8262


0.8262050115450225

### Evaluating the Surprise Module

In order to evaluate how our model operates, we generate predicted score for each user in the test set and estimate what they would have scored for certain books. In order to start with that evaluation, I create a data frame displaying the real rating and estimated rating of the user.

In [9]:
test_pred = pd.DataFrame(test_pred)
test_pred.drop(columns = ['details'], inplace = True)
test_pred.columns = ['user_id','book_id','real_rating','est_rating']
test_pred['error'] = abs(test_pred.est_rating - test_pred.real_rating)

In [10]:
test_pred.head()

Unnamed: 0,user_id,book_id,real_rating,est_rating,error
0,32612,1387,5.0,4.45424,0.54576
1,7331,1840,5.0,3.967807,1.032193
2,6281,1350,4.0,3.630505,0.369495
3,7834,1725,5.0,3.891528,1.108472
4,32288,9448,4.0,3.499786,0.500214


As a next step, I will check the performance of our estimators.

In [12]:
books_ext = pd.read_csv('./data/books_extended.csv', encoding='utf-8-sig')

In [13]:
books = pd.read_csv('./data/books.csv', encoding='utf-8-sig')

Merging these two datasets so that we can have the most extensive dataset about our books:

In [14]:
book = pd.merge(books[['id','book_id']], books_ext,
                      on='book_id', how = 'inner')

In [15]:
# Dropping the unnecessary columns:
book.drop(columns = ['book_id','Unnamed: 0', 'image_url', 'small_image_url'], 
           inplace = True)

In [16]:
# Just to have the same column names, I change the column 'id' to 'book_id':
book.rename(columns={'id':'book_id'}, inplace = True)

In [17]:
book.head()

Unnamed: 0,book_id,books_count,isbn,authors,original_publication_year,original_title,title_x,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,description
0,1,272,439023483,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317,"Could you survive on your own, in the wild, wi..."
1,2,491,439554934,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,4.44,4602479,4800065,75867,75504,101676,455024,1156318,3011543,Harry Potter's life is miserable. His parents ...
2,3,226,316015849,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191,436802,793319,875073,1355439,<b>About three things I was absolutely positiv...
3,4,487,61120081,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427,117415,446835,1001952,1714267,The unforgettable novel of a childhood in a sl...
4,5,1356,743273567,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236,197621,606158,936012,947718,Alternate Cover Edition ISBN: 0743273567 (ISBN...


Now we will compare the rating given by the user and what out model predicts. In order to do that, we will merge our predictions dataframe and books dataframe. 

In [18]:
main = pd.merge(test_pred, book[['book_id','title_x']], 
                on = 'book_id', how = 'inner')

In [19]:
main.columns.tolist()

['user_id', 'book_id', 'real_rating', 'est_rating', 'error', 'title_x']

In [20]:
main.head()

Unnamed: 0,user_id,book_id,real_rating,est_rating,error,title_x
0,32612,1387,5.0,4.45424,0.54576,Origins (The Vampire Diaries: Stefan's Diaries...
1,41015,1387,3.0,2.999321,0.000679,Origins (The Vampire Diaries: Stefan's Diaries...
2,48326,1387,5.0,3.119508,1.880492,Origins (The Vampire Diaries: Stefan's Diaries...
3,43417,1387,3.0,3.096333,0.096333,Origins (The Vampire Diaries: Stefan's Diaries...
4,26327,1387,5.0,4.510642,0.489358,Origins (The Vampire Diaries: Stefan's Diaries...


Just to provide an example, below is the performance of our model for the user `2514`!

In [21]:
main[main['user_id'] == 2514].sort_values(by=['error'])[['user_id','title_x','real_rating','est_rating','error']].reset_index(drop=True)

Unnamed: 0,user_id,title_x,real_rating,est_rating,error
0,2514,May We Be Forgiven,4.0,4.194417,0.194417
1,2514,"Marching Powder: A True Story of Friendship, C...",5.0,4.43521,0.56479
2,2514,Amy and Isabelle,5.0,4.262544,0.737456
3,2514,"Silence (Silence, #1)",3.0,3.820086,0.820086
4,2514,This is the Story of a Happy Marriage,5.0,4.145987,0.854013
5,2514,"Silent Scream (D.I. Kim Stone, #1)",3.0,3.984216,0.984216
6,2514,"Blue Monday (Frieda Klein, #1)",3.0,4.139158,1.139158


In [22]:
# Filtering one random user out of all users:
user_interest = main[main['user_id'] == 12874]

# Sorting the dataframe in respect to the error rate and taking the first 45:
user_interest = user_interest.sort_values(by=['error'])[['user_id','title_x','real_rating','est_rating','error']].head(45)

# I reset the index so that user ID also becomes a column
user_interest = user_interest.reset_index(drop=True)

# Then round the rating such that there is only one digit after the comma.
user_interest.est_rating = user_interest.est_rating.round(1)

In [23]:
# Renaming the columns:
user_interest.columns = ['USER ID','BOOK TITLE', 'ACT RATING','EST RATING', 'ERROR']

# Taking the ones with error rate less than 0.5:
user_interest = user_interest[user_interest['ERROR'] < 0.5]

# Sorting by estimated rating so that the books that highly be interesting for the user will be displayed:
user_interest.sort_values(by=['EST RATING'], ascending=False)

Unnamed: 0,USER ID,BOOK TITLE,ACT RATING,EST RATING,ERROR
4,12874,"Winnie-the-Pooh (Winnie-the-Pooh, #1)",4.0,4.0,0.04163
1,12874,A Light in the Attic,4.0,4.0,0.023927
8,12874,Where the Sidewalk Ends,4.0,3.9,0.090704
9,12874,"Little House on the Prairie (Little House, #2)",4.0,3.9,0.109474
11,12874,The Glass Castle,4.0,3.9,0.128621
12,12874,The Art of Racing in the Rain,4.0,3.8,0.156716
14,12874,Harry Potter and the Sorcerer's Stone (Harry P...,4.0,3.8,0.220691
23,12874,"The Two Towers (The Lord of the Rings, #2)",4.0,3.7,0.319661
17,12874,The Hobbit,4.0,3.7,0.279615
25,12874,"A Wrinkle in Time (A Wrinkle in Time Quintet, #1)",4.0,3.7,0.340359


Utilizing plotly, we can generate a dynamic graph to see the residuals on the user `12874`.

In [28]:
import plotly.plotly as py
import plotly.graph_objs as go

x = user_interest.sample(n=12, random_state=1)['BOOK TITLE'].tolist()
y = user_interest.sample(n=12, random_state=1)['ACT RATING'].tolist()
y2 = user_interest.sample(n=12, random_state=1)['EST RATING'].tolist()

actual = go.Bar(
    x=x,
    y=y,
    text=y,
    textposition = 'auto',
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=0.5),
        ),
    opacity=0.6
)

predicted = go.Bar(
    x=x,
    y=y2,
    text=y2,
    textposition = 'auto',
    marker=dict(
        color='rgb(58,200,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5),
        ),
    opacity=0.6
)

data = [actual,predicted]

py.iplot(data, filename='grouped-bar-direct-labels')

In [25]:
# Building its function:
def top_choice(user_identification, top):
    top_ch = main[main['user_id'] == user_identification].sort_values(by=['error'])[['user_id',
                                                              'title_x',
                                                              'real_rating',
                                                              'est_rating',
                                                              'error']].reset_index(drop=True)
    top_ch = top_ch.sort_values(by=['est_rating'], ascending = False).head(top)
    return(top_ch)

In [26]:
# Thanks to the above function, we can generate the books that the user 
# would enjoy most according to our recommendation engine:
top_choice(12874, 5)

Unnamed: 0,user_id,title_x,real_rating,est_rating,error
1,12874,A Light in the Attic,4.0,4.023927,0.023927
4,12874,"Winnie-the-Pooh (Winnie-the-Pooh, #1)",4.0,3.95837,0.04163
8,12874,Where the Sidewalk Ends,4.0,3.909296,0.090704
9,12874,"Little House on the Prairie (Little House, #2)",4.0,3.890526,0.109474
11,12874,The Glass Castle,4.0,3.871379,0.128621
