# Capstone Project: Books recommender system

### Overall Contents:
- Background
- Data Collection
- Data Cleaning Books Interactions
- Data Cleaning Booklist
- Exploratory Data Analysis
* Non-personalized recommendation
    - Modeling 1 Popularity-based and Content-based recommendation system 
* Personalized recommendation
    - Modeling 2 Collaborative-filtering-based recommendation system
    - Modeling 3 Clustering-Collaborative-filtering-based recommendation system
    - [Modeling 4 Model based recommendation systems](##9.-Modeling-4-Model-based-recommendation-systems) <br>**(In this notebook)**
- Evaluation
- Conclusion and Recommendation

### Datasets

The dataset are obtained from [University of California San Diego Book Graph](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home?authuser=0).

The dataset below, which is user-book interactions and reference will be used for recommender system.

User-book interactions:-
* user_work_interactions
* user_work_interactions_model
* user_work_interactions_sample
* genrebook_interactions

Reference:-
* booklist_worktitle
* booklist_url

For more details on the datasets, please refer to the data_dictionary_model.ipynb.

## 9. Modeling 4 Model based recommendation systems

### 9.1 Libraries Import

In [1]:
import numpy as np
import pandas as pd
from book_recommender import coverage, ratings_rmse
from sklearn import metrics
import surprise
from numpy import count_nonzero
from sklearn.model_selection import train_test_split, GridSearchCV

%config InlineBackend.figure_format = 'retina'
%matplotlib inline 

# Maximum display of columns
pd.options.display.max_colwidth = 2000
pd.options.display.max_rows = 2000

### 9.2 Data Import

In [2]:
userbook_interactions = pd.read_parquet("../data/user_work_interactions_sample_int.parquet")

### 9.3 Check the dataset

In [3]:
print(f"The number of user-book interactions is {userbook_interactions.shape[0]}")
print(f"The number of unique users is {userbook_interactions.user_id.nunique()}")
print(f"The number of unique books is {userbook_interactions.work_id.nunique()}")

The number of user-book interactions is 149756
The number of unique users is 22859
The number of unique books is 14376


In [4]:
userbook_interactions.head()

Unnamed: 0,user_id,work_id,rating
6,49476,13785503,3
7,347421,1679789,4
8,232281,1207024,5
14,195915,1424362,4
26,25982,3481898,4


### 9.4 Split the data into train/test data

In [5]:
train, test = train_test_split(userbook_interactions, test_size = 0.25, random_state = 39, stratify = userbook_interactions["user_id"])

In [6]:
#Verify Dimensions
print('train: ', train.shape)
print('test: ', test.shape)
print('train number of unique user_id: ', train.user_id.nunique())
print('test number of unique user_id: ', test.user_id.nunique())

train:  (112317, 3)
test:  (37439, 3)
train number of unique user_id:  22859
test number of unique user_id:  22859


In [7]:
print(f"The presence of train user_id in test: {train.user_id.isin(test.user_id).value_counts()}")
print(f"The presence of train work_id in test: {train.work_id.isin(test.work_id).value_counts()}")

The presence of train user_id in test: True    112317
Name: user_id, dtype: int64
The presence of train work_id in test: True     100624
False     11693
Name: work_id, dtype: int64


In [8]:
print(f"The presence of test user_id in train: {test.user_id.isin(train.user_id).value_counts()}")
print(f"The presence of test work_id in train: {test.work_id.isin(train.work_id).value_counts()}")

The presence of test user_id in train: True    37439
Name: user_id, dtype: int64
The presence of test work_id in train: True     36955
False      484
Name: work_id, dtype: int64


In [9]:
# To transfer the test work_id not present in train to train set
test["presence"] = test.work_id.isin(train.work_id).astype(int)
test_notpresent = test[test["presence"] == 0]
test = test[test.presence == 1]
test_notpresent = test_notpresent.drop(["presence"], axis = 1)
test = test.drop(["presence"], axis = 1)
train = pd.concat([train, test_notpresent], axis = 0)

# Reset the index for train and test set
train = train.sort_values(by="user_id").reset_index(drop=True)
test = test.sort_values(by="user_id").reset_index(drop=True)

print(f"The train set shape is {train.shape}")

The train set shape is (112801, 3)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["presence"] = test.work_id.isin(train.work_id).astype(int)


In [10]:
print(f"The presence of test user_id in train: {test.user_id.isin(train.user_id).value_counts()}")
print(f"The presence of test work_id in train: {test.work_id.isin(train.work_id).value_counts()}")

The presence of test user_id in train: True    36955
Name: user_id, dtype: int64
The presence of test work_id in train: True    36955
Name: work_id, dtype: int64


In [11]:
reader = surprise.Reader(rating_scale = (1,5))
data_surprise = surprise.Dataset.load_from_df(train, reader)

### 9.5 Non-negative Matrix Factorization

In [12]:
param_grid_nmf = {'lr_bu' : [0.01, 0.05],'lr_bi' : [0.005, 0.01], "n_epochs" : [25,50,100], 'n_factors' : [20,30,40]}
gs_nmf = surprise.model_selection.GridSearchCV(surprise.NMF, param_grid_nmf, measures = ['rmse'], cv=5, n_jobs = -1)
gs_nmf.fit(data_surprise)
print(gs_nmf.best_score['rmse'])
print(gs_nmf.best_params['rmse'])

1.0695592089073456
{'lr_bu': 0.05, 'lr_bi': 0.005, 'n_epochs': 50, 'n_factors': 40}


In [13]:
best_model_nmf = surprise.NMF(lr_bu= 0.05, lr_bi= 0.005, n_epochs= 50, n_factors= 40)
output_nmf = surprise.model_selection.cross_validate(best_model_nmf, data_surprise, measures = ["rmse"], cv = 5, verbose = True)

Evaluating RMSE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0658  1.0665  1.0726  1.0718  1.0717  1.0697  0.0029  
Fit time          11.75   11.75   11.93   11.73   11.77   11.79   0.07    
Test time         0.19    0.19    0.20    0.18    0.12    0.17    0.03    


In [14]:
# Fit with the best parameters
nmf_best_model = gs_nmf.best_estimator['rmse']
nmf_best_model.fit(data_surprise.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.NMF at 0x1aafa2ffe50>

In [15]:
test_nmf = test.copy()

In [16]:
for i in range(len(test_nmf)):
    user_query_nmf = test_nmf.user_id[i]
    work_query_nmf = test_nmf.work_id[i]
    result_nmf = nmf_best_model.predict(user_query_nmf, work_query_nmf)
    test_nmf.loc[i, "predicted_rating"] = result_nmf[3]

In [17]:
rmse_nmf = ratings_rmse(test_nmf, "rating", "predicted_rating")
rmse_nmf

1.017577427218871

In [18]:
test_coverage_nmf = coverage(test_nmf, "predicted_rating")
test_coverage_nmf

The total observations: 36955
Unable to predict:(0, 4)
Able to predict:(36955, 4)

The coverage for the recommender system is


100.0

In [19]:
test_nmf.head()

Unnamed: 0,user_id,work_id,rating,predicted_rating
0,28,2960529,5,3.97289
1,32,856555,4,3.711898
2,39,914540,3,3.458416
3,41,19248070,5,3.811168
4,97,16680623,5,5.0


In [20]:
rmse_nmf = ratings_rmse(test_nmf, "rating", "predicted_rating")
rmse_nmf

1.017577427218871

**Analysis: Non-negative matrix factorization has an RMSE of 1.0176 with a coverage of 100%.**

This is a collaborative filtering algorithm based on non-negative matrix factorization. It has an RMSE that is lower than the baseline model, which suggests that the recommender system is able to perform a better prediction, with a coverage of 100%. 

### 9.6 Singular Value Decomposition

In [21]:
param_grid_svd = {'lr_all' : [0.005, 0.01,0.05], "reg_all" : [0.02, 0.3, 0.4, 0.5], 'n_factors' : [10,20,30], 'n_epochs':[20,30,40]}
gs_svd = surprise.model_selection.GridSearchCV(surprise.SVD, param_grid_svd, measures = ['rmse'], cv=5 ,n_jobs = -1)
gs_svd.fit(data_surprise)
print(gs_svd.best_score['rmse'])
print(gs_svd.best_params['rmse'])

0.9232889540629798
{'lr_all': 0.005, 'reg_all': 0.3, 'n_factors': 10, 'n_epochs': 30}


In [22]:
best_model_svd = surprise.SVD(lr_all = 0.005, reg_all = 0.3, n_factors = 10, n_epochs = 30)
output_svd = surprise.model_selection.cross_validate(best_model_svd, data_surprise,measures = ["rmse"], cv =5, verbose = True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9243  0.9224  0.9145  0.9284  0.9264  0.9232  0.0048  
Fit time          2.41    2.33    2.35    2.33    2.34    2.35    0.03    
Test time         0.13    0.25    0.26    0.13    0.13    0.18    0.06    


In [23]:
# Fit with the best parameters
svd_best_model = gs_svd.best_estimator['rmse']
svd_best_model.fit(data_surprise.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1aa9d368d30>

In [24]:
test_svd = test.copy()

In [25]:
for i in range(len(test_svd)):
    user_query_svd = test_svd.user_id[i]
    work_query_svd = test_svd.work_id[i]
    result_svd = svd_best_model.predict(user_query_svd, work_query_svd)
    test_svd.loc[i, "predicted_rating"] = result_svd[3]

In [26]:
rmse_svd = ratings_rmse(test_svd, "rating", "predicted_rating")
rmse_svd

0.9086056409639094

In [27]:
test_coverage_svd = coverage(test_svd, "predicted_rating")
test_coverage_svd

The total observations: 36955
Unable to predict:(0, 4)
Able to predict:(36955, 4)

The coverage for the recommender system is


100.0

In [28]:
test_svd.head()

Unnamed: 0,user_id,work_id,rating,predicted_rating
0,28,2960529,5,3.394128
1,32,856555,4,3.780331
2,39,914540,3,3.80832
3,41,19248070,5,3.819059
4,97,16680623,5,4.346606


**Analysis: Singular Value Decomposition has an RMSE of 0.9086 with a coverage of 100%.**

This is the singular value decomposition algorithm which is a matrix factorization technique that reduces the number of features of a dataset. It has an RMSE that is lower than the baseline model with 100% coverage. This suggests that the recommender system is able to perform relative good prediction. 

### 9.7 Singular Value Decomposition ++

In [29]:
param_grid_svdpp = {'lr_all' : [0.005, 0.01,0.05], "reg_all" : [0.02, 0.3, 0.4, 0.5], 'n_factors' : [10,20,30], 'n_epochs':[20,30,40]}
gs_svdpp = surprise.model_selection.GridSearchCV(surprise.SVDpp, param_grid_svdpp, measures = ['rmse'], cv=5 ,n_jobs = -1)
gs_svdpp.fit(data_surprise)
print(gs_svdpp.best_score['rmse'])
print(gs_svdpp.best_params['rmse'])

0.9234583397670836
{'lr_all': 0.005, 'reg_all': 0.3, 'n_factors': 10, 'n_epochs': 30}


In [30]:
best_model_svdpp = surprise.SVDpp(lr_all = 0.005, reg_all = 0.3, n_factors = 10, n_epochs = 30)
output_svdpp = surprise.model_selection.cross_validate(best_model_svdpp, data_surprise,measures = ["rmse"], cv=5, verbose = True)

Evaluating RMSE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9225  0.9251  0.9241  0.9250  0.9196  0.9233  0.0021  
Fit time          13.88   13.74   14.18   14.24   14.15   14.04   0.19    
Test time         0.31    0.52    0.42    0.31    0.31    0.37    0.09    


In [31]:
# Fit with the best parameters
svdpp_best_model = gs_svdpp.best_estimator['rmse']
svdpp_best_model.fit(data_surprise.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x1aa9d732c40>

In [32]:
test_svdpp = test.copy()

In [33]:
for i in range(len(test_svdpp)):
    user_query_svdpp = test_svdpp.user_id[i]
    work_query_svdpp = test_svdpp.work_id[i]
    result_svdpp = svdpp_best_model.predict(user_query_svdpp, work_query_svdpp)
    test_svdpp.loc[i, "predicted_rating"] = result_svdpp[3]

In [34]:
rmse_svdpp = ratings_rmse(test_svdpp, "rating", "predicted_rating")
rmse_svdpp

0.9088537625040092

In [35]:
test_coverage_svdpp = coverage(test_svdpp, "predicted_rating")
test_coverage_svdpp

The total observations: 36955
Unable to predict:(0, 4)
Able to predict:(36955, 4)

The coverage for the recommender system is


100.0

In [36]:
test_svdpp.head()

Unnamed: 0,user_id,work_id,rating,predicted_rating
0,28,2960529,5,3.367236
1,32,856555,4,3.80885
2,39,914540,3,3.806653
3,41,19248070,5,3.892551
4,97,16680623,5,4.366291


**Analysis: Singular Value Decomposition ++ has an RMSE of 0.9089 with a coverage of 100%.**

This is the extension of singular value decomposition algorithm which factors in the implicit ratings (the item was rated regardless of the rating value). It has an RMSE that is lower than the baseline model with coverage of 100%, which suggests that this model is able to perform relatively good prediction. 