# Capstone Project: Books recommender system

### Overall Contents:
- Background
- Data Collection
- Data Cleaning Books Interactions
- Data Cleaning Booklist
- Exploratory Data Analysis
* Non-personalized recommendation
    - Modeling 1 Popularity-based and Content-based recommendation system
* Personalized recommendation
    - [Modeling 2 Collaborative-filtering based recommendation system](#7.-Modeling-2-Collaborative-filtering-based-recommendation-system)<br>**(In this notebook)**
    - Modeling 3 Clustering-Collaborative-filtering-based recommendation system 
    - Modeling 4 Model-based recommendation systems
- Evaluation
- Conclusion and Recommendation

### Datasets

The dataset are obtained from [University of California San Diego Book Graph](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home?authuser=0).

The dataset below, which is user-book interactions and reference will be used for recommender system.

User-book interactions:-
* user_work_interactions
* user_work_interactions_model
* user_work_interactions_sample
* genrebook_interactions

Reference:-
* booklist_worktitle
* booklist_url

For more details on the datasets, please refer to the data_dictionary_model.ipynb.

## 7. Modeling 2 Collaborative-filtering based recommendation system

### 7.1 Libraries Import

In [1]:
import numpy as np
import pandas as pd
from book_recommender import user_collaborative_filtering_cosine, item_collaborative_filtering_cosine, coverage, ratings_rmse
from numpy import count_nonzero
from sklearn import metrics
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from IPython.display import clear_output

%config InlineBackend.figure_format = 'retina'
%matplotlib inline 

# Maximum display of columns
pd.options.display.max_colwidth = 2000
pd.options.display.max_rows = 2000

### 7.2 Data Import

In [2]:
userbook_interactions = pd.read_parquet("../data/user_work_interactions_sample_int.parquet")

### 7.3 Check the dataset

In [3]:
print(f"The number of user-book interactions is {userbook_interactions.shape[0]}")
print(f"The number of unique users is {userbook_interactions.user_id.nunique()}")
print(f"The number of unique books is {userbook_interactions.work_id.nunique()}")

The number of user-book interactions is 149756
The number of unique users is 22859
The number of unique books is 14376


In [4]:
userbook_interactions.head()

Unnamed: 0,user_id,work_id,rating
6,49476,13785503,3
7,347421,1679789,4
8,232281,1207024,5
14,195915,1424362,4
26,25982,3481898,4


### 7.4 Split the data into train/test data

In [5]:
train, test = train_test_split(userbook_interactions, test_size = 0.25, random_state = 39, stratify = userbook_interactions["user_id"])

In [6]:
#Verify Dimensions
print('train: ', train.shape)
print('test: ', test.shape)
print('train number of unique user_id: ', train.user_id.nunique())
print('test number of unique user_id: ', test.user_id.nunique())

train:  (112317, 3)
test:  (37439, 3)
train number of unique user_id:  22859
test number of unique user_id:  22859


In [7]:
print(f"The presence of train user_id in test: {train.user_id.isin(test.user_id).value_counts()}")
print(f"The presence of train work_id in test: {train.work_id.isin(test.work_id).value_counts()}")

The presence of train user_id in test: True    112317
Name: user_id, dtype: int64
The presence of train work_id in test: True     100624
False     11693
Name: work_id, dtype: int64


In [8]:
print(f"The presence of test user_id in train: {test.user_id.isin(train.user_id).value_counts()}")
print(f"The presence of test work_id in train: {test.work_id.isin(train.work_id).value_counts()}")

The presence of test user_id in train: True    37439
Name: user_id, dtype: int64
The presence of test work_id in train: True     36955
False      484
Name: work_id, dtype: int64


In [9]:
# To transfer the test work_id not present in train to train set
test["presence"] = test.work_id.isin(train.work_id).astype(int)
test_notpresent = test[test["presence"] == 0]
test = test[test.presence == 1]
test_notpresent = test_notpresent.drop(["presence"], axis = 1)
test = test.drop(["presence"], axis = 1)
train = pd.concat([train, test_notpresent], axis = 0)

# Reset the index for train and test set
train = train.sort_values(by="user_id").reset_index(drop=True)
test = test.sort_values(by="user_id").reset_index(drop=True)

print(f"The train set shape is {train.shape}")

The train set shape is (112801, 3)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["presence"] = test.work_id.isin(train.work_id).astype(int)


In [10]:
print(f"The presence of test user_id in train: {test.user_id.isin(train.user_id).value_counts()}")
print(f"The presence of test work_id in train: {test.work_id.isin(train.work_id).value_counts()}")

The presence of test user_id in train: True    36955
Name: user_id, dtype: int64
The presence of test work_id in train: True    36955
Name: work_id, dtype: int64


### 7.5 Null model

For null model, we will be using random prediction of the ratings as a baseline model.

In [11]:
test_null = test.copy()

In [12]:
randomstate = np.random.RandomState(39)
for num in range(len(test)):
    test_null.loc[[num], ["predicted_rating_random"]] = randomstate.randint(1,6)

In [13]:
test_null.head()

Unnamed: 0,user_id,work_id,rating,predicted_rating_random
0,28,2960529,5,2.0
1,32,856555,4,2.0
2,39,914540,3,2.0
3,41,19248070,5,1.0
4,97,16680623,5,5.0


In [14]:
test_null.predicted_rating_random.describe()

count    36955.000000
mean         3.001218
std          1.412902
min          1.000000
25%          2.000000
50%          3.000000
75%          4.000000
max          5.000000
Name: predicted_rating_random, dtype: float64

In [15]:
# RMSE_random_int
np.sqrt(metrics.mean_squared_error(test_null.rating, test_null.predicted_rating_random))

1.9265149580634633

### 7.6 Collaborative filtering - user-based

In [16]:
# Form the pivot table and mean-centralised the ratings per user
user_rating = pd.pivot_table(train, index = "user_id", columns = "work_id", values = "rating")
user_mean_centering = (user_rating.T - user_rating.mean(axis = 1)).T
user_mean_centering = user_mean_centering.fillna(0)
# Cosine similarity of the users
similarity_matrix = cosine_similarity(user_mean_centering)
similar_users = pd.DataFrame(similarity_matrix, columns=user_mean_centering.index, index=user_mean_centering.index)

In [17]:
for index, value in enumerate(test.user_id):
    userid_query = value
    workid_query = test.work_id[index]
    test.loc[[index], ["predicted_rating_user"]] = user_collaborative_filtering_cosine(user_rating, similar_users, userid_query, workid_query)
    clear_output(wait=True)
    print(f'progress: {index+1}/{len(test.user_id)}')

progress: 36955/36955


In [19]:
test_coverage = coverage(test, "predicted_rating_user")
test_coverage

The total observations: 36955
Unable to predict:(34352, 4)
Able to predict:(2603, 4)

The coverage for the recommender system is


7.04

In [20]:
rmse = ratings_rmse(test, "rating", "predicted_rating_user")
rmse

1.2720746113174732

**Analysis: User-based collaborative filtering has an RMSE of 1.2721 with a coverage of 7.04%.** 

The user-based collaborative filtering has a RMSE that is lower than the baseline model, which suggests that the recommender system is able to perform some prediction. However, it has a very low coverage that could possibly due to the inability to find similar users, or similar users have not read the book in query.

### 7.7 Collaborative filtering - item-based

In [21]:
user_book_rating = pd.pivot_table(train, index = "user_id", columns = "work_id", values = "rating")
user_book_rating_mean = (user_book_rating - user_book_rating.mean(axis=0)).T.fillna(0)
sim_matrix = cosine_similarity(user_book_rating_mean)
movies_sim = pd.DataFrame(sim_matrix, columns=user_book_rating_mean.index, index=user_book_rating_mean.index)

In [22]:
for index, value in enumerate(test.user_id):
    userid_query = value
    workid_query = test.work_id[index]
    result = item_collaborative_filtering_cosine(user_book_rating, movies_sim, userid_query, workid_query)
    test.loc[[index], ["predicted_rating_item"]] = result
    clear_output(wait=True)
    print(f'progress: {index+1}/{len(test.user_id)}')

progress: 36955/36955


In [24]:
test_coverage = coverage(test, "predicted_rating_item")
test_coverage

The total observations: 36955
Unable to predict:(33848, 5)
Able to predict:(3107, 5)

The coverage for the recommender system is


8.41

In [25]:
rmse = ratings_rmse(test, "rating", "predicted_rating_item")
rmse

1.2662865840090682

**Analysis: Item-based collaborative filtering has an RMSE of 1.2663 with a coverage of 8.41%.** 

The item-based collaborative filtering has a RMSE that is lower than the baseline model and performs slightly better than user-based collaborative filtering. This is because item-based collaborative filtering is based on items, which are fixed rather than user-based, which can varies with the presence of similar users and the user preference. However, it also has a very low coverage of 8.41%, which could possibly due to the inability to find similar items among the users, or similar items that include the book in query.