## **Book Recommendation System**

### Problem Description
- An online-based book platform (Goodreads) was experiencing the problem of losing potential revenue due to a 20% decrease in user activity.
- After conducting research, the Goodreads team found that users felt confused and lost when trying to choose the books they wanted to read on Goodreads, which has nearly ~10,000 books. In addition, users also experience a decrease in interest in reading because they no longer find book preferences that match their interests.

### Business Objective
- Improve user experience and interest in reading while using the platform by solving the problem of confusion when choosing books.

### Solution
Create book recommendations to help users choose books easily and eliminate user difficulties in using the Goodreads platform.

The recommendation approach that will be carried out is:
1. Non-personalized: popularity-based recommendation
2. Personalized: collaborative filtering

Approaches in personalized recommender systems can be divided based on the presence of interaction data (implicit/explicit):
1. Implicit data is taken from indirect user behavior such as book clicks, time spent scrolling through book pages, purchasing items, or adding books to the reading list.
2. Explicit data is derived from direct user assessments such as book ratings, book reviews or feedback on user opinions of certain books.

Related to this problem, there is direct interaction by the user in the form of rating data. So that the approach to be taken is to use *collaborative filtering*.

### Data Description
- The data comes from [Goodbooks dataset](https://github.com/zygmuntz/goodbooks-10k).
- The dataset contains 10,000 books and 5,976,479 ratings.

There are 2 files that will be used:


**Book rating data**: `ratings.csv`

<center>

|Feature|Description|Data Type|
|:--|:--|:--:|
|`user_id`|User ID|`int`|
|`book_id`|BookID|`int`|
|`rating`|The rating of the book given by the user. Rating starts from `0` to `5`|`int`|

**Books data** : `books.csv`

<center>

|Feature|Description|Data Type|
|:--|:--|:--:|
|`book_id`|Book ID|`int`|
|`goodreads_book_id`|The goodreads book ID|`int`|
|`best_book_id`|Rating of the book given by the user. Rating starts from `0` to `5`|`int`|
|`work_id`|Work ID|`int`|
|`books_count`|Books count|`int`|
|`isbn`|International standard book number|`object`|
|`isbn13`|Book identification number (new version of ISBN)|`float`|
|`authors`|The authors of the book|`object`|
|`original_publication_year`|The year of publication|`float`|
|`original_title`|Original title|`object`|
|`title`|Book title|`object`|
|`language_code`|Code of language|`object`|
|`average_rating`|Average rating|`float`|
|`ratings_count`|Rating count|`int`|
|`work_ratings_count`|Work ratings count|`int`|
|`work_text_reviews_count`|Work text reviews count|`int`|
|`ratings_1`|Rating 1|`int`|
|`ratings_2`|Rating 2|`int`|
|`ratings_3`|Rating 3|`int`|
|`ratings_4`|Rating 4|`int`|
|`ratings_5`|Rating 5|`int`|
|`image_url`|Image link|`object`|
|`small_image_url`|Small image links|`object`|

### **Import Data**

In [1]:
#load library
import numpy as np
import pandas as pd

In [2]:
#load data from path
rating_path = 'data/ratings.csv'
book_path = 'data/books.csv'

In [3]:
#reads the CSV file data and saves it as a DataFrame
rating_data = pd.read_csv(rating_path, delimiter=',')
book_data = pd.read_csv(book_path, delimiter=',')

In [4]:
#show rating_data
rating_data.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


In [5]:
#show book_data
book_data.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


### **Check data and handle duplicated**

In [6]:
#show the dimensions of rating_data
rating_data.shape

(5976479, 3)

In [7]:
#show the datatype of rating_data
rating_data.dtypes

user_id    int64
book_id    int64
rating     int64
dtype: object

In [8]:
#check the total number of null values in rating_data
rating_data.isnull().sum()

user_id    0
book_id    0
rating     0
dtype: int64

In [9]:
#check duplicate in rating_data
rating_data.duplicated(subset=['user_id','book_id']).sum()

0

**rating_data** has the correct type and feature. There is no null data and duplicated in rating_data.

In [10]:
#show the dimensions of book_data
book_data.shape

(10000, 23)

In [11]:
#show columns of book_data
book_data.columns

Index(['book_id', 'goodreads_book_id', 'best_book_id', 'work_id',
       'books_count', 'isbn', 'isbn13', 'authors', 'original_publication_year',
       'original_title', 'title', 'language_code', 'average_rating',
       'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
       'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
       'image_url', 'small_image_url'],
      dtype='object')

In [12]:
#copy dataframe book_data, and delete some feature.
book_copy = book_data.copy()
book_copy = book_copy.drop(columns=['goodreads_book_id','best_book_id','work_id','books_count','isbn',
       'isbn13','title','language_code','average_rating',
       'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
       'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
       'small_image_url'], axis=1)
book_copy.head(3)

Unnamed: 0,book_id,authors,original_publication_year,original_title,image_url
0,1,Suzanne Collins,2008.0,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,2,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
2,3,Stephenie Meyer,2005.0,Twilight,https://images.gr-assets.com/books/1361039443m...


In [13]:
#show the datatype of book_data
book_copy.dtypes

book_id                        int64
authors                       object
original_publication_year    float64
original_title                object
image_url                     object
dtype: object

In [14]:
#check the total number of null values in book_data
book_copy.isnull().sum()

book_id                        0
authors                        0
original_publication_year     21
original_title               585
image_url                      0
dtype: int64

In [15]:
#remove null values in book_data
book_copy = book_copy.dropna(axis=0)
book_copy.isnull().sum()

book_id                      0
authors                      0
original_publication_year    0
original_title               0
image_url                    0
dtype: int64

In [16]:
#changes the data type original_publication_year column to int data type
book_copy.loc[:, 'original_publication_year'] = book_copy['original_publication_year'].astype(int)
book_copy.dtypes

book_id                       int64
authors                      object
original_publication_year     int32
original_title               object
image_url                    object
dtype: object

In [17]:
#show book_data
book_copy.head(3)

Unnamed: 0,book_id,authors,original_publication_year,original_title,image_url
0,1,Suzanne Collins,2008,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,2,"J.K. Rowling, Mary GrandPré",1997,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
2,3,Stephenie Meyer,2005,Twilight,https://images.gr-assets.com/books/1361039443m...


In [18]:
#check duplicate data
book_copy.duplicated().sum()

0

In [19]:
#show the dimensions of book_data
book_copy.shape

(9409, 5)

**book_copy** has the correct feature. The data type in 'original_publication_year' has been corrected. There is no duplicated in book_copy and null data has been removed.

### **Non-personalized: popularity-based recommendation**

In [20]:
#count the number of ratings given for each book and store the result in a new df called 'rating_count'
rating_count = rating_data.groupby('book_id').count()['rating'].reset_index()
rating_count.rename(columns={'rating':'rating_count'}, inplace=True)
rating_count

Unnamed: 0,book_id,rating_count
0,1,22806
1,2,21850
2,3,16931
3,4,19088
4,5,16604
...,...,...
9995,9996,141
9996,9997,93
9997,9998,102
9998,9999,130


In [21]:
#count the mean of ratings given for each book and store the result in a new df called 'mean_rating'
mean_rating = rating_data.groupby('book_id').mean().round(2)['rating'].reset_index()
mean_rating.rename(columns={'rating':'mean_rating'}, inplace=True)
mean_rating

Unnamed: 0,book_id,mean_rating
0,1,4.28
1,2,4.35
2,3,3.21
3,4,4.33
4,5,3.77
...,...,...
9995,9996,4.01
9996,9997,4.45
9997,9998,4.32
9998,9999,3.71


In [22]:
#merge 'rating_count' dataframe with 'mean_rating' dataframe based on 'book_id' column
popular = rating_count.merge(mean_rating, on='book_id')
popular

Unnamed: 0,book_id,rating_count,mean_rating
0,1,22806,4.28
1,2,21850,4.35
2,3,16931,3.21
3,4,19088,4.33
4,5,16604,3.77
...,...,...,...
9995,9996,141,4.01
9996,9997,93,4.45
9997,9998,102,4.32
9998,9999,130,3.71


In [23]:
#merge df 'popular' with df 'book_copy' based on column 'book_id' then select specific columns and remove duplicate rows based on 'book_id'
popular = popular.merge(book_copy, on="book_id").drop_duplicates("book_id")[["book_id","rating_count","mean_rating","authors","original_publication_year","original_title","image_url"]]
popular

Unnamed: 0,book_id,rating_count,mean_rating,authors,original_publication_year,original_title,image_url
0,1,22806,4.28,Suzanne Collins,2008,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,2,21850,4.35,"J.K. Rowling, Mary GrandPré",1997,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
2,3,16931,3.21,Stephenie Meyer,2005,Twilight,https://images.gr-assets.com/books/1361039443m...
3,4,19088,4.33,Harper Lee,1960,To Kill a Mockingbird,https://images.gr-assets.com/books/1361975680m...
4,5,16604,3.77,F. Scott Fitzgerald,1925,The Great Gatsby,https://images.gr-assets.com/books/1490528560m...
...,...,...,...,...,...,...,...
9404,9996,141,4.01,Ilona Andrews,2010,Bayou Moon,https://images.gr-assets.com/books/1307445460m...
9405,9997,93,4.45,Robert A. Caro,1990,Means of Ascent,https://s.gr-assets.com/assets/nophoto/book/11...
9406,9998,102,4.32,Patrick O'Brian,1977,The Mauritius Command,https://images.gr-assets.com/books/1455373531m...
9407,9999,130,3.71,Peggy Orenstein,2011,Cinderella Ate My Daughter: Dispatches from th...,https://images.gr-assets.com/books/1279214118m...


In [24]:
#show the order of values from largest to smallest
popular.sort_values("rating_count", ascending=False).head(10)

Unnamed: 0,book_id,rating_count,mean_rating,authors,original_publication_year,original_title,image_url
0,1,22806,4.28,Suzanne Collins,2008,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,2,21850,4.35,"J.K. Rowling, Mary GrandPré",1997,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
3,4,19088,4.33,Harper Lee,1960,To Kill a Mockingbird,https://images.gr-assets.com/books/1361975680m...
2,3,16931,3.21,Stephenie Meyer,2005,Twilight,https://images.gr-assets.com/books/1361039443m...
4,5,16604,3.77,F. Scott Fitzgerald,1925,The Great Gatsby,https://images.gr-assets.com/books/1490528560m...
16,17,16549,4.13,Suzanne Collins,2009,Catching Fire,https://images.gr-assets.com/books/1358273780m...
19,20,15953,3.85,Suzanne Collins,2010,Mockingjay,https://images.gr-assets.com/books/1358275419m...
17,18,15855,4.42,"J.K. Rowling, Mary GrandPré, Rufus Beck",1999,Harry Potter and the Prisoner of Azkaban,https://images.gr-assets.com/books/1499277281m...
22,23,15657,4.23,"J.K. Rowling, Mary GrandPré",1998,Harry Potter and the Chamber of Secrets,https://images.gr-assets.com/books/1474169725m...
6,7,15558,4.15,J.R.R. Tolkien,1937,The Hobbit or There and Back Again,https://images.gr-assets.com/books/1372847500m...


In [25]:
#show the dimensions of popular
popular.shape

(9409, 7)

### **Personalized recommender system**

In [26]:
#copy rating_data and make pivot to check total 'user_id' and 'book_id'
rating_copy = rating_data.copy()
user_rating_pivot = rating_copy.pivot(index='user_id',columns='book_id',values='rating')
user_rating_pivot

book_id,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,5.0,,,,,,4.0,...,,,,,,,,,,
2,,5.0,,,5.0,,,4.0,,5.0,...,,,,,,,,,,
3,,,,3.0,,,,,,,...,,,,,,,,,,
4,,5.0,,4.0,4.0,,4.0,4.0,,5.0,...,,,,,,,,,,
5,,,,,,4.0,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53420,4.0,5.0,3.0,,2.0,,,,4.0,3.0,...,,,,,,,,,,
53421,4.0,5.0,,5.0,4.0,,4.0,,5.0,,...,,,,,,,,,,
53422,4.0,5.0,,,,,5.0,,,5.0,...,,,,,,,,,,
53423,4.0,5.0,,5.0,,,5.0,4.0,,,...,,,,,,,,,,


The total number of users is more than the number of items, so for personalized recommender systems, *User-to-User Collaborative Filtering (User CF)* is used.

#### Train Model

In [27]:
#load library
import surprise
from surprise import accuracy, Dataset, Reader, KNNBaseline, BaselineOnly
from surprise.model_selection.search import RandomizedSearchCV
from surprise.model_selection import cross_validate, train_test_split

In [28]:
#Initialize a Reader object in the Surprise library to read rating data on a scale of 1-5
reader = Reader(rating_scale = (1, 5))

In [29]:
#reads the rating data and converts it into a format that can be used to load the recommendation dataset from df 'rating_data'
dataset = Dataset.load_from_df(rating_data[['user_id', 'book_id', 'rating']].copy(), reader)
dataset

<surprise.dataset.DatasetAutoFolds at 0x1f44d3134c0>

In [30]:
#show data
dataset.df

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3
...,...,...,...
5976474,49925,510,5
5976475,49925,528,4
5976476,49925,722,4
5976477,49925,949,5


#### Split Train-Test

In [31]:
#split dataset into training data and test data
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

In [32]:
#validate splitting
train_data.n_ratings, len(test_data)

(4781183, 1195296)

#### Create baseline model

Baselineonly calculates the predicted value based on the baseline (global, user, and item averages)

In [33]:
#initialize
model_baseline = BaselineOnly()
model_baseline

<surprise.prediction_algorithms.baseline_only.BaselineOnly at 0x1f44cea9430>

In [34]:
#perform cross-validation on the initialized recommendation model using the 'BaselineOnly'
cv_baseline = cross_validate(algo=model_baseline, data=dataset, cv=5,measures=['rmse'])

Estimating biases using als...


KeyboardInterrupt: 

In [None]:
#cv result
cv_baseline_rmse = cv_baseline['test_rmse'].mean()
cv_baseline_rmse

#### Hyperparameter candidate

In [None]:
#initialization of parameters that will be used in a randomized search
#for hyperparameters in the recommendation model with the KNNBaseline method
param_dist = {'k':list(np.arange(start=5, stop=200)),
          'sim_options':{'name':['cosine','pearson_baseline'],'user_based':['True']}, 'min_k': [1, 2, 3]}

In [None]:
#randomized search for hyperparameters in the recommendation model with the KNNBaseline method
knn_search = RandomizedSearchCV(algo_class=KNNBaseline, param_distributions = param_dist, cv=5)

In [None]:
#process search hyperparams
knn_search.fit(data=dataset)

In [None]:
#summarize performance
summary_df = pd.DataFrame({'Model': ['Baseline', 'Neighborhood Collaborative Filtering'],
                           'CV Performance - RMSE': [cv_baseline_rmse,knn_search.best_score['rmse']],
                           'Model Condiguration':['N/A',f'{knn_search.best_params["rmse"]}']})

summary_df

In [None]:
#best hyperparams combination
knn_search.best_params["rmse"]

In [None]:
#intialize ber hyperparams
best_params = knn_search.best_params['rmse']

In [None]:
#create obj. and retrain whole train data
model_best = KNNBaseline(**best_params)
model_best.fit(train_data)

In [None]:
#predict test data using best model
test_pred = model_best.test(test_data)
test_rmse = accuracy.rmse(test_pred)
test_rmse

In [None]:
#summarize RMSE tuning dan test
summary_test_df = pd.DataFrame({'Model' : ['User to User CF'],
                                'RMSE-Tuning': [knn_search.best_score['rmse']],
                                'RMSE-Test': [test_rmse]})

summary_test_df

#### Prediction

In [None]:
#predict user_id = 2 and book_id = 4
sample_prediction = model_best.predict(uid = 2,
                                      iid = 4)

In [None]:
sample_prediction

Find all book that have not been viewed/unrated by user_id= 2

In [None]:
#get unique book_id
unique_book_id = set(rating_data['book_id'])
print(unique_book_id)

In [None]:
#get book_id that is rated by user_id = 2
rated_book_id = set(rating_data.loc[rating_data['user_id']==2, 'book_id'])
print(rated_book_id)

In [None]:
#find unrated book_id
unrated_book_id = unique_book_id.difference(rated_book_id)
print(unrated_book_id)

In [None]:
#create predict from unrated book
predicted_unrated_book = {
    'user_id': 2,
    'book_id': [],
    'predicted_rating': []
}

predicted_unrated_book

In [None]:
#loop all unrated book
for id in unrated_book_id:
    #make predict
    pred_id = model_best.predict(uid = predicted_unrated_book['user_id'],
                                 iid = id)
    #append
    predicted_unrated_book['book_id'].append(id)
    predicted_unrated_book['predicted_rating'].append(pred_id.est)

In [None]:
#convert to df
predicted_unrated_book = pd.DataFrame(predicted_unrated_book)
predicted_unrated_book

In [None]:
#sort predicted rating
predicted_unrated_book = predicted_unrated_book.sort_values('predicted_rating',
                                                              ascending = False)
predicted_unrated_book

In [None]:
#initialize book data
new_book_data = book_copy
new_book_data.head()

In [None]:
#top k biggest rating
k = 5
top_book = predicted_unrated_book.head(k).copy()
top_book

In [None]:
#add detail
top_book['authors'] = new_book_data.loc[top_book['book_id'], 'authors'].values
top_book['original_publication_year'] = new_book_data.loc[top_book['book_id'], 'original_publication_year'].values
top_book['original_title'] = new_book_data.loc[top_book['book_id'], 'original_title'].values
top_book['image_url'] = new_book_data.loc[top_book['book_id'], 'image_url'].values

top_book