## **Book Recommendation System**

### Data Description
- The data comes from [Goodbooks dataset](https://github.com/zygmuntz/goodbooks-10k).
- The dataset contains 10,000 books and 5,976,479 ratings.

There are 2 files that will be used:


**Book rating data**: `ratings.csv`

<center>

|Feature|Description|Data Type|
|:--|:--|:--:|
|`user_id`|User ID|`int`|
|`book_id`|BookID|`int`|
|`rating`|The rating of the book given by the user. Rating starts from `0` to `5`|`int`|

|**Books data** : `books.csv`

<center>

|Feature|Description|Data Type|
|:--|:--|:--:|
|`book_id`|Book ID|`int`|
|`goodreads_book_id`|The goodreads book ID|`int`|
|`best_book_id`|Rating of the book given by the user. Rating starts from `0` to `5`|`int`|
|`work_id`|Work ID|`int`|
|`books_count`|Books count|`int`|
|`isbn`|International standard book number|`object`|
|`isbn13`|Book identification number (new version of ISBN)|`float`|
|`authors`|The authors of the book|`object`|
|`original_publication_year`|The year of publication|`float`|
|`original_title`|Original title|`object`|
|`title`|Book title|`object`|
|`language_code`|Code of language|`object`|
|`average_rating`|Average rating|`float`|
|`ratings_count`|Rating count|`int`|
|`work_ratings_count`|Work ratings count|`int`|
|`work_text_reviews_count`|Work text reviews count|`int`|
|`ratings_1`|Rating 1|`int`|
|`ratings_2`|Rating 2|`int`|
|`ratings_3`|Rating 3|`int`|
|`ratings_4`|Rating 4|`int`|
|`ratings_5`|Rating 5|`int`|
|`image_url`|Image link|`object`|
|`small_image_url`|Small image links|`object`|

### **Import Data**

In [1]:
#load library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#load data from path
rating_path = 'data/ratings.csv'
book_path = 'data/books.csv'

In [None]:
def load_data(book_path, rating_path):
    #reads the CSV file data and saves it as a DataFrame
    rating_data = pd.read_csv(rating_path, delimiter=',')
    book_data = pd.read_csv(book_path, delimiter=',')
    
    #copy dataframe book_data, and delete some feature.
    book_copy = book_data.copy()
    book_copy = book_copy.drop(columns=['best_book_id','work_id','books_count','isbn',
           'isbn13','title','language_code','average_rating',
           'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
           'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
           'small_image_url'], axis=1)
    book_copy.head(3)
    
    #fill null values in book_data
    print("Missing values before fillna: ", book_copy.isnull().sum())
    book_copy['original_publication_year'] = book_copy['original_publication_year'].fillna(0)
    book_copy['original_title'] = book_copy['original_title'].fillna(book_data['title'])
    print("Missing values after fillna: ", book_copy.isnull().sum())
    
    #changes the data type original_publication_year column to int data type
    book_copy.loc[:, 'original_publication_year'] = book_copy['original_publication_year'].astype(int)
    book_copy.dtypes
    
    ## b. drop duplicated rows
    print("Books shape before drop dup: ", book_copy.shape)
    print("Ratings shape before drop dup: ", rating_data.shape)
    book_copy.drop_duplicates(subset = ['book_id', 'goodreads_book_id'], inplace = True)
    rating_data.drop_duplicates(subset=['user_id','book_id'], inplace = True)
    print("Books shape after drop dup: ", book_copy.shape)
    print("Ratings shape after drop dup: ", rating_data.shape)

In [3]:
#reads the CSV file data and saves it as a DataFrame
rating_data = pd.read_csv(rating_path, delimiter=',')
book_data = pd.read_csv(book_path, delimiter=',')

In [4]:
#show rating_data
rating_data.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


In [5]:
#show book_data
book_data.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


### **Check data and handle duplicated**

In [6]:
#show the dimensions of rating_data
rating_data.shape

(5976479, 3)

In [7]:
#show the datatype of rating_data
rating_data.dtypes

user_id    int64
book_id    int64
rating     int64
dtype: object

In [8]:
#check the total number of null values in rating_data
rating_data.isnull().sum()

user_id    0
book_id    0
rating     0
dtype: int64

In [9]:
#check duplicate in rating_data
rating_data.duplicated(subset=['user_id','book_id']).sum()

0

**rating_data** has the correct type and feature. There is no null data and duplicated in rating_data.

In [10]:
#show the dimensions of book_data
book_data.shape

(10000, 23)

In [11]:
#show columns of book_data
book_data.columns

Index(['book_id', 'goodreads_book_id', 'best_book_id', 'work_id',
       'books_count', 'isbn', 'isbn13', 'authors', 'original_publication_year',
       'original_title', 'title', 'language_code', 'average_rating',
       'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
       'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
       'image_url', 'small_image_url'],
      dtype='object')

In [12]:
#copy dataframe book_data, and delete some feature.
book_copy = book_data.copy()
book_copy = book_copy.drop(columns=['goodreads_book_id','best_book_id','work_id','books_count','isbn',
       'isbn13','title','language_code','average_rating',
       'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
       'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
       'small_image_url'], axis=1)
book_copy.head(3)

Unnamed: 0,book_id,authors,original_publication_year,original_title,image_url
0,1,Suzanne Collins,2008.0,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,2,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
2,3,Stephenie Meyer,2005.0,Twilight,https://images.gr-assets.com/books/1361039443m...


In [13]:
#show the datatype of book_data
book_copy.dtypes

book_id                        int64
authors                       object
original_publication_year    float64
original_title                object
image_url                     object
dtype: object

In [14]:
#check the total number of null values in book_data
book_copy.isnull().sum()

book_id                        0
authors                        0
original_publication_year     21
original_title               585
image_url                      0
dtype: int64

In [15]:
#fill null values in book_data
book_copy['original_publication_year'] = book_copy['original_publication_year'].fillna(0)
book_copy['original_title'] = book_copy['original_title'].fillna(book_data['title'])
book_copy.isnull().sum()

book_id                      0
authors                      0
original_publication_year    0
original_title               0
image_url                    0
dtype: int64

In [16]:
#changes the data type original_publication_year column to int data type
book_copy.loc[:, 'original_publication_year'] = book_copy['original_publication_year'].astype(int)
book_copy.dtypes

book_id                       int64
authors                      object
original_publication_year     int64
original_title               object
image_url                    object
dtype: object

In [17]:
#show book_data
book_copy.head(3)

Unnamed: 0,book_id,authors,original_publication_year,original_title,image_url
0,1,Suzanne Collins,2008,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,2,"J.K. Rowling, Mary GrandPré",1997,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
2,3,Stephenie Meyer,2005,Twilight,https://images.gr-assets.com/books/1361039443m...


In [18]:
book_copy[book_copy['original_publication_year'] == 0].sum().reset_index()

Unnamed: 0,index,0
0,book_id,125975
1,authors,"Mark Cotta VazRobert Kapilow, Dr. SeussYuu Wat..."
2,original_publication_year,0
3,original_title,Twilight: The Complete Illustrated Movie Compa...
4,image_url,https://images.gr-assets.com/books/1352539022m...


In [19]:
#check duplicate data
book_copy.duplicated().sum()

0

In [20]:
#show the dimensions of book_data
book_copy.shape

(10000, 5)

In [66]:
user_id = rating_data.user_id.unique()
np.sort(user_id)

array([    1,     2,     3, ..., 53422, 53423, 53424])

**book_copy** has the correct feature. The data type in 'original_publication_year' has been corrected. There is no duplicated in book_copy and null data has been removed.

### **Non-personalized: popularity-based recommendation**

In [67]:
#reduce data (make sampling)
book_data_small  = book_copy.drop(book_copy[book_copy['book_id'] > 3000].copy().index)
rating_data_small = rating_data.loc[(rating_data['user_id'] <= 10000) & (rating_data['book_id'] <= 3000)]

In [68]:
book_data_small.shape

(3000, 5)

In [69]:
rating_data_small.shape

(885615, 3)

In [70]:
book_data_small['book_id'].max()

3000

In [71]:
rating_data_small['user_id'].max()

10000

In [72]:
rating_data_small

Unnamed: 0,user_id,book_id,rating
0,1,258,5
2,2,260,5
4,2,2318,3
5,2,26,4
6,2,315,3
...,...,...,...
5976167,7666,164,4
5976173,8094,1467,4
5976174,8094,2833,4
5976197,6262,1045,3


In [73]:
#count the number of ratings given for each book and store the result in a new df called 'rating_count'
rating_count = rating_data_small.groupby('book_id').count()['rating'].reset_index()
rating_count.rename(columns={'rating':'rating_count'}, inplace=True)
rating_count

Unnamed: 0,book_id,rating_count
0,1,3902
1,2,3612
2,3,2746
3,4,3736
4,5,3405
...,...,...
2994,2996,37
2995,2997,34
2996,2998,30
2997,2999,128


In [74]:
#count the mean of ratings given for each book and store the result in a new df called 'mean_rating'
mean_rating = rating_data_small.groupby('book_id').mean().round(2)['rating'].reset_index()
mean_rating.rename(columns={'rating':'mean_rating'}, inplace=True)
mean_rating

Unnamed: 0,book_id,mean_rating
0,1,4.26
1,2,4.11
2,3,3.43
3,4,4.34
4,5,3.72
...,...,...
2994,2996,3.81
2995,2997,3.88
2996,2998,4.13
2997,2999,3.57


In [75]:
#merge 'rating_count' dataframe with 'mean_rating' dataframe based on 'book_id' column
popular = rating_count.merge(mean_rating, on='book_id')
popular

Unnamed: 0,book_id,rating_count,mean_rating
0,1,3902,4.26
1,2,3612,4.11
2,3,2746,3.43
3,4,3736,4.34
4,5,3405,3.72
...,...,...,...
2994,2996,37,3.81
2995,2997,34,3.88
2996,2998,30,4.13
2997,2999,128,3.57


In [76]:
#merge df 'popular' with df 'book_copy' based on column 'book_id' then select specific columns and remove duplicate rows based on 'book_id'
popular = popular.merge(book_data_small, on="book_id").drop_duplicates("book_id")[["book_id","rating_count","mean_rating","authors","original_publication_year","original_title","image_url"]]
popular

Unnamed: 0,book_id,rating_count,mean_rating,authors,original_publication_year,original_title,image_url
0,1,3902,4.26,Suzanne Collins,2008,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,2,3612,4.11,"J.K. Rowling, Mary GrandPré",1997,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
2,3,2746,3.43,Stephenie Meyer,2005,Twilight,https://images.gr-assets.com/books/1361039443m...
3,4,3736,4.34,Harper Lee,1960,To Kill a Mockingbird,https://images.gr-assets.com/books/1361975680m...
4,5,3405,3.72,F. Scott Fitzgerald,1925,The Great Gatsby,https://images.gr-assets.com/books/1490528560m...
...,...,...,...,...,...,...,...
2994,2996,37,3.81,Kelley Armstrong,2011,The Gathering,https://images.gr-assets.com/books/1277820938m...
2995,2997,34,3.88,Richelle Mead,2013,The Fiery Heart,https://images.gr-assets.com/books/1383243238m...
2996,2998,30,4.13,Ilona Andrews,2010,Magic Bleeds,https://images.gr-assets.com/books/1407110429m...
2997,2999,128,3.57,"Isabel Allende, Nick Caistor, Amanda Hopkinson",2015,El amante japonés,https://images.gr-assets.com/books/1501991754m...


In [77]:
#show the order of values from largest to smallest
top_30 = popular.sort_values("rating_count", ascending=False).head(30)
top_30

Unnamed: 0,book_id,rating_count,mean_rating,authors,original_publication_year,original_title,image_url
0,1,3902,4.26,Suzanne Collins,2008,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
3,4,3736,4.34,Harper Lee,1960,To Kill a Mockingbird,https://images.gr-assets.com/books/1361975680m...
1,2,3612,4.11,"J.K. Rowling, Mary GrandPré",1997,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
4,5,3405,3.72,F. Scott Fitzgerald,1925,The Great Gatsby,https://images.gr-assets.com/books/1490528560m...
25,26,3366,3.36,Dan Brown,2003,The Da Vinci Code,https://images.gr-assets.com/books/1303252999m...
16,17,3342,4.07,Suzanne Collins,2009,Catching Fire,https://images.gr-assets.com/books/1358273780m...
17,18,3243,4.24,"J.K. Rowling, Mary GrandPré, Rufus Beck",1999,Harry Potter and the Prisoner of Azkaban,https://images.gr-assets.com/books/1499277281m...
22,23,3229,3.97,"J.K. Rowling, Mary GrandPré",1998,Harry Potter and the Chamber of Secrets,https://images.gr-assets.com/books/1474169725m...
19,20,3187,3.75,Suzanne Collins,2010,Mockingjay,https://images.gr-assets.com/books/1358275419m...
23,24,3160,4.26,"J.K. Rowling, Mary GrandPré",2000,Harry Potter and the Goblet of Fire,https://images.gr-assets.com/books/1361482611m...


In [78]:
book_copy['image_url'][0]

'https://images.gr-assets.com/books/1447303603m/2767052.jpg'

In [79]:
#show the dimensions of popular
popular.shape

(2999, 7)

In [81]:
import pickle
pickle.dump(top_30, open('output/popular.pkl','wb'))

### **Personalized recommender system**

In [82]:
#make matrix
#copy rating_data and make pivot to check total 'user_id' and 'book_id'
user_rating_pivot = rating_data_small.pivot(index='user_id',columns='book_id',values='rating')

The total number of users is more than the number of items, so for personalized recommender systems, *User-to-User Collaborative Filtering (User CF)* is used.

#### Train Model

In [83]:
#load library
import surprise
from surprise import accuracy, Dataset, Reader, BaselineOnly, KNNBasic, KNNBaseline, SVD, NMF
from surprise.model_selection.search import RandomizedSearchCV
from surprise.model_selection import cross_validate, train_test_split

In [84]:
#Initialize a Reader object in the Surprise library to read rating data on a scale of 1-5
reader = Reader(rating_scale = (1, 5))

In [85]:
#reads the rating data and converts it into a format that can be used to load the recommendation dataset from df 'rating_data'
dataset = Dataset.load_from_df(rating_data_small[['user_id', 'book_id', 'rating']].copy(), reader)
dataset

<surprise.dataset.DatasetAutoFolds at 0x7efda60ed550>

In [86]:
#show data
dataset.df

Unnamed: 0,user_id,book_id,rating
0,1,258,5
2,2,260,5
4,2,2318,3
5,2,26,4
6,2,315,3
...,...,...,...
5976167,7666,164,4
5976173,8094,1467,4
5976174,8094,2833,4
5976197,6262,1045,3


#### Split Train-Test

In [87]:
#split dataset into training data and test data
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

In [88]:
#validate splitting
train_data.n_ratings, len(test_data)

(708492, 177123)

#### Create baseline model

Baselineonly calculates the predicted value based on the baseline (global, user, and item averages)

In [89]:
#initialize
model_baseline = BaselineOnly()
model_baseline

<surprise.prediction_algorithms.baseline_only.BaselineOnly at 0x7efda8906640>

In [90]:
#perform cross-validation on the initialized recommendation model using the 'BaselineOnly'
cv_baseline = cross_validate(algo=model_baseline, data=dataset, cv=5,measures=['rmse'])

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


In [92]:
#cv result
cv_baseline_rmse = cv_baseline['test_rmse'].mean()
cv_baseline_rmse

0.8715796982537904

#### Hyperparameter candidate

In [93]:
#initialization of parameters that will be used in a randomized search
#for hyperparameters in the recommendation model with the KNNBaseline method
param_dist = {'k':list(np.arange(start=20, stop=40, step=5)),
          'sim_options':{'name':['pearson','pearson_baseline','cosine'],'user_based':['True']}, 'min_k': [1, 2, 3]}

In [94]:
#randomized search for hyperparameters in the recommendation model with the KNNBasic method
knn_basic = RandomizedSearchCV(algo_class=KNNBasic, param_distributions = param_dist, cv=5)

In [95]:
knn_basic.fit(data=dataset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similari

In [97]:
#randomized search for hyperparameters in the recommendation model with the KNNBaseline method
knn_search = RandomizedSearchCV(algo_class=KNNBaseline, param_distributions = param_dist, cv=5)

In [None]:
#process search hyperparams
knn_search.fit(data=dataset)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


In [None]:
pickle.dump(knn_basic.best_params["rmse"],open('output/knn_basic.pkl','wb'))
pickle.dump(knn_search.best_params["rmse"],open('output/knn_baseline.pkl','wb'))

In [96]:
params_SVD = {'lr_all' : [1,0.1,0.01,0.001], 'n_factors' : [50,100],
              'reg_all' : [1,0.1,0.01, 0.02]
              }  

In [52]:
svd_search = RandomizedSearchCV(algo_class=SVD, param_distributions = params_SVD, cv=5)
svd_search.fit(data=dataset)

In [53]:
params_NMF = {'n_factors': np.arange(5, 50, 5),
              'n_epochs': np.arange(10, 100, 10)
             }

In [54]:
nmf_search = RandomizedSearchCV(algo_class=NMF, param_distributions = params_NMF, cv=5)
nmf_search.fit(data=dataset)

In [55]:
#summarize performance
summary_df = pd.DataFrame({'Model': ['Baseline', 'KNN Basic','KNN Baseline', 'SVD', 'NMF'],
                           'CV Performance - RMSE': [cv_baseline_rmse,knn_basic.best_score['rmse'],knn_search.best_score['rmse'],svd_search.best_score['rmse'],nmf_search.best_score['rmse']],
                           'Model Condiguration':['N/A',f'{knn_basic.best_params["rmse"]}',f'{knn_search.best_params["rmse"]}',f'{svd_search.best_params["rmse"]}',f'{nmf_search.best_params["rmse"]}']})

summary_df

Unnamed: 0,Model,CV Performance - RMSE,Model Condiguration
0,Baseline,0.875839,
1,KNN Basic,0.902894,"{'k': 30, 'sim_options': {'name': 'pearson_bas..."
2,KNN Baseline,0.8573,"{'k': 20, 'sim_options': {'name': 'pearson_bas..."
3,SVD,0.866337,"{'lr_all': 0.01, 'n_factors': 50, 'reg_all': 0.1}"
4,NMF,0.875994,"{'n_factors': 45, 'n_epochs': 90}"


In [70]:
#best hyperparams combination
knn_search.best_params["rmse"]

{'k': 25,
 'sim_options': {'name': 'pearson_baseline', 'user_based': 'True'},
 'min_k': 2}

In [71]:
#intialize ber hyperparams
best_params = knn_search.best_params['rmse']

In [72]:
#create obj. and retrain whole train data
model_best = KNNBaseline(**best_params)
model_best.fit(train_data)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x29b0683c700>

In [73]:
#predict test data using best model
test_pred = model_best.test(test_data)
test_rmse = accuracy.rmse(test_pred)
test_rmse

RMSE: 0.8581


0.8581299488628784

In [74]:
#summarize RMSE tuning dan test
summary_test_df = pd.DataFrame({'Model' : ['User to User CF'],
                                'RMSE-Tuning': [knn_search.best_score['rmse']],
                                'RMSE-Test': [test_rmse]})

summary_test_df

Unnamed: 0,Model,RMSE-Tuning,RMSE-Test
0,User to User CF,0.856963,0.85813


#### Prediction

In [75]:
#predict user_id = 2 and book_id = 4
sample_prediction = model_best.predict(uid = 2,
                                      iid = 4)

In [76]:
sample_prediction

Prediction(uid=2, iid=4, r_ui=None, est=4.961821802113375, details={'actual_k': 25, 'was_impossible': False})

Find all book that have not been viewed/unrated by user_id= 2

In [None]:
#get unique book_id
unique_book_id = set(rating_data['book_id'])
print(unique_book_id)

In [None]:
#get book_id that is rated by user_id = 2
rated_book_id = set(rating_data.loc[rating_data['user_id']==2, 'book_id'])
print(rated_book_id)

In [None]:
#find unrated book_id
unrated_book_id = unique_book_id.difference(rated_book_id)
print(unrated_book_id)

In [80]:
#make function
def get_unrated_book_ids(rating_data, user_id):
    """
    Gets a list of book IDs that a user has not rated yet.

    Parameters
    ----------
    rating_data : DataFrame
        The DataFrame containing the rating data.
    user_id : int
        The ID of the user for whom we want to find unrated book IDs.

    Returns
    -------
    unrated_book_ids : set
        A set of book IDs that the user has not rated.
    """
    #get unique book_id
    unique_book_ids = set(rating_data['book_id'])
    #get book_id that is rated by user_id = 2
    rated_book_ids = set(rating_data.loc[rating_data['user_id'] == user_id, 'book_id'])
    #find unrated book_id
    unrated_book_ids = unique_book_ids.difference(rated_book_ids)
    
    return unrated_book_ids


In [None]:
#check result function and manual
unrated_books = get_unrated_book_ids(rating_data, 2)
print(unrated_books)

In [82]:
#create predict from unrated book
predicted_unrated_book = {
    'user_id': 2,
    'book_id': [],
    'predicted_rating': []
}

predicted_unrated_book

{'user_id': 2, 'book_id': [], 'predicted_rating': []}

In [83]:
#loop all unrated book
for id in unrated_book_id:
    #make predict
    pred_id = model_best.predict(uid = predicted_unrated_book['user_id'],
                                 iid = id)
    #append
    predicted_unrated_book['book_id'].append(id)
    predicted_unrated_book['predicted_rating'].append(pred_id.est)

In [84]:
#convert to df
predicted_unrated_book = pd.DataFrame(predicted_unrated_book)
predicted_unrated_book

Unnamed: 0,user_id,book_id,predicted_rating
0,2,1,4.565140
1,2,3,3.498752
2,2,4,4.961822
3,2,6,4.588569
4,2,7,4.074330
...,...,...,...
9930,2,9996,4.205349
9931,2,9997,4.205349
9932,2,9998,4.205349
9933,2,9999,4.205349


In [85]:
#sort predicted rating
predicted_unrated_book = predicted_unrated_book.sort_values('predicted_rating',
                                                              ascending = False)
predicted_unrated_book

Unnamed: 0,user_id,book_id,predicted_rating
470,2,507,5.000000
66,2,85,5.000000
385,2,422,5.000000
453,2,490,4.973167
2,2,4,4.961822
...,...,...,...
195,2,224,2.784238
19,2,34,2.688126
216,2,246,2.609063
77,2,96,2.582685


In [86]:
#make function
def predict_and_sort_ratings(model, user_id, unrated_book_ids):
    """
    Predicts and sorts unrated books based on predicted ratings for a given user.

    Parameters
    ----------
    model : object
        The collaborative filtering model used for predictions.
    user_id : int
        The ID of the user for whom we want to predict and sort unrated books.
    unrated_book_ids : list
        A list of book IDs that the user has not rated yet.

    Returns
    -------
    predicted_unrated_book_df : DataFrame
        A DataFrame containing the predicted ratings and book IDs,
        sorted in descending order of predicted ratings.
    """

    #initialize
    predicted_unrated_book = {
        'user_id': user_id,
        'book_id': [],
        'predicted_rating': []
    }
    
    #loop all unrated book
    for book_id in unrated_book_ids:
        #make predict
        pred_id = model.predict(uid=predicted_unrated_book['user_id'],
                                iid=book_id)
        #append
        predicted_unrated_book['book_id'].append(book_id)
        predicted_unrated_book['predicted_rating'].append(pred_id.est)

    #create df
    predicted_unrated_book_df = pd.DataFrame(predicted_unrated_book).sort_values('predicted_rating',
                                                                                  ascending=False)

    return predicted_unrated_book_df


In [87]:
predicted_books_df = predict_and_sort_ratings(model_best,2,unrated_book_ids=unrated_books)
predicted_books_df

Unnamed: 0,user_id,book_id,predicted_rating
470,2,507,5.000000
66,2,85,5.000000
385,2,422,5.000000
453,2,490,4.973167
2,2,4,4.961822
...,...,...,...
195,2,224,2.784238
19,2,34,2.688126
216,2,246,2.609063
77,2,96,2.582685


In [88]:
#initialize book data
new_book_data = book_copy
new_book_data.head()

Unnamed: 0,book_id,authors,original_publication_year,original_title,image_url
0,1,Suzanne Collins,2008,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,2,"J.K. Rowling, Mary GrandPré",1997,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
2,3,Stephenie Meyer,2005,Twilight,https://images.gr-assets.com/books/1361039443m...
3,4,Harper Lee,1960,To Kill a Mockingbird,https://images.gr-assets.com/books/1361975680m...
4,5,F. Scott Fitzgerald,1925,The Great Gatsby,https://images.gr-assets.com/books/1490528560m...


In [89]:
#top k biggest rating
k = 5
top_book = predicted_unrated_book.head(k).copy()
top_book

Unnamed: 0,user_id,book_id,predicted_rating
470,2,507,5.0
66,2,85,5.0
385,2,422,5.0
453,2,490,4.973167
2,2,4,4.961822


In [90]:
#add detail
top_book['authors'] = new_book_data.loc[top_book['book_id'], 'authors'].values
top_book['original_publication_year'] = new_book_data.loc[top_book['book_id'], 'original_publication_year'].values
top_book['original_title'] = new_book_data.loc[top_book['book_id'], 'original_title'].values
top_book['image_url'] = new_book_data.loc[top_book['book_id'], 'image_url'].values

top_book

Unnamed: 0,user_id,book_id,predicted_rating,authors,original_publication_year,original_title,image_url
470,2,507,5.0,Barbara Ehrenreich,2001,Nickel and Dimed: On (Not) Getting By in America,https://s.gr-assets.com/assets/nophoto/book/11...
66,2,85,5.0,John Grisham,1989,A Time to Kill,https://s.gr-assets.com/assets/nophoto/book/11...
385,2,422,5.0,Kiera Cass,2013,The Elite,https://images.gr-assets.com/books/1391454595m...
453,2,490,4.973167,"P.C. Cast, Kristin Cast",2008,Untamed: A House of Night Novel,https://images.gr-assets.com/books/1438037020m...
2,2,4,4.961822,F. Scott Fitzgerald,1925,The Great Gatsby,https://images.gr-assets.com/books/1490528560m...


In [91]:
def get_top_predicted_books(model, k, user_id, rating_data, book_data):
    """
    Gets the top predicted books for a given user based on a collaborative filtering model.

    Parameters
    ----------
    model : object
        The collaborative filtering model used for predictions
    k : int
        The number of top predicted books to retrieve
    user_id : int
        The ID of the user for whom to get top predicted books
    rating_data : DataFrame
        The DataFrame containing the rating data
    book_data : DataFrame
        The DataFrame containing the book details

    Returns
    -------
    top_books_df : DataFrame
        A DataFrame containing the top predicted books along with their details
    """

    # Get unrated book IDs for the user
    unrated_book_ids = get_unrated_book_ids(rating_data, user_id)

    # Predict and sort unrated books
    predicted_books_df = predict_and_sort_ratings(model, user_id, unrated_book_ids)

    # Get the top k predicted books
    top_predicted_books = predicted_books_df.head(k).copy()

    # Add book details to the top predicted books
    top_predicted_books['authors'] = book_data.loc[top_predicted_books['book_id'], 'authors'].values
    top_predicted_books['original_publication_year'] = book_data.loc[top_predicted_books['book_id'], 'original_publication_year'].values
    top_predicted_books['original_title'] = book_data.loc[top_predicted_books['book_id'], 'original_title'].values
    top_predicted_books['image_url'] = book_data.loc[top_predicted_books['book_id'], 'image_url'].values

    return top_predicted_books

In [92]:
# Example usage
predicted_books = get_top_predicted_books(model_best, 9, 2, rating_data, new_book_data)
predicted_books

Unnamed: 0,user_id,book_id,predicted_rating,authors,original_publication_year,original_title,image_url
470,2,507,5.0,Barbara Ehrenreich,2001,Nickel and Dimed: On (Not) Getting By in America,https://s.gr-assets.com/assets/nophoto/book/11...
66,2,85,5.0,John Grisham,1989,A Time to Kill,https://s.gr-assets.com/assets/nophoto/book/11...
385,2,422,5.0,Kiera Cass,2013,The Elite,https://images.gr-assets.com/books/1391454595m...
453,2,490,4.973167,"P.C. Cast, Kristin Cast",2008,Untamed: A House of Night Novel,https://images.gr-assets.com/books/1438037020m...
2,2,4,4.961822,F. Scott Fitzgerald,1925,The Great Gatsby,https://images.gr-assets.com/books/1490528560m...


In [111]:
pickle.dump(model_best,open('model_best.pkl','wb'))
pickle.dump(book_data_small,open('books.pkl','wb'))
pickle.dump(rating_data_small,open('rating.pkl','wb'))
pickle.dump(predicted_books,open('predicted_books.pkl','wb'))