## **Book Recommendation System**

### Data Description
- The data comes from [Goodbooks dataset](https://github.com/zygmuntz/goodbooks-10k).
- The dataset contains 10,000 books and 5,976,479 ratings.


There are 2 files that will be used:


**Book rating data**: `ratings.csv`

<center>

|Feature|Description|Data Type|
|:--|:--|:--:|
|`user_id`|User ID|`int`|
|`book_id`|BookID|`int`|
|`rating`|The rating of the book given by the user. Rating starts from `0` to `5`|`int`|

**Books data** : `books.csv`

<center>

|Feature|Description|Data Type|
|:--|:--|:--:|
|`book_id`|Book ID|`int`|
|`goodreads_book_id`|The goodreads book ID|`int`|
|`best_book_id`|Rating of the book given by the user. Rating starts from `0` to `5`|`int`|
|`work_id`|Work ID|`int`|
|`books_count`|Books count|`int`|
|`isbn`|International standard book number|`object`|
|`isbn13`|Book identification number (new version of ISBN)|`float`|
|`authors`|The authors of the book|`object`|
|`original_publication_year`|The year of publication|`float`|
|`original_title`|Original title|`object`|
|`title`|Book title|`object`|
|`language_code`|Code of language|`object`|
|`average_rating`|Average rating|`float`|
|`ratings_count`|Rating count|`int`|
|`work_ratings_count`|Work ratings count|`int`|
|`work_text_reviews_count`|Work text reviews count|`int`|
|`ratings_1`|Rating 1|`int`|
|`ratings_2`|Rating 2|`int`|
|`ratings_3`|Rating 3|`int`|
|`ratings_4`|Rating 4|`int`|
|`ratings_5`|Rating 5|`int`|
|`image_url`|Image link|`object`|
|`small_image_url`|Small image links|`object`|

### **1. Import Data dan Preparation**

In [1]:
#load library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

In [2]:
#load data from path
rating_path = '../data/ratings.csv'
book_path = '../data/books.csv'
book_tags_path = '../data/book_tags.csv'
tags_path = '../data/tags.csv'

In [3]:
def load_data(book_path, rating_path):
    """
    
    Function to load book and rating data
    - subsetting only the used columns
    - fill in missing values
    - drop duplicate rows
    
    """
        
    # reads the CSV file data and saves it as a DataFrame
    rating_data = pd.read_csv(rating_path, delimiter=',')
    book_data = pd.read_csv(book_path, delimiter=',')
    
    # copy dataframe book_data, and delete some feature.
    book_copy = book_data.copy()
    book_copy = book_copy.drop(columns=['best_book_id','work_id','books_count','isbn',
           'isbn13','title','language_code','average_rating',
           'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
           'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
           'small_image_url'], axis=1)
    book_copy.head(3)
    
    # fill null values in book_data
    print("Missing values before fillna: ", book_copy.isnull().sum())
    book_copy['original_publication_year'] = book_copy['original_publication_year'].fillna(0)
    book_copy['original_title'] = book_copy['original_title'].fillna(book_data['title'])
    print("Missing values after fillna: ", book_copy.isnull().sum())
    
    # changes the data type original_publication_year column to int data type
    book_copy.loc[:, 'original_publication_year'] = book_copy['original_publication_year'].astype(int)
    book_copy.dtypes
    
    # drop duplicated rows
    print("Books shape before drop dup: ", book_copy.shape)
    print("Ratings shape before drop dup: ", rating_data.shape)
    book_copy.drop_duplicates(subset = ['book_id', 'goodreads_book_id'], inplace = True)
    rating_data.drop_duplicates(subset=['user_id','book_id'], inplace = True)
    print("Books shape after drop dup: ", book_copy.shape)
    print("Ratings shape after drop dup: ", rating_data.shape)
    
    return book_copy, rating_data

In [4]:
book_copy, rating_data = load_data(book_path, rating_path)

Missing values before fillna:  book_id                        0
goodreads_book_id              0
authors                        0
original_publication_year     21
original_title               585
image_url                      0
dtype: int64
Missing values after fillna:  book_id                      0
goodreads_book_id            0
authors                      0
original_publication_year    0
original_title               0
image_url                    0
dtype: int64
Books shape before drop dup:  (10000, 6)
Ratings shape before drop dup:  (5976479, 3)
Books shape after drop dup:  (10000, 6)
Ratings shape after drop dup:  (5976479, 3)


**rating_data** has the correct type and feature. There is no null data and duplicated in rating_data.

**book_copy** has the correct feature. The data type in 'original_publication_year' has been corrected. There is no duplicated in book_copy and null data has been removed.

**book_copy** has the correct feature. The data type in 'original_publication_year' has been corrected. There is no duplicated in book_copy and null data has been removed.

In [5]:
#show the datatype of book_data
book_copy.dtypes

book_id                       int64
goodreads_book_id             int64
authors                      object
original_publication_year     int32
original_title               object
image_url                    object
dtype: object

In [6]:
#show book_data
book_copy.head(3)

Unnamed: 0,book_id,goodreads_book_id,authors,original_publication_year,original_title,image_url
0,1,2767052,Suzanne Collins,2008,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,2,3,"J.K. Rowling, Mary GrandPré",1997,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
2,3,41865,Stephenie Meyer,2005,Twilight,https://images.gr-assets.com/books/1361039443m...


### Extracting Tags Data to obtain book genres

In [7]:
book_id_limit = 2500
user_id_limit = 13356

# subset the book data and rating data to include only limited number of books, user and bookID
book_data_small  = book_copy.drop(book_copy[book_copy['book_id'] > book_id_limit].copy().index)
rating_data_small = rating_data.loc[(rating_data['user_id'] <= user_id_limit) & (rating_data['book_id'] <= book_id_limit)]


In [8]:
# separate

book_tags = pd.read_csv(book_tags_path, delimiter=',')
tags = pd.read_csv(tags_path, delimiter=',')

genres = ["Art", "Biography", "Business", "Chick Lit", "Children", "Christian", "Classics",
          "Comics", "Contemporary", "Cookbooks", "Crime", "Ebooks", "Fantasy", "Fiction",
          "Gay and Lesbian", "Graphic Novels", "Historical Fiction", "History", "Horror",
          "Humor and Comedy", "Manga", "Memoir", "Music", "Mystery", "Nonfiction", "Paranormal",
          "Philosophy", "Poetry", "Psychology", "Religion", "Romance", "Science", "Science Fiction", 
          "Self Help", "Suspense", "Spirituality", "Sports", "Thriller", "Travel", "Young Adult"]

genres = list(map(str.lower, genres))

def create_genre_list(tag):
    """
    Function for building list of extracted genres
    """
    genre_list = []
    string_tag = str(tag)
    
    for genre in genres:
        
        if ('nonfiction' in string_tag):
            genre_list.append('nonfiction')
        elif (genre in string_tag) & ('non' not in string_tag):
            genre_list.append(genre)
        elif ('sci-fi' in string_tag) | ('scifi' in string_tag):
            genre_list.append('science fiction')
        else:
            pass
        
    return genre_list


def unique_array(list_):
    unique_list = list(set(list_))
    return unique_list

def extract_genres(book_tags, tags, genres):
    """
    Function to extract genres from tag names
    """
    tags['tag_name_lower'] = tags['tag_name'].str.lower()
    available_genres = tags.loc[tags.tag_name_lower.str.lower().isin(genres)]
    available_genres.head()
    
    tags['genre_list'] = [[]] * tags.shape[0]   

    # Add tags
    tags['genre_list'] = tags.apply(lambda row: create_genre_list(row['tag_name_lower']), axis = 1)
    tags_filtered = tags[tags.genre_list.str.len() != 0]
    
    # join with books
    booktags_to_genre = pd.merge(book_tags, tags_filtered, how = "left", on = "tag_id")
    booktags_to_genre.dropna(subset = ["genre_list"], inplace = True)
    booktags_to_genre.drop(['tag_id', 'tag_name', 'tag_name_lower', 'count'], axis=1, inplace = True)
    gr_book_genres = booktags_to_genre.groupby('goodreads_book_id').agg({'genre_list': 'sum'}).reset_index(drop = False)

    gr_book_genres['genres'] = gr_book_genres.apply(lambda row: unique_array(row['genre_list']), axis = 1)
    gr_book_genres.drop(['genre_list'], axis = 1, inplace = True)
    
    # Join with books
    books_with_genres = pd.merge(book_data_small, gr_book_genres, how = "left", on = "goodreads_book_id")
    #books_with_genres = books_with_genres[["book_id", "original_title", "genres"]]
    
    return books_with_genres


In [9]:
books_with_genres = extract_genres(book_tags, tags, genres)
books_with_genres

Unnamed: 0,book_id,goodreads_book_id,authors,original_publication_year,original_title,image_url,genres
0,1,2767052,Suzanne Collins,2008,The Hunger Games,https://images.gr-assets.com/books/1447303603m...,"[romance, suspense, fantasy, science, contempo..."
1,2,3,"J.K. Rowling, Mary GrandPré",1997,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...,"[children, fantasy, science, contemporary, par..."
2,3,41865,Stephenie Meyer,2005,Twilight,https://images.gr-assets.com/books/1361039443m...,"[romance, fantasy, science, contemporary, para..."
3,4,2657,Harper Lee,1960,To Kill a Mockingbird,https://images.gr-assets.com/books/1361975680m...,"[contemporary, history, crime, mystery, fictio..."
4,5,4671,F. Scott Fitzgerald,1925,The Great Gatsby,https://images.gr-assets.com/books/1490528560m...,"[romance, ebooks, fiction, classics]"
...,...,...,...,...,...,...,...
2495,2496,8584686,Karen Russell,2011,Swamplandia!,https://images.gr-assets.com/books/1320536498m...,"[fantasy, contemporary, fiction, ebooks]"
2496,2497,49353,Michael Connelly,1995,The Last Coyote,https://s.gr-assets.com/assets/nophoto/book/11...,"[suspense, contemporary, crime, thriller, myst..."
2497,2498,2715,Mark Kurlansky,2002,Salt: A World History,https://images.gr-assets.com/books/1414608893m...,"[travel, science, history, business, cookbooks..."
2498,2499,31173,"Charlotte Brontë, A.S. Byatt, Μαρία Λαϊνά, Ign...",1853,Villette,https://images.gr-assets.com/books/1320412741m...,"[romance, ebooks, fiction, classics]"


## **2. Non-personalized: popularity-based recommendation**

In [10]:
rating_data_small.shape

(1122956, 3)

In [11]:
def popular_books(rating_data, book_data):
    #count the number of ratings given for each book and store the result in a new df called 'rating_count'
    rating_count = rating_data.groupby('book_id').count()['rating'].reset_index()
    rating_count.rename(columns={'rating':'rating_count'}, inplace=True)
    
    #count the mean of ratings given for each book and store the result in a new df called 'mean_rating'
    mean_rating = rating_data.groupby('book_id').mean().round(2)['rating'].reset_index()
    mean_rating.rename(columns={'rating':'mean_rating'}, inplace=True)
    
    #merge 'rating_count' dataframe with 'mean_rating' dataframe based on 'book_id' column
    popular = rating_count.merge(mean_rating, on='book_id')
    
    #merge df 'popular' with df 'book_copy' based on column 'book_id' then select specific columns and remove duplicate rows based on 'book_id'
    popular = popular.merge(book_data, on="book_id").drop_duplicates("book_id")[["book_id","rating_count","mean_rating","authors","original_publication_year","original_title","image_url"]]

    #merge df 'popular' with genres
    popular_with_genres = popular.merge(books_with_genres, on="book_id").drop_duplicates("book_id")
    popular_with_genres['authors'] = popular_with_genres['authors_x'].fillna(popular_with_genres['authors_y'])
    popular_with_genres.drop(['authors_x', 'authors_y'], axis=1, inplace=True)

    popular_with_genres['original_publication_year'] = popular_with_genres['original_publication_year_x'].fillna(popular_with_genres['original_publication_year_y'])
    popular_with_genres.drop(['original_publication_year_x', 'original_publication_year_y'], axis=1, inplace=True)

    popular_with_genres['original_title'] = popular_with_genres['original_title_x'].fillna(popular_with_genres['original_title_y'])
    popular_with_genres.drop(['original_title_x', 'original_title_y'], axis=1, inplace=True)

    popular_with_genres['image_url'] = popular_with_genres['image_url_x'].fillna(popular_with_genres['image_url_y'])
    popular_with_genres.drop(['image_url_x', 'image_url_y'], axis=1, inplace=True)

    return popular_with_genres

In [12]:
popular_with_genres = popular_books(rating_data_small, book_data_small)
#show the order of values from largest to smallest
top_30 = popular_with_genres.sort_values("rating_count", ascending=False).head(30)
top_30

Unnamed: 0,book_id,rating_count,mean_rating,goodreads_book_id,genres,authors,original_publication_year,original_title,image_url
0,1,5250,4.29,2767052,"[romance, suspense, fantasy, science, contempo...",Suzanne Collins,2008,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,2,4892,4.15,3,"[children, fantasy, science, contemporary, par...","J.K. Rowling, Mary GrandPré",1997,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
3,4,4879,4.32,2657,"[contemporary, history, crime, mystery, fictio...",Harper Lee,1960,To Kill a Mockingbird,https://images.gr-assets.com/books/1361975680m...
16,17,4506,4.1,6148028,"[romance, suspense, fantasy, science, contempo...",Suzanne Collins,2009,Catching Fire,https://images.gr-assets.com/books/1358273780m...
4,5,4446,3.7,4671,"[romance, ebooks, fiction, classics]",F. Scott Fitzgerald,1925,The Great Gatsby,https://images.gr-assets.com/books/1490528560m...
25,26,4391,3.39,968,"[romance, suspense, religion, art, fantasy, sc...",Dan Brown,2003,The Da Vinci Code,https://images.gr-assets.com/books/1303252999m...
22,23,4371,4.02,15881,"[children, fantasy, science, contemporary, par...","J.K. Rowling, Mary GrandPré",1998,Harry Potter and the Chamber of Secrets,https://images.gr-assets.com/books/1474169725m...
17,18,4358,4.27,5,"[children, travel, fantasy, science, contempor...","J.K. Rowling, Mary GrandPré, Rufus Beck",1999,Harry Potter and the Prisoner of Azkaban,https://images.gr-assets.com/books/1499277281m...
19,20,4291,3.79,7260188,"[romance, suspense, fantasy, science, science ...",Suzanne Collins,2010,Mockingjay,https://images.gr-assets.com/books/1358275419m...
23,24,4263,4.29,6,"[children, fantasy, science, contemporary, par...","J.K. Rowling, Mary GrandPré",2000,Harry Potter and the Goblet of Fire,https://images.gr-assets.com/books/1361482611m...


In [90]:
pickle.dump(top_30, open('output/popular.pkl','wb'))

## **3. Personalized recommender system**

There are 5 approach we train with cross validation to obtain the best model:
a. Baseline Approach
b. KNN (Basic and Baseline)
c. Matrix Factorization (SVD and NMF)

The total number of users is more than the number of items, so for personalized recommender systems, *User-to-User Collaborative Filtering (User CF)* is used. 

In [15]:
#load library
import surprise
from surprise import accuracy, Dataset, Reader, BaselineOnly, KNNBasic, KNNBaseline, SVD, NMF
from surprise.model_selection.search import RandomizedSearchCV
from surprise.model_selection import cross_validate, train_test_split

### Preparation Train-Test Data

In [16]:
#make matrix
#copy rating_data and make pivot to check total 'user_id' and 'book_id'
user_rating_pivot = rating_data_small.pivot(index='user_id',columns='book_id',values='rating')

In [17]:
#Initialize a Reader object in the Surprise library to read rating data on a scale of 1-5
reader = Reader(rating_scale = (1, 5))

In [18]:
#reads the rating data and converts it into a format that can be used to load the recommendation dataset from df 'rating_data'
dataset = Dataset.load_from_df(rating_data_small[['user_id', 'book_id', 'rating']].copy(), reader)
#show data
dataset.df.head(5)

Unnamed: 0,user_id,book_id,rating
0,1,258,5
2,2,260,5
4,2,2318,3
5,2,26,4
6,2,315,3


In [19]:
#split dataset into training data and test data
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

In [20]:
#validate splitting
train_data.n_ratings, len(test_data)

(898364, 224592)

### a. Baseline model

Baselineonly calculates the predicted value based on the baseline (global, user, and item averages)

In [21]:
#initialize
model_baseline = BaselineOnly()
model_baseline

<surprise.prediction_algorithms.baseline_only.BaselineOnly at 0x7f216f860280>

In [22]:
#perform cross-validation on the initialized recommendation model using the 'BaselineOnly'
cv_baseline = cross_validate(algo=model_baseline, data=dataset, cv=5,measures=['rmse'])

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


In [23]:
#cv result
cv_baseline_rmse = cv_baseline['test_rmse'].mean()
cv_baseline_rmse

0.8792313203106795

### b. KNN Basic and Baseline

In [39]:
#initialization of parameters that will be used in a randomized search
#for hyperparameters in the recommendation model with the KNNBaseline method
param_dist = {'k':list(np.arange(start=20, stop=40, step=5)),
          'sim_options':{'name':['pearson_baseline'],'user_based':['True']}, 'min_k': [1, 2, 3]}

In [None]:
#randomized search for hyperparameters in the recommendation model with the KNNBasic method
knn_basic = RandomizedSearchCV(algo_class=KNNBasic, param_distributions = param_dist, cv=5)

In [24]:
knn_basic.fit(data=dataset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline si

In [40]:
#randomized search for hyperparameters in the recommendation model with the KNNBaseline method
knn_search = RandomizedSearchCV(algo_class=KNNBaseline, param_distributions = param_dist, cv=5)

In [None]:
#process search hyperparams
knn_search.fit(data=dataset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing simila

In [None]:
pickle.dump(knn_search.best_params["rmse"],open('output/knn_baseline.pkl','wb'))

### c. Matrix Factorization (SVD and NMF)

In [48]:
params_SVD = {'lr_all' : [1,0.1,0.01,0.001], 'n_factors' : [50,100],
              'reg_all' : [1,0.1,0.01, 0.02]
              }  

In [49]:
svd_search = RandomizedSearchCV(algo_class=SVD, param_distributions = params_SVD, cv=5)
svd_search.fit(data=dataset)

In [50]:
params_NMF = {'n_factors': np.arange(5, 50, 5),
              'n_epochs': np.arange(10, 100, 10)
             }

In [51]:
nmf_search = RandomizedSearchCV(algo_class=NMF, param_distributions = params_NMF, cv=5)
nmf_search.fit(data=dataset)

### Summary CV of Multiple Algorithms

In [52]:
#summarize performance
summary_df = pd.DataFrame({'Model': ['Baseline', 'KNN Basic','KNN Baseline', 'SVD', 'NMF'],
                           'CV Performance - RMSE': [cv_baseline_rmse,knn_basic.best_score['rmse'],knn_search.best_score['rmse'],svd_search.best_score['rmse'],nmf_search.best_score['rmse']],
                           'Model Condiguration':['N/A',f'{knn_basic.best_params["rmse"]}',f'{knn_search.best_params["rmse"]}',f'{svd_search.best_params["rmse"]}',f'{nmf_search.best_params["rmse"]}']})

summary_df

Unnamed: 0,Model,CV Performance - RMSE,Model Condiguration
0,Baseline,0.879231,
1,KNN Basic,0.894693,"{'k': 30, 'sim_options': {'name': 'pearson_bas..."
2,KNN Baseline,0.840927,"{'k': 30, 'sim_options': {'name': 'pearson_bas..."
3,SVD,0.864923,"{'lr_all': 0.01, 'n_factors': 50, 'reg_all': 0.1}"
4,NMF,0.873761,"{'n_factors': 35, 'n_epochs': 80}"


## 4. Train Best Model and Predict

In [54]:
#best hyperparams combination
#intialize ber hyperparams
best_params = knn_search.best_params['rmse']
best_params

{'k': 30,
 'sim_options': {'name': 'pearson_baseline', 'user_based': 'True'},
 'min_k': 3}

In [55]:
#create obj. and retrain whole train data
model_best = KNNBaseline(**best_params)
model_best.fit(train_data)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7f213c1e4370>

In [56]:
#predict test data using best model
test_pred = model_best.test(test_data)
test_rmse = accuracy.rmse(test_pred)

RMSE: 0.8419


In [58]:
#summarize RMSE tuning dan test
summary_test_df = pd.DataFrame({'Model' : ['User to User CF'],
                                'RMSE-CV-Tuning': [knn_search.best_score['rmse']],
                                'RMSE-Test': [test_rmse]})

summary_test_df

Unnamed: 0,Model,RMSE-CV-Tuning,RMSE-Test
0,User to User CF,0.840927,0.841932


#### Prediction

In [67]:
#predict user_id = 2 and book_id = 4
sample_prediction = model_best.predict(uid = 2, iid = 4)
sample_prediction

Prediction(uid=2, iid=4, r_ui=None, est=4.886242510686152, details={'actual_k': 30, 'was_impossible': False})

In [69]:
# Your get_unrated_book_ids, predict_and_sort_ratings, and get_top_predicted_books functions here
#make function
def get_unrated_book_ids(rating_data, user_id):
    """
    Gets a list of book IDs that a user has not rated yet.

    Parameters
    ----------
    rating_data : DataFrame
        The DataFrame containing the rating data.
    user_id : int
        The ID of the user for whom we want to find unrated book IDs.

    Returns
    -------
    unrated_book_ids : set
        A set of book IDs that the user has not rated.
    """
    #get unique book_id
    unique_book_ids = set(rating_data['book_id'])
    #get book_id that is rated by user_id = 2
    rated_book_ids = set(rating_data.loc[rating_data['user_id'] == user_id, 'book_id'])
    #find unrated book_id
    unrated_book_ids = unique_book_ids.difference(rated_book_ids)
    
    return unrated_book_ids


In [75]:
#check result function and manual
unrated_books = get_unrated_book_ids(rating_data_small, 2)
unrated_books_lst = list(unrated_books)
print(unrated_books_lst[:10])

[1, 3, 4, 6, 7, 9, 11, 12, 13, 14]


In [76]:
#make function
def predict_and_sort_ratings(model, user_id, unrated_book_ids):
    """
    Predicts and sorts unrated books based on predicted ratings for a given user.

    Parameters
    ----------
    model : object
        The collaborative filtering model used for predictions.
    user_id : int
        The ID of the user for whom we want to predict and sort unrated books.
    unrated_book_ids : list
        A list of book IDs that the user has not rated yet.

    Returns
    -------
    predicted_unrated_book_df : DataFrame
        A DataFrame containing the predicted ratings and book IDs,
        sorted in descending order of predicted ratings.
    """

    #initialize
    predicted_unrated_book = {
        'user_id': user_id,
        'book_id': [],
        'predicted_rating': []
    }
    
    #loop all unrated book
    for book_id in unrated_book_ids:
        #make predict
        pred_id = model.predict(uid=predicted_unrated_book['user_id'],
                                iid=book_id)
        #append
        predicted_unrated_book['book_id'].append(book_id)
        predicted_unrated_book['predicted_rating'].append(pred_id.est)

    #create df
    predicted_unrated_book_df = pd.DataFrame(predicted_unrated_book).sort_values('predicted_rating',
                                                                                  ascending=False)

    return predicted_unrated_book_df

In [77]:
predicted_books_df = predict_and_sort_ratings(model_best,2,unrated_book_ids=unrated_books)
predicted_books_df

Unnamed: 0,user_id,book_id,predicted_rating
2196,2,2244,5.000000
740,2,780,5.000000
2161,2,2209,5.000000
385,2,422,5.000000
2055,2,2101,5.000000
...,...,...,...
2291,2,2340,2.702617
19,2,34,2.640711
1081,2,1125,2.516687
1747,2,1793,1.917401


In [82]:
def get_top_predicted_books(model, k, user_id, rating_data, book_data):
    """
    Gets the top predicted books for a given user based on a collaborative filtering model.

    Parameters
    ----------
    model : object
        The collaborative filtering model used for predictions
    k : int
        The number of top predicted books to retrieve
    user_id : int
        The ID of the user for whom to get top predicted books
    rating_data : DataFrame
        The DataFrame containing the rating data
    book_data : DataFrame
        The DataFrame containing the book details

    Returns
    -------
    top_books_df : DataFrame
        A DataFrame containing the top predicted books along with their details
    """

    # Get unrated book IDs for the user
    unrated_book_ids = get_unrated_book_ids(rating_data, user_id)

    # Predict and sort unrated books
    predicted_books_df = predict_and_sort_ratings(model, user_id, unrated_book_ids)

    # Get the top k predicted books
    top_predicted_books = predicted_books_df.head(k).copy()

    # Add book details to the top predicted books
    top_predicted_books['authors'] = book_data.loc[top_predicted_books['book_id'], 'authors'].values
    top_predicted_books['original_publication_year'] = book_data.loc[top_predicted_books['book_id'], 'original_publication_year'].values
    top_predicted_books['original_title'] = book_data.loc[top_predicted_books['book_id'], 'original_title'].values
    top_predicted_books['image_url'] = book_data.loc[top_predicted_books['book_id'], 'image_url'].values

    return top_predicted_books

In [84]:
# Example usage
predicted_books = get_top_predicted_books(model_best, 9, 2, rating_data_small, book_data_small)
predicted_books

Unnamed: 0,user_id,book_id,predicted_rating,authors,original_publication_year,original_title,image_url
2161,2,2209,5.0,Ursula K. Le Guin,1974,The Dispossessed,https://images.gr-assets.com/books/1353467455m...
740,2,780,5.0,Scott Westerfeld,2006,Specials,https://s.gr-assets.com/assets/nophoto/book/11...
967,2,1010,5.0,Geraldine Brooks,2008,People of the Book,https://s.gr-assets.com/assets/nophoto/book/11...
2208,2,2256,5.0,Edward Albee,1962,Who's Afraid of Virginia Woolf?,https://images.gr-assets.com/books/1327962277m...
2196,2,2244,5.0,Lee Child,2004,The Enemy,https://s.gr-assets.com/assets/nophoto/book/11...
2325,2,2374,5.0,John le Carré,1974,"Tinker, Tailor, Soldier, Spy",https://images.gr-assets.com/books/1327889127m...
427,2,464,5.0,Jennifer Weiner,2002,In Her Shoes,https://images.gr-assets.com/books/1435252471m...
821,2,862,5.0,"Sam McBratney, Anita Jeram",1988,Guess How Much I Love You,https://images.gr-assets.com/books/1320457007m...
385,2,422,5.0,Kiera Cass,2013,The Elite,https://images.gr-assets.com/books/1391454595m...


In [86]:
pickle.dump(best_params,open('output/best_params.pkl','wb'))
pickle.dump(book_data_small,open('output/books.pkl','wb'))
pickle.dump(rating_data_small,open('output/rating.pkl','wb'))