# Lesson/Article 07 - Book Recommendation

**Author: Gabriel Moretti de Souza - 200018205**

**Author's GitHub username: MorettiGS**

## HuggingFace

The following is the link for the HuggingFace space regarding this notebook:

[HuggingFace - BooksRecommendation](https://huggingface.co/spaces/MorettiGS/BooksRecommendation)

## Objective

In this article, we are going through the creation and usage of a **recommendation system** using Artificial Intelligence, with the main purpose of training it to determine a person's likes in books.

For the creation of said predictor, we will be creating a neural network with the help of collaborative filtering that is capable of determining recommendations, this is based on the dataset [Good Books Rating](https://www.kaggle.com/datasets/mustafayazici/goodbooks-10k-rating-and-description).

This search and results could be useful, in a real life situation, when trying to create recommendation systems, in this case related to books.

## Pre-requisites

Previous downloads that were necessary for all dependencies used on the notebook:

In [50]:
# !pip install --upgrade pip setuptools wheel
# !pip install p5py
# !pip install PEP517
# !pip install gcc
# !pip install pandas
# !pip install kaggle
# !pip install opendatasets
# !pip install torch
# !pip install fastai
# !pip install scikit-learn

Before going any further, it is essential that we get the necessary dependencies for the AI training, and these can be found in the Python libraries imported with the code below:

In [63]:
import opendatasets as od # Responsible for dataset download
import numpy as np # Responsible for linear algebra
import pandas as pd # Responsible for data processing and manipulation
import pickle # Responsible for model exporting
from fastai.collab import *
from fastai.tabular.all import *

## Data Preprocessing

Now we can work on the data that will be used on this AI training. As said before, the data will be obtained by the use of the kaggle dataset. We can download it as follows, using the username and key from kaggle:

In [37]:
od.download(
    "https://www.kaggle.com/datasets/mustafayazici/goodbooks-10k-rating-and-description/data")

# Downloading it by kaggle username and key, or simply download it and add it to the directory

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

  gmorettis


Your Kaggle Key:

  ········


Downloading goodbooks-10k-rating-and-description.zip to ./goodbooks-10k-rating-and-description


100%|██████████████████████████████████████| 33.7M/33.7M [00:04<00:00, 7.49MB/s]





Then, the dataset's columns can be seen organized as follows:

In [64]:
df = pd.read_csv('goodbooks-10k-rating-and-description/ratings.csv')
df

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3
...,...,...,...
5976474,49925,510,5
5976475,49925,528,4
5976476,49925,722,4
5976477,49925,949,5


### Treating Data

In [65]:
# Changing names of the columns
column_mapping = {
    'user_id': 'user',
    'book_id': 'item',
    'rating': 'rating'
}

# Use the rename method to rename columns
df.rename(columns=column_mapping, inplace=True)
df.head()

Unnamed: 0,user,item,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


In [66]:
# Getting smaller amount of data
ratings = df.iloc[:1000000]
ratings

Unnamed: 0,user,item,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3
...,...,...,...
999995,8264,3180,5
999996,2157,1682,4
999997,9573,1839,3
999998,11700,1276,5


In [67]:
def crosstab(ratings, max_nan_user=100000, max_nan=100000):
    ctdf = pd.crosstab(ratings.user, ratings.item,values=ratings.rating, aggfunc=np.sum);ctdf

    null_cont_serie = ctdf.isnull().sum(axis = 0);

    ctdf = ctdf[null_cont_serie[null_cont_serie<=max_nan_user].index]
    
    ctdf = ctdf.dropna(axis = 0, how = 'all')

    user_null_cont_serie = ctdf.isnull().sum(axis = 1)
    ctdf = ctdf.loc[:, :ctdf.shape[1] - 7633]
    ctdf = ctdf.loc[user_null_cont_serie[user_null_cont_serie<max_nan].index]
    return ctdf

In [68]:
# Getting crosstab for ratings
cross = crosstab(ratings)
cross

  ctdf = pd.crosstab(ratings.user, ratings.item,values=ratings.rating, aggfunc=np.sum);ctdf


item,1,2,3,4,5,6,7,8,9,10,...,66,67,68,69,70,71,72,73,74,75
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,5.0,,,,,,4.0,...,4.0,3.0,,,5.0,,3.0,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,3.0,,,,,,,...,,,,,,,,,,
4,,5.0,,4.0,4.0,,4.0,4.0,,5.0,...,,,,,4.0,,3.0,,,
6,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22351,,4.0,,,,,5.0,,3.0,,...,4.0,,,,,,,,,3.0
24420,,,,,,,,,,,...,,,,,,,,,,
37076,,5.0,3.0,4.0,,,,,5.0,,...,,,,,,,,,,
41961,,,,,,,,,,,...,,,,,,,,,,


In [69]:
# Ordering rows by most non-NaN values
row_counts = cross.count(axis=1)

# Add a new column with the count of non-NaN values
cross['non_nan_count'] = row_counts

# Sort the DataFrame based on the number of non-NaN values in each row
cross_sorted = cross.sort_values(by='non_nan_count', ascending=False)

# Drop the temporary column used for sorting and show result
cross_sorted = cross_sorted.drop(columns='non_nan_count')
cross_sorted

item,1,2,3,4,5,6,7,8,9,10,...,66,67,68,69,70,71,72,73,74,75
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
14889,2.0,4.0,3.0,2.0,3.0,,4.0,4.0,,2.0,...,,,,,,3.0,4.0,2.0,1.0,4.0
13925,5.0,5.0,1.0,3.0,5.0,,3.0,2.0,5.0,2.0,...,2.0,,1.0,,1.0,2.0,5.0,,,5.0
9485,5.0,5.0,5.0,,4.0,,5.0,5.0,5.0,5.0,...,5.0,4.0,,,4.0,4.0,3.0,,,5.0
8750,,5.0,5.0,4.0,5.0,,4.0,4.0,4.0,5.0,...,2.0,,,,4.0,4.0,5.0,,,5.0
12946,,5.0,4.0,5.0,5.0,,5.0,4.0,,5.0,...,,5.0,,,,3.0,1.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14098,,,,,,,,,,,...,,,,,,,,,,
9807,,,,,,,,,,,...,,,,,,,,,,
3342,,,,,,,,,,,...,,,,,,,,,,
9801,,,,,,,,,,,...,,,,,,,,,,


In [70]:
# Trimming rows to a smaller size
cross_trim = cross_sorted.iloc[:-12000, :]
cross_trim

item,1,2,3,4,5,6,7,8,9,10,...,66,67,68,69,70,71,72,73,74,75
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
14889,2.0,4.0,3.0,2.0,3.0,,4.0,4.0,,2.0,...,,,,,,3.0,4.0,2.0,1.0,4.0
13925,5.0,5.0,1.0,3.0,5.0,,3.0,2.0,5.0,2.0,...,2.0,,1.0,,1.0,2.0,5.0,,,5.0
9485,5.0,5.0,5.0,,4.0,,5.0,5.0,5.0,5.0,...,5.0,4.0,,,4.0,4.0,3.0,,,5.0
8750,,5.0,5.0,4.0,5.0,,4.0,4.0,4.0,5.0,...,2.0,,,,4.0,4.0,5.0,,,5.0
12946,,5.0,4.0,5.0,5.0,,5.0,4.0,,5.0,...,,5.0,,,,3.0,1.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5179,,4.0,2.0,5.0,4.0,,4.0,5.0,,,...,,5.0,,,4.0,4.0,,,,
7375,,5.0,1.0,5.0,2.0,,3.0,2.0,,,...,,,,,,3.0,,,,4.0
14564,,2.0,1.0,4.0,5.0,,1.0,3.0,,3.0,...,1.0,,,,,4.0,4.0,,,
4934,5.0,4.0,3.0,2.0,5.0,,3.0,,4.0,5.0,...,4.0,,,,,5.0,,,,


In [71]:
# Verifying number of NaN values
cross_trim.isna().sum()

item
1     338
2      18
3      87
4      81
5     121
     ... 
71    270
72    368
73    463
74    563
75    302
Length: 74, dtype: int64

With this, we can see how the data is distributed and filled, and now proceed to the creation of DataLoaders, responsible for the creation of our recommendation model.

## Creating DataLoaders

To start with the process of creating a collaborative filter, we can merge our two datasets together, books and ratings.

In [72]:
books = pd.read_csv('goodbooks-10k-rating-and-description/goodbooks_10k_rating_and_description.csv')
books.head()

Unnamed: 0,book_id,book_title,book_series,title,book_authors,genres,book_score,book_rating,book_rating_obj,book_rating_count,...,book_pages_obj,books_count,books_count_obj,goodreads_book_id,book_isbn,isbn,isbn13,image_url_x,image_url_y,small_image_url
0,1,The Hunger Games,"The Hunger Games, #1","The Hunger Games (The Hunger Games, #1)",Suzanne Collins,Young Adult|Fiction|Science Fiction|Dystopia|Fantasy|Science Fiction,1.0,4.33,4.0,5519135,...,3.0,272,14.0,2767052,9780440000000.0,439023483,9780439000000.0,https://images.gr-assets.com/books/1447303603m/2767052.jpg,https://images.gr-assets.com/books/1447303603l/2767052.jpg,https://images.gr-assets.com/books/1447303603s/2767052.jpg
1,2,Harry Potter and the Philosopher's Stone,"Harry Potter, #1","Harry Potter and the Sorcerer's Stone (Harry Potter, #1)","J.K. Rowling, Mary GrandPré",,0.824569,4.44,4.0,6682936,...,,491,15.0,3,,439554934,9780440000000.0,https://images.gr-assets.com/books/1474154022m/3.jpg,,https://images.gr-assets.com/books/1474154022s/3.jpg
2,3,Twilight,"Twilight, #1","Twilight (Twilight, #1)",Stephenie Meyer,Young Adult|Fantasy|Romance|Paranormal|Vampires|Fiction|Fantasy|Paranormal,0.02902,3.58,3.5,4281268,...,4.0,226,14.0,41865,9780320000000.0,316015849,9780316000000.0,https://images.gr-assets.com/books/1361039443m/41865.jpg,https://images.gr-assets.com/books/1361039443l/41865.jpg,https://images.gr-assets.com/books/1361039443s/41865.jpg
3,4,To Kill a Mockingbird,-,To Kill a Mockingbird,Harper Lee,Classics|Fiction|Historical|Historical Fiction|Academic|School,0.231059,4.27,4.0,3745197,...,3.0,487,15.0,2657,9780060000000.0,61120081,9780061000000.0,https://images.gr-assets.com/books/1361975680m/2657.jpg,https://images.gr-assets.com/books/1361975680l/2657.jpg,https://images.gr-assets.com/books/1361975680s/2657.jpg
4,5,The Great Gatsby,-,The Great Gatsby,F. Scott Fitzgerald,Classics|Fiction|Academic|School|Literature|Historical|Historical Fiction|Romance|Novels|Academic|Read For School|Young Adult|High School|Literature|American,0.049427,3.9,3.5,3141842,...,1.0,1356,15.0,4671,,743273567,9780743000000.0,https://images.gr-assets.com/books/1490528560m/4671.jpg,https://images.gr-assets.com/books/1490528560l/4671.jpg,https://images.gr-assets.com/books/1490528560s/4671.jpg


In [73]:
# Removing all but the two first columns
books = books.iloc[:, :2]
books.head()

Unnamed: 0,book_id,book_title
0,1,The Hunger Games
1,2,Harry Potter and the Philosopher's Stone
2,3,Twilight
3,4,To Kill a Mockingbird
4,5,The Great Gatsby


In [74]:
# Changing names of the columns
column_mapping = {
    'book_id': 'item',
    'book_title': 'title'
}

# Use the rename method to rename columns
books.rename(columns=column_mapping, inplace=True)
books.head()

Unnamed: 0,item,title
0,1,The Hunger Games
1,2,Harry Potter and the Philosopher's Stone
2,3,Twilight
3,4,To Kill a Mockingbird
4,5,The Great Gatsby


In [75]:
# Merging datasets
ratings = ratings.merge(books)
ratings

Unnamed: 0,user,item,rating,title
0,1,258,5,La sombra del viento
1,11,258,3,La sombra del viento
2,143,258,4,La sombra del viento
3,242,258,5,La sombra del viento
4,325,258,4,La sombra del viento
...,...,...,...,...
988700,9379,1001,3,Het diner
988701,11804,8778,2,Tyskungen
988702,12377,7192,5,Ruthless
988703,5978,7229,4,Dead End in Norvelt


In [76]:
# Creating dataloader with the first 25000 rows
dls = CollabDataLoaders.from_df(ratings.iloc[:25000], item_name='title', bs=64)
dls.show_batch()

  return getattr(torch, 'has_mps', False)


Unnamed: 0,user,title,rating
0,6988,The Da Vinci Code,3
1,8752,The Sun Also Rises,5
2,15119,Harry Potter and the Half-Blood Prince,5
3,1525,The Da Vinci Code,4
4,12795,Ender's Game,3
5,13006,The Sun Also Rises,5
6,9433,Memoirs of a Geisha,4
7,7217,Harry Potter and the Half-Blood Prince,4
8,7033,Ender's Game,1
9,512,Ender's Game,5


In [77]:
n_users  = len(dls.classes['user'])
n_books = len(dls.classes['title'])
n_factors = 50
n_users

8155

In [78]:
n_books

19

In [79]:
# DotProduct Function
class DotProduct(Module):
    def __init__(self, n_users, n_books, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.books_factors = Embedding(n_books, n_factors)

    def forward(self, x):
        users = self.user_factors(x[:,0])
        books = self.books_factors(x[:,1])
        return (users * books).sum(dim=1)

In [80]:
model = DotProduct(n_users, n_books, 50)
learn_dot = Learner(dls, model, loss_func=MSELossFlat())
learn_dot.fit_one_cycle(5, 5e-3)

  return getattr(torch, 'has_mps', False)


epoch,train_loss,valid_loss,time
0,13.090444,9.990256,00:02
1,3.068362,3.433342,00:01
2,1.401435,3.049355,00:01
3,0.760177,2.823254,00:01
4,0.467484,2.802221,00:02


  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps',

In [81]:
# Final DotProduct
class DotProduct(Module):
    def __init__(self, n_users, n_books, n_factors, y_range=(0, 5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.books_factors = Embedding(n_books, n_factors)
        self.books_bias = Embedding(n_books, 1)
        self.y_range = y_range
    
    def forward(self, x):
        users = self.user_factors(x[:,0])
        books = self.books_factors(x[:,1])
        res = (users * books).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.books_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

In [82]:
model = DotProduct(n_users, n_books, 50)
learn_dot = Learner(dls, model, loss_func=MSELossFlat())
learn_dot.fit_one_cycle(5, 10e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,1.155485,0.946477,00:02
1,0.641791,0.837752,00:02
2,0.235273,0.802486,00:02
3,0.086735,0.798092,00:02
4,0.031132,0.796847,00:02


  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps',

## Collaborative Filter

In [83]:
embs = get_emb_sz(dls)
embs

[(8155, 248), (19, 8)]

In [84]:
class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

In [85]:
# Creating model for collaborative filtering, using embs accordingly
model = CollabNN(*embs)

# Creating learn, responsible for the model's training
# MSELossFlat() is responsible for calculating the error between the model and the real labels
learn_deep = Learner(dls, model, loss_func=MSELossFlat())

# Calling method fit_one_cycle, training the model
# Parameters: number of training epochs, learning rate for the model and regularization by weight decay
learn_deep.fit_one_cycle(5, 0.005, wd=0.01)

epoch,train_loss,valid_loss,time
0,0.90074,0.892387,00:06
1,0.802449,0.904912,00:06
2,0.619268,0.903931,00:06
3,0.400403,0.937394,00:06
4,0.277477,0.980153,00:06


  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps',

With this, the model's ready and we can see that the loss is getting better.

## Infering

After the model's creation, we can use that model to do some tests. We can test the model by trying to recommend some books to a given user. Here are the books with the lowest bias value:

In [57]:
books_bias_dot = learn_dot.model.books_bias.weight.squeeze()
idxs = books_bias_dot.argsort()[:5]
[dls.classes['title'][i] for i in idxs]

['Congo',
 'Who Moved My Cheese?',
 'The Lost World',
 "The Memory Keeper's Daughter",
 'The Da Vinci Code']

And the books with the highest bias value:

In [61]:
books_bias_dot = learn_dot.model.books_bias.weight.squeeze()
idxs = books_bias_dot.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]

['#na#',
 'Das Drama des begabten Kindes und die Suche nach dem wahren Selbst: eine Um- und Fortschreibung',
 'Harry Potter Collection (Harry Potter, #1-6)',
 'The Hobbit and The Lord of the Rings',
 'Harry Potter and the Half-Blood Prince']

With this, it is possible to see what are the most likely books to receive high and low ratings, respectively.

## Exporting

Now we can save the model for future usage:

We have to keep in mind that we will use some data to make the recommendation possible, so we will be separating it like so:

In [62]:
# Exporting validation data
valid = dls.valid_ds
export = pd.DataFrame(valid.items)

# Save trained model
with open("model.pkl", "wb") as f:
    pickle.dump(model, f)