# Book Recommdation System for Goodreads dataset


## 1. Introduction
      In this project, we are going to build a hybrid book recommender system based on Goodreads dataset. The recommender system combines both content based recommender and collaborative filter (CF) based recommender. In the content based recommender we use transformer to analyze the content of book description. Then, we use an autoencoder to reduce the dimension of the features. After that, we use a cosine similarity to obtain the recommendation items. In the CF based recommender, we generate book and user embedding using a deep learning  networks. Then, the cosine similarity is obtained for the book embedding. As a hybrid recommender, the cosine similarity is averaged between that of content based recommender and CF based recommender.

## 2. Data processing
      The Goodreads dataset includes both book dataset and user rating dataset. We import both 

In [18]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
%matplotlib inline
sns.set_style('whitegrid')

In [19]:
import os
dataset_path = '../datasets/goodreads'
books_pd = pd.DataFrame(columns = pd.read_csv(os.path.join(dataset_path,'book1000k-1100k.csv')).columns)
print(books_pd)
users_pd = pd.DataFrame(columns = pd.read_csv(os.path.join(dataset_path,'user_rating_0_to_1000.csv')).columns)
print(users_pd)

Empty DataFrame
Columns: [Id, Name, Authors, ISBN, Rating, PublishYear, PublishMonth, PublishDay, Publisher, RatingDist5, RatingDist4, RatingDist3, RatingDist2, RatingDist1, RatingDistTotal, CountsOfReview, Language, pagesNumber, Description, Count of text reviews]
Index: []
Empty DataFrame
Columns: [ID, Name, Rating]
Index: []


In [20]:
import fnmatch
for dirname, _, filenames in os.walk('../datasets/goodreads'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        if fnmatch.fnmatch(filename, 'book*.csv'):
            books_pd = pd.concat([books_pd, pd.read_csv(os.path.join(dataset_path,filename))])
        if fnmatch.fnmatch(filename, 'user*.csv'):
            users_pd = pd.concat([users_pd, pd.read_csv(os.path.join(dataset_path,filename))])
books_pd.info(show_counts=True)

../datasets/goodreads/book600k-700k.csv
../datasets/goodreads/book1600k-1700k.csv
../datasets/goodreads/book200k-300k.csv
../datasets/goodreads/user_rating_0_to_1000.csv
../datasets/goodreads/book100k-200k.csv
../datasets/goodreads/book500k-600k.csv
../datasets/goodreads/book1-100k.csv
../datasets/goodreads/book1300k-1400k.csv
../datasets/goodreads/book800k-900k.csv
../datasets/goodreads/user_rating_6000_to_11000.csv
../datasets/goodreads/book1700k-1800k.csv
../datasets/goodreads/user_rating_4000_to_5000.csv
../datasets/goodreads/book1400k-1500k.csv
../datasets/goodreads/book1000k-1100k.csv
../datasets/goodreads/book2000k-3000k.csv
../datasets/goodreads/book400k-500k.csv
../datasets/goodreads/archive.zip
../datasets/goodreads/book900k-1000k.csv
../datasets/goodreads/user_rating_3000_to_4000.csv
../datasets/goodreads/book4000k-5000k.csv
../datasets/goodreads/book1100k-1200k.csv
../datasets/goodreads/user_rating_5000_to_6000.csv
../datasets/goodreads/book1800k-1900k.csv
../datasets/goodr

#### Because the user rating dataset contains multiple indices we need to reset the index.

In [21]:
users_pd.reset_index(drop=True, inplace=True)
users_pd.index.value_counts()

0         1
241760    1
241736    1
241735    1
241734    1
         ..
120863    1
120862    1
120861    1
120860    1
362595    1
Length: 362596, dtype: int64

#### Drop the duplicated books

In [22]:
books_pd.drop_duplicates(inplace = True)
books_pd.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1850198 entries, 0 to 54272
Data columns (total 21 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Id                     1850198 non-null  object 
 1   Name                   1850198 non-null  object 
 2   Authors                1850198 non-null  object 
 3   ISBN                   1844276 non-null  object 
 4   Rating                 1850198 non-null  float64
 5   PublishYear            1850198 non-null  object 
 6   PublishMonth           1850198 non-null  object 
 7   PublishDay             1850198 non-null  object 
 8   Publisher              1832375 non-null  object 
 9   RatingDist5            1850198 non-null  object 
 10  RatingDist4            1850198 non-null  object 
 11  RatingDist3            1850198 non-null  object 
 12  RatingDist2            1850198 non-null  object 
 13  RatingDist1            1850198 non-null  object 
 14  RatingDistTotal     

#### Drop the books with published year > 2021 and < 1900

In [23]:
books_pd.drop((books_pd[(books_pd['PublishYear'] < 1900) | (books_pd['PublishYear'] > 2021)].index).tolist(), inplace = True)


#### Drop the data with no publisher info

In [24]:
books_pd.drop((books_pd[books_pd['Publisher'].isnull()].index).tolist(), inplace=True)
books_pd.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1631813 entries, 0 to 54272
Data columns (total 21 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Id                     1631813 non-null  object 
 1   Name                   1631813 non-null  object 
 2   Authors                1631813 non-null  object 
 3   ISBN                   1626743 non-null  object 
 4   Rating                 1631813 non-null  float64
 5   PublishYear            1631813 non-null  object 
 6   PublishMonth           1631813 non-null  object 
 7   PublishDay             1631813 non-null  object 
 8   Publisher              1631813 non-null  object 
 9   RatingDist5            1631813 non-null  object 
 10  RatingDist4            1631813 non-null  object 
 11  RatingDist3            1631813 non-null  object 
 12  RatingDist2            1631813 non-null  object 
 13  RatingDist1            1631813 non-null  object 
 14  RatingDistTotal     

#### The Rating distribution data has the rating number after a colon. Need to fix this

In [25]:
books_pd['RatingDistTotal'] = books_pd['RatingDistTotal'].apply(lambda rating: rating.split(':')[1]).astype('int')
books_pd['RatingDist1'] = books_pd['RatingDist1'].apply(lambda rating: rating.split(':')[1]).astype('int')
books_pd['RatingDist2'] = books_pd['RatingDist2'].apply(lambda rating: rating.split(':')[1]).astype('int')
books_pd['RatingDist3'] = books_pd['RatingDist3'].apply(lambda rating: rating.split(':')[1]).astype('int')
books_pd['RatingDist4'] = books_pd['RatingDist4'].apply(lambda rating: rating.split(':')[1]).astype('int')
books_pd['RatingDist5'] = books_pd['RatingDist5'].apply(lambda rating: rating.split(':')[1]).astype('int')

#### replace 'en-US' 'en-GB' 'en-CA' to 'eng'

In [26]:
books_pd['Language'] = books_pd['Language'].str.replace('en-US', 'eng')
books_pd['Language'] = books_pd['Language'].str.replace('en-GB', 'eng')
books_pd['Language'] = books_pd['Language'].str.replace('en-CA', 'eng')

In [27]:
books_pd['Language'] = books_pd['Language'].replace('--', np.NaN)

#### Drop null descriptions

In [28]:
books_pd.dropna(subset=['Description'], inplace=True)

#### Use regex to clean up the descriptions as some of them have embedded HTML tags like <br>

In [29]:
# use regex to clean up the descriptions as some of them have embedded HTML tags like <br>
import re
# compile once only
CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});') 

def cleanhtml(raw_html):
    cleantext = re.sub(CLEANR, '', raw_html)
    return cleantext

#### Remove the html tags

In [30]:
books_pd['Description'] = books_pd.Description.apply(cleanhtml)
books_pd['Description'].sample(10)

38000     Maybe it was a moment of insanity. Maybe it wa...
295931    Christmas on a psychiatric ward can be downrig...
138166    In 1880 the Canadian Pacific Railway was born ...
269105    Almasy's photographic work consists of two gre...
36779     This work covers the structure, components, an...
216963    Award-winning director and screenwriter, actor...
130626    Ted Rueter panders to Democratic party lines b...
46238     Finding herself struggling with depression ("l...
39406     Written over a thirty-year period, the essays ...
23890     Phyllisia eventually recognizes that her own s...
Name: Description, dtype: object

#### Drop books not in English. Only recommend English books

In [31]:
books_eng = books_pd[books_pd['Language']=='eng']
books_eng.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105284 entries, 0 to 54253
Data columns (total 21 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Id                     105284 non-null  object 
 1   Name                   105284 non-null  object 
 2   Authors                105284 non-null  object 
 3   ISBN                   104439 non-null  object 
 4   Rating                 105284 non-null  float64
 5   PublishYear            105284 non-null  object 
 6   PublishMonth           105284 non-null  object 
 7   PublishDay             105284 non-null  object 
 8   Publisher              105284 non-null  object 
 9   RatingDist5            105284 non-null  int64  
 10  RatingDist4            105284 non-null  int64  
 11  RatingDist3            105284 non-null  int64  
 12  RatingDist2            105284 non-null  int64  
 13  RatingDist1            105284 non-null  int64  
 14  RatingDistTotal        105284 non-nul

### we should drop the duplicate of name and author

In [32]:
books_eng.drop_duplicates(subset=["Authors", "Name"], inplace=True)
books_eng.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98735 entries, 0 to 54253
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Id                     98735 non-null  object 
 1   Name                   98735 non-null  object 
 2   Authors                98735 non-null  object 
 3   ISBN                   97970 non-null  object 
 4   Rating                 98735 non-null  float64
 5   PublishYear            98735 non-null  object 
 6   PublishMonth           98735 non-null  object 
 7   PublishDay             98735 non-null  object 
 8   Publisher              98735 non-null  object 
 9   RatingDist5            98735 non-null  int64  
 10  RatingDist4            98735 non-null  int64  
 11  RatingDist3            98735 non-null  int64  
 12  RatingDist2            98735 non-null  int64  
 13  RatingDist1            98735 non-null  int64  
 14  RatingDistTotal        98735 non-null  int64  
 15  Co

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


### Drop some columns not used for this task

In [33]:
books_eng.drop(columns=["PagesNumber", "CountsOfReview", "Count of text reviews","pagesNumber"], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


### Data processing for the user rating file

In [36]:
# drop the user without any rating
no_rating_df = users_pd[users_pd['Rating']=='This user doesn\'t have any rating']
no_rating_df['Name'].value_counts()

Rating    4765
Name: Name, dtype: int64

In [37]:
users_pd = users_pd.drop(no_rating_df.index.tolist())
users_pd_dropna['Rating'].value_counts()

really liked it    132808
liked it            96047
it was amazing      92354
it was ok           28811
did not like it      7811
Name: Rating, dtype: int64

### Merge the book table and user rating table

In [39]:
book_user = users_pd.merge(books_eng, on='Name',how='inner')
book_user.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 126824 entries, 0 to 126823
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   ID               126824 non-null  object 
 1   Name             126824 non-null  object 
 2   Rating_x         126824 non-null  object 
 3   Id               126824 non-null  object 
 4   Authors          126824 non-null  object 
 5   ISBN             122533 non-null  object 
 6   Rating_y         126824 non-null  float64
 7   PublishYear      126824 non-null  object 
 8   PublishMonth     126824 non-null  object 
 9   PublishDay       126824 non-null  object 
 10  Publisher        126824 non-null  object 
 11  RatingDist5      126824 non-null  int64  
 12  RatingDist4      126824 non-null  int64  
 13  RatingDist3      126824 non-null  int64  
 14  RatingDist2      126824 non-null  int64  
 15  RatingDist1      126824 non-null  int64  
 16  RatingDistTotal  126824 non-null  int6

In [41]:
book_user['Name'].value_counts()

The Great Gatsby                                                                       2655
Pride and Prejudice                                                                    1160
The Catcher in the Rye                                                                  985
The Da Vinci Code (Robert Langdon, #2)                                                  846
To Kill a Mockingbird                                                                   830
                                                                                       ... 
Dead at Daybreak                                                                          1
The Fatal Shore: History of the Transportation of Convicts to Australia 1787 - 1868       1
The Fahrenheit Twins                                                                      1
Bad Debts (Jack Irish, #1)                                                                1
The Horizontal World: Growing Up Wild in the Middle of Nowhere                  

In [42]:
#Encode each book with Book_Id
book_map = book_user[['Name']]
book_map.drop_duplicates(subset=['Name'],keep='first',inplace=True)
book_map.reset_index(drop=True,inplace=True)
book_map.info()
book_map['Book_Id']=book_map.index.values
print(book_map.index.values)
book_user_wid = pd.merge(book_user,book_map, on=['Name'], how='left')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12292 entries, 0 to 12291
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    12292 non-null  object
dtypes: object(1)
memory usage: 96.2+ KB
[    0     1     2 ... 12289 12290 12291]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  book_map['Book_Id']=book_map.index.values


In [43]:
book_user_wid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 126824 entries, 0 to 126823
Data columns (total 20 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   ID               126824 non-null  object 
 1   Name             126824 non-null  object 
 2   Rating_x         126824 non-null  object 
 3   Id               126824 non-null  object 
 4   Authors          126824 non-null  object 
 5   ISBN             122533 non-null  object 
 6   Rating_y         126824 non-null  float64
 7   PublishYear      126824 non-null  object 
 8   PublishMonth     126824 non-null  object 
 9   PublishDay       126824 non-null  object 
 10  Publisher        126824 non-null  object 
 11  RatingDist5      126824 non-null  int64  
 12  RatingDist4      126824 non-null  int64  
 13  RatingDist3      126824 non-null  int64  
 14  RatingDist2      126824 non-null  int64  
 15  RatingDist1      126824 non-null  int64  
 16  RatingDistTotal  126824 non-null  int6

In [44]:
# Select columns
book_user_wid.rename(columns={'ID':'User_ID'}, inplace=True)
book_user_sel = book_user_wid.iloc[:,[0,1,2,4,18,19]]
book_user_sel

Unnamed: 0,User_ID,Name,Rating_x,Authors,Description,Book_Id
0,1,The Restaurant at the End of the Universe (Hit...,it was amazing,Douglas Adams,Just when you thought it was safe to go back t...,0
1,73,The Restaurant at the End of the Universe (Hit...,really liked it,Douglas Adams,Just when you thought it was safe to go back t...,0
2,116,The Restaurant at the End of the Universe (Hit...,it was amazing,Douglas Adams,Just when you thought it was safe to go back t...,0
3,171,The Restaurant at the End of the Universe (Hit...,really liked it,Douglas Adams,Just when you thought it was safe to go back t...,0
4,338,The Restaurant at the End of the Universe (Hit...,liked it,Douglas Adams,Just when you thought it was safe to go back t...,0
...,...,...,...,...,...,...
126819,3166,The Bible Salesman,really liked it,Clyde Edgerton,When career criminal Preston Clearwater picks ...,12287
126820,3166,My Side Of The Story,liked it,Will Davis,"My name is Jarold, but everyone calls me Jaz, ...",12288
126821,3166,Life as We Know It: A Collection of Personal E...,liked it,Jennifer Foote Sweeney,"""...these essays are jewels of the unexpected,...",12289
126822,3166,"Hello, I Must Be Going",really liked it,Christie Hodgen,"It's the early 1980s, and tomboy Frankie Hawth...",12290


In [45]:
#Extract book information
book_df = book_user_sel[['Name','Authors','Description', 'Book_Id']].drop_duplicates(subset=['Name'],keep='first')
book_df

Unnamed: 0,Name,Authors,Description,Book_Id
0,The Restaurant at the End of the Universe (Hit...,Douglas Adams,Just when you thought it was safe to go back t...,0
48,Siddhartha,Hermann Hesse,"In the novel, Siddhartha, a young man, leaves ...",1
311,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,2
771,The Authoritative Calvin and Hobbes: A Calvin ...,Bill Watterson,"The Authoritative Calvin and Hobbes, is a larg...",3
783,The Name of the Rose,Umberto Eco,Librarian note: Older edition of 9780099466031...,4
...,...,...,...,...
126819,The Bible Salesman,Clyde Edgerton,When career criminal Preston Clearwater picks ...,12287
126820,My Side Of The Story,Will Davis,"My name is Jarold, but everyone calls me Jaz, ...",12288
126821,Life as We Know It: A Collection of Personal E...,Jennifer Foote Sweeney,"""...these essays are jewels of the unexpected,...",12289
126822,"Hello, I Must Be Going",Christie Hodgen,"It's the early 1980s, and tomboy Frankie Hawth...",12290


In [46]:

torch.cuda.set_device(1)

## 3. Build Hybrid Recommender
### Build the content based similarity using transformer

In [47]:
#Use sentence transformer for embedding instead of TF-IDF
from sentence_transformers import SentenceTransformer

2022-05-14 15:34:25.782004: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


In [48]:
# Download model
model = SentenceTransformer('all-distilroberta-v1')

In [49]:

book_transformer_embedding = model.encode(book_df['Description'].values.tolist(), show_progress_bar=True)

Batches:   0%|          | 0/385 [00:00<?, ?it/s]

In [89]:
book_transformer_embedding.shape

(12292, 768)

In [90]:
x_train, x_test,y_train, y_test = train_test_split(book_transformer_embedding, book_transformer_embedding, test_size=0.1, random_state=2022)

### Build an autoencoder to reduce the dimension for the book embedding from transformer

In [91]:
##Deep Learning specific stuff
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense , Concatenate
from tensorflow.keras.optimizers import Adam,SGD,Adagrad,Adadelta,RMSprop
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.utils import model_to_dot
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.layers import Dropout, Flatten,Activation,Input,Embedding
from tensorflow.keras.layers import Conv2D, MaxPooling2D, BatchNormalization
from tensorflow.keras.layers import dot
from tensorflow.keras.models import Model

In [92]:
#build autoencoder model
def build_autoencoder(input_dim, middle_dim, latent_factors, drop_out): 
    book_input=Input(shape=(input_dim,),name='book_input',dtype=tf.float64)
    #encoder
    dense1out = Dense(middle_dim, activation='relu')(book_input)
    bat1out = BatchNormalization()(dense1out)
    drop1out = Dropout(drop_out)(bat1out)
    encout = Dense(latent_factors, activation='relu',name='embedding')(drop1out)
    #decoder
    dense3out = Dense(middle_dim, activation='relu')(encout)
    bat2out = BatchNormalization()(dense3out)
    drop2out = Dropout(drop_out)(bat2out)
    dense4out = Dense(input_dim)(drop2out)
    bat3out = BatchNormalization()(dense4out)
    decout = tf.keras.activations.sigmoid(bat3out)


    autoencoder =Model(book_input,decout)
    return autoencoder

In [93]:

x_train, x_test,y_train, y_test = train_test_split(book_transformer_embedding, book_transformer_embedding, test_size=0.1, random_state=2022)

In [94]:
autoenc2 = build_autoencoder(x_train.shape[1], 300, 16, 0.2)
autoenc2.summary()

Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
book_input (InputLayer)      [(None, 768)]             0         
_________________________________________________________________
dense_12 (Dense)             (None, 300)               230700    
_________________________________________________________________
batch_normalization_10 (Batc (None, 300)               1200      
_________________________________________________________________
dropout_8 (Dropout)          (None, 300)               0         
_________________________________________________________________
embedding (Dense)            (None, 16)                4816      
_________________________________________________________________
dense_13 (Dense)             (None, 300)               5100      
_________________________________________________________________
batch_normalization_11 (Batc (None, 300)               1200

In [95]:
loss_fun = tf.keras.losses.MeanSquaredError()
autoenc2.compile(optimizer=Adam(learning_rate=1e-4),loss=loss_fun)
batch_size=4
epochs=20
History = autoenc2.fit(x_train,y_train, batch_size=batch_size,
                              epochs =epochs, validation_data = (x_test, y_test),
                              verbose = 1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [96]:
#Extract the embedding
extractor2 = Model(inputs=autoenc2.inputs,outputs=autoenc2.get_layer('embedding').output)
book_embedding_trf_reduce = extractor2.predict(book_transformer_embedding, batch_size=4)

In [98]:
book_embedding_trf_reduce.shape

(12292, 16)

### Generate the book embedding from user rating matrix based on deep learning collaborative filter

In [65]:
book_map = users_pd[['Name']]
book_map.drop_duplicates(subset=['Name'],keep='first',inplace=True)
book_map.reset_index(drop=True, inplace=True)
book_map['Book_Id']=book_map.index.values
user_rating_temp = pd.merge(users_pd,book_map, on='Name', how='left')
user_rating = user_rating_temp[user_rating_temp['Name']!='Rating'] ##Dropping users who have not rated any books
user_rating.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  book_map['Book_Id']=book_map.index.values


Unnamed: 0,ID,Name,Rating,Book_Id
0,1,Agile Web Development with Rails: A Pragmatic ...,it was amazing,0
1,1,The Restaurant at the End of the Universe (Hit...,it was amazing,1
2,1,Siddhartha,it was amazing,2
3,1,The Clock of the Long Now: Time and Responsibi...,really liked it,3
4,1,"Ready Player One (Ready Player One, #1)",really liked it,4


In [66]:
le = preprocessing.LabelEncoder()
user_rating['Rating_numeric'] = le.fit_transform(user_rating.Rating.values)
user_rating.head()

Unnamed: 0,ID,Name,Rating,Book_Id,Rating_numeric
0,1,Agile Web Development with Rails: A Pragmatic ...,it was amazing,0,1
1,1,The Restaurant at the End of the Universe (Hit...,it was amazing,1,1
2,1,Siddhartha,it was amazing,2,1
3,1,The Clock of the Long Now: Time and Responsibi...,really liked it,3,4
4,1,"Ready Player One (Ready Player One, #1)",really liked it,4,4


In [67]:
users = user_rating.ID.unique()
print(users)
books = user_rating.Book_Id.unique()
print(books)
userid2idx = {o:i for i,o in enumerate(users)}
bookid2idx = {o:i for i,o in enumerate(books)}
user_rating['ID'] = user_rating['ID'].apply(lambda x: userid2idx[x])
user_rating['Book_Id'] = user_rating['Book_Id'].apply(lambda x: bookid2idx[x])
user_rating.head()

[1 2 3 ... 2986 3018 3155]
[     0      1      2 ... 103529 103530 103531]


Unnamed: 0,ID,Name,Rating,Book_Id,Rating_numeric
0,0,Agile Web Development with Rails: A Pragmatic ...,it was amazing,0,1
1,0,The Restaurant at the End of the Universe (Hit...,it was amazing,1,1
2,0,Siddhartha,it was amazing,2,1
3,0,The Clock of the Long Now: Time and Responsibi...,really liked it,3,4
4,0,"Ready Player One (Ready Player One, #1)",really liked it,4,4


In [68]:
y=user_rating['Rating_numeric'];
X=user_rating.drop(['Rating_numeric'],axis=1)

In [70]:
def build_model(dropout,latent_factors):
    n_books=len(user_rating['Book_Id'].unique())
    n_users=len(user_rating['ID'].unique())
    n_latent_factors=latent_factors  # hyperparamter to deal with. 
    user_input=Input(shape=(1,),name='user_input',dtype='int64')
    user_embedding=Embedding(n_users,n_latent_factors,name='user_embedding',embeddings_initializer=tf.keras.initializers.GlorotUniform(seed=42))(user_input)
    user_vec =Flatten(name='FlattenUsers')(user_embedding)
    #user_vec=Dropout(dropout)(user_vec)
    book_input=Input(shape=(1,),name='book_input',dtype='int64')
    book_embedding=Embedding(n_books,n_latent_factors,name='book_embedding',embeddings_initializer=tf.keras.initializers.GlorotUniform(seed=42))(book_input)
    book_vec=Flatten(name='FlattenBooks')(book_embedding)
    #book_vec=Dropout(dropout)(book_vec)
    sim = tf.concat([user_vec, book_vec], axis=1)
    #sim=dot([user_vec,book_vec],name='Similarity-Dot-Product',axes=1)
    nn_inp=Dense(256,activation='relu')(sim)
    nn_inp=BatchNormalization()(nn_inp)
    nn_inp=Dropout(dropout)(nn_inp)
    nn_inp=Dense(64,activation='relu')(nn_inp)
    nn_inp=BatchNormalization()(nn_inp)
    nn_inp=Dropout(dropout)(nn_inp)
    nn_inp=Dense(1,activation='relu')(nn_inp)
    nn_model =Model([user_input, book_input],nn_inp)
    return nn_model

In [76]:
nn_model_embed = build_model(0.2,16)
nn_model_embed.summary()

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
user_input (InputLayer)         [(None, 1)]          0                                            
__________________________________________________________________________________________________
book_input (InputLayer)         [(None, 1)]          0                                            
__________________________________________________________________________________________________
user_embedding (Embedding)      (None, 1, 16)        66464       user_input[0][0]                 
__________________________________________________________________________________________________
book_embedding (Embedding)      (None, 1, 16)        1656512     book_input[0][0]                 
____________________________________________________________________________________________

In [77]:
loss_fun = tf.keras.losses.MeanSquaredError()
nn_model_embed.compile(optimizer=Adam(lr=5e-5),loss=loss_fun)
batch_size=128
epochs=20
History = nn_model_embed.fit([X.ID,X.Book_Id],y, batch_size=batch_size,
                              epochs =epochs,
                              verbose = 1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [78]:
#Extract book embedding from the autoencoder
extractor_dl = Model(inputs=nn_model_embed.get_layer('book_input').input,outputs=nn_model_embed.get_layer('book_embedding').output)
extractor_dl.summary()


Model: "model_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
book_input (InputLayer)      [(None, 1)]               0         
_________________________________________________________________
book_embedding (Embedding)   (None, 1, 16)             1656512   
Total params: 1,656,512
Trainable params: 1,656,512
Non-trainable params: 0
_________________________________________________________________


In [79]:
book_id = np.expand_dims(book_map["Book_Id"].values,axis=1)
book_id.shape

(103532, 1)

In [80]:
book_embedding_dl = extractor_dl.predict(book_id)
book_embedding_dl

array([[[-1.82831902e-02, -1.13010043e-02, -8.36298149e-03, ...,
          1.24472671e-03,  1.28921177e-02,  5.03244018e-03]],

       [[-5.98788029e-03, -1.47160068e-02, -1.34457862e-02, ...,
          2.38346242e-04, -4.95599490e-03,  4.57263459e-03]],

       [[ 1.04435226e-02,  3.79981170e-03, -4.45456943e-03, ...,
         -6.61602302e-04, -7.53294397e-03,  3.02512152e-03]],

       ...,

       [[ 2.97932560e-03, -8.96624941e-03,  1.30520854e-02, ...,
         -9.32689011e-03, -7.58682983e-03, -6.75443700e-03]],

       [[ 8.83808173e-03,  3.27293435e-03, -8.07063188e-03, ...,
          7.22469631e-05,  3.10405088e-03, -1.96278212e-03]],

       [[ 1.94203742e-02, -5.60679706e-03,  1.30649125e-02, ...,
         -7.75150955e-03, -7.70844705e-03, -7.47999933e-04]]],
      dtype=float32)

In [82]:
#select book_embedding_dl for the books in book_df
merge_df = pd.merge(book_df, book_map, how='left', on='Name')
merge_df.head()

Unnamed: 0,Name,Authors,Description,Book_Id_x,Book_Id_y
0,The Restaurant at the End of the Universe (Hit...,Douglas Adams,Just when you thought it was safe to go back t...,0,1
1,Siddhartha,Hermann Hesse,"In the novel, Siddhartha, a young man, leaves ...",1,2
2,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,2,5
3,The Authoritative Calvin and Hobbes: A Calvin ...,Bill Watterson,"The Authoritative Calvin and Hobbes, is a larg...",3,7
4,The Name of the Rose,Umberto Eco,Librarian note: Older edition of 9780099466031...,4,12


In [83]:
book_embedding_dl_sel=np.squeeze(book_embedding_dl[merge_df.Book_Id_y.values,:])
book_embedding_dl_sel.shape


(12292, 16)

### Generate recommendation list using cosine similarity

In [99]:
#Computer the similarity matrices for both content based filter and DL based collarborative filter 
from sklearn.metrics.pairwise import cosine_similarity
cb_sim_matrix = cosine_similarity(book_embedding_trf_reduce,book_embedding_trf_reduce)
cf_sim_matrix = cosine_similarity(book_embedding_dl_sel,book_embedding_dl_sel)


In [100]:
#hybrid matrix
hybrid_sim_matrix = (cb_sim_matrix + cf_sim_matrix)/2
hybrid_sim_matrix

array([[1.        , 0.23786959, 0.1547549 , ..., 0.24865462, 0.34855962,
        0.34323412],
       [0.23786959, 1.        , 0.5994524 , ..., 0.6069042 , 0.2051981 ,
        0.5673981 ],
       [0.1547549 , 0.5994524 , 1.        , ..., 0.6350025 , 0.18303059,
        0.5492029 ],
       ...,
       [0.24865462, 0.6069042 , 0.6350025 , ..., 1.        , 0.4218791 ,
        0.58473   ],
       [0.34855962, 0.2051981 , 0.18303059, ..., 0.4218791 , 1.        ,
        0.24114972],
       [0.34323412, 0.5673981 , 0.5492029 , ..., 0.58473   , 0.24114972,
        1.        ]], dtype=float32)

In [107]:
book_df.reset_index(drop=True, inplace=True)
indices = pd.Series(book_df.index, index=book_df['Name'])

In [108]:
def get_hybrid_recommendations(title):
    try:
        # handle case in which book by same title is in dataset
        idx = indices[title][0]
    except IndexError:
        idx = indices[title]
    sim_scores = list(enumerate(hybrid_sim_matrix[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:15]
    book_indices = [i[0] for i in sim_scores]
    return book_df.iloc[book_indices]

In [109]:
def print_book_info_hybrid(index):
    title = book_df.iloc[index].Name
    desc = book_df.iloc[index].Description
    author = book_df.iloc[index].Authors
    print("Title:", title, "\nDescription:", desc, "\nAuthor:", author)
    return title

### Try an example of book

In [110]:

title=print_book_info_hybrid(4432)
print(title)

Title: My Cousin, My Gastroenterologist 
Description: Welcome to Mark Leyner's America, where you can order gallium arsenide sushi at a roadside diner, get loaded on a cocktail of growth hormones and anabolic steroids, and support your habit by appearing on TV game shows. Here is fiction the brain can dance to, by one of the funniest and most subversive young writers of this, or any other, decade. 
Author: Mark Leyner
My Cousin, My Gastroenterologist


In [111]:
get_hybrid_recommendations(title)

Unnamed: 0,Name,Authors,Description,Book_Id
6255,Pretty Little Mistakes: A Do-Over Novel,Heather McElhatton,There are hundreds of lives sown inside Pretty...,6255
9866,Granta 101,Jason Crowley,"Reinvigorated and redesigned, Granta has a new...",9866
4680,So You Want to Be a Wizard,Diane Duane,Something stopped Nita's hand as it ran along ...,4680
333,The Blind Side: Evolution of a Game,Michael Lewis,"When we first meet Michael Oher, he is one of ...",333
1027,Roots: The Saga of an American Family,Alex Haley,One of the most important books and television...,1027
11417,Mexico Biography of Power,Enrique Krauze,The concentration of power in the caudillo (le...,11417
2211,Wild Fire,Nelson DeMille,This is an ACE for ISBN13:9780446579674 Wild F...,2211
11178,"Shoots to Kill (A Flower Shop Mystery, #7)",Kate Collins,"Eight years ago, Abby Knight babysat for a pro...",11178
6155,Shoot the Damn Dog: A Memoir of Depression,Sally Brampton,A successful magazine editor and prize-winning...,6155
10699,Inside Out: A Personal History of Pink Floyd,Nick Mason,The definitive history of Pink Floyd by foundi...,10699


#### Examine a few books from the recommendation list

In [112]:
print_book_info_hybrid(6255)

Title: Pretty Little Mistakes: A Do-Over Novel 
Description: There are hundreds of lives sown inside Pretty Little Mistakes, Heather McElhatton's singularly spectacular, breathtakingly unique novel that has more than 150 possible endings. You may end up in an opulent mansion or homeless down by the river; happily married with your own corporation or alone and pecked to death by ducks in London; a Zen master in Japan or morbidly obese in a trailer park.Is it destiny or decision that controls our fate? You can't change your past and start over from scratch in real life—but in Pretty Little Mistakes, you can! But be warned, choose wisely. 
Author: Heather McElhatton


'Pretty Little Mistakes: A Do-Over Novel'

In [113]:
print_book_info_hybrid(9866)

Title: Granta 101 
Description: Reinvigorated and redesigned, Granta has a new editor and a new Web site. But it’s not all change: we will still continue to publish the world’s finest writers of fiction, memoir, and reportage, in an elegant and collectable paperback book. In Granta 101, there is original work from Robert Macfarlane, reporting from a blitzed Beijing ahead of the Olympics, as well as gripping narrative dispatches from Angola, Kenya, and the troubled suburbs of Paris. Highlights include a new opening section, with pieces by Hilary Mantel and Douglas Coupland, fiction from Annie Proulx and Joshua Ferris, brilliant photo essays and a remarkable investigation into the macabre murder of a celebrated London media figure by award-winning writer Tim Lott. 
Author: Jason Crowley


'Granta 101'

In [115]:
print_book_info_hybrid(333)

Title: The Blind Side: Evolution of a Game 
Description: When we first meet Michael Oher, he is one of thirteen children by a mother addicted to crack; he does not know his real name, his father, his birthday, or how to read or write. He takes up football and school after a rich, white, Evangelical family plucks him from the streets. Then two great forces alter Oher: the family's love and the evolution of professional football itself into a game in which the quarterback must be protected at any cost. Our protagonist becomes the priceless package of size, speed, and agility necessary to guard the quarterback's greatest vulnerability: his blind side. 
Author: Michael   Lewis


'The Blind Side: Evolution of a Game'

#### It seems that all these books have something in common, talking about history, tradition and people's life experience from different perspectives

In [116]:
# Another example

title=print_book_info_hybrid(23)
print(title)

Title: The Remains of the Day 
Description: In the summer of 1956, Stevens, a long-serving butler at Darlington Hall, decides to take a motoring trip through the West Country. The six-day excursion becomes a journey into the past of Stevens and England, a past that takes in fascism, two world wars, and an unrealised love between the butler and his housekeeper. 
Author: Kazuo Ishiguro
The Remains of the Day


In [117]:
get_hybrid_recommendations(title)

Unnamed: 0,Name,Authors,Description,Book_Id
4724,Address Unknown,Kathrine Kressmann Taylor,"A rediscovered classic, originally published i...",4724
11923,Mrs. Palfrey at the Claremont,Elizabeth Taylor,"On a rainy Sunday in January, the recently wid...",11923
2057,Lost Island,Phyllis A. Whitney,"Lacey, Elise and Giles. They grew up together ...",2057
3865,Blade Dancer,K.M. Tolan,Emerging from an ancient civil war with only a...,3865
6347,Alexander's Bridge,Willa Cather,“The sun sank rapidly; the silvery light had f...,6347
713,"A Breath of Snow and Ashes (Outlander, #6)",Diana Gabaldon,"Eagerly anticipated by her legions of fans, th...",713
8763,"Truth or Dare (Whispering Springs, #2)",Jayne Ann Krentz,The New York Times bestselling author Jayne An...,8763
2521,Our Man in Havana,Graham Greene,"First published in 1959, Our Man in Havana is ...",2521
1717,The Music of Chance,Paul Auster,Paul Auster fuses Samuel Beckett and The Broth...,1717
10108,The Book of Merlyn: The Unpublished Conclusion...,T.H. White,"""... a personal as well as historical story th...",10108


In [118]:
print_book_info_hybrid(4724)


Title: Address Unknown 
Description: A rediscovered classic, originally published in 1938 --and now an international bestseller.When it first appeared in Story magazine in 1938, Address Unknown became an immediate social phenomenon and literary sensation. Published in book form a year later and banned in Nazi Germany, it garnered high praise in the United States and much of Europe.A series of fictional letters between a Jewish art dealer living in San Francisco and his former business partner, who has returned to Germany, Address Unknown is a haunting tale of enormous and enduring impact. 
Author: Kathrine Kressmann Taylor


'Address Unknown'

In [119]:
print_book_info_hybrid(11923)

Title: Mrs. Palfrey at the Claremont 
Description: On a rainy Sunday in January, the recently widowed Mrs. Palfrey arrives at the Claremont Hotel where she will spend her remaining days. Her fellow residents are magnificently eccentric and endlessly curious, living off crumbs of affection and snippets of gossip. Together, upper lips stiffened, they fight off their twin enemies—boredom and the Grim Reaper. Then one day Mrs. Palfrey strikes up an unexpected friendship with Ludo, a handsome young writer, and learns that even the old can fall in love. 
Author: Elizabeth Taylor


'Mrs. Palfrey at the Claremont'

In [120]:
print_book_info_hybrid(2057)

Title: Lost Island 
Description: Lacey, Elise and Giles. They grew up together on a mist-shrouded island off the Georgia coast. Long ago, and without Giles ever knowing it, Lacey gave birth to his son. But Elise, the beautiful, domineering one, got Giles. She got Lacey's child too, to bring up as her own. Lacey has tried to forget. But in ten years she has not been able to. So she's going back. To see her son. To confront Elise. To exorcise the spell of the island -- and of Giles. Or perhaps to be trapped by them forever. 
Author: Phyllis A. Whitney


'Lost Island'