# Book Recommendation System - Part 2

---

## Overview

In this notebook, we will be creating our recommendation systems to predict which books that an user would like.


## Contents:

- [Imports](#Imports)
- [Types of Recommendation System](#Types-of-Recommendation-System)
- [Content-Based Filtering](#Content-Based-Filtering)
    - [Building Content Engine](#Building-Content-Engine)
    - [Recommendations](#Recommendations)
- [Collaborative Filtering](#Collaborative-Filtering)
    - [Memory-Based Approach](#Memory-Based-Approach)
    - [Model-Based Approach](#Model-Based-Approach)
        - [Baseline model](#Baseline-model)
        - [Hyperparameters tuning](#Hyperparameters-tuning)
        - [Compute precision@k and recall@k](#Compute-precision@k-and-recall@k)
        - [Predictions](#Predictions)
- [Hybrid Recommendation System](#Hybrid-Recommendation-System)
- [Conclusion](#Conclusion)
- [Future Works](#Future-Works)

## Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time
import recmetrics
import random

from math import sqrt
from scipy import sparse
from bs4 import BeautifulSoup
from collections import defaultdict

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity, euclidean_distances
from sklearn.metrics import mean_squared_error

from surprise import (Reader, Dataset,
                      SVD, SVDpp,
                      SlopeOne, NMF, 
                      KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore, 
                      BaselineOnly, CoClustering)
from surprise.model_selection import cross_validate, GridSearchCV, train_test_split, KFold
from surprise.accuracy import rmse, mae


In [2]:
books = pd.read_csv('./dataset/books_clean.csv')
ratings = pd.read_csv('./dataset/ratings_clean.csv')

In [3]:
# ratings dataset with title
new_ratings = pd.read_csv('./dataset/new_ratings.csv')

In [4]:
books.head()

Unnamed: 0,authors,average_rating,book_id,genres,cover_image,ratings_count,title,clean_desc,goodreads_link
0,['Suzanne Collins'],4.34,1,"['young-adult', 'fiction', 'fantasy', 'science...",https://images.gr-assets.com/books/1447303603m...,4780653,"The Hunger Games (The Hunger Games, #1)",winning means fame fortune losing means certai...,https://www.goodreads.com/book/show/2767052
1,"['J.K. Rowling', 'Mary GrandPré']",4.44,2,"['fantasy', 'fiction', 'young-adult', 'classics']",https://images.gr-assets.com/books/1474154022m...,4602479,Harry Potter and the Sorcerer's Stone (Harry P...,harry potter life miserable parents dead stuck...,https://www.goodreads.com/book/show/3
2,['Stephenie Meyer'],3.57,3,"['young-adult', 'fantasy', 'romance', 'fiction...",https://images.gr-assets.com/books/1361039443m...,3866839,"Twilight (Twilight, #1)",three things absolutely positive first edward ...,https://www.goodreads.com/book/show/41865
3,['Harper Lee'],4.25,4,"['classics', 'fiction', 'historical-fiction', ...",https://images.gr-assets.com/books/1361975680m...,3198671,To Kill a Mockingbird,unforgettable novel childhood sleepy southern ...,https://www.goodreads.com/book/show/2657
4,['F. Scott Fitzgerald'],3.89,5,"['classics', 'fiction', 'historical-fiction', ...",https://images.gr-assets.com/books/1490528560m...,2683664,The Great Gatsby,great gatsby f scott fitzgerald third book sta...,https://www.goodreads.com/book/show/4671


In [5]:
ratings.head()

Unnamed: 0,user_id,book_id,rating
0,61,11,5
1,61,33,3
2,61,26,3
3,61,2,4
4,61,24,4


## Types of Recommendation System

**Content-Based Filtering** *(similar items)*
- The details from a book are broken down into features, such as its authors and genres.
- Using machine learning, we can compute a model to find out other books with similar features.

**Collaborative Filtering** *(similar users)*
- Finding users with similar preferences.
- 2 main types: <br>
    a) Find similar users and recommend what they like **(user-based)**. <br>
    b) Find items similar to already-liked items **(item-based)**.


## Content-Based Filtering

![4](images/Content.png "Content")

The content-based recommendation system will be built using the book's title, authors, genres and description available.

#### Building Content Engine

In [6]:
# Strip spaces and lower-case `authors`
books['authors_clean'] = books['authors'].apply(lambda x: [str.lower(i.strip("[]'").replace("'","").replace(" ", "")) for i in x.split(', ')])

In [7]:
# Formatting by removing quotation marks
books['authors'] = books['authors'].apply(lambda x: x.strip("[]'").replace("'","").split(", "))

In [8]:
books['authors_clean'].head()

0               [suzannecollins]
1    [j.k.rowling, marygrandpré]
2               [stepheniemeyer]
3                    [harperlee]
4            [f.scottfitzgerald]
Name: authors_clean, dtype: object

In [9]:
# Strip spaces and lower-case `genres`
books['genres_clean'] = books['genres'].apply(lambda x: [str.lower(i.strip("[]'").replace("'","").replace(" ", "")) for i in x.split(', ')])

In [10]:
books['genres_clean'].head()

0    [young-adult, fiction, fantasy, science-fictio...
1            [fantasy, fiction, young-adult, classics]
2    [young-adult, fantasy, romance, fiction, paran...
3    [classics, fiction, historical-fiction, young-...
4     [classics, fiction, historical-fiction, romance]
Name: genres_clean, dtype: object

In [11]:
# Convert all items to str
books['clean_desc'] = books[['clean_desc']].astype(str)

In [12]:
books['clean_desc'].head()

0    winning means fame fortune losing means certai...
1    harry potter life miserable parents dead stuck...
2    three things absolutely positive first edward ...
3    unforgettable novel childhood sleepy southern ...
4    great gatsby f scott fitzgerald third book sta...
Name: clean_desc, dtype: object

In [13]:
# Combine title, authors, genres and description into features
books['soup'] = books.apply(lambda x: ' '.join([x['title']] + x['authors_clean'] + x['genres_clean'] + [x['clean_desc']]), axis=1)

In [14]:
books['soup'].head()

0    The Hunger Games (The Hunger Games, #1) suzann...
1    Harry Potter and the Sorcerer's Stone (Harry P...
2    Twilight (Twilight, #1) stepheniemeyer young-a...
3    To Kill a Mockingbird harperlee classics ficti...
4    The Great Gatsby f.scottfitzgerald classics fi...
Name: soup, dtype: object

In [15]:
# Transform text from soup into vectors
tvec = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfv_matrix = tvec.fit_transform(books['soup'])

In [16]:
tfv_matrix.shape

(9287, 639188)

In [17]:
# Use cosine similarity to compute distances between 2 books
cosine_sim = cosine_similarity(tfv_matrix, tfv_matrix)

In [18]:
indices = pd.Series(books.index, index=books['title'])
titles = books['title']

#### Recommendations

In [19]:
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_image(val):
    return '<img src="{}" width=50></img>'.format(val)
    
# Function to generate 10 recommendations based on a book title
def get_recommendations(title, n=10):
    idx = indices[title]
    
    # Get pairwise similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort scores in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Retrieve book indices
    sim_scores = sim_scores[1:11]
    book_indices = [i[0] for i in sim_scores]
    
    # To format results as such
    df = books.iloc[book_indices][['title', 'authors', 'cover_image', 'goodreads_link',
                                  'average_rating', 'ratings_count']]
    
    return df.head(n).style.format({'goodreads_link': make_clickable, 'cover_image': show_image})


In [20]:
get_recommendations("Angels & Demons  (Robert Langdon, #1)")

Unnamed: 0,title,authors,cover_image,goodreads_link,average_rating,ratings_count
3775,"Angels and Demons / The Da Vinci Code (Robert Langdon, #1-2)",['Dan Brown'],,Goodreads,4.19,18711
158,"The Lost Symbol (Robert Langdon, #3)",['Dan Brown'],,Goodreads,3.66,369428
21,"The Da Vinci Code (Robert Langdon, #2)",['Dan Brown'],,Goodreads,3.79,1447148
192,"Inferno (Robert Langdon, #4)",['Dan Brown'],,Goodreads,3.8,287533
9077,The Gemini Contenders,['Robert Ludlum'],,Goodreads,3.87,10428
5180,Temple,['Matthew Reilly'],,Goodreads,4.04,14961
9026,The Third Secret,['Steve Berry'],,Goodreads,3.84,10809
7991,The Dead Key,['D.M. Pulley'],,Goodreads,3.79,18773
9059,"Staked (The Iron Druid Chronicles, #8)",['Brad Thor'],,Goodreads,4.23,9082
8089,The Short and Tragic Life of Robert Peace: A Brilliant Young Man Who Left Newark for the Ivy League,['Jeff Hobbs'],,Goodreads,4.13,14451


Top recommendations include Dan Brown books / Robert Langdon series.

In [21]:
# Function to suggest book titles based on partial strings
def get_name_from_partial(title):
    return list(books.title[books.title.str.lower().str.contains(title) == True].values)

In [22]:
title = "people"
l = get_name_from_partial(title)
list(enumerate(l))

[(0, 'The Five People You Meet in Heaven'),
 (1,
  'The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change'),
 (2, 'How to Win Friends and Influence People'),
 (3, "A People's History of the United States"),
 (4, 'People of the Book'),
 (5,
  'The 21 Irrefutable Laws of Leadership: Follow Them and People Will Follow You'),
 (6, 'The People of Sparks (Book of Ember, #2)'),
 (7,
  'The Art of Asking; or, How I Learned to Stop Worrying and Let People Help'),
 (8, "Smiley's People"),
 (9, 'Games People Play'),
 (10, 'And the Band Played On: Politics, People, and the AIDS Epidemic'),
 (11, "Why Mosquitoes Buzz in People's Ears"),
 (12, 'What Do You Care What Other People Think?'),
 (13, 'The Bone People'),
 (14,
  'The Righteous Mind: Why Good People are Divided by Politics and Religion'),
 (15, 'Ordinary People'),
 (16, 'Facing Your Giants: A David and Goliath Story for Everyday People'),
 (17,
  "Fresh Wind, Fresh Fire: What Happens When God's Spirit Invades the Hear

In [23]:
get_recommendations("The Five People You Meet in Heaven")

Unnamed: 0,title,authors,cover_image,goodreads_link,average_rating,ratings_count
1917,The First Phone Call from Heaven,['Mitch Albom'],,Goodreads,3.73,38957
662,For One More Day,['Mitch Albom'],,Goodreads,4.09,102193
7197,"The Boy Who Came Back from Heaven: A Remarkable Account of Miracles, Angels, and Life beyond This World","['Kevin Malarkey', 'Alex Malarkey']",,Goodreads,3.89,8945
242,Heaven is for Real: A Little Boy's Astounding Story of His Trip to Heaven and Back,"['Todd Burpo', 'Lynn Vincent']",,Goodreads,4.01,228371
90,Tuesdays with Morrie,"['Mitch Albom', 'Saulius Dagys']",,Goodreads,4.06,556518
5054,"Heaven (Casteel, #1)",['V.C. Andrews'],,Goodreads,4.01,17289
5388,"Rock Chick Rescue (Rock Chick, #2)",['Kristen Ashley'],,Goodreads,4.31,26661
7244,"Heaven (Halo, #3)",['Alexandra Adornetto'],,Goodreads,3.91,14520
8035,"Love Wins: A Book About Heaven, Hell, and the Fate of Every Person Who Ever Lived",['Rob Bell'],,Goodreads,3.54,17668
6133,The Christmas Sweater,['Glenn Beck'],,Goodreads,3.79,12916


## Collaborative Filtering

![5](images/Collaborative.png "Collaborative")

We will try the two main approaches to collaborative filtering: <br>
    - Memory-based <br>
    - Model-based

### Memory-Based Approach

The memory-based approach involves computing the similarity between users or items using user ratings.

In [24]:
# Create pivot table for user CF
user_matrix = new_ratings.pivot_table(index='user_id', columns='title', values='rating')
user_matrix.fillna(0, inplace=True)
user_matrix.head()

title,'Salem's Lot,11/22/63,1776,1984,1Q84,"1st to Die (Women's Murder Club, #1)","2001: A Space Odyssey (Space Odyssey, #1)",A Bend in the Road,"A Breath of Snow and Ashes (Outlander, #6)",A Brief History of Time,...,"Wolves of the Calla (The Dark Tower, #5)",Wonder,"Wool Omnibus (Silo, #1)","Words of Radiance (The Stormlight Archive, #2)",World War Z: An Oral History of the Zombie War,"World Without End (The Kingsbridge Series, #2)","Xenocide (Ender's Saga, #3)",Year of Wonders,Yes Please,Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0
35,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0
36,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
61,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,3.0


In [25]:
def get_similar(title, mat):
    title_user_ratings = mat[title]
    similar_to_title = mat.corrwith(title_user_ratings)
    corr_title = pd.DataFrame(similar_to_title, columns=['correlation'])
    corr_title.dropna(inplace=True)
    corr_title.sort_values('correlation', ascending=False, inplace=True)
    return corr_title

In [26]:
title = "The Five People You Meet in Heaven"
smlr = get_similar(title, user_matrix)

In [27]:
# Top 10 similar books
smlr.head(10)

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
The Five People You Meet in Heaven,1.0
Tuesdays with Morrie,0.369439
For One More Day,0.276436
"The Notebook (The Notebook, #1)",0.199514
My Sister's Keeper,0.195729
The Shack,0.192527
The Secret Life of Bees,0.18473
A Thousand Splendid Suns,0.18261
The Memory Keeper's Daughter,0.174425
The Last Lecture,0.174358


In [28]:
# Include ratings_count
smlr = smlr.join(books.set_index('title')['ratings_count'])
smlr.head()

Unnamed: 0_level_0,correlation,ratings_count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
The Five People You Meet in Heaven,1.0,449501
Tuesdays with Morrie,0.369439,556518
For One More Day,0.276436,102193
"The Notebook (The Notebook, #1)",0.199514,1053403
My Sister's Keeper,0.195729,863879


In [29]:
# Sort by ratings_count with min 500k ratings
smlr[smlr.ratings_count > 5e5].sort_values('correlation', ascending=False).head(10)

Unnamed: 0_level_0,correlation,ratings_count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Tuesdays with Morrie,0.369439,556518
"The Notebook (The Notebook, #1)",0.199514,1053403
My Sister's Keeper,0.195729,863879
The Secret Life of Bees,0.18473,916189
A Thousand Splendid Suns,0.18261,818742
The Memory Keeper's Daughter,0.174425,501430
The Lovely Bones,0.172442,1605173
The Kite Runner,0.172229,1813044
Water for Elephants,0.163817,1068146
A Walk to Remember,0.15296,546948


### Model-Based Approach



The model-based approach involves building a model based on the user ratings dataset. 

For this part, we will be making use of the `Surprise` [library](https://surprise.readthedocs.io/en/stable/) to build the recommendation system. In this library, there are many algorithms that has been built specifically for recommendation systems using explicit ratings. 

Reading in the dataset for modelling:

In [30]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings[['user_id', 'book_id', 'rating']], reader)

#### Baseline model

Since there are many prediction algorithms available in the library, we will use our `ratings` dataset and iterate through all the algorithms and use the best performing one. Our goal here is to establish the algorithm with the lowest RMSE as a baseline model. After that, we will perform hyperparameters tuning on that model. 

More information about the algorithms can be found in the [documentation](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

In [24]:
benchmark = []
# Iterate over all algorithms
for algorithm in [
    KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), 
    BaselineOnly(), CoClustering(),
    SlopeOne(), NMF(), 
    SVD(), SVDpp()
]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = pd.concat([tmp, pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm'])])
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')  

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,0.843055,566.634514,27.828063
SVD,0.859417,35.636158,4.998294
KNNBaseline,0.864164,64.243537,367.939671
SlopeOne,0.869239,3.018741,24.117684
KNNWithZScore,0.870423,64.629724,390.459211
KNNWithMeans,0.871343,58.662309,339.339279
BaselineOnly,0.876753,2.033532,4.390604
CoClustering,0.88388,21.376527,3.852897
NMF,0.885807,44.309575,4.446064
KNNBasic,0.894614,81.12552,380.896954


`SVDpp` has obtained the lowest RMSE amongst all the algorithms. However, it has taken 566s (9mins) on average to fit 1 fold. 

In order to speed up the modelling process, we will use the next best algorithm, `SVD` instead.

#### Hyperparameters tuning

In [31]:
# Performing GridSearchCV on SVD
param_grid = {"n_factors": [200],    #[70, 80, 100, 120, 150, 180, 200]
              "n_epochs": [40],      #[20, 25, 30, 35, 45, 50]
              "lr_all": [0.015],     #[0.005, 0.010, 0.03, 0.018]
              "reg_all": [0.08],     #[0.02, 0.05, 0.1, 0.075, 0.09]
              "random_state": [23],
             }

gs = GridSearchCV(SVD, param_grid, measures=['RMSE'], cv=3)

In [32]:
%%time
gs.fit(data)

Wall time: 16min 16s


In [33]:
gs.best_score

{'rmse': 0.8288784729298851}

After tuning the hyperparameters for `SVD`, we managed to achieve a RMSE score of **0.828878**, an improvement by **0.0306**.

We had also tried to use the same set of parameters on `SVDpp`.

In [69]:
param_grid = {"n_factors": [200], #[70, 80, 100, 120, 150, 180]
              "n_epochs": [40], #[20, 25, 30, 35, 45, 50]
              "lr_all": [0.015], #[0.005, 0.010, 0.03, 0.018]
              "reg_all": [0.08], #[0.02, 0.05, 0.1, 0.075, 0.09]
              "random_state": [23],
              "verbose": [True]
             }

pp = GridSearchCV(SVDpp, param_grid, measures=['RMSE'], cv=3)

In [70]:
%%time
pp.fit(data)

 processing epoch 0
 processing epoch 1
 processing epoch 2
 processing epoch 3
 processing epoch 4
 processing epoch 5
 processing epoch 6
 processing epoch 7
 processing epoch 8
 processing epoch 9
 processing epoch 10
 processing epoch 11
 processing epoch 12
 processing epoch 13
 processing epoch 14
 processing epoch 15
 processing epoch 16
 processing epoch 17
 processing epoch 18
 processing epoch 19
 processing epoch 20
 processing epoch 21
 processing epoch 22
 processing epoch 23
 processing epoch 24
 processing epoch 25
 processing epoch 26
 processing epoch 27
 processing epoch 28
 processing epoch 29
 processing epoch 30
 processing epoch 31
 processing epoch 32
 processing epoch 33
 processing epoch 34
 processing epoch 35
 processing epoch 36
 processing epoch 37
 processing epoch 38
 processing epoch 39
 processing epoch 0
 processing epoch 1
 processing epoch 2
 processing epoch 3
 processing epoch 4
 processing epoch 5
 processing epoch 6
 processing epoch 7
 processin

In [71]:
pp.best_score

{'rmse': 0.8281736598195852}

While the `SVDpp` obtained an improvement in RMSE by **0.0007** over our tuned `SVD` model, it had took almost 5.5hrs for run through the same set of hyperparameters. Hence, we will continue to use `SVD` instead.

In [34]:
# instantiate SVD with best params
algo = SVD(n_factors=200,
           n_epochs=40, 
           lr_all=0.015,
           reg_all=0.08,
           random_state=23
                   )

#### Compute precision@k and recall@k

Precision and recall are binary metrics used to evaluate models. For our case where we have a range of ratings (1 to 5), we will need to translate it into a binary problem. We assume that any true rating above 3.5 corresponds to a relevant item, and any true rating below 3.5 is irrelevant. A relevant item translates to a good recommendation.

![](images/rak_pak.jpg "formula")

- Precision @ k is the proportion of recommended items in the top-k set that are relevant.
    - Suppose precision @ 10 in a top-10 recommendation system is 80%. 
    - This means that 80% of recommendations made are relevant to the user. 

- Recall @ k is the proportion of relevant items found in the top-k recommendations.
    - Suppose recall @ 10 in a top-10 recommendation system is 40%. 
    - This means that 40% of the total number of the relevant items appear in the top-10 results.

In [35]:
def precision_recall_at_k(predictions, k=10, threshold=3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(
            ((true_r >= threshold) and (est >= threshold))
            for (est, true_r) in user_ratings[:k]
        )

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls

In [36]:
kf = KFold(n_splits=3)

for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=10, threshold=3.5)

    # Precision and recall can then be averaged over all users
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print(sum(rec for rec in recalls.values()) / len(recalls))

0.8268607934272374
0.40451876664004355
0.8276200965346058
0.4057337614014454
0.8264269980551908
0.4060103020382488


Precision@10: 82.7% <br>
Recall@10: 40.5%

In [37]:
%%time
rmse(predictions)

RMSE: 0.8282
Wall time: 693 ms


0.828163685341514

#### Predictions

We will re-instantiate SVD to get predictions outside of our dataset, i.e. books that users have not rated before.

In [31]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings[['user_id', 'book_id', 'rating']], reader)

In [32]:
# instantiate SVD with best params
algo = SVD(n_factors=200,
           n_epochs=40, 
           lr_all=0.015,
           reg_all=0.08,
           random_state=23
                   )

In [33]:
trainset = data.build_full_trainset()
algo.fit(trainset)
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

In [34]:
def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [35]:
top_n = get_top_n(predictions, n=10)

In [36]:
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_image(val):
    return '<img src="{}" width=50></img>'.format(val)

def get_user_pred(user_id):
    user_recommendations = pd.DataFrame(columns=['user_id', 
                                                 'book_id', 
                                                 'title',
                                                 'authors',
                                                 'cover_image',
                                                 'goodreads_link',
                                                 'average_rating',
                                                 'ratings_count'])
    
    for uId, user_ratings in top_n.items():
        if uId == user_id:
            user_recommendations['book_id'] = [book_id for (book_id, _) in user_ratings]


    user_recommendations['user_id'] = user_recommendations['book_id'].map(
                                            lambda x: user_id)
    
    user_recommendations['title'] = user_recommendations['book_id'].map(
                                        lambda x: books.loc[books['book_id'] == x,
                                                               'title'].values[0])
    
    user_recommendations['authors'] = user_recommendations['book_id'].map(
                                        lambda x: books.loc[books['book_id'] == x,
                                                               'authors'].values[0])

    user_recommendations['average_rating'] = user_recommendations['book_id'].map(
                                        lambda x: books.loc[books['book_id'] == x,
                                                               'average_rating'].values[0])
    
    user_recommendations['ratings_count'] = user_recommendations['book_id'].map(
                                        lambda x: books.loc[books['book_id'] == x,
                                                               'ratings_count'].values[0])    
    
    user_recommendations['cover_image'] = user_recommendations['book_id'].map(
                                        lambda x: books.loc[books['book_id'] == x,
                                                               'cover_image'].values[0])
    
    user_recommendations['goodreads_link'] = user_recommendations['book_id'].map(
                                        lambda x: books.loc[books['book_id'] == x,
                                                               'goodreads_link'].values[0])
        
    return user_recommendations.style.format({'goodreads_link': make_clickable, 'cover_image': show_image})

In [37]:
get_user_pred(6335)

Unnamed: 0,user_id,book_id,title,authors,cover_image,goodreads_link,average_rating,ratings_count
0,6335,1010,The Essential Calvin and Hobbes: A Calvin and Hobbes Treasury,['Bill Watterson'],,Goodreads,4.65,93001
1,6335,780,Calvin and Hobbes,"['Bill Watterson', 'G.B. Trudeau']",,Goodreads,4.61,117788
2,6335,422,"Harry Potter Boxset (Harry Potter, #1-7)",['J.K. Rowling'],,Goodreads,4.74,190050
3,6335,4,To Kill a Mockingbird,['Harper Lee'],,Goodreads,4.25,3198671
4,6335,490,"Maus I: A Survivor's Tale: My Father Bleeds History (Maus, #1)",['Art Spiegelman'],,Goodreads,4.35,184007
5,6335,862,"Words of Radiance (The Stormlight Archive, #2)",['Brandon Sanderson'],,Goodreads,4.77,73572
6,6335,459,A Man Called Ove,"['Fredrik Backman', 'Henning Koch']",,Goodreads,4.35,183777
7,6335,958,"The Complete Anne of Green Gables Boxed Set (Anne of Green Gables, #1-8)",['L.M. Montgomery'],,Goodreads,4.42,92142
8,6335,983,Between the World and Me,['Ta-Nehisi Coates'],,Goodreads,4.4,74218
9,6335,998,The Monster at the End of this Book,"['Jon Stone', 'Michael J. Smollin']",,Goodreads,4.45,102184


In [38]:
new_ratings[new_ratings['user_id'] == 6335].head(10)

Unnamed: 0,user_id,book_id,rating,title
5935,6335,33,4,Memoirs of a Geisha
31116,6335,372,5,Dress Your Family in Corduroy and Denim
57314,6335,180,2,Siddhartha
62323,6335,45,4,Life of Pi
72460,6335,8,2,The Catcher in the Rye
74640,6335,119,5,The Handmaid's Tale
77831,6335,32,4,Of Mice and Men
90085,6335,100,4,The Poisonwood Bible
96565,6335,13,5,1984
104041,6335,14,5,Animal Farm


## Hybrid Recommendation System

There are different approaches to creating a hybrid recommendations. For this notebook, we will employ a mixed hybrid approach, where user profile and features are input into different recommendation models accordingly. The prediction are combined to produce the result recommendation.

- Input: User ID and the Title of a Book
- Output: Similar books sorted on the basis of expected ratings by that particular user.


In [39]:
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_image(val):
    return '<img src="{}" width=50></img>'.format(val)

def hybrid(user_id, title, n=10):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:51]
    book_indices = [i[0] for i in sim_scores]
    
    df = books.iloc[book_indices][['book_id', 'title', 'average_rating', 'ratings_count',
                                   'authors', 'cover_image', 'goodreads_link']]
    
    df['est'] = df['book_id'].apply(lambda x: algo.predict(user_id, x).est)

    df = df.sort_values('est', ascending=False)
    
    final = df[['title', 'authors', 'cover_image', 'goodreads_link',
                'average_rating', 'ratings_count', 'est']]
    
    return final.head(n).style.format({'goodreads_link': make_clickable, 'cover_image': show_image})


In [40]:
hybrid(202, 'The Five People You Meet in Heaven')

Unnamed: 0,title,authors,cover_image,goodreads_link,average_rating,ratings_count,est
869,"Lord of Chaos (Wheel of Time, #6)",['Robert Jordan'],,Goodreads,4.1,91046,4.016919
787,"The Alchemyst (The Secrets of the Immortal Nicholas Flamel, #1)",['Michael Scott'],,Goodreads,3.84,58396,4.003085
1917,The First Phone Call from Heaven,['Mitch Albom'],,Goodreads,3.73,38957,3.995743
7734,The First Fifteen Lives of Harry August,['Claire North'],,Goodreads,4.04,22327,3.995743
4530,The Rose Garden,['Susanna Kearsley'],,Goodreads,4.01,18553,3.995743
4364,"Abandon (Abandon, #1)",['Meg Cabot'],,Goodreads,3.71,30934,3.995743
978,The Time Keeper,['Mitch Albom'],,Goodreads,3.85,72277,3.995743
4460,The Dive From Clausen's Pier,['Ann Packer'],,Goodreads,3.41,18445,3.995743
4139,"11 Birthdays (Willow Falls, #1)",['Wendy Mass'],,Goodreads,4.18,22339,3.995743
4392,The Chemist,['Stephenie Meyer'],,Goodreads,3.69,25188,3.995743


In [41]:
hybrid(6335, 'The Five People You Meet in Heaven')

Unnamed: 0,title,authors,cover_image,goodreads_link,average_rating,ratings_count,est
17,The Lovely Bones,['Alice Sebold'],,Goodreads,3.77,1605173,4.05685
869,"Lord of Chaos (Wheel of Time, #6)",['Robert Jordan'],,Goodreads,4.1,91046,4.017087
208,How to Win Friends and Influence People,['Dale Carnegie'],,Goodreads,4.13,282623,3.987156
1917,The First Phone Call from Heaven,['Mitch Albom'],,Goodreads,3.73,38957,3.979531
7735,The Guns of August,['Claire North'],,Goodreads,4.18,35147,3.979531
4364,"Abandon (Abandon, #1)",['Meg Cabot'],,Goodreads,3.71,30934,3.979531
978,The Time Keeper,['Mitch Albom'],,Goodreads,3.85,72277,3.979531
4460,The Dive From Clausen's Pier,['Ann Packer'],,Goodreads,3.41,18445,3.979531
4139,"11 Birthdays (Willow Falls, #1)",['Wendy Mass'],,Goodreads,4.18,22339,3.979531
4392,The Chemist,['Stephenie Meyer'],,Goodreads,3.69,25188,3.979531


The hybrid recommender gave personalised suggestions to different user IDs.

## Conclusion

The table below is a summary of predictions from 3 types of recommendation systems, using `User ID: 6335` and the book `The Five People You Meet in Heaven`:

|  	| Hybrid 	| Collaborative 	| Content 	|
|---	|---	|---	|---	|
| 1 	| The Lovely Bones 	| The Essential Calvin and Hobbes: A Calvin and Hobbes Treasury 	| The First Phone Call from Heaven 	|
| 2 	| Lord of Chaos (Wheel of Time, #6) 	| Calvin and Hobbes 	| For One More Day 	|
| 3 	| How to Win Friends and Influence People 	| Harry Potter Boxset (Harry Potter, #1-7) 	| The Boy Who Came Back from Heaven: A Remarkable Account of Miracles, Angels, and Life beyond This World 	|
| 4 	| The First Phone Call from Heaven 	| To Kill a Mockingbird 	| Heaven is for Real: A Little Boy's Astounding Story of His Trip to Heaven and Back 	|
| 5 	| The Guns of August 	| Maus I: A Survivor's Tale: My Father Bleeds History (Maus, #1) 	| Tuesdays with Morrie 	|
| 6 	| Abandon (Abandon, #1) 	| Words of Radiance (The Stormlight Archive, #2) 	| Heaven (Casteel, #1) 	|
| 7 	| The Time Keeper 	| A Man Called Ove 	| Rock Chick Rescue (Rock Chick, #2) 	|
| 8 	| The Dive From Clausen's Pier 	| The Complete Anne of Green Gables Boxed Set (Anne of Green Gables, #1-8) 	| Heaven (Halo, #3) 	|
| 9 	| 11 Birthdays (Willow Falls, #1) 	| Between the World and Me 	| Love Wins: A Book About Heaven, Hell, and the Fate of Every Person Who Ever Lived 	|
| 10 	| The Chemist 	| The Monster at the End of this Book 	| The Christmas Sweater 	|
|  	|  	|  	|  	|

**1. Content Based Recommendation**: <br>
    We took book titles, authors, genres and its descriptions as features and came up with predictions with similar features.
    
**2. Collaborative Filtering Recommendation**: <br>
    We tried both memory and model based approaches, by building our own matrix and using the `Surprise` library respectively. In our model based approach, we used Singular Value Decomposition as our chosen algorithm, and obtained top predictions for a given user based on estimated ratings.
    
**3. Hybrid Recommendation**: <br>
    We used a mixed hybrid approach to combine both our content and collaborative filtering models. Given an user ID and a book title, we produced book suggestions based on features and highest estimated ratings.

**Future Works**
1. The main concepts or storyline of a book cannot be summarised or contained within the book title or description only. In order to better group the books with similar 'content', we can try including other sources such as text reviews or discussion, to provide more content-based features.
2. For this project, we had only used `sklearn` and `surprise` libraries to create our recommendation engines. We can consider using other recommendation system libraries and/or neural network models, such as `LightFM` or `TensorFlow Recommenders`, for faster or more efficient models.
