# Book Recommendation with Collaborative Filtering

<img src="https://img-cdn.inc.com/image/upload/w_1920,h_1080,c_fill/images/panoramic/GettyImages-577674005_492115_zfpgiw.jpg" alt="pearson correlation">

## Build state-of-the-art models for book recommendation system

## Context
During the last few decades, with the rise of Youtube, Amazon, Netflix and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys.
In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries).

Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors. As a proof of the importance of recommender systems, we can mention that, a few years ago, Netflix organised a challenges (the “Netflix prize”) where the goal was to produce a recommender system that performs better than its own algorithm with a prize of 1 million dollars to win.

By applying this simple dataset and related tasks and notebooks , we will evolutionary go through different paradigms of recommender algorithms . For each of them, we will present how they work, describe their theoretical basis and discuss their strengths and weaknesses. For extra information, please check <a href="https://www.kaggle.com/arashnic/book-recommendation-dataset">Kaggle Möbius</a>.

## Dataset and Extra Information

https://www.kaggle.com/arashnic/book-recommendation-dataset

### Picture link:

<a href="https://www.inc.com/jessica-stillman/books-reading-intelligence-tyler-cowen.html">Jessica Stillman</a>

### Imporing libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")
# Dataframe manipulation library
import pandas as pd
# Math functions, we'll only need the sqrt function so let's import only that
from math import sqrt
import numpy as np

### Seperate datasets

In [2]:
# Storing the movie information into a pandas dataframe
books_df = pd.read_csv('Books.csv')
books_df = books_df[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication']]
# Storing the user information into a pandas dataframe
ratings_df = pd.read_csv('Ratings.csv')
# Head is a function that gets the first N rows of a dataframe. N's default is 5.
books_df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication
0,195153448,Classical Mythology,Mark P. O. Morford,2002
1,2005018,Clara Callan,Richard Bruce Wright,2001
2,60973129,Decision in Normandy,Carlo D'Este,1991
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999


### Looking for duplicated book titles

In [3]:
books_df['Book-Title'].duplicated().sum()

29225

In [4]:
books_df.drop_duplicates(subset='Book-Title', keep="last", inplace=True)

In [5]:
books_df['Book-Title'].duplicated().sum()

0

In [6]:
ratings_df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,5
1,276726,155061224,2
2,276727,446520802,9
3,276729,052165615X,6
4,276729,521795028,7


### User input for making recommendations

In [15]:
userInput = [
            {'Book-Title': 'Lightning', 'Book-Rating': 9},
            {'Book-Title': 'Manhattan Hunt Club', 'Book-Rating': 9},
            {'Book-Title': 'Clara Callan', 'Book-Rating': 7},
            {'Book-Title': "Jane Doe", 'Book-Rating': 2},
            {'Book-Title': 'Wild Animus', 'Book-Rating': 5}
         ]
inputBooks= pd.DataFrame(userInput)
inputBooks

Unnamed: 0,Book-Title,Book-Rating
0,Lightning,9
1,Manhattan Hunt Club,9
2,Clara Callan,7
3,Jane Doe,2
4,Wild Animus,5


### Looking for ISBN number of the books

In [16]:
# Filtering out the movies by title
inputId = books_df[books_df['Book-Title'].isin(inputBooks['Book-Title'].tolist())]
# Then merging it so we can get the movieId. It's implicitly merging it by title.
inputBooks = pd.merge(inputId, inputBooks)
# Dropping information we won't use from the input dataframe
inputBooks = inputBooks.drop('Year-Of-Publication', 1)
# Final input dataframe
# If a movie you added in above isn't here, then it might not be in the original 
# dataframe or it might spelled differently, please check capitalisation.
inputBooks

Unnamed: 0,ISBN,Book-Title,Book-Author,Book-Rating
0,2005018,Clara Callan,Richard Bruce Wright,7
1,971880107,Wild Animus,Rich Shapero,5
2,449006522,Manhattan Hunt Club,JOHN SAUL,9
3,1551665107,Jane Doe,R.J. Kaiser,2
4,553290703,Lightning,Patricia Potter,9


We had merged our recommendation dataframe and the ISBN number of the books with book's author.

In [17]:
# Filtering out users that have watched movies that the input has watched and storing it
userSubset = ratings_df[ratings_df['ISBN'].isin(inputBooks['ISBN'].tolist())]
userSubset.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
86676,18401,1551665107,3
128287,29806,1551665107,5
162695,35859,1551665107,1
166997,36609,1551665107,5
201177,45114,1551665107,5


In [18]:
userSubsetGroup = userSubset.groupby(['User-ID'])

In [19]:
userSubsetGroup.get_group(18401)

Unnamed: 0,User-ID,ISBN,Book-Rating
86676,18401,1551665107,3


In [20]:
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

In [21]:
userSubsetGroup[0:3]

[(18401,
         User-ID        ISBN  Book-Rating
  86676    18401  1551665107            3),
 (29806,
          User-ID        ISBN  Book-Rating
  128287    29806  1551665107            5),
 (35859,
          User-ID        ISBN  Book-Rating
  162695    35859  1551665107            1)]

In [22]:
userSubsetGroup = userSubsetGroup[0:100]

### Building Pearson correlation for the recommendation

<img src="https://www.datasciencemadesimple.com/wp-content/uploads/2017/07/CORRELATION-COEFFICIENT-FORMULA.png?ezimgfmt=ng:webp/ngcb1" alt="pearson correlation">

Where x and y are the sample means of the two arrays of values.

r = correlation coefficient

If the resultant value – r is close to +1, this indicates a <strong>strong positive correlation.</strong>

If the resultant value – r is close to -1, this indicates a <strong>strong negative correlation</strong>

In [23]:
# Store the Pearson correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

# For every user group in our subset
for name, group in userSubsetGroup:
    # Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='ISBN')
    inputBooks = inputBooks.sort_values(by='ISBN')
    # Get the N for the formula
    nRatings = len(group)
    # Get the review scores for the movies that they both have in common
    temp_df = inputBooks[inputBooks['ISBN'].isin(group['ISBN'].tolist())]
    # And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['Book-Rating'].tolist()
    # Let's also put the current user group reviews in a list format
    tempGroupList = group['Book-Rating'].tolist()
    # Now let's calculate the Pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0

# Thanks to IBM, they showed me how to build Pearson correlation on python.

In [24]:
pearsonCorrelationDict.items()

dict_items([(18401, 0), (29806, 0), (35859, 0), (36609, 0), (45114, 0), (72238, 0), (73394, 0), (78553, 0), (79903, 0), (93518, 0), (135265, 0), (135321, 0), (143411, 0), (158295, 0), (166596, 0), (167349, 0), (175003, 0), (201451, 0), (221908, 0), (230522, 0), (242824, 0)])

### Pearson correlation dictonary on DataFrame

In [25]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0,18401
1,0,29806
2,0,35859
3,0,36609
4,0,45114


### Sort topUsers DataFrame based on similarity index

In [26]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,userId
0,0,18401
11,0,135321
19,0,230522
18,0,221908
17,0,201451


### Merge topUsers DataFrame with ratings_df DataFrame

In [27]:
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='User-ID', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,User-ID,ISBN,Book-Rating
0,0,18401,18401,60009241,8
1,0,18401,18401,60085444,9
2,0,18401,18401,60092149,5
3,0,18401,18401,60502177,6
4,0,18401,18401,61012513,7


### Multiplies the similarity by the user's ratings

In [28]:
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['Book-Rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,User-ID,ISBN,Book-Rating,weightedRating
0,0,18401,18401,60009241,8,0
1,0,18401,18401,60085444,9,0
2,0,18401,18401,60092149,5,0
3,0,18401,18401,60502177,6,0
4,0,18401,18401,61012513,7,0


### Applies a sum to the topUsers after grouping it up by userId

In [29]:
tempTopUsersRating = topUsersRating.groupby('ISBN').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
000104799X,0,0
002026478X,0,0
002045211X,0,0
002089130X,0,0
006000147X,0,0


In [35]:
recommendation_df = pd.DataFrame()
# Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['ISBN'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,ISBN
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
000104799X,,000104799X
002026478X,,002026478X
002045211X,,002045211X
002089130X,,002089130X
006000147X,,006000147X


### Sort recommendation_df DataFrame based on weighted average recommendation score

In [36]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,ISBN
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
000104799X,,000104799X
002026478X,,002026478X
002045211X,,002045211X
002089130X,,002089130X
006000147X,,006000147X


### Show recommended books with title, author and the year of publication based on ISBN

In [37]:
books_df.loc[books_df['ISBN'].isin(recommendation_df.head(10)['ISBN'].tolist())]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication
3701,002026478X,AGE OF INNOCENCE (MOVIE TIE-IN),Edith Wharton,1993
47683,006000181X,With Her Last Breath,Cait London,2003
49133,006008460X,Cheaper by the Dozen (Perennial Classics),Frank B. Gilbreth,2002
51433,000104799X,Monk's-hood,Ellis Peters,1994
81494,006000553X,Victoria and the Rogue (An Avon True Romance),Meg Cabot,2003
127878,006000147X,Cherokee Warriors: The Loner,Genell Dellin,2003
225224,006008197X,Once Upon a Town : The Miracle of the North Pl...,Bob Greene,2003
245421,002045211X,Salazar Blinks (Collier Fiction),David R. Slavitt,1990
