# Book recommendation system

Recommendation systems investigate relationships like products and users/product and product or user-user.
Understanding these relationships can provide tremendous insights. Because of this such systems are used in a variety of areas. Here are some of the commonly recognised examples taking the form of playlist generators for video, music and book services, product recommenders for online stores, or content recommenders for social media platforms - personalized homepages, promotions emails and etc. <br>

This notebook creates book recommendation system based on Book-Crossing Dataset mined by Cai-Nicolas Ziegler, DBIS Freiburg. The data is collected from the Book-Cross community - book lovers community for exchaning books world wide.<br>

There are different algorithms that could be used in a recomendation system - like content based filtering, collaborative filtering, association rules learning etc. <br><br>
The algorithm that best fits our use case is the collaborative filtering. It inverstigate user to user relationship. <br>
**Collaborative filtering**  alorithm makes automatic predictions (filtering) about the interests of a user by collecting preferences/taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue (book they both  read), A is more likely to have B's opinion on a different issue (about another book) than that of a randomly chosen person.

The algorithm uses easily captured user behaviour data (ratings a user have given for a book). Based on the ratings a user have given to the books he/she already read, the algorithm tries to find the most "similar" in the books interest other users and predicts what could be the users rating for the unread books based on the other users rations.
To find the most similar users we are using **K-Nearest Neighbors** (KNN) algorithm. For measuring the "similarity" between the users (distances) KNN could use a list of a distance metrics. We are using one of the simplest available metrics called - **hamming distance**. It measures the presentage of dissagreements between two series of numbers. <br>
Example: <br>
You are give two sets of numbers "2|3|5|2|4" "2|3|7|1|6|". In these sets only the first two numbers are even. =><br> The percentage of disagreement is 3/5 = 0,6 <br>
The percentage of agreement is 2/5 = 0,4. <br>
So, hamming distance (the disagreement) is equal to 0.6.

## **Prerequisites**: 

### Download working dataset 

- Download CSV dump of "Book-Crossing Dataset" from http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip 
- Unarchive the file in /$HOME/Downloads/BX-CSV-Dump file. 

The archive contains 3 files. <br>

*BX-Books.csv* - contains information about the Books.<br>
*BX-Users.csv* - contains information about the Users.<br>
*BX-Book-Ratings.csv* - contains the Users rating about the books. (The ratings is given in a scale from 1 to 5.)


## Book Recommendations Engine

### Install all needed libraries

In [1]:
!python3 -m pip install pandas --user
!python3 -m pip install scipy --user
!python3 -m pip install numpy --user



### Load libraries

In [2]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import hamming 

### Load dataset

#### Load ratings file

In [3]:
def loadRatings():
    ratingsFile='/home/didi/Downloads/BX-CSV-Dump/BX-Book-Ratings.csv'
    ratings=pd.read_csv(ratingsFile,sep=";",header=0, encoding='ISO-8859–1')
    ratings.columns = ["user","isbn","rating"]
    return ratings

ratings = loadRatings()

#### Get the first 10 rows of the ratings

In [4]:
ratings.head()

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


#### Load Books file by extracting only the first three columns.

In [5]:
def loadBooks():
    bookFile='/home/didi/Downloads/BX-CSV-Dump/BX-Books.csv'
    column_names = ['isbn',"title","author"]
    books=pd.read_csv(bookFile,sep=";",header=0,error_bad_lines=False, usecols=[0,1,2], names = column_names, index_col=0,encoding='ISO-8859–1')
    return books

books = loadBooks()



  books = loadBooks()


#### Get the first few rows of the books

In [6]:
books.head()

Unnamed: 0_level_0,title,author
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1
195153448,Classical Mythology,Mark P. O. Morford
2005018,Clara Callan,Richard Bruce Wright
60973129,Decision in Normandy,Carlo D'Este
374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
393045218,The Mummies of Urumchi,E. J. W. Barber


In [7]:
def bookMeta(isbn):
    title = books.at[isbn,"title"]
    author = books.at[isbn,"author"]
    return title, author

bookMeta("0671027360")

('Angels &amp; Demons', 'Dan Brown')

In [8]:
ratings = ratings[ratings["isbn"].isin(books.index)] #

In [9]:
ratings.shape

(1031175, 3)

### Create Rating Matrix

#### Reduce sparsity 
Update ratings so it contains info for books that are read more than 10 users and users that have read more than 10 books

In [10]:
def reduceSizeofData(ratings):
    usersPerISBN = ratings.isbn.value_counts() #Find how many times each book is read
    ISBNsPerUser = ratings.user.value_counts() #Find how many times a book is rated by user
    ratings = ratings[ratings["isbn"].isin(usersPerISBN[usersPerISBN>10].index)]
    ratings = ratings[ratings["user"].isin(ISBNsPerUser[ISBNsPerUser>10].index)]
    return ratings


#### Create Rating Matrix
Transform the ratings data in matrix that have book numbers for comumns, user ids for row index, and ratings for values..  

In [11]:
def createRatingMatrix():
    reduced_ratings = reduceSizeofData(ratings)
    userBooksRatingMatrix=pd.pivot_table(reduced_ratings, values='rating',
                                    index=['user'], columns=['isbn'])
    return userBooksRatingMatrix

#### Show few linex from the matrix

In [12]:
userBooksRatingMatrix = createRatingMatrix()
userBooksRatingMatrix.head()

isbn,0002005018,0002251760,0002259834,0002558122,0006480764,000648302X,0006485200,000649840X,000651202X,0006512062,...,8845906884,8845915611,8878188212,8885989403,9074336329,9074336469,950491036X,9681500830,9681500954,9871138016
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,5.0,,,,,,,,,,...,,,,,,,,,,
99,,,,,,,,,,,...,,,,,,,,,,
242,,,,,,,,,,,...,,,,,,,,,,
243,,,,,,,,,,,...,,,,,,,,,,
254,,,,,,,,,,,...,,,,,,,,,,


### Calculate the distance between two users using hamming algorithm
Get all ratings both users have given and apply hamming algorithm.

In [13]:
def distance(user1,user2):
        try:
            user1Ratings = userBooksRatingMatrix.transpose()[user1]
            user2Ratings = userBooksRatingMatrix.transpose()[user2]
            distance = hamming(user1Ratings,user2Ratings)
        except: 
            distance = np.NaN
        return distance 

In [14]:
distance(204622,10118)

0.9998705585399004

### Function that finds the K nearest neighbours of a specific user

In [15]:
def nearestNeighbors(user,K=10):
    allUserIds = pd.DataFrame(userBooksRatingMatrix.index) #get all users Ids (the index of the Matrix contains userIs)
    allUserIds = allUserIds[allUserIds.user!=user] #From the list of all users Ids remove the current user id
    allUserIds["distance"] = allUserIds["user"].apply(lambda x: distance(user,x)) #Add new column [distance] to the allUsersId dataframe by applying lamda function to each user.
    #AllUsersIds contains all users (except the current user) with their corresponding distances to the current user.
    #Sort the dateFrame by distance in ascending order and get the top K users. (user with the )
    KnearestUsers = allUserIds.sort_values(["distance"],ascending=True)["user"][:K] 
    return KnearestUsers

In [16]:

user = 204622
KnearestUsers = nearestNeighbors(user)

In [17]:
KnearestUsers

3201     82893
3368     87555
2624     68555
1813     48046
5401    140036
7584    198711
565      16795
8866    232131
239       7346
9693    251422
Name: user, dtype: int64

### Find the top N Recomentations for a user.

In [18]:
def topNRecommendationsPerUser(user,N=3):
    KnearestUserIds = nearestNeighbors(user) # Find N nearest neighbors of a user
    NNRatings = userBooksRatingMatrix[userBooksRatingMatrix.index.isin(KnearestUserIds)] # from the Matrix get the ratings only of the N nearest neighbors
    avgRating = NNRatings.apply(np.nanmean).dropna() #calculate the mean value of the book ratings give by the nearest neighbors and ingnore the books that has NaN results (books that do not have rating/ they are not read by the neighbours)
    booksAlreadyRead = userBooksRatingMatrix.transpose()[user].dropna().index # find the books read by the user
    avgRating = avgRating[~avgRating.index.isin(booksAlreadyRead)] # from the avgRating remove the books read by the user (they should not be in the recommendation list)
    topNISBNs = avgRating.sort_values(ascending=False).index[:N] #Sort the avarage ratings in descending order and get the top N books 
    return pd.Series(topNISBNs).apply(bookMeta) # apply the boopkMeta function to the topNISBNs go get the books metadata 

In [19]:
topNRecommendationsPerUser(204813,10)

  results[i] = self.f(v)


0    (Special Operations (Badge of Honor Novels (Pa...
1                       (Lady of Desire, Gaelen Foley)
2         (Up &amp; Out (Red Dress Ink), Ariella Papa)
3             (The Little Drummer Girl, John Le Carre)
4    (Stiff: The Curious Lives of Human Cadavers, M...
5     (A Kiss for Little Bear, Else Holmelund Minarik)
6                         (Name Der Rose, Umberto Eco)
7          (Sabriel (The Abhorsen Trilogy), Garth Nix)
8              (Me Talk Pretty One Day, David Sedaris)
9                    (Mixed Blessings, DANIELLE STEEL)
Name: isbn, dtype: object