## Book Recommendation

Dataset url: http://www2.informatik.uni-freiburg.de/~cziegler/BX/

#### About the dataset:
The Book-Crossing dataset comprises 3 tables.

**BX-Users**
Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.

**BX-Books**
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.

**BX-Book-Ratings**
Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

In [1]:
# importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
books = pd.read_csv('BX-Books.csv', sep = ';', error_bad_lines = False, encoding = "latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('BX-Users.csv', sep = ';', error_bad_lines = False, encoding = "latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('BX-Book-Ratings.csv', sep = ';', error_bad_lines = False, encoding = "latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
books.head(2)

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...


In [4]:
users.head(2)

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0


In [5]:
ratings.head(2)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5


In [6]:
books.shape, users.shape, ratings.shape

((271360, 8), (278858, 3), (1149780, 3))

`kNN` is a machine learning algorithm to find clusters of similar users based on common book ratings, and make predictions using the average rating of top-k nearest neighbors. For example, we first present ratings in a matrix with the matrix having one row for each item (book) and one column for each user.

To ensure statistical significance, users with less than 200 ratings, and books with less than 100 ratings are excluded.

In [7]:
userRatings = ratings['userID'].value_counts()
ratings = ratings[ratings['userID'].isin(userRatings[userRatings >= 200].index)]
counts = ratings['bookRating'].value_counts()
ratings = ratings[ratings['bookRating'].isin(counts[counts >= 100].index)]

In [8]:
df_booksRatings = pd.merge(ratings, books, on = 'ISBN')
columns = ['yearOfPublication', 'publisher', 'bookAuthor', 'imageUrlS', 'imageUrlM', 'imageUrlL']
df_booksRatings = df_booksRatings.drop(columns, axis = 1)
df_booksRatings.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...


In [9]:
df_booksRatings[df_booksRatings.bookTitle == 'Always Have Popsicles']

Unnamed: 0,userID,ISBN,bookRating,bookTitle


In [10]:
# We then group by book titles and create a new column for total rating count.

df_booksRatings = df_booksRatings.dropna(axis = 0, subset = ['bookTitle'])

df_bookRatingCount = (df_booksRatings.groupby(by = ['bookTitle'])['bookRating'].count().
                      reset_index().rename(columns = {'bookRating': 'totalRatingCount'})[['bookTitle', 'totalRatingCount']])
df_bookRatingCount.head()

Unnamed: 0,bookTitle,totalRatingCount
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1


In [11]:
# We combine the rating data with the total rating count data, this gives us exactly what we need to find out which books are popular 
# and filter out lesser-known books.

df_ratingTotalCount = df_booksRatings.merge(df_bookRatingCount, left_on = 'bookTitle', right_on = 'bookTitle', how = 'left')
df_ratingTotalCount.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,totalRatingCount
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,82


In [12]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
print(df_bookRatingCount['totalRatingCount'].describe())

count   160576.000
mean         3.044
std          7.428
min          1.000
25%          1.000
50%          1.000
75%          2.000
max        365.000
Name: totalRatingCount, dtype: float64


In [13]:
# The median book has been rated only once. Let’s look at the top of the distribution

print(df_bookRatingCount['totalRatingCount'].quantile(np.arange(.9, 1, .01)))

0.900    5.000
0.910    6.000
0.920    7.000
0.930    7.000
0.940    8.000
0.950   10.000
0.960   11.000
0.970   14.000
0.980   19.000
0.990   31.000
Name: totalRatingCount, dtype: float64


In [14]:
popularity_threshold = 50
df_ratingPopBook = df_ratingTotalCount.query('totalRatingCount >= @popularity_threshold')
df_ratingPopBook.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,totalRatingCount
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,82


In [15]:
df_ratingPopBook.shape

(62149, 5)

In [16]:
df_userBookRating = df_ratingPopBook.merge(users, left_on = 'userID', right_on = 'userID', how = 'left')
df_userBookRating.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,totalRatingCount,Location,Age
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,82,"gilbert, arizona, usa",48.0
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,82,"knoxville, tennessee, usa",29.0
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,82,"n/a, n/a, n/a",
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,82,"byron, minnesota, usa",18.0
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,82,"cordova, tennessee, usa",32.0


In [17]:
df_userBookRating['Location'].unique()

array(['gilbert, arizona, usa', 'knoxville, tennessee, usa',
       'n/a, n/a, n/a', 'byron, minnesota, usa',
       'cordova, tennessee, usa', 'mechanicsville, maryland, usa',
       'palos hills, illinois, usa', 'nj, new jersey, usa',
       'hickory, mississippi, usa', 'south ohio, nova scotia, canada',
       'charleston, south carolina, usa', 'jasper, missouri, usa',
       'orlando, florida, usa', 'toronto, ontario, canada',
       'florence, alabama, usa', 'livermore, california, usa',
       'chamblee, georgia, usa', 'alvin, texas, usa',
       'valley center, kansas, usa',
       'atlantic highlands, new jersey, usa', 'lisboa, lisboa, portugal',
       'tigard, oregon, usa', 'washington, dc, usa',
       'nashville, tennessee, usa',
       'christchurch, canterbury, new zealand', 'houston, texas, usa',
       'kirkland, washington, usa', 'albuquerque, new mexico, usa',
       'lakewood, washington, usa', 'evanston, illinois, usa',
       'north vancouver, british columbia, can

##### Filtering cases from USA, Canada, UK, to process data faster

In [18]:
df_locationUserRating = df_userBookRating[df_userBookRating['Location'].str.contains("usa|canada|united kingdom")]
df_locationUserRating = df_locationUserRating.drop('Age', axis=1)
df_locationUserRating.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,totalRatingCount,Location
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,82,"gilbert, arizona, usa"
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,82,"knoxville, tennessee, usa"
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,82,"byron, minnesota, usa"
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,82,"cordova, tennessee, usa"
5,16795,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,82,"mechanicsville, maryland, usa"


##### Implementing kNN
We convert our table to a 2D matrix, and fill the missing values with zeros (since we will calculate distances between rating vectors). We then transform the values(ratings) of the matrix dataframe into a scipy sparse matrix for more efficient calculations.

Finding the Nearest Neighbors, we use unsupervised algorithms with `sklearn.neighbors`. The algorithm we use to compute the nearest neighbors is “brute”, and we specify “metric=cosine” so that the algorithm will calculate the cosine similarity between rating vectors. Finally, we fit the model.

In [19]:
from scipy.sparse import csr_matrix

df_locationUserRating = df_locationUserRating.drop_duplicates(['userID', 'bookTitle'])
df_locationUserRatingPivot = df_locationUserRating.pivot(index = 'bookTitle', columns = 'userID', values = 'bookRating').fillna(0)
locationUserRatingMatrix = csr_matrix(df_locationUserRatingPivot.values)

In [20]:
df_locationUserRatingPivot.values

array([[ 9.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0., 10.,  0., ...,  0.,  0.,  0.],
       ...,
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [21]:
from sklearn.neighbors import NearestNeighbors

model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(locationUserRatingMatrix)

NearestNeighbors(algorithm='brute', metric='cosine')

In [22]:
queryIndex = np.random.choice(df_locationUserRatingPivot.shape[0])
print(queryIndex)
distances, indices = model_knn.kneighbors(df_locationUserRatingPivot.iloc[queryIndex,:].values.reshape(1, -1), n_neighbors = 6)

657


In [23]:
df_locationUserRatingPivot.iloc[queryIndex,:].values.reshape(1,-1)

array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  5.,  0.,  0.,  0.,  5.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., 10.

In [24]:
df_locationUserRatingPivot.iloc[queryIndex]

userID
254      0.000
2276     0.000
2766     0.000
2977     0.000
3363     0.000
          ... 
274808   0.000
275970   0.000
277427   0.000
277639   0.000
278418   0.000
Name: The Surgeon, Length: 774, dtype: float64

In [26]:
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(df_locationUserRatingPivot.index[queryIndex]))
    else:
        print('{0}: {1} with distance {2}'.format(i, df_locationUserRatingPivot.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for The Surgeon:

1: The Mulberry Tree with distance 0.680724527860187
2: The Apprentice with distance 0.692977854823377
3: Beach House with distance 0.7252964887079045
4: The Jester with distance 0.7395623766658288
5: Mercy with distance 0.7413004757045465
