In [1]:
import numpy as np
import pandas as pd

In [2]:
#Reading the excel sheets into a dataframe
books = pd.read_csv("BX-Books.csv", sep = ";", encoding = "latin-1",  on_bad_lines='skip')
users = pd.read_csv("BX-Users.csv", sep = ";", encoding = "latin-1",  on_bad_lines='skip')
ratings = pd.read_csv("BX-Book-Ratings.csv", sep = ";", encoding = "latin-1",  on_bad_lines='skip')

  books = pd.read_csv("BX-Books.csv", sep = ";", encoding = "latin-1",  on_bad_lines='skip')


# Preprocessing Data

In [3]:
#Extracting only the columns that we'll need 
books = books[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher']]

#Renaming the columns to make them easy to use
books.rename(columns = {'Book-Title':'title', 'Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher'}, inplace=True)
users.rename(columns = {'User-ID':'user_id', 'Location':'location', 'Age':'age'}, inplace=True)
ratings.rename(columns = {'User-ID':'user_id', 'Book-Rating':'rating'}, inplace=True)   

In [4]:
books.head(100)

Unnamed: 0,ISBN,title,author,year,publisher
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company
...,...,...,...,...,...
95,0671867156,Pretend You Don't See Her,Mary Higgins Clark,1998,Pocket
96,0312252617,Fast Women,Jennifer Crusie,2001,St. Martin's Press
97,0312261594,Female Intelligence,Jane Heller,2001,St. Martin's Press
98,0316748641,Pasquale's Nose: Idle Days in an Italian Town,Michael Rips,2002,Back Bay Books


# Exploratory Data Analysis

## Flaw in the dataset 
To build our model, we should only rely on users who have given a decent amount of ratings so the model will be accurate enough. So we decided to limit the minimal number of ratings required at 200 ratings.

The same goes for books, those who are not rated enough will disturb our model and never get recommended in the first place, so we will need to only extract the book who have at least 50 ratings

## Step 1 : Extracting users and ratings of more than 200

In [5]:
#Extracting the users with at least 200 ratings
x = ratings['user_id'].value_counts() > 200
y = x[x].index
print(y.shape) #899 users are included in our model
#Reducing the ratings set to the preselected users
ratings = ratings[ratings['user_id'].isin(y)]

(899,)


## Step 2 : Merging the ratings with the books 

In [6]:
rating_with_books = ratings.merge(books, on='ISBN')
rating_with_books.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc


## Step 3 : Extracting books that have received more than 50 ratings

In [7]:
number_rating = rating_with_books.groupby('title')['rating'].count().reset_index()
number_rating.rename(columns= {'rating':'number_of_ratings'}, inplace=True)
#Merging everything
final_rating = rating_with_books.merge(number_rating, on='title')
final_rating = final_rating[final_rating['number_of_ratings'] >= 50]
final_rating.drop_duplicates(['user_id','title'], inplace=True)
print(final_rating.shape)

(59850, 8)


## Step 4 : Create pivot table 
Now we will create a pivot table where columns will be user ids, the index will be book title and the value is ratings. And the user id who has not rated any book will have value as NAN so impute it with zero.

In [8]:
book_pivot = final_rating.pivot_table(columns='user_id', index='title', values="rating")
book_pivot.fillna(0, inplace=True)

But here in the pivot table, we have lots of zero values and on clustering, this computing power will increase to calculate the distance of zero values so we will convert the pivot table to the sparse matrix and then feed it to the model

In [9]:
from scipy.sparse import csr_matrix
book_sparse = csr_matrix(book_pivot)

Now we will train the nearest neighbors algorithm. here we need to specify an algorithm which is brute means find the distance of every point to every other point.

In [10]:
from sklearn.neighbors import NearestNeighbors
model = NearestNeighbors(algorithm = "brute")
model.fit(book_sparse)

NearestNeighbors(algorithm='brute')

## Step 5 : Testing the model 
Let’s make a prediction and see whether it is suggesting books or not. we will find the nearest neighbors to the input book id and after that, we will print the top 5 books which are closer to those books. It will provide us distance and book id at that distance. let us pass harry potter which is at index 237.

In [11]:
#Getting the suggestions and the distances
INDEX_SUGGESTION = 199
distances, suggestions = model.kneighbors(book_pivot.iloc[INDEX_SUGGESTION, :].values.reshape(1, -1))

In [12]:
#Printing all the suggestions we got 
print("Given book : " + book_pivot.index[INDEX_SUGGESTION])
for i in range(len(suggestions)):
  print(f"Book number {i} : " + book_pivot.index[suggestions[i]] + "\n")

Given book : Fat Tuesday
Index(['Book number 0 : Fat Tuesday\n', 'Book number 0 : Exclusive\n',
       'Book number 0 : Long After Midnight\n',
       'Book number 0 : Jacob Have I Loved\n',
       'Book number 0 : No Safe Place\n'],
      dtype='object', name='title')
