# Book Recommendation System

## What actually is Recommendation System
A recommendation engine is a class of machine learning which offers relevant suggestions to the customer.  
Before the recommendation system, the major tendency to buy was to take a suggestion from friends. 
But Now Google knows what news you will read, Youtube knows what type of videos you will watch based on your search history, watch history, or purchase history.

A recommendation system helps an organization to create loyal customers and build trust by them desired products and services for which they came on your site. 
The recommendation system today are so powerful that they can handle the new customer too who has visited the site for the first time. 
They recommend the products which are currently trending or highly rated and they can also recommend the products which bring maximum profit to the company.

## Types Of Recommendation System
A recommendation system is usually built using 3 techniques which are **content-based filtering, collaborative filtering, and a combination of both.**

## 1) Content-Based Filtering
The algorithm recommends a product that is similar to those which used as watched. 
In simple words, In this algorithm, we try to find finding item look alike. 
For example, a person likes to watch Sachin Tendulkar shots, so he may like watching Ricky Ponting shots too because the two videos have similar tags and similar categories.

Only it looks similar between the content and does not focus more on the person who is watching this. Only it recommends the product which has the highest score based on past preferences.

## 2) Collaborative-based Filtering
Collaborative based filtering recommender systems are based on past interactions of users and target items.  
In simple words here, we try to search for the look-alike customers and offer products based on what his or her lookalike has chosen. 
Let us understand with an example. X and Y are two similar users and X user has watched A, B, and C movie. 
And Y user has watched B, C, and D movie then we will recommend A movie to Y user and D movie to X user.

Youtube has shifted its recommendation system from content-based to Collaborative based filtering technique. 
If you have experienced sometimes there are also videos which not at all related to your history 
but then also it recommends it because the other person similar to you has watched it.

## 3) Hybrid Filtering Method
It is basically a combination of both the above methods. It is a too complex model which recommends product based on your history as well based on similar users like you.

There are some organizations that use this method like Facebook which shows news which is important for you and for others also in your network and the same is used by Linkedin too.

## Book Recommendation System
A book recommendation system is a type of recommendation system 
where we have to recommend similar books to the reader based on his interest. 
The books recommendation system is used by online websites which provide ebooks like google play books, open library, good Read’s, etc.

## Dataset
- https://www.kaggle.com/rxsraghavagrawal/book-recommender-system?select=BX-Users.csv

## Dataset Description
we have 3 files in our dataset which is extracted from some books selling websites.

- Books – first are about books which contain all the information related to books like an author, title, publication year, etc.
- Users – The second file contains registered user’s information like user id, location.
- ratings –  Ratings contain information like which user has given how much rating to which book.


So based on all these three files we can build a powerful **collaborative filtering model.**


In [7]:
# Import Libries
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [6]:
# Load Data

books = pd.read_csv("Data/BX-Books.csv", sep=';', encoding="latin-1", error_bad_lines=False)
users = pd.read_csv("Data/BX-Users.csv", sep=';', encoding="latin-1", error_bad_lines=False)
ratings = pd.read_csv("Data/BX-Book-Ratings.csv", sep=';', encoding="latin-1", error_bad_lines=False)

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


In [8]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [9]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [10]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [11]:
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

In [12]:
users.columns

Index(['User-ID', 'Location', 'Age'], dtype='object')

In [13]:
ratings.columns

Index(['User-ID', 'ISBN', 'Book-Rating'], dtype='object')

In [14]:
books = books[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher']]
books.rename(columns = {'Book-Title':'title', 'Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher'}, inplace=True)
users.rename(columns = {'User-ID':'user_id', 'Location':'location', 'Age':'age'}, inplace=True)
ratings.rename(columns = {'User-ID':'user_id', 'Book-Rating':'rating'}, inplace=True)

In [15]:
books.shape

(271360, 5)

In [16]:
users.shape

(278858, 3)

In [17]:
ratings.shape

(1149780, 3)

## Approach to a problem statement
We do not want to find a similarity between users or books.
we want to do that If there is user A who has read and liked x and y books, And user B has also liked this two books and now user A has read and liked some z book which is not read by B so we have to recommend z book to user B. 
This is what collaborative filtering is.

So this is achieved using Matrix Factorization, we will create one matrix 
where columns will be users and indexes will be books and value will be rating. Like we have to create a Pivot table.

A big flaw with a problem statement in the dataset
If we take all the books and all the users for modeling, 
Don’t you think will it create a problem? So what we have to do is we have to decrease the number of users and books because we cannot consider a user who has only registered on the website or has only read one or two books. On such a user, we cannot rely to recommend books to others because we have to extract knowledge from data. 
So what we will limit this number and we will **take a user who has rated at least 200 books** and also we will limit books 
and we will **take only those books which have received at least 50 ratings from a user.**

## Exploratory Data Analysis
So let’s get with analysis and prepare the dataset as we discussed for modeling. 
let us see how many users have given ratings and extract those users who have given more than 200 ratings.

In [18]:
ratings['user_id'].value_counts()

11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
          ...  
271728        1
245123        1
234886        1
259466        1
187812        1
Name: user_id, Length: 105283, dtype: int64

### Step-1) Extract users and ratings of more than 200
we can see only 105283 peoples have given a rating among 278000.
Now we will extract the user ids who have given more than 200 ratings and
when we will have user ids we will extract the ratings of only this user id from the rating dataframe.



In [19]:
x = ratings['user_id'].value_counts() > 200
y = x[x].index  #user_ids
print(y.shape)

(899,)


In [24]:
x = ratings['user_id'].value_counts() > 200
y = x[x].index  #user_ids
print(y.shape)
ratings = ratings[ratings['user_id'].isin(y)]
ratings

(899,)


Unnamed: 0,user_id,ISBN,rating
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0
...,...,...,...
1147612,275970,3829021860,0
1147613,275970,4770019572,0
1147614,275970,896086097,0
1147615,275970,9626340762,8


### step-2) Merge ratings with books
So 900 users are there who have given 5.2 lakh rating and this we want. 
Now we will merge ratings with books **on basis of ISBN** so that 
we will get the rating of each user on each book id and the user who has not rated that book id the value will be zero.

In [26]:
rating_with_books = ratings.merge(books, on='ISBN')
rating_with_books.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc


In [27]:
rating_with_books.shape

(487671, 7)

### step-3) Extract books that have received more than 50 ratings.
Now dataframe size has decreased and we have 4.8 lakh because when we merge the dataframe, all the book id-data we were not having. Now we will count the rating of each book so we will **group data based on title and aggregate based on rating.**

In [28]:
number_rating = rating_with_books.groupby('title')['rating'].count().reset_index()
number_rating.rename(columns= {'rating':'number_of_ratings'}, inplace=True)
number_rating.head()

Unnamed: 0,title,number_of_ratings
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1


In [29]:

final_rating = rating_with_books.merge(number_rating, on='title')
final_rating.shape


(487671, 8)

In [30]:
final_rating.head(2)

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,number_of_ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82


In [31]:
final_rating = final_rating[final_rating['number_of_ratings'] >= 50]
final_rating.drop_duplicates(['user_id','title'], inplace=True)
final_rating.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,number_of_ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82


In [33]:
final_rating.shape

(59850, 8)

In [38]:
final_rating['title'].value_counts().shape

(742,)

In [40]:
final_rating['user_id'].value_counts().shape

(888,)

In [36]:
# drop duplicate values because if the same user has rated the same book multiple times so it will create a problem. 
# Finally, we have a dataset with that user who has rated more than 700 books and books that received more than 50 ratings. 
# the shape of the final dataframe is 59850 rows and 8 columns.
final_rating[['user_id','title']].value_counts()

user_id  title                                  
254      1984                                       1
190459   Exclusive                                  1
         Angela's Ashes (MMP) : A Memoir            1
         Back Roads                                 1
         Bastard Out of Carolina                    1
                                                   ..
98391    The Five People You Meet in Heaven         1
         The Guardian                               1
         The Gunslinger (The Dark Tower, Book 1)    1
         The Hours: A Novel                         1
278418   Whirlwind (Tyler, Book 1)                  1
Length: 59850, dtype: int64

### Step-4) Create Pivot Table
- create a **pivot table** where columns will be user ids, the index will be book title and the value is ratings. 
- And the user id who has not rated any book will have value as NAN so impute it with zero.



In [39]:
book_pivot = final_rating.pivot_table(columns='user_id', index='title', values="rating")
book_pivot.fillna(0, inplace=True)
book_pivot.head()

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84 Charing Cross Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0


### Modeling
- prepared our dataset for modeling. 
- use the nearest neighbors algorithm which is the same as K nearest which is used for clustering based on euclidian distance.

But here in the pivot table, we have lots of zero values and on clustering,  
this computing power will increase to calculate the distance of zero values,
so we will **convert the pivot table to the sparse matrix** and then feed it to the model.

In [41]:
from scipy.sparse import csr_matrix
book_sparse = csr_matrix(book_pivot)
book_sparse

<742x888 sparse matrix of type '<class 'numpy.float64'>'
	with 14942 stored elements in Compressed Sparse Row format>

In [42]:
# train the nearest neighbors algorithm. 
# algorithm which is brute means find the distance of every point to every other point.

from sklearn.neighbors import NearestNeighbors

model = NearestNeighbors(algorithm='brute')
model.fit(book_sparse)

NearestNeighbors(algorithm='brute')

#### make a prediction and see whether it is suggesting books or not. 
- we will find the nearest neighbors to the input book id and after that, 
- we will print the top 5 books which are closer to those books. It will provide us distance and book id at that distance. let us pass harry potter which is at 237 indexes.

In [148]:
distances, suggestions = model.kneighbors(book_pivot.iloc[237, :].values.reshape(1, -1))
# print all the Recommended books.
for i in range(len(suggestions)):
    print(book_pivot.index[suggestions[i]])

Index(['Harry Potter and the Chamber of Secrets (Book 2)',
       'Harry Potter and the Prisoner of Azkaban (Book 3)',
       'Harry Potter and the Goblet of Fire (Book 4)',
       'Harry Potter and the Sorcerer's Stone (Book 1)', 'Exclusive'],
      dtype='object', name='title')


In [149]:
# Distance which is very small 
distances

array([[ 0.        , 68.78953409, 69.5413546 , 72.64296249, 76.83098333]])

In [150]:
suggestions  # index of input data

array([[237, 240, 238, 241, 184]], dtype=int64)

## Predict or Recommendation books based on Book Name

In [146]:
def rec_books_ByBookName(book_pivot, book_name):
    list_books = list(book_pivot.index)
    b = [x for x in list_books if book_name.lower() in x.lower()]
    #print(b)
    try:
        first_book = b[0]
    except IndexError:
        #print("No Such a Book Name or check spelling")
        return ["No Such a Book Name or check spelling"]
        
    idx = list_books.index(first_book)
    # n_neighbors How many number of neighbour means how many books default it is 5
    distances, suggestions = model.kneighbors(book_pivot.iloc[idx, :].values.reshape(1, -1), n_neighbors=10)

    suggestions_lst = suggestions.tolist()[0]
    rec_books = book_pivot.index[ suggestions_lst ]
    rec_books = rec_books.tolist()
    return rec_books

In [147]:
book_name = input("Enter Book Name: ")
rec_books = rec_books_ByBookName(book_pivot, book_name)
print("\n\n")
print("Recommanded Books : \n")
for i in rec_books:
    print(i)

Enter Book Name: Harry



Recommanded Books : 

Harry Potter and the Chamber of Secrets (Book 2)
Harry Potter and the Prisoner of Azkaban (Book 3)
Harry Potter and the Goblet of Fire (Book 4)
Harry Potter and the Sorcerer's Stone (Book 1)
Exclusive
The Cradle Will Fall
Jacob Have I Loved
The Witness
Tom Clancy's Op-Center (Tom Clancy's Op Center (Paperback))
Toxin


In [None]:
# https://www.analyticsvidhya.com/blog/2021/06/build-book-recommendation-system-unsupervised-learning-project/