### Project Name
- Books Recomendation System

### Problem statement:

"Design a system that takes in data on users' reading history and preferences, and uses this information to generate personalized recommendations for books that the user is likely to enjoy."

### **Dataset Description**

The Book-Crossing dataset comprises 3 files.


1. **Users:**
- Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL values.


2. **Books:**
- Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of- Publication, Publisher), obtained from Amazon Web Services. Note that in the case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon website.


3. **Ratings:**
- Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by O.

### Github link

### 1. Importing the library

In [1]:
import pandas as pd
import numpy as np

### 2. loading all three datasets

In [4]:
books = pd.read_csv("Books.csv")

  books = pd.read_csv("Books.csv")


In [5]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [6]:
user = pd.read_csv("Users.csv")

In [7]:
user.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [9]:
rating = pd.read_csv("Ratings.csv")

In [10]:
rating.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### 3. EDA

#### checking the null vaules

In [11]:
books.isnull().sum()

ISBN                   0
Book-Title             0
Book-Author            1
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64

In [51]:
# Missing Value Count Function
def show_missing():
    missing = user.columns[user.isnull().any()].tolist()
    return missing

# Missing data counts and percentage
print('Missing Data Count')
print(user[show_missing()].isnull().sum().sort_values(ascending = False))
print('--'*50)
print('Missing Data Percentage')
print(round(user[show_missing()].isnull().sum().sort_values(ascending = False)/len(user)*100,2))

Missing Data Count
age    110762
dtype: int64
----------------------------------------------------------------------------------------------------
Missing Data Percentage
age    39.72
dtype: float64


- Age has **39.7%** missing data from the dataset

In [13]:
rating.isnull().sum()

User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

#### checking the categorical and Numerical variables 

In [52]:
# find categorical variables for books
categorical = [var for var in books.columns if books[var].dtype=='O']
print('There are {} categorical variables'.format(len(categorical)))

There are 5 categorical variables


In [53]:
# find categorical variables for rating
categorical = [var for var in rating.columns if rating[var].dtype=='O']
print('There are {} categorical variables'.format(len(categorical)))

There are 1 categorical variables


In [54]:
# find categorical variables for user
categorical = [var for var in user.columns if user[var].dtype=='O']
print('There are {} categorical variables'.format(len(categorical)))

There are 1 categorical variables


In [55]:
# find Numerical variables for books
numerical = [var for var in books.columns if books[var].dtype!='O']
print('There are {} numerical variables'.format(len(numerical)))

There are 0 numerical variables


In [56]:
# find Numerical variables for rating
numerical = [var for var in rating.columns if rating[var].dtype!='O']
print('There are {} numerical variables'.format(len(numerical)))

There are 2 numerical variables


In [57]:
# find Numerical variables for user
numerical = [var for var in user.columns if user[var].dtype!='O']
print('There are {} numerical variables'.format(len(numerical)))

There are 2 numerical variables


1. For Books we have:
- 5 categorical variables and 0 numerical variables

2. For rating we have:
- 1 categorical variables and 2 numerical variables

3. For user we have:
- 1 categorical variables and 2 numerical variables

#### dropping the nan values

In [14]:
books.dropna(inplace=True)
books.isnull().sum()

ISBN                   0
Book-Title             0
Book-Author            0
Year-Of-Publication    0
Publisher              0
Image-URL-S            0
Image-URL-M            0
Image-URL-L            0
dtype: int64

In [15]:
user["Age"].mean()

34.75143370454978

In [16]:
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

### 4. pre-processing

#### keeping only the necessary data

In [17]:
books = books[["ISBN","Book-Title","Book-Author","Year-Of-Publication","Publisher"]]
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


#### renaming the columns

In [18]:
books.rename(columns={"Book-Title":"title", "Book-Author":"author","Year-Of-Publication":"year","Publisher":"publisher"},inplace=True)
books.head()

Unnamed: 0,ISBN,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [19]:
user.rename(columns={"Location":"location","Age":"age"},inplace=True)
user.head()

Unnamed: 0,User-ID,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [20]:
rating.rename(columns={"User-ID":"user_id","Book-Rating":"book_rating"},inplace=True)
rating.head()

Unnamed: 0,user_id,ISBN,book_rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [21]:
print(books.shape, user.shape ,rating.shape)

(271354, 5) (278858, 3) (1149780, 3)


In [22]:
rating["user_id"].value_counts()

11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
          ...  
116180        1
116166        1
116154        1
116137        1
276723        1
Name: user_id, Length: 105283, dtype: int64

In [24]:
x =rating["user_id"].value_counts()>350
# returns only the true values
x[x].shape

(463,)

In [25]:
#return th true values with user_id
y = x[x].index

In [26]:
ratings= rating[rating["user_id"].isin(y)]
ratings.shape

(412186, 3)

In [27]:
ratings.head()

Unnamed: 0,user_id,ISBN,book_rating
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0


In [28]:
rating_books = ratings.merge(books,on="ISBN")
rating_books.head()

Unnamed: 0,user_id,ISBN,book_rating,title,author,year,publisher
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc


In [29]:
number_rating = rating_books.groupby("title")["book_rating"].count().reset_index()
number_rating.rename(columns={"book_rating":"no_of_rating"},inplace=True)
number_rating.head()

Unnamed: 0,title,no_of_rating
0,A Light in the Storm: The Civil War Diary of ...,1
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1


In [30]:
final_rating = rating_books.merge(number_rating,on="title")
final_rating.head()

Unnamed: 0,user_id,ISBN,book_rating,title,author,year,publisher,no_of_rating
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,57
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,57
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,57
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,57
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,57


In [31]:
final_rating.shape

(382649, 8)

In [32]:
final_rating= final_rating[final_rating["no_of_rating"]>=60]

final_rating.shape

(22057, 8)

#### droping the duplicate values

In [36]:
final_rating.drop_duplicates(["user_id","title"],inplace=True)

In [37]:
final_rating.head()
print(final_rating.shape)

(20987, 8)


#### creating a pivot  table

In [38]:
book_pivot = final_rating.pivot_table(columns="user_id",index="title",values='book_rating')
book_pivot

user_id,2276,3363,3757,6251,6543,6575,7158,7346,8681,11601,...,269719,269728,270713,271284,274004,274061,274308,275970,277427,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1st to Die: A Novel,,,,,9.0,,0.0,,,,...,,,,,,,,,,
2nd Chance,10.0,,,,0.0,,,,,0.0,...,,,,,,,0.0,,,
A Bend in the Road,,,,,,1.0,,,,0.0,...,,,,,,,,,,
A Is for Alibi (Kinsey Millhone Mysteries (Paperback)),,,,,,7.0,,,,0.0,...,,,,,,,0.0,,,
A Map of the World,,,,,,7.0,,,,,...,0.0,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
White Oleander : A Novel,,0.0,,8.0,,0.0,,,,0.0,...,,,,,,,,0.0,0.0,
White Oleander : A Novel (Oprah's Book Club),,0.0,,0.0,,,,,,,...,,,,,,,,,0.0,
Wicked: The Life and Times of the Wicked Witch of the West,,0.0,,0.0,10.0,9.0,,7.0,0.0,,...,,,,,,,,,,
Wild Animus,,0.0,,0.0,0.0,,0.0,,0.0,,...,0.0,0.0,,,,,,,0.0,


#### filling the na values

In [39]:
# filling the na values
book_pivot.fillna(0,inplace=True)
book_pivot

user_id,2276,3363,3757,6251,6543,6575,7158,7346,8681,11601,...,269719,269728,270713,271284,274004,274061,274308,275970,277427,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1st to Die: A Novel,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Bend in the Road,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Is for Alibi (Kinsey Millhone Mysteries (Paperback)),0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Map of the World,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
White Oleander : A Novel,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
White Oleander : A Novel (Oprah's Book Club),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Wicked: The Life and Times of the Wicked Witch of the West,0.0,0.0,0.0,0.0,10.0,9.0,0.0,7.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Wild Animus,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### converting to sparse matrix

In [40]:
from scipy.sparse import csr_matrix

In [41]:
book_sparse = csr_matrix(book_pivot)
book_sparse

<255x456 sparse matrix of type '<class 'numpy.float64'>'
	with 4798 stored elements in Compressed Sparse Row format>

### 5. model building

In [42]:
from sklearn.neighbors import NearestNeighbors

In [43]:
model = NearestNeighbors(algorithm="brute")

In [44]:
model.fit(book_sparse)

In [45]:
distances,suggestion = model.kneighbors(book_pivot.iloc[123,:].values.reshape(1,-1), n_neighbors=6)

In [46]:
for i in range(len(suggestion)):
    print(book_pivot.index[suggestion[i]])

Index(['Pop Goes the Weasel', 'Isle of Dogs', 'Slow Waltz in Cedar Bend',
       'The Simple Truth', 'Full Tilt (Janet Evanovich's Full Series)',
       'Black and Blue'],
      dtype='object', name='title')


### 6. building recommendation

In [47]:
from sklearn.metrics.pairwise import cosine_similarity
similarity_scores = cosine_similarity(book_pivot)

In [48]:
def recommend(book_name):
    # index fetch
    index = np.where(book_pivot.index==book_name)[0][0]
    similar_items = sorted(list(enumerate(similarity_scores[index])),key=lambda x:x[1],reverse=True)[1:5]
    
    data = []
    for i in similar_items:
        item = []
        temp_df = books[books['title'] == book_pivot.index[i[0]]]
        item.extend(list(temp_df.drop_duplicates('title')['title'].values))
        item.extend(list(temp_df.drop_duplicates('title')['author'].values))
        
        
        data.append(item)
    
    return data

In [49]:
recommend('Red Storm Rising')

[['Congo', 'Michael Crichton'],
 ['The Partner', 'John Grisham'],
 ['The Chamber', 'John Grisham'],
 ['The Client', 'John Grisham']]