# Book Rental Recommendation .
Course-end Project 4
# Description
Book Rent is the largest online and offline book rental chain in India. They provide books of various genres, such as thrillers, mysteries, romances, and science fiction. The company charges a fixed rental fee for a book per month. Lately, the company has been losing its user base. The main reason for this is that users are not able to choose the right books for themselves. The company wants to solve this problem and increase its revenue and profit. 
Project Objective:
You, as an ML expert, should focus on improving the user experience by personalizing it to the user's needs. You have to model a recommendation engine so that users get recommendations for books based on the behavior of similar users. This will ensure that users are renting the books based on their tastes and traits.
Note: You have to perform user-based collaborative filtering and item-based collaborative filtering.
Dataset description:
BX-Users: It contains the information of users.
•	user_id - These have been anonymized and mapped to integers
•	Location - Demographic data is provided
•	Age - Demographic data is provided
If available, otherwise, these fields contain NULL-values.
 
BX-Books: 
•	isbn - Books are identified by their respective ISBNs. Invalid ISBNs have already been removed from the dataset.
•	book_title
•	book_author
•	year_of_publication
•	publisher

 
BX-Book-Ratings: Contains the book rating information. 
•	user_id
•	isbn
•	rating - Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1–10 (higher values denoting higher appreciation), or implicit, expressed by 0.
 


In [1]:
import  pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from sklearn import preprocessing 
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import PolynomialFeatures

# Following operations should be performed:
# •	Read the books dataset and explore it


In [2]:
book_df =pd.read_csv('BX-Books.csv', encoding ='latin')
book_df.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [3]:
ratings_df =pd.read_csv('BX-Book-Ratings.csv',encoding ='latin')
ratings_df.head()

Unnamed: 0,user_id,isbn,rating
0,276725,034545104X,0
1,276726,155061224,5
2,276727,446520802,0
3,276729,052165615X,3
4,276729,521795028,6


In [4]:
user_df =pd.read_csv('BX-Users.csv',encoding ='latin')
user_df.head()

Unnamed: 0,user_id,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [5]:
recommend_df =pd.read_csv('Recommend.csv',encoding='latin')
recommend_df.head()

Unnamed: 0,196,242,3,881250949
0,186,302,3,891717742
1,22,377,1,878887116
2,244,51,2,880606923
3,166,346,1,886397596
4,298,474,4,884182806


In [6]:
print( "the shape of book_df{}".format(book_df.shape))
print( "the shape of book_ratings_df{}".format(ratings_df.shape))
print( "the shape of user_df{}".format(user_df.shape))

the shape of book_df(271379, 5)
the shape of book_ratings_df(1048575, 3)
the shape of user_df(278859, 3)


In [7]:
# lets find the columns
print( "the shape of bx_book_df{}".format(book_df.columns))
print( "the shape of bx_book_ratings_df{}".format(ratings_df.columns))
print( "the shape of bx_user_df{}".format(user_df.columns))

the shape of bx_book_dfIndex(['isbn', 'book_title', 'book_author', 'year_of_publication',
       'publisher'],
      dtype='object')
the shape of bx_book_ratings_dfIndex(['user_id', 'isbn', 'rating'], dtype='object')
the shape of bx_user_dfIndex(['user_id', 'Location', 'Age'], dtype='object')


In [8]:
# lets find the nan value with its percentage for each book, ratings,user book.
percentage_book = (book_df.isna().sum(axis=0)/book_df.shape[0])*100
percentage_book

isbn                   0.000000
book_title             0.000000
book_author            0.000368
year_of_publication    0.000000
publisher              0.000737
dtype: float64

In [9]:
book_df.isnull().sum(axis=0)

isbn                   0
book_title             0
book_author            1
year_of_publication    0
publisher              2
dtype: int64

In [10]:
book_df.isnull().sum(axis=0).value_counts()

0    3
2    1
1    1
dtype: int64

In [11]:
ratings_df.isnull().sum(axis=0)

user_id    0
isbn       0
rating     0
dtype: int64

In [12]:
user_df.isnull().sum(axis=0)

user_id          0
Location         1
Age         110763
dtype: int64

In [13]:
user_df.isnull().sum(axis=0).value_counts()

110763    1
1         1
0         1
dtype: int64

##### 

Interpretation: we have seen that in the book dataset we have  book_author=1 and publisher=2 as a NAN values and in the user dataset we have 
Location= 1 ,age= 110763 NAN values.

# •	Clean up NaN values.


In [14]:
user_df['Location'].unique()

array(['nyc, new york, usa', 'stockton, california, usa',
       'moscow, yukon territory, russia', ...,
       'sergnano, lombardia, italy', 'stranraer, n/a, united kingdom',
       'tacoma, washington, united kingdom'], dtype=object)

In [15]:
book_df= book_df.dropna()

In [16]:
book_df.isna().sum(axis=0)

isbn                   0
book_title             0
book_author            0
year_of_publication    0
publisher              0
dtype: int64

In [17]:
user_df =user_df.dropna()

In [18]:
user_df.isna().sum(axis=0)

user_id     0
Location    0
Age         0
dtype: int64

In [19]:
ratings_df.isnull().sum(axis=0)

user_id    0
isbn       0
rating     0
dtype: int64

Interpretation: we have cleaned the NAN values by droping all the NAN values.

# •	Read the data where ratings are given by users


In [20]:
book_df.shape

(271376, 5)

In [21]:
book_df.columns

Index(['isbn', 'book_title', 'book_author', 'year_of_publication',
       'publisher'],
      dtype='object')

In [22]:
user_df.shape

(168096, 3)

In [23]:
user_df.columns

Index(['user_id', 'Location', 'Age'], dtype='object')

In [24]:
ratings_df.shape

(1048575, 3)

In [25]:
ratings_df.columns

Index(['user_id', 'isbn', 'rating'], dtype='object')

In [26]:
ratings_df.describe()

Unnamed: 0,user_id,rating
count,1048575.0,1048575.0
mean,128508.9,2.879907
std,74218.76,3.85787
min,2.0,0.0
25%,63394.0,0.0
50%,128835.0,0.0
75%,192779.0,7.0
max,278854.0,10.0


we can see we cannot clear with  the dataset values ,hence,lets recall rating datasets and call 10000 rows to rechack the description of the data.

In [27]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype 
---  ------   --------------    ----- 
 0   user_id  1048575 non-null  int64 
 1   isbn     1048575 non-null  object
 2   rating   1048575 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 24.0+ MB


In [28]:
ratings1_df =pd.read_csv('BX-Book-Ratings.csv',encoding ='latin', nrows=10000)
ratings1_df.head()

Unnamed: 0,user_id,isbn,rating
0,276725,034545104X,0
1,276726,155061224,5
2,276727,446520802,0
3,276729,052165615X,3
4,276729,521795028,6


In [29]:
ratings1_df.describe()

Unnamed: 0,user_id,rating
count,10000.0,10000.0
mean,265844.3796,1.9747
std,56937.189618,3.424884
min,2.0,0.0
25%,277478.0,0.0
50%,278418.0,0.0
75%,278418.0,4.0
max,278854.0,10.0


Interpretation: we have seen that the ratings and user id count is 1048575  values including int64, object and float as well. but in calling the rating1 data we have seen that count is 10000 standard deviation is aprox 56000 and min value =2 and max= 278854. hence we have seen that calling all the values has made memory crashed to read the data. we will stick to work with 10000 dataset.

# lets merge the new rating1df and book dataset and then re merge for our final data set with the user dataset.

In [30]:
# lets first merge the  book and rating data.
final_df =pd.merge(ratings1_df,book_df, on='isbn')
final_df

Unnamed: 0,user_id,isbn,rating,book_title,book_author,year_of_publication,publisher
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
1,276726,155061224,5,Rites of Passage,Judith Rae,2001,Heinle
2,276727,446520802,0,The Notebook,Nicholas Sparks,1996,Warner Books
3,278418,446520802,0,The Notebook,Nicholas Sparks,1996,Warner Books
4,276729,052165615X,3,Help!: Level 1,Philip Prowse,1999,Cambridge University Press
...,...,...,...,...,...,...,...
8696,243,385720106,7,A Map of the World,Jane Hamilton,1999,Anchor Books/Doubleday
8697,243,425092917,0,The Accidental Tourist,Anne Tyler,1994,Berkley Publishing Group
8698,243,425098834,0,If Morning Ever Comes,Anne Tyler,1983,Berkley Publishing Group
8699,243,425163407,9,Unnatural Exposure,Patricia Daniels Cornwell,1998,Berkley Publishing Group


# •	Take a quick look at the number of unique users and books


In [31]:
 # check for the unique id list and isbn(books):
print("The length of unique number of user  is {}".format(len(final_df.user_id.unique()))); print("The length of unique number of books is {}".format(len(final_df.isbn.unique())))

The length of unique number of user  is 828
The length of unique number of books is 8051


In [32]:
final_book_df =pd.merge(final_df,user_df, on='user_id')
final_book_df.head()


Unnamed: 0,user_id,isbn,rating,book_title,book_author,year_of_publication,publisher,Location,Age
0,99,451166892,3,The Pillars of the Earth,Ken Follett,1996,Signet Book,"franktown, colorado, usa",42.0
1,99,786868716,0,The Five People You Meet in Heaven,Mitch Albom,2003,Hyperion,"franktown, colorado, usa",42.0
2,99,067976397X,0,Corelli's Mandolin : A Novel,LOUIS DE BERNIERES,1995,Vintage,"franktown, colorado, usa",42.0
3,99,312252617,8,Fast Women,Jennifer Crusie,2001,St. Martin's Press,"franktown, colorado, usa",42.0
4,99,312261594,8,Female Intelligence,Jane Heller,2001,St. Martin's Press,"franktown, colorado, usa",42.0


Interpretation: we have noticed that in the above data final_book_df there will be no impact on location and age column hence we will be continuing to take the final dataset as final_df

In [33]:
print("the shape of the final book data", final_book_df.shape)
print("the columns in the final book dataset",final_book_df.columns)

the shape of the final book data (136, 9)
the columns in the final book dataset Index(['user_id', 'isbn', 'rating', 'book_title', 'book_author',
       'year_of_publication', 'publisher', 'Location', 'Age'],
      dtype='object')


In [34]:
 print("The length of unique number  user  is {}".format(len(final_book_df.user_id.unique()))); print("The length of unique  number of books is {}".format(len(final_book_df.isbn.unique())))

The length of unique number  user  is 43
The length of unique  number of books is 134


In [35]:
final_book_df.describe()

Unnamed: 0,rating,Age
count,136.0,136.0
mean,4.492647,36.044118
std,4.113206,12.004856
min,0.0,14.0
25%,0.0,27.0
50%,6.0,37.0
75%,8.0,42.0
max,10.0,62.0


Interpretation: we have seen that the final book dataset the rating  and user id count is 682099.

In [36]:
book_df.describe()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
count,271376,271376,271376,271376,271376
unique,271376,242148,102041,202,16822
top,156649222X,Selected Poems,Agatha Christie,2002,Harlequin
freq,1,27,632,17144,7535


Interpretation : in the final data the user and number of books is around 43 and 134 
and in the ratings dataset the number of user and number of books are 828 and 8051 respectivvely who rated the books.

# •	Convert ISBN variables to numeric numbers in the correct order


In [37]:
final_df.isbn

0       034545104X
1        155061224
2        446520802
3        446520802
4       052165615X
           ...    
8696     385720106
8697     425092917
8698     425098834
8699     425163407
8700     425164403
Name: isbn, Length: 8701, dtype: object

we have noticed that in the numeric value some parts of values has string value attached and the dtype is object, lets convert it into numeric fully.

In [38]:
isbn_list = final_df.isbn.unique()
isbn_list

array(['034545104X', '155061224', '446520802', ..., '425098834',
       '425163407', '425164403'], dtype=object)

In [39]:
print("the length of number of  book", len(isbn_list))
   
def isbn_numeric(isbn):
    # print("isbn:",isbn)
    isbn_index =np.where(isbn_list==isbn)
    return isbn_index[0][0]     #This line returns the index of the matched ISBN in the isbn_list. 
                                #It can be used to map a given ISBN to its index within the list of unique ISBNs.








    

the length of number of  book 8051


Overall, the code is creating a function (get_isbn_numeric_id) that takes an ISBN as input and returns its index within the list of unique ISBNs extracted from the df_final DataFrame. 


# •	Convert  ISBN to the ordered list, i.e., from 0...n-1


In [40]:
final_df['isbn_id']=final_df['isbn'].apply(isbn_numeric)
final_df.head()

Unnamed: 0,user_id,isbn,rating,book_title,book_author,year_of_publication,publisher,isbn_id
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,0
1,276726,155061224,5,Rites of Passage,Judith Rae,2001,Heinle,1
2,276727,446520802,0,The Notebook,Nicholas Sparks,1996,Warner Books,2
3,278418,446520802,0,The Notebook,Nicholas Sparks,1996,Warner Books,2
4,276729,052165615X,3,Help!: Level 1,Philip Prowse,1999,Cambridge University Press,3


# •	Convert the user_id variable to numeric numbers in the correct order


In [41]:
user_id_list =final_df.user_id.unique()

In [42]:
# similarly creates the function for user_id to get converted into numeric data
print("the number of user",len(user_id_list))
def user_is_numeric(user_id):
    user_id_index= np.where(user_id_list==user_id)
    return user_id_index[0][0]
    

the number of user 828


# •	Convert user_id to the ordered list, i.e., from 0...n-1


In [43]:
final_df['user_id_order'] = final_df['user_id'].apply(user_is_numeric)

In [44]:
final_df.head()

Unnamed: 0,user_id,isbn,rating,book_title,book_author,year_of_publication,publisher,isbn_id,user_id_order
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,0,0
1,276726,155061224,5,Rites of Passage,Judith Rae,2001,Heinle,1,1
2,276727,446520802,0,The Notebook,Nicholas Sparks,1996,Warner Books,2,2
3,278418,446520802,0,The Notebook,Nicholas Sparks,1996,Warner Books,2,3
4,276729,052165615X,3,Help!: Level 1,Philip Prowse,1999,Cambridge University Press,3,4


Interpretation: now we can user user_id_order and isbn_id for model prediction.

# •	Re-index the columns to build a matrix


In [45]:
# lets re index the column for building the matrix before that lets cal the columns
final_df.columns

Index(['user_id', 'isbn', 'rating', 'book_title', 'book_author',
       'year_of_publication', 'publisher', 'isbn_id', 'user_id_order'],
      dtype='object')

In [46]:
# lets ordered it accordingly:
cols=['user_id_order','isbn_id', 'rating','book_title','book_author','year_of_publication', 'publisher','user_id','isbn']
final_df =final_df.reindex(columns=cols)
final_df.head()

Unnamed: 0,user_id_order,isbn_id,rating,book_title,book_author,year_of_publication,publisher,user_id,isbn
0,0,0,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,276725,034545104X
1,1,1,5,Rites of Passage,Judith Rae,2001,Heinle,276726,155061224
2,2,2,0,The Notebook,Nicholas Sparks,1996,Warner Books,276727,446520802
3,3,2,0,The Notebook,Nicholas Sparks,1996,Warner Books,278418,446520802
4,4,3,3,Help!: Level 1,Philip Prowse,1999,Cambridge University Press,276729,052165615X


Now it will be easy for us to view  the data and connecting it.

# •	Split your data into two sets (training and testing)


In [47]:
from sklearn.model_selection import train_test_split

In [48]:
train_data, test_data =train_test_split(final_df,random_state=7,train_size=0.8)

In [49]:
train_data.columns

Index(['user_id_order', 'isbn_id', 'rating', 'book_title', 'book_author',
       'year_of_publication', 'publisher', 'user_id', 'isbn'],
      dtype='object')

In [50]:
print("the datset in train data is{}".format(train_data.shape))
print("the datset in test data is {}".format(test_data.shape))

the datset in train data is(6960, 9)
the datset in test data is (1741, 9)


# Approach for Recommendation Book:
    
# a) User-based nearest-neighbor collaborative filtering:
The system finds out the users who have the same sort of taste of books rading and similarity between users is computed based upon the rating behavior.

# b) Item-based nearest-neighbor collaborative filtering:
The system checks the items that are similar to the items the user bought. The similarity between different items is computed based on the items and not the users for the prediction.

In [51]:

n_user= final_df.user_id.nunique()
n_books =final_df.isbn.nunique()

print("Numbr of Users"+str(n_user))
print("Number of books:"+str(n_books))
train_matrix= np.zeros((n_user, n_books))
for line in train_data.itertuples():
    train_matrix[line[1]-1,line[2]-1]=line[3]

    
# Create user-book matrix for testing
test_matrix = np.zeros((n_user,n_books))
for line in test_data.itertuples():
    test_matrix[line[1]-1, line[2]-1] = line[3]

Numbr of Users828
Number of books:8051


# •	Make predictions based on user and item variables


In [52]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import pairwise_distances

In [53]:
# colaborative user based recommendation system
user_similarity =pairwise_distances(train_matrix, metric='cosine')
user_similarity

array([[0., 1., 1., ..., 1., 1., 1.],
       [1., 0., 1., ..., 1., 1., 1.],
       [1., 1., 0., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 0., 1., 1.],
       [1., 1., 1., ..., 1., 0., 1.],
       [1., 1., 1., ..., 1., 1., 0.]])

In [54]:
user_similarity.shape

(828, 828)

In [55]:
# item based collaborative recommendation system
item_similarity =pairwise_distances(train_matrix.T,metric='cosine')
item_similarity

array([[0., 1., 1., ..., 1., 1., 1.],
       [1., 0., 1., ..., 1., 1., 1.],
       [1., 1., 0., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 0., 1., 1.],
       [1., 1., 1., ..., 1., 0., 1.],
       [1., 1., 1., ..., 1., 1., 0.]])

In [56]:
item_similarity.shape

(8051, 8051)

# Make Prediction

In [57]:
def prediction(ratings, similarity,type='user'):
    if type== 'user':
        mean_user =ratings.mean(axis=1)
        rating_diff =(ratings-mean_user[:,np.newaxis])
        pred= mean_user[:,np.newaxis]+similarity.dot(rating_diff)/np.array([np.abs(similarity).sum(axis=1)]).T
    elif type=='item':
        pred= ratings.dot(similarity)/np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [58]:
test_matrix.shape 

(828, 8051)

In [59]:
user_prediction= prediction(train_matrix,user_similarity, type='user')
user_prediction

array([[-0.0013735 , -0.0013735 ,  0.00225407, ..., -0.0013735 ,
        -0.0013735 , -0.0013735 ],
       [ 0.00405066, -0.00199529,  0.00163228, ..., -0.00199529,
        -0.00199529, -0.00199529],
       [ 0.06511313,  0.05906554,  0.06269409, ...,  0.05906554,
         0.05906554,  0.05906554],
       ...,
       [ 0.00405066, -0.00199529,  0.00163228, ..., -0.00199529,
        -0.00199529, -0.00199529],
       [ 0.00405066, -0.00199529,  0.00163228, ..., -0.00199529,
        -0.00199529, -0.00199529],
       [ 0.00405066, -0.00199529,  0.00163228, ..., -0.00199529,
        -0.00199529, -0.00199529]])

In [60]:
item_similarity.shape

(8051, 8051)

In [61]:
item_prediction =prediction(train_matrix,item_similarity,type='item')
item_prediction

array([[0.        , 0.00062112, 0.0006212 , ..., 0.00062112, 0.00062112,
        0.00062112],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.06099379, 0.06099379, 0.06100137, ..., 0.06099379, 0.06099379,
        0.06099379],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

# •	Use RMSE to evaluate the predictions


In [64]:
# Importing RMSE function 
from sklearn.metrics import mean_squared_error
from math import sqrt

# Defining custom function to filter out elements with ground_truth.nonzero
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [65]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_matrix)))


User-based CF RMSE: 7.818113790950216
Item-based CF RMSE: 7.817194775495292


# Conclusion: 
we have seen that recommendation system by using collaborative filtering user based and item based has an accuracy of model pridiction for the book name according to rating is 78%.
    

In [109]:
final_df.shape

(8701, 9)

In [110]:
final_df['user_id'].value_counts()

278418    3997
277427     490
277639     265
278188     194
277478     187
          ... 
277827       1
277819       1
277811       1
277803       1
278528       1
Name: user_id, Length: 828, dtype: int64

In [114]:
# Extra Work: 
#we can presume the recoomendation code:
# using collaborative filtering method
# how many user rated more than 10
x =final_df.groupby('user_id').count()['rating']>10
x[x]# count the index place where it is true 

user_id
8         True
99        True
242       True
243       True
276762    True
          ... 
278633    True
278637    True
278771    True
278843    True
278851    True
Name: rating, Length: 77, dtype: bool

Interpretation: 
    we can check there are books whose ratings are above 200.

In [115]:
# how many books are there which are rated by 828 user?
required_user =x[x].index
required_user

Int64Index([     8,     99,    242,    243, 276762, 276798, 276822, 276828,
            276847, 276856, 276859, 276866, 276925, 276929, 276939, 276954,
            276964, 276984, 276994, 277042, 277051, 277157, 277168, 277171,
            277187, 277195, 277196, 277203, 277378, 277427, 277439, 277466,
            277478, 277523, 277629, 277639, 277662, 277681, 277710, 277711,
            277744, 277879, 277882, 277901, 277922, 277928, 277929, 277937,
            277945, 277954, 277965, 277982, 277984, 278002, 278026, 278137,
            278144, 278188, 278194, 278202, 278221, 278314, 278346, 278356,
            278390, 278418, 278506, 278522, 278535, 278554, 278563, 278582,
            278633, 278637, 278771, 278843, 278851],
           dtype='int64', name='user_id')

In [116]:
final_df[final_df['user_id'].isin(required_user)]

Unnamed: 0,user_id_order,isbn_id,rating,book_title,book_author,year_of_publication,publisher,user_id,isbn
3,3,2,0,The Notebook,Nicholas Sparks,1996,Warner Books,278418,446520802
8,3,6,0,A Painted House,JOHN GRISHAM,2001,Doubleday,278418,038550120X
10,8,7,0,Lightning,Dean R. Koontz,1996,Berkley Publishing Group,277427,425115801
12,9,8,8,Manhattan Hunt Club,JOHN SAUL,2002,Ballantine Books,278026,449006522
15,3,10,0,Night Sins,TAMI HOAG,1995,Bantam,278418,055356451X
...,...,...,...,...,...,...,...,...,...
8696,96,8046,7,A Map of the World,Jane Hamilton,1999,Anchor Books/Doubleday,243,385720106
8697,96,8047,0,The Accidental Tourist,Anne Tyler,1994,Berkley Publishing Group,243,425092917
8698,96,8048,0,If Morning Ever Comes,Anne Tyler,1983,Berkley Publishing Group,243,425098834
8699,96,8049,9,Unnatural Exposure,Patricia Daniels Cornwell,1998,Berkley Publishing Group,243,425163407


In [117]:
final_df['user_id'].nunique()

828

In [118]:
# lets filter the rating according the  name of the books and the user that rated it?
filter_rating =final_df[final_df['user_id'].isin(required_user)]

In [119]:
filter_rating.shape

(7240, 9)

In [120]:
filter_rating.head()

Unnamed: 0,user_id_order,isbn_id,rating,book_title,book_author,year_of_publication,publisher,user_id,isbn
3,3,2,0,The Notebook,Nicholas Sparks,1996,Warner Books,278418,446520802
8,3,6,0,A Painted House,JOHN GRISHAM,2001,Doubleday,278418,038550120X
10,8,7,0,Lightning,Dean R. Koontz,1996,Berkley Publishing Group,277427,425115801
12,9,8,8,Manhattan Hunt Club,JOHN SAUL,2002,Ballantine Books,278026,449006522
15,3,10,0,Night Sins,TAMI HOAG,1995,Bantam,278418,055356451X


In [121]:
filter_rating.describe()

Unnamed: 0,user_id_order,isbn_id,rating,user_id
count,7240.0,7240.0,7240.0,7240.0
mean,56.077348,3985.956768,1.383978,274224.448481
std,107.459209,2230.542641,3.021188,32601.232483
min,3.0,2.0,0.0,8.0
25%,3.0,2017.75,0.0,277639.0
50%,3.0,4093.5,0.0,278418.0
75%,89.0,5898.25,0.0,278418.0
max,775.0,8050.0,10.0,278851.0


In [122]:
# lets put this in the new variable called as y
y =filter_rating.groupby('book_title').count()['rating']
y[y]

book_title
01-01-00: The Novel of the Millennium    1
01-01-00: The Novel of the Millennium    1
01-01-00: The Novel of the Millennium    1
01-01-00: The Novel of the Millennium    1
01-01-00: The Novel of the Millennium    1
                                        ..
01-01-00: The Novel of the Millennium    1
01-01-00: The Novel of the Millennium    1
01-01-00: The Novel of the Millennium    1
01-01-00: The Novel of the Millennium    1
01-01-00: The Novel of the Millennium    1
Name: rating, Length: 6760, dtype: int64

In [123]:
y.value_counts()

1     6394
2      295
3       49
4       13
5        5
7        1
10       1
6        1
9        1
Name: rating, dtype: int64

In [124]:
famous_books= y[y].index
famous_books

Index(['01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '100 Best-Loved Poems (Dover Thrift Editions)',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       ...
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '100 Best-Loved Poems (Dover Thrift Editions)',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium',
       '01-01-00: The Novel of the Millennium'],
      dtype='o

In [125]:
final_df['book_title']

0           Flesh Tones: A Novel
1               Rites of Passage
2                   The Notebook
3                   The Notebook
4                 Help!: Level 1
                  ...           
8696          A Map of the World
8697      The Accidental Tourist
8698       If Morning Ever Comes
8699          Unnatural Exposure
8700    Only Love (Magical Love)
Name: book_title, Length: 8701, dtype: object

In [126]:
final_rating=filter_rating[filter_rating['book_title'].isin(famous_books)]
final_rating

Unnamed: 0,user_id_order,isbn_id,rating,book_title,book_author,year_of_publication,publisher,user_id,isbn
915,202,749,5,01-01-00: The Novel of the Millennium,R. J. Pineiro,1999,Tor Books (Mm),277168,812568710
1206,239,977,8,101 Dalmatians,Walt Disney,1995,Stoddart+publishing,277203,717284832
1914,8,1566,0,101 Great Resumes,Career Press,1995,Delmar Learning,277427,1564142019
3165,95,2644,9,100 Best-Loved Poems (Dover Thrift Editions),Philip Smith,1995,Dover Publications,277965,486285537
4368,3,3756,0,101 Dalmatians,Justine Korman,1996,Golden Books Publishing Company,278418,307001164
6020,3,5401,0,1001 Ways to Cut Your Expenses,Jonathan D. Pond,1992,Dell Publishing Company,278418,440504953
6564,3,5937,0,101 Bug Jokes,Lisa Eisenberg,1986,Scholastic Paperbacks,278418,590332473
6573,3,5946,0,101 Pet Jokes,Phil Hirsch,1980,Scholastic Inc.,278418,590371177
7480,3,6848,0,100 Days of Fun at School,Janet Palazzo Craig,1998,Troll Communications,278418,816745412
7800,3,7166,0,101 Best Home-Based Businesses for Women,Priscilla Y. Huff,1995,Prima Lifestyles,278418,1559587032


In [127]:
# lets apply the collaborative filtering method by finding the pivot table
pivot =final_rating.pivot_table(index='book_title',columns= 'user_id',values='rating')
pivot

user_id,277168,277203,277427,277965,278418
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
01-01-00: The Novel of the Millennium,5.0,,,,
100 Best-Loved Poems (Dover Thrift Editions),,,,9.0,
100 Days of Fun at School,,,,,0.0
1001 Ways to Cut Your Expenses,,,,,0.0
101 Best Home-Based Businesses for Women,,,,,0.0
101 Bug Jokes,,,,,0.0
101 Dalmatians,,8.0,,,0.0
101 Great Resumes,,,0.0,,
101 Pet Jokes,,,,,0.0


In [132]:
pivot.fillna(0, inplace =True)
pivot

user_id,277168,277203,277427,277965,278418
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
01-01-00: The Novel of the Millennium,5.0,0.0,0.0,0.0,0.0
100 Best-Loved Poems (Dover Thrift Editions),0.0,0.0,0.0,9.0,0.0
100 Days of Fun at School,0.0,0.0,0.0,0.0,0.0
1001 Ways to Cut Your Expenses,0.0,0.0,0.0,0.0,0.0
101 Best Home-Based Businesses for Women,0.0,0.0,0.0,0.0,0.0
101 Bug Jokes,0.0,0.0,0.0,0.0,0.0
101 Dalmatians,0.0,8.0,0.0,0.0,0.0
101 Great Resumes,0.0,0.0,0.0,0.0,0.0
101 Pet Jokes,0.0,0.0,0.0,0.0,0.0


In [133]:
pivot.shape

(9, 5)

In [135]:
# find the similarity using cosine_similarity pairwise distances
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics.pairwise import cosine_similarity

In [155]:
similar_score = cosine_similarity(pivot)
similar_score

array([[1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [156]:
pivot.index

Index(['01-01-00: The Novel of the Millennium',
       '100 Best-Loved Poems (Dover Thrift Editions)',
       '100 Days of Fun at School', '1001 Ways to Cut Your Expenses',
       '101 Best Home-Based Businesses for Women', '101 Bug Jokes',
       '101 Dalmatians', '101 Great Resumes', '101 Pet Jokes'],
      dtype='object', name='book_title')

In [157]:
pivot.index[0]

'01-01-00: The Novel of the Millennium'

In [158]:
# lets define a recoomendation book 
def recommend(book_name):
    index= np.where(pivot.index==book_name)[0][0]
    similar_items =sorted(list(enumerate(similar_score[index])),key=lambda x:x[1],reverse=True)[1:6]
    for i in similar_items:
        print(pivot.index[i[0]])
    
    

In [159]:
recommend('01-01-00: The Novel of the Millennium')

100 Best-Loved Poems (Dover Thrift Editions)
100 Days of Fun at School
1001 Ways to Cut Your Expenses
101 Best Home-Based Businesses for Women
101 Bug Jokes


In [163]:
book_df.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


Interpretation: we have seen after applying the cosine similarity function between book title and user id with respect to ratings, we can ask the recoomendation wihich is similar to the novel '01-01-00: The Novel of the Millennium'
 and we the model had predicted certainly:
Model Prediction: 

100 Best-Loved Poems (Dover Thrift Editions)
100 Days of Fun at School
1001 Ways to Cut Your Expenses
101 Best Home-Based Businesses for Women
101 Bug Jokes
 

In [174]:
# lets call the rating and user id alon with the book title
# we are using the same define class recommmend but with little modification:

def recommend(book_name):
    # fetch index using book_name
    index = np.where(pivot.index==book_name)[0][0]
    similar_items = sorted(list(enumerate(similar_score[index])),key = lambda x: x[1], reverse=True)[1:6]
    
    data=[]
    for i in similar_items:
        items=[]
        temp_df = book_df[book_df['book_title'] == pivot.index[i[0]]]
        items.extend(list(temp_df.drop_duplicates('book_title')['book_title'].values))
        items.extend(list(temp_df.drop_duplicates('book_title')['book_author'].values))
        items.extend(list(temp_df.drop_duplicates('book_title')['publisher'].values))
        
        data.append(items)
    return data

In [175]:
recommend('01-01-00: The Novel of the Millennium')

[['100 Best-Loved Poems (Dover Thrift Editions)',
  'Philip Smith',
  'Dover Publications'],
 ['100 Days of Fun at School', 'Janet Palazzo Craig', 'Troll Communications'],
 ['1001 Ways to Cut Your Expenses',
  'Jonathan D. Pond',
  'Dell Publishing Company'],
 ['101 Best Home-Based Businesses for Women',
  'Priscilla Y. Huff',
  'Prima Lifestyles'],
 ['101 Bug Jokes', 'Lisa Eisenberg', 'Scholastic Paperbacks']]

# Conclusion:
hence we noticed that recommendation of books according to user and ratings has been displayed above.
 Machine Learning Using recoomendation system techniques shows the accuracy of prediction novels is 78%. 
 list of novels, publisher and authors predicted accordingly by using cosine_similarity function and user based and item based collaborative filtering techniques.