<a href="https://colab.research.google.com/github/amanichivilkar/Books-Recommendation-System/blob/main/Amani_Chivilkar_Content_Based_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such
web services, recommender systems have taken more and more place in our lives. From
e-commerce (suggest to buyers articles that could interest them) to online advertisement
(suggest to users the right contents, matching their preferences), recommender systems are
today unavoidable in our daily online journeys.
In a very general way, recommender systems are algorithms aimed at suggesting relevant
items to users (items being movies to watch, text to read, products to buy, or anything else
depending on industries).
Recommender systems are really critical in some industries as they can generate a huge
amount of income when they are efficient or also be a way to stand out significantly from
competitors. The main objective is to create a book recommendation system for users.

 **Content**
--------------
The Book-Crossing dataset comprises 3 files.
*  Users
---
Contains the users. Note that user IDs (User-ID) have been anonymized and map to
integers. Demographic data is provided (Location, Age) if available. Otherwise, these
fields contain NULL values. 
*  Books
---------
Books are identified by their respective ISBN. Invalid ISBNs have already been removed
from the dataset. Moreover, some content-based information is given (Book-Title,
Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web
Services. Note that in the case of several authors, only the first is provided. URLs linking
to cover images are also given, appearing in three different flavors (Image-URL-S,
Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon website.
*   Ratings
------------
Contains the book rating information. Ratings (Book-Rating) are either explicit,
expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit,
expressed by 0.

## **Import data and libraries**

In [None]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
a = np.random.randn(12288, 150)
a.shape

(12288, 150)

In [None]:
b = np.random.randn(150, 45) 

In [None]:
c=np.dot(a,b)
c.shape

(12288, 45)

In [None]:
books = pd.read_csv('/content/drive/MyDrive/data/data_book_recommendation/Books.csv')
ratings = pd.read_csv('/content/drive/MyDrive/data/data_book_recommendation/Ratings.csv')
user = pd.read_csv('/content/drive/MyDrive/data/data_book_recommendation/Users.csv')

In [None]:
books.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], inplace=True)
books.rename(columns={'User-ID':'user_id','Book-Title':'title', 'Book-Rating':'rating','Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher',
                    'Book-Rating':'rating'}, inplace=True)
print(len(books))
books.head(2)

271360


Unnamed: 0,ISBN,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


In [None]:
# Working on books dataframe
books[(books['title']=='Selected Poems')].head()

Unnamed: 0,ISBN,title,author,year,publisher
4523,081120958X,Selected Poems,William Carlos Williams,1985,New Directions Publishing Corporation
39416,0811201465,Selected Poems,K. Patchen,1957,New Directions Publishing Corporation
41316,0679750800,Selected Poems,Rita Dove,1993,Vintage Books USA
106885,0060931744,Selected Poems,Gwendolyn Brooks,1999,Perennial
118775,0517101548,Selected Poems,John Donne,1994,Gramercy Books


*  Since same title has different author, in order to differentiate between the title of same nae we combine the tile with the auther 

*  And since the same book has different ISBN we cant use it insted we will combine the title and auther , and create a title_id for each unique title

In [None]:
# combining tiltle with author to differentiate bet the books with same title
books['title']=books['title'] + " " + books['author']

In [None]:
# Generating book_id for each unique book title
books['book_id'] = books[['title']].sum(axis=1).map(hash)

In [None]:
books.head()

Unnamed: 0,ISBN,title,author,year,publisher,book_id
0,195153448,Classical Mythology Mark P. O. Morford,Mark P. O. Morford,2002,Oxford University Press,-7293735737548367065
1,2005018,Clara Callan Richard Bruce Wright,Richard Bruce Wright,2001,HarperFlamingo Canada,7149617095307170465
2,60973129,Decision in Normandy Carlo D'Este,Carlo D'Este,1991,HarperPerennial,4608496659329701630
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,1160329172564954229
4,393045218,The Mummies of Urumchi E. J. W. Barber,E. J. W. Barber,1999,W. W. Norton &amp; Company,-8279562061523982204


In [None]:
print(len(ratings))
ratings.rename(columns={'User-ID':'user_id', 'Book-Rating':'rating'}, inplace=True)
ratings.head(2)

1149780


Unnamed: 0,user_id,ISBN,rating
0,276725,034545104X,0
1,276726,0155061224,5


## **Data Preprocessing**

In [None]:
# We select those user who has given >= 50 ratings 
df1=ratings.groupby(['user_id'])['rating'].count().reset_index()
list_of_imp_user=list(df1[df1['rating']>50]['user_id'])
len(list_of_imp_user)

3371

In [None]:
# Get rating dataframe of the user who has given >= 50 ratings
ratings=ratings[ratings['user_id'].isin(list_of_imp_user)]
print(len(ratings))
ratings.head()

765672


Unnamed: 0,user_id,ISBN,rating
173,276847,446364193,0
174,276847,3257200552,5
175,276847,3379015180,0
176,276847,3404145909,8
177,276847,3404148576,8


In [None]:
# merge ratings with books
df=ratings.merge(books, on='ISBN')
print(len(df))
df.head()

700848


Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,book_id
0,276847,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
1,278418,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
2,5483,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
3,7346,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760
4,8362,446364193,0,Along Came a Spider (Alex Cross Novels) James ...,James Patterson,1993,Warner Books,6935983676206818760


In [None]:
print(f"unique title = {df['title'].nunique()}")
print(f"unique ISBN = {df['ISBN'].nunique()}")

unique title = 206071
unique ISBN = 221678


*  ISBN=014028009X title=Bridget Jones's Diary	year=1999	
*  ISBN=0330375253 title=Bridget Jones's Diary	year=2001