# Deep Learning $for$ Book Recommending System
This is a project on a Recommendation system that suggest different books to users based on their past behavior i.e., likings and ratings. 
## About the dataset:
The Book-Crossing dataset can be found on the following website:

http://www2.informatik.uni-freiburg.de/~cziegler/BX/

This dataset is a collaborative filtering dataset and contains information about users, books, and ratings. It was collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community, and contains 278,858 users (anonymized) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

## Load all necessary libraries
This section is about importing all libraries that will make the project walkthrough a success without an error.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf

## Load the datasets into this workspace

In [2]:
ratings = pd.read_csv("Data/BX-Book-Ratings.csv", delimiter=";", on_bad_lines='skip')
books = pd.read_csv("Data/BX-Books.csv", delimiter=";", on_bad_lines='skip', low_memory=False)
users = pd.read_csv("Data/BX-Users.csv", delimiter=";", on_bad_lines='skip')

## Datasets overview and information

### Book Ratings
To get the glimpse of the dataset, I will show the top 3 rows of the book ratings dataset.

In [3]:
ratings.head(3)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0


* Brief information about different columns of the ratings dataframe:

In [4]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


The `ISBN` (i.e., book-id) is represented as an object. What can causes this is non-numerical character in the ID.

### Book information
* Top 3 rows of the books information:

In [5]:
books.head(3)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...


* Brief information about different columns of the books dataframe:

In [6]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271361 entries, 0 to 271360
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271361 non-null  object
 1   Book-Title           271361 non-null  object
 2   Book-Author          271360 non-null  object
 3   Year-Of-Publication  271361 non-null  object
 4   Publisher            271359 non-null  object
 5   Image-URL-S          271361 non-null  object
 6   Image-URL-M          271361 non-null  object
 7   Image-URL-L          271358 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


The same is true for this dataset also i.e., `ISBN` column is represented as an object instead of an integer.

### Users information
* Top 5 header lines of the users information

In [7]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


* Brief information about different columns of the users dataframe:

In [8]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


### OBSERVATIONS:
- The "books" dataframe contains some unnecessary columns which are not needed for this analysis e.g. `"Image-URL-S"`, `"Image-URL-M"`, `"Image-URL-L"` etc.
- Some wrong columns are assigned to the `"Year-Of-Publication"` in the `books` dataframe which makes it to be loaded as an `object` instead of an `integer`.
- `ISBN` column (i.e., the book-id) has been misrepresented due to some non numerical characters.

## Data Cleaning
In this section, I will remove all the unnecessary columns and the invalid rows from the dataframe. Some user-id and book-id have unknown character which make them to be represented as an object. So, I will be using the LabelEncoder() function of the sklearn library to encode each and every id in the dataset to a numerical equivalent. But before that, I will remove all unknown ids in both User-ID and ISBN in the ratings dataframe.

* Drop all redundant columns from the books dataframe:

In [9]:
# Drop unnecessary columns
books.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], inplace=True)

* Remove ids with lesser counts in from the dataframe:

In [10]:
# Remove invalid book IDs (less than 5 ratings)
rbook_counts = ratings['ISBN'].value_counts()
ratings = ratings[ratings['ISBN'].isin(rbook_counts[rbook_counts >= 5].index)]

# Remove invalid user IDs (less than 10 ratings)
ruser_counts = ratings['User-ID'].value_counts()
ratings = ratings[ratings['User-ID'].isin(ruser_counts[ruser_counts >= 10].index)]

* Remove rows with id that are not in the books and users dataframe:

In [11]:
ratings = ratings[ratings['ISBN'].isin(books['ISBN'].values)]
ratings = ratings[ratings['User-ID'].isin(users['User-ID'].values)]

* Transform all IDs to integer equivalent

In [12]:
# Transform both the user and the book IDs
isbn_transformer = LabelEncoder().fit(books['ISBN'])
books['ISBN'] = isbn_transformer.transform(books['ISBN'])
ratings['ISBN'] = isbn_transformer.transform(ratings['ISBN'])

userid_transformer = LabelEncoder().fit(users['User-ID'])
users['User-ID'] = userid_transformer.transform(users['User-ID'])
ratings['User-ID'] = userid_transformer.transform(ratings['User-ID'])

## Brief overview of dataframes after data cleaning

#### Ratings

In [13]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
133,276821,3504,10
134,276821,21203,9
137,276821,62977,9
138,276821,65338,10
139,276821,83980,0


In [14]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 540760 entries, 133 to 1149772
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   User-ID      540760 non-null  int64
 1   ISBN         540760 non-null  int32
 2   Book-Rating  540760 non-null  int64
dtypes: int32(1), int64(2)
memory usage: 14.4 MB


#### Books

In [15]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,25028,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,73,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,8211,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,60198,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,71711,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [16]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271361 entries, 0 to 271360
Data columns (total 5 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271361 non-null  int32 
 1   Book-Title           271361 non-null  object
 2   Book-Author          271360 non-null  object
 3   Year-Of-Publication  271361 non-null  object
 4   Publisher            271359 non-null  object
dtypes: int32(1), object(4)
memory usage: 9.3+ MB


#### Users

In [17]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,0,"nyc, new york, usa",
1,1,"stockton, california, usa",18.0
2,2,"moscow, yukon territory, russia",
3,3,"porto, v.n.gaia, portugal",17.0
4,4,"farnborough, hants, united kingdom",


In [18]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


## Building Recommendation System
In this section, I will be building a recommendation system. But before then, I will create a traininig and testing set to test the performance of my model. The library that can help in the splitting of the dataset into training and testing set is `train_test_split()` function of sklearn library.

In [None]:
from model import RecommendationSystem

In [None]:
rmodel = RecommendationSystem(ratings=ratings,
                              books=books,
                              users=users)

In [None]:
rmodel.build_fit_model()

#### Recommend books to a user [`276821`]
> * These are the books the interested user (i.e., `276821`) has read:

In [None]:
books.loc[books['ISBN'].isin(ratings.loc[ratings['User-ID']==276821, "ISBN"])]

* These are the books recommended to the user:

In [None]:
rmodel.recommend_books(276821)

More details about the different recommended books:

In [None]:
rmodel.recommended_books_table()

#### Recommend books to a user [`103`]
> * These are the books the interested user (i.e., `103`) has read:

In [None]:
books.loc[books['ISBN'].isin(ratings.loc[ratings['User-ID']==242, "ISBN"])]

* These are the books recommended to the user:

In [None]:
rmodel.recommend_books(242)

More details about the different recommended books:

In [None]:
rmodel.recommended_books_table()