# Books Recommendation system - Data Wrangling and EDA
## Capstone Project Two : Springboard Data Science career track
### Notebook by Debisree Ray

For the sourcecode and details of the Data wrangling and EDA, visit the following notebook:
https://github.com/debisree/Springboard-Data-Science-Career-Track/blob/master/Capstone_2_Book_Recommending_System/books_eda.ipynb

### Acknowledgement:
* Mentor: Max Sop 

* Springboard Team

* Book crossing (Cai-Nicolas Ziegler) for the data

* Cover image: Internet

## 1. Introduction - The Problem:

Online recommendation systems are the ‘in’ thing to do for many e-commerce websites. A recommendation system broadly recommends products to customers best suited to their tastes and traits. This project is focused around building various kinds of book recommendation engines; namely the Simple Generic Recommender, the Content-Based Filter and the User-Based Collaborative Filter. The performance of the systems will be evaluated in both a qualitative and quantitative manner.

<img src="book.png" align="center" width="100%"/>

## 2. The Client:

Any E-commerce business website or online book-selling portal is the potential customer.

## 3. The Data:

The data is from the 'Book Crossing dataset'. http://www2.informatik.uni-freiburg.de/~cziegler/BX/

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books, and ratings. All three (following) files are available in the CSV dump file (BX-CSV-Dump.zip).

* **BX-Book-Ratings.csv** (referred to as the rating file) has the following data fields:
     * **User ID: -**    The ID of the reviewer
     * **ISBN: -**     International Standard Book Number (Unique no. identifying the book)
     * **Book Rating: -**    Numeric (1-10) showing the rating
     
     
     
     
* **BX-Users.csv** (referred to as the users' file) has the following data fields:
     * **User ID: -**  The ID of the reviewer
     * **Location: -**  City of the reviewer
     * **Age: -**   Age of the reviewer
     
     
     
     
* **BX-Books.csv** (referred to the books file) has the following data fields:

     * **ISBN: -** International Standard Book Number (Unique no. identifying the book)
     * **Book Title: -** Title of the book
     * **Book Author: -** Author name
     * **Year of Publication: -** Year
     * **Publisher: -** Publisher name/company
     * **Image-URL-S: -** URL
     * **Image-URL-M: -** URL
     * **Image-URL-L: -** URL

## 4. The questions of interest:
The data analysis and story-telling report is organized around the following questions of interest:

* Can the ratings depend on the age of the reviewers?

* What are the unique features affecting the user rating?

* Is there any relation between the cities and the rating?

* Which location has the most and the least no. of the reviewers?

* Do the year of publication and the publisher's house name has any effect on the rating?

* Distribution of ratings by book
    - Most rated books
    - Highest rated books
    - Highest variance of ratings (most controversial books)

## 5. Data Wrangling:

This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:

**NumPy:** Provides a fast numerical array structure and helper functions.

**pandas:** Provides a DataFrame structure to store data in memory and work with it easily and efficiently.

**scikit-learn:** The essential Machine Learning package in Python.

**Matplotlib:** Basic plotting library in Python; most other Python plotting libraries are built on top of it.

**Seaborn:** Advanced statistical plotting library.

There are three datasets. So, we would import them and eventually merge them to build a final dataframe. 

* The first Dataset is the **'BX-Book-Ratings.csv'**. It has three columns and 1149780 rows. The columns are 'User-ID', 'ISBN', and 'Book-Rating'. There are no missing values here.

<img src="data1.png" align="center" width="25%"/>

* Second Dataset is the **'BX-Users.csv'**. It also has the three columns but the number of rows are 278858. The columns are: 'User-ID', 'Location', and 'Age'. We can see that there are some missing values (39.7%) in the 'Age' column. 

<img src="data2.png" align="center" width="30%"/>

* The third Dataset is the **'BX-Books.csv'**. It has 8 columns and 271360 rows. The  columns are: 'ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication','Publisher', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L'.The last three columns are not much use in our model building aspect. So we dropped them. There are mostly no missing values.

<img src="data3.png" align="center" width="70%"/>


We have merged the dataframes using the 'Merge' function to construct the final data frame as **'final_df'**. It has 1031136 rows and 9 columns. The columns are as 'User-ID', 'ISBN', 'Book-Rating', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher', 'Location', and 'Age'. (The description of the data fields are given above.)

'Age' column has 26.94% missing values.

<img src="final_df.png" align="center" width="100%"/>

## 6. Exploratory Data Analysis (EDA) :
### 6.1 User - ID:

* These are some categorical variable to identify an user.
* 92106 unique User-ID exist.
* We see that 44.3% of the total users are the frequent reviewers (returned to review at least more than once). However, 55.7% are non-returning (only one time reviewer).
* The reviewer with the User-ID '11676' is the most frequent reviewer (11144 times).
* The next most frequent reviewer is the one with User-ID '198711' (6456 times) and so on.
* Below is the top 20 most frequent reviewers and their total no. of reviews.

<img src="1.png" align="center" width="50%"/>


## 6.2 ISBN:

* 270151 unique different books (ISBN) have been listed. 
* Some books are very popular and been reviewd many times, however, some books are less popular. Just reviewed once.
* 46.1% of the total books have been reviewed at least more than once. However, 53.9% of books have been reviewed just once.
* The most popular book (ISBN = 0971880107) has been reviewed 2502 times.
* The next popular book (ISBN = 0316666343) has been reviewed 1295 times and so on.

<img src="2.png" align="center" width="50%"/>

### 6.3 Ratings:

* These are the numerical features for rating the books.
* The scale is from '0-10'.
* 62.8 % of the times the rating is '0'. This is the most popular rating given by the reviewers.
* Next popular rating is '8'. This one has given 8.9% of times ; followed by '10' (6.9%)

<img src="3.png" align="left" width="50%"/><img src="4.png" align="right" width="50%"/>
* It appears that the rating scale is actually from 1 to 10, and the 'zero' indicates an 'implicit' rather than an 'explicit' rating. An implicit rating represents an interaction (may be positive or negative) between the user and the item. So, we have splitte dthe dataframe into the explicit and implicit rating cases. 

* The 'rating_explicit' (only 1-10 rating scale) dataframe has 383842 rows, where as the 'rating_implicit' (only 0) has 647294 rows.

* Now, 23.9% of the rating is of score '8'.
<img src="3a.png" align="left" width="50%"/><img src="4a.png" align="right" width="50%"/>

### 6.4 Book- Title:

* These are the titles of the books, strings as object type.
* There are 241071 unique book titles enlisted. But, we saw that there are 270151 unique ISBNs (book identifier   code)out there. So, definitely 29080 books lack the title information in the metadata.
* Top 20 book titles (and the corresponding no. of reviews) are as follows:

<img src="title.png" align="center" width="50%"/>

### 6.5 Book-author:

* There are unique 101588 different author names enlisted.
* The most famous/popular/reviwed author is **Stephen King** (10053 times reviewed).
* 51.5% of the total authors are 'popular', their books being reviewed at least more than once, where as 48.5% are not that popular. Their books are reviewed only once.
* There is one row, where the book-author name is the 'null'.
* Top 20 authors (and the corresponding no. of reviews) are as follows:
<img src="author.png" align="center" width="30%"/>
<img src="5.png" align="center" width="50%"/>

### 6.6 Year of Publication:

* This is a numerical feature, showing the year of the publication.
* In this column, there are many '0's and NaNs. We have replaced all zeros by NaNs.
* 115 unique years of publications are there.
* The oldest publication year recorded is: year 1376
* 2002 is the most popular year (Maximum no. of Books published in this year, 91800 times.)
* The latest year of publication = 2050!
* The plot shows the general trend that the more recent books are much more frequent in number.
<img src="6.png" align="center" width="60%"/>

### 6.7 Publisher:

* 16729 publisher names are enlisted, out of them 572 are uniquely different.
* The most popular one is the 'Ballantine Books' (with 34724 books)
* 57.2% of the publishing houses are popular (books reviewed more than once)
* 42.8% of the publishers are not popular (books reviewed hardly once)
* Top 20 publishers (and corresponding no. of reviews) are as follows:
<img src="pub.png" align="center" width="30%"/>
<img src="7.png" align="center" width="50%"/>

### 6.8 Location:

* Categorical variable, featuring the geographic location of the reviewrs.
* Unique 762 locations (city,state,country) are there.
* The most popular location is: **'toronto, ontario, canada'** and one of the least popular one: **'essex, grays, united kingdom'**
* Splitted the information in three different columns - city, state and country for better understanding.

<img src="loc.png" align="center" width="80%"/>

* 14670 unique cities enlisted from where the reviewers belong (including the 'NA')
* 1959 state information is there.
* 452 different countries are there.
* Toronto is the most popular city (15124 reviewers are alone from this town.) There are so many cities, from where only one rating has been listed (eg. hillsoboro)
* California is the most popular state (Total 107465 reviews); followed by Texas (44158)
* USA is the most popular (745052 reviews), followed by Canada (92917 reviews).
* Analyzing, it appears apparently that the ratings are not dependent on cities. (analyzed only 10 top cities)
* Analyzed the average age distribution across 10 top cities. 'Omaha' has the least average age, however 'Olympia' has the maximum spread in ages.

<img src="13.png" align="left" width="50%"/><img src="14.png" align="center" width="50%"/>
<img src="15.png" align="center" width="50%"/>

### 6.9 Age:

* This is a numeric column, featuring the age of the reviewers.
* There are 141 different unique age (of the reviewers) information enlisted.
* The minimum age = 0
* The maximum age = 244!
* The descriptive statistics of the original 'Age' is as follows:

<img src="age.png" align="center" width="20%"/>

* So, a lot of them are unphysical/unrealistic! (eg. 0 and 244)
* So, Replaced the **unrealistic** ages (5 > age > 90) by NaN
* Total 4979 values got replaced by NaN.
* So, after the replacement, there are 84 different unique ages (reasonable) out there.
* For the new age group, the minimum = 5 (as we have set), the maximum = 90 and the mean = 37.047
* The descriptive statistics of the new 'Age' (after replacing NaN) is as follows:

<img src="age_new.png" align="center" width="20%"/>

* So, the mean age has decreased by 0.94%
* From the histogram of the age distribution, it's evident that the ditribution is skewed. The younger people review more than the older ones.
<img src="8.png" align="left" width="50%"/><img src="9.png" align="right" width="50%"/>
<img src="10.png" align="center" width="50%"/>

* To see if the age of the reviwers and the rating has any correlation, we plotted the boxplots showing the age variation for all rating groups (including the zero). Apparently, it seems that they are not correlated.

<img src="11.png" align="center" width="50%"/>

* The average age of the reviewrs range from ~ 33-37 years, for all the rating bands.
* Error bars are roughly between 10-12 years.
* So, the conclusion is, rating is not age dependent. (mean age is almost same for all rating bands.)

<img src="12.png" align="center" width="50%"/>

### 7. Executive Summary:

* There are three original datasets. Used 'Pandas' to read them seperately. Then used 'Merge' function to merge them all to construct the final dataframe.
* The final dataframe has 1031136 rows and 9 columns. The columns are as 'User-ID', 'ISBN', 'Book-Rating', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher', 'Location', and 'Age'. (The description of the data fields are given above.)
* 92106 unique User-ID exist.
* 44.3% of the total users are the frequent reviewers (returned to review at least more than once). However, 55.7% are non-returning (only one time reviewer).
* The reviewer with the User-ID '11676' is the most frequent reviewer (11144 times).
* 270151 unique different books (ISBN) have been listed. 
* Some books are very popular and been reviewd many times, however, some books are less popular. Just reviewed once.
* 46.1% of the total books have been reviewed at least more than once. However, 53.9% of books have been reviewed just once.
* The most popular book (ISBN = 0971880107) has been reviewed 2502 times.
* The rating scale enlisted is from '0-10'.
* 62.8 % of the times the rating is '0'. This is the most popular rating given by the reviewers.
* Next popular rating is '8'. This one has given 8.9% of times ; followed by '10' (6.9%)
* However, it appears that the rating scale is actually from 1 to 10, and the 'zero' indicates an 'implicit' rather than an 'explicit' rating. An implicit rating represents an interaction (may be positive or negative) between the user and the item. So, we have splitte dthe dataframe into the explicit and implicit rating cases.
* The 'rating_explicit' (only 1-10 rating scale) dataframe has 383842 rows, where as the 'rating_implicit' (only 0) has 647294 rows. Now, 23.9% of the rating is of score '8'.
* There are 241071 unique book titles enlisted. But, we saw that there are 270151 unique ISBNs (book identifier code)out there. So, definitely 29080 books lack the title information in the metadata.
* There are unique 101588 different author names enlisted, including 'Null'.
* The most famous/popular/reviwed author is Stephen King (10053 times reviewed).
* 51.5% of the total authors are 'popular', their books being reviewed at least more than once, where as 48.5% are not that popular. Their books are reviewed only once.
* 'Year of publication' is a numerical feature, featuring the year of the publication of a particular book.
* In this column, there are many '0's and NaNs. We have replaced all zeros by NaNs.
* 115 unique years of publications are there.
* The oldest publication year recorded is: year 1376
* 2002 is the most popular year (Maximum no. of Books published in this year, 91800 times.)
* The latest year of publication = 2050!
* The plot shows the general trend that the more recent books are much more frequent in number.
* 16729 publisher names are enlisted, out of them 572 are uniquely different.
* The most popular one is the 'Ballantine Books' (with 34724 books)
* 57.2% of the publishing houses are popular (books reviewed more than once)
* 42.8% of the publishers are not popular (books reviewed hardly once)
* 'Location's are the categorical variable, featuring the geographic location of the reviewrs.
* Unique 762 locations (city,state,country) are there.
* The most popular location is: **'toronto, ontario, canada'** and one of the least popular one: **'essex, grays, united kingdom'**
* Splitted the information in three different columns - city, state and country for better understanding.
* 14670 unique cities enlisted from where the reviewers belong (including the 'NA')
* 1959 state information is there.
* 452 different countries are there.
* **Toronto** is the most popular city (15124 reviewers are alone from this town.) There are so many cities, from where only one rating has been listed (eg. hillsoboro)
* California is the most popular state (Total 107465 reviews); followed by Texas (44158)
* USA is the most popular (745052 reviews), followed by Canada (92917 reviews).
* Analyzing, it appears apparently that the ratings are not dependent on cities. (analyzed only 10 top cities)
* Analyzed the average age distribution across 10 top cities. 'Omaha' has the least average age, however 'Olympia' has the maximum spread in ages.
* This is a numeric column, featuring the age of the reviewers.
* 'Age' is the numerical variable featuring the age of the reviewers.
* There are 141 different unique age (of the reviewers) information enlisted.
* The minimum age = 0
* The maximum age = 244!
* So, a lot of them are unphysical/unrealistic! (eg. 0 and 244)
* So, Replaced the unrealistic ages (5 > age > 90) by NaN
* Total 4979 values got replaced by NaN.
* After the replacement, there are 84 different unique ages (reasonable) out there.
* For the new age group, the minimum = 5 (as we have set), the maximum = 90 and the mean = 37.047
* So, the mean age has decreased by 0.94%
* From the histogram of the age distribution, it's evident that the ditribution is skewed. The younger people review more than the older ones.
* The average age of the reviewrs range from ~ 33-37 years, for all the rating bands and the Error bars are roughly between 10-12 years.
* So, the conclusion is, rating is not age dependent. (mean age is almost same for all rating bands.)