In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## The Data

The data consists of three tables: ratings, books info, and users info.

In [None]:
books = pd.read_csv('BX-Books.csv', sep=';', error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']
users = pd.read_csv('BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']
ratings = pd.read_csv('BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

### Ratings Data

The ratings data set provides a list of ratings that users have given to books. It includes 1,149,780 records and 3 fields: userID, ISBN, and rating.

In [None]:
print(ratings.shape)
print(list(ratings.columns))

In [None]:
ratings.head()

### Ratings Distribution

In [None]:
plt.rc("font", size=15)
ratings.bookRating.value_counts(sort=False).plot(kind='bar')
plt.title('Rating Distribution\n')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.savefig('system1.png', bbox_inches='tight')
plt.show()

### Books dataset

This dataset provides books details. It includes 271360 records and 8 fields: ISBN, book title, book author, publisher and so on.

In [None]:
print(books.shape)
print(list(books.columns))

In [None]:
books.head()

### Users dataset 

This dataset provides the user demographic information. It includes 278858 records and 3 fields: user id, location and age.

In [None]:
print(users.shape)
print(list(users.columns))

In [None]:
users.head()

### Age distribution

The most active users are among 20-30s.

In [None]:
users.Age.hist(bins=[0, 10, 20, 30, 40, 50, 100])
plt.title('Age Distribution\n')
plt.xlabel('Age')
plt.ylabel('Count')
plt.savefig('system2.png', bbox_inches='tight')
plt.show()

## Recommendations based on rating counts

In [None]:
rating_count = pd.DataFrame(ratings.groupby('ISBN')['bookRating'].count())
rating_count.sort_values('bookRating', ascending=False).head()

The book with ISBN "0971880107" received the most ratings. Let's find out which books are in the top 5.

In [None]:
most_rated_books = pd.DataFrame(['0971880107', '0316666343', '0385504209', '0060928336', '0312195516'], index=np.arange(5), columns = ['ISBN'])
most_rated_books_summary = pd.merge(most_rated_books, books, on='ISBN')
most_rated_books_summary

The book that received the most ratings in this data set is Rich Shapero's Wild Animus. Something in common among these five most rated books - they are fictions or novels. The recommender suggests that novels and fictions are popular and likely receive more ratings. And if someone likes "Wild Animus", probably we should recommend him(her) "The Lovely Bones: A Novel".

## Recommendations based on correlations

Find out the average rating and the number of ratings each book received.

In [None]:
average_rating = pd.DataFrame(ratings.groupby('ISBN')['bookRating'].mean())
average_rating['ratingCount'] = pd.DataFrame(ratings.groupby('ISBN')['bookRating'].count())
average_rating.sort_values('ratingCount', ascending=False).head()

#### Observation: 

In this dataet, the book that received the most ratings is not highly rated at all. So if we were set to use recommendations based on rating counts, we would definitely make mistaks here.

#### To ensure statistical significance, users with less than 200 ratings, and books with less than 100 ratings are excluded.

In [None]:
counts1 = ratings['userID'].value_counts()
ratings = ratings[ratings['userID'].isin(counts1[counts1 >= 200].index)]
counts = ratings['bookRating'].value_counts()
ratings = ratings[ratings['bookRating'].isin(counts[counts >= 100].index)]

### Rating matrix

Convert the table to a 2D matrix. The matrix will be sparse because not every user rate every book.

In [None]:
ratings_pivot = ratings.pivot(index='userID', columns='ISBN').bookRating
userID = ratings_pivot.index
ISBN = ratings_pivot.columns
print(ratings_pivot.shape)
ratings_pivot.head()

Let's find out which books are correlated with the 2nd most rated book "The Lovely Bones: A Novel". To blatantly quote from the Wikipedia: It is the story of a teenage girl who, after being raped and murdered, watches from her personal Heaven as her family and friends struggle to move on with their lives while she comes to terms with her own death.

In [None]:
bones_ratings = ratings_pivot['0316666343']
similar_to_bones = ratings_pivot.corrwith(bones_ratings)
corr_bones = pd.DataFrame(similar_to_bones, columns=['pearsonR'])
corr_bones.dropna(inplace=True)
corr_summary = corr_bones.join(average_rating['ratingCount'])
corr_summary[corr_summary['ratingCount']>=300].sort_values('pearsonR', ascending=False).head(10)

We obtained the books' ISBNs, but we need to find out the names of the books to see whether they make sense.

In [None]:
books_corr_to_bones = pd.DataFrame(['0312291639', '0316601950', '0446610038', '0446672211', '0385265700', '0345342968', '0060930535', '0375707972', '0684872153'], 
                                  index=np.arange(9), columns=['ISBN'])
corr_books = pd.merge(books_corr_to_bones, books, on='ISBN')
corr_books

Let's select three books to examine from the above highly correlated list "The Nanny Diaries: A Novel", "The Pilot's Wife: A Novel" and "Where the heart is". 

"The Nanny Diaries" satirizes upper class Manhattan society as seen through the eyes of their children's caregivers. 

Written by the same author of "The Lovely Bones", "The Pilot's Wife" is the third novel in Shreve's informal trilogy to be set in a large beach house on the New Hampshire coast that used to be a conventis.

"Where the Heart Is" dramatizes in detail the tribulations of lower-income and foster children in the United States.

These three books sound right to me to be highly correlated with "The Lovely Bones". Seems our correlation recommender system is working.