# Book recommendation engine using unsupervised learning on book rating data
## _Udacity Machine Learning Engineer nanodegree capstone project_

Emily Behlmann
July 2018

The goal of this project is to help readers find books to read next that are aligned with their tastes. Although readers' opinions about books are subjective, I will be able to measure how well readers like or dislike books based on the ratings they give them on a five-star scale. A good solution will be one that maximizes the rating score readers give to the books selected for them.

This project relies on the [goodbooks-10k dataset](http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/), which was published under a Creative Commons license by Zygunt Zajac on the FastML website in 2017. The dataset includes about 6 million ratings of 10,000 books with the highest volume of ratings (ratings.csv). Each rating is associated with a user ID, making it possible to cluster individual readers based on their ratings. The dataset also includes metadata about each book, such as its title, author and average rating (books.csv).

### Data exploration

First is to import libraries and the data.

In [1]:
import numpy as np
import pandas as pd

# Load the dataset
ratings_file = 'goodbooks-10k-data/ratings.csv'
ratings = pd.read_csv(ratings_file)

books_file = 'goodbooks-10k-data/books.csv'
books = pd.read_csv(books_file)

display(ratings.head())
display(books.head())

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


Next, I will look at some characteristics of the data set.

In [42]:
# book info
book_count = books.shape[0]
ratings_on_books = books['ratings_count'].sum()
most_ratings = books.iloc[books['ratings_count'].idxmax()]
fewest_ratings = books.iloc[books['ratings_count'].idxmin()]

highest_average_rating = books.iloc[books['average_rating'].idxmax()]
lowest_average_rating = books.iloc[books['average_rating'].idxmin()]

most_5_star_ratings = books.iloc[books['ratings_5'].idxmax()]
most_1_star_ratings = books.iloc[books['ratings_1'].idxmax()]

print('Total number of books in dataset: %s' % "{:,}".format(book_count))
print('Total number of ratings on books in dataset: %s' % "{:,}".format(ratings_on_books))
print('Book with most ratings: %s with %s ratings' % (most_ratings['title'], "{:,}".format(most_ratings['ratings_count'])))
print('Book with fewest ratings: %s with %s ratings' % (fewest_ratings['title'], "{:,}".format(fewest_ratings['ratings_count'])))
print('Book with highest average rating: %s with %s average' % (highest_average_rating['title'], highest_average_rating['average_rating']))
print('Book with lowest average rating: %s with %s average' % (lowest_average_rating['title'], lowest_average_rating['average_rating']))
print('Book with most 5-star ratings: %s with %s 5-star ratings' % (most_5_star_ratings['title'], "{:,}".format(most_5_star_ratings['ratings_5'])))
print('Book with most 1-star ratings: %s with %s 1-star ratings' % (most_1_star_ratings['title'], "{:,}".format(most_1_star_ratings['ratings_1'])))

Total number of books in dataset: 10,000
Total number of ratings on books in dataset: 540,012,351
Book with most ratings: The Hunger Games (The Hunger Games, #1) with 4,780,653 ratings
Book with fewest ratings: درخت زیبای من with 2,716 ratings
Book with highest average rating: The Complete Calvin and Hobbes with 4.82 average
Book with lowest average rating: One Night at the Call Center with 2.47 average
Book with most 5-star ratings: Harry Potter and the Sorcerer's Stone (Harry Potter, #1) with 3,011,543 5-star ratings
Book with most 1-star ratings: Twilight (Twilight, #1) with 456,191 1-star ratings


In [64]:
# ratings info
rating_count = ratings.shape[0]
percent_of_total = (rating_count / ratings_on_books) * 100

mean_rating = ratings['rating'].mean()
median_rating = ratings['rating'].median()

user_count = ratings['user_id'].nunique()
ratings_per_user = rating_count / user_count

print('Total number of ratings in dataset: %s' % "{:,}".format(rating_count))
print('Percent of total ratings included in this dataset: %s percent' % round(percent_of_total, 2))
print('Mean rating: %s' % round(mean_rating, 2))
print('Median rating: %s' % median_rating)
print('Total number of users in dataset: %s' % "{:,}".format(user_count))
print('Ratings per user: %s' % round(ratings_per_user, 2))

Total number of ratings in dataset: 5,976,479
Percent of total ratings included in this dataset: 1.11 percent
Mean rating: 3.92
Median rating: 4.0
Total number of users in dataset: 53,424
Ratings per user: 111.87


### Observations about the data

* The books included in books.csv have a total of 540,012,351 ratings, but only 5,976,479 are included here, or about 1.11 percent.
* The mean rating (3.92) and the median rating (4.0) are both quite high, meaning many readers tend to concentrate their ratings on the higher end of the scale. This could be because they already know what they like well enough that they select books they end up enjoying, because they don't want to give negative reviews, or because they've given relatively high scores to books they only somewhat enjoyed, and now they don't have much higher to go. Regardless, it appears that some readers aren't fully taking advantage of the 5-star scale.