# Building a Goodreads recommendation system
**Capstone Project 2 Final Report**

## Background

Goodreads is a social media platform that allows users to rate and review books as well as see what their friends are reading, rating, and reviewing.

Users can use the platform to keep track of the books that they have read while also identifying books they'd like to read next. In addition to pulling reading recommendations from their friends' profiles, they can also receive book recommendations through the Goodreads recommender system. However, based on personal experience, the system does not always offer the most helpful suggestions. Most of the books it recommends are obscure and do not appear to be based on what a user previously rated or what their friends have read. 

Seeing as Amazon purchased Goodreads in 2013, this seems like a huge missed opportunity to provide useful book recommendations that could turn into book sales.  

Is it possible to create a more useful recommendation system for readers? 

## The Data 

To answer this question I will be using the Goodbooks __[datasets](https://github.com/zygmuntz/goodbooks-10k)__ provided by Github user __[zygmuntz](https://github.com/zygmuntz)__. 

The Goodbooks dataset includes over six million ratings of ten thousand books on Goodreads. It is separated into three different files:

* **Ratings:** Contains nearly 6 million user ratings from 53,424 users 
* **To-Read:** Contains nearly 1 million books that users added to their 'to-read' shelf 
* **Books:** Contains all of the meta data for 10,000 books. The metadata includes: title, author, number of ratings, number of each type of rating, and more 

For the purposes of this project, I will be using the Ratings and Books datasets.

### Loading the Data


In [None]:
# Import libraries
# ----------------

# Pandas
import pandas as pd

# Numpy
import numpy as np

# SKLearn
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import pairwise_distances

# Matplotlib
%matplotlib inline
%config InlineBackend.figure_format='retina'
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [None]:
# Load Datasets
# ----------------

# Ratings
ratings_data = "../data/raw/ratings.csv"
ratings = pd.read_csv(ratings_data)

# Books
books_data = "../data/raw/books.csv"
books = pd.read_csv(books_data)
books = books[['book_id', 'title']] ## I only need the book_id and title columns for modeling 

# Merge ratings and books
merged = pd.merge(books, ratings)

print("Number of books in dataset: ",len(merged.book_id.value_counts()))
print("Number of users in dataset: ",len(merged.user_id.value_counts()))
merged.sample(10)

## Exploratory Data Analysis

Before I start recommending books, let's explore the data:

In [None]:
# Visualize the number of reviews per user
eda = merged
eda['number_of_ratings_user'] = eda['user_id'].groupby(eda['user_id']).transform('count')

plt.hist(eda.number_of_ratings_user, alpha = 0.6, color = 'g')
plt.xlabel('number of ratings')
plt.ylabel('frequency')
plt.title('Number of Ratings per User')
plt.show()

<img src="https://github.com/ameenamarie/Goodreads-Recommendation-System/blob/master/reports/figures/fig3.png?raw=true}" title="Weighted Score" width=600 />

In [None]:
# Visualize the distribution of ratings from users
plt.hist(eda.rating, alpha = 0.6, color='g')
plt.xlabel('rating')
plt.ylabel('frequency')
plt.title('Distribution of Ratings')
plt.show()

<img src="https://github.com/ameenamarie/Goodreads-Recommendation-System/blob/master/reports/figures/fig4.png?raw=true}" title="Weighted Score" width=600 />

In [None]:
# Show the ratings count for the 10 most popular books
eda['number_of_ratings_book'] = eda['book_id'].groupby(eda['book_id']).transform('count')
popular = eda.sort_values('number_of_ratings_book', ascending=False)

# Drop duplicate books
popular = popular.drop_duplicates(subset='book_id', keep="first")
popular = popular[['book_id', 'title', 'number_of_ratings_book']]
popular.head(10)

## Building a simple model

A simple model recommends the same books to all users regardless of their reading or rating history. The recommendations are usually based on the most popular or highest rated items or a score that combines the two. For my simple model, I will be giving each book a weighted score. This will prevent a book with only 10 reviews that are all 5 stars from skewing my results.

To build my model I will be adapting DataCamp's _[IMDB 250 tutorial](https://www.datacamp.com/community/tutorials/recommender-systems-python)_ which gives each item a weighted score using the following formula: 

<img src="https://github.com/ameenamarie/Goodreads-Recommendation-System/blob/master/reports/figures/image1.png?raw=true}" title="Weighted Score" width=400 />

* _v_ number of ratings a book has received;
* _m_ the minimum number of ratings a book needs to receive to be included in the model;
* _R_ the average rating for the book; And
* _C_ the average rating for all books

To start I created a column with the average number of ratings that a book received (*v*) and the average rating that each book received (*R*). I then calculated the average rating for all books (*C*) and identified the books that fall in the 90th quantile (*m*). 

In [None]:
# Create a column for the number of ratings a book has received called 'ratings_count'
merged['ratings_count'] = merged['book_id'].groupby(merged['book_id']).transform('count')

# Create a column for the average rating a book has received called 'average_rating'
merged['average_rating'] = merged['rating'].groupby(merged['book_id']).transform('mean')

# Calculate the average rating for all books
C = merged['rating'].mean()

# Calculate the minimum number of ratings a book needs to receive in order to be included in the model
m = merged['ratings_count'].quantile(0.90)

# Filter the dataset based on value m
filtered = merged.copy().loc[merged['ratings_count'] >= m]

I then built a function that gives a weighted score to each book using the formula above:

In [None]:
# Create a function that gives a weighted score to each book
def weighted_score(x, m=m, C=C):
    v = x['ratings_count']
    R = x['average_rating']
    # Calculation based on an IMDB formula 
    return (v/(v+m) * R) + (m/(m+v) * C)

# Create a 'score' column and give each book a weighted score
filtered['score'] = filtered.apply(weighted_score, axis=1)
filtered.sample(5)

Once every book had a score, I dropped the duplicates and ranked the books by score. I then identified the ten books with the higest score. 

In [None]:
# Sort books based on score
top10 = filtered.sort_values('score', ascending=False)

# Drop duplicate books
top10 = top10.drop_duplicates(subset='book_id', keep="first")

#Print the top 10 books
top10[['title', 'ratings_count', 'average_rating', 'score']].head(10)

Wow! It looks like everyone really loves Harry Potter. 

If I were to use this model as my recommendation system, Goodreads would recommend these 10 books to every user regardless of their rating or reading history. However, I do not want to recommend Harry Potter to every single user, so I'll be building a collaborative model to get more specific recommendations. 

I can use this simple recommendation model to recommend books to "cold start" users. That is, users who have just joined Goodreads and do not yet have a reading or rating history on the platform. 

## Building a collaborative model from scratch

Collaborative models make recommendations by comparing users and books to each other and grouping by similarity. This is accomplished by measuring the distance between each user and each item in order to group similar users and items together. The model makes multiple calculations for every item in the dataset, so it is best to have a smaller dataset. 

I started by filtering the data to only include users that have reviewed at least 175 books.

In [None]:
# Filter data to only include users who have reviewed at least 175 books
collab = merged
collab['number_of_reviews'] = collab['user_id'].groupby(collab['user_id']).transform('count') 
collab = collab[collab['number_of_reviews'] >= 175]
collab = collab.drop('number_of_reviews', 1)

When I removed users who have rated fewer than 175 books, I had to create a new user index and a new book index so that I could place the ratings for each user and book into a matrix. 

In [None]:
# Create a new column for user_index and book_index
collab = collab.assign(user_index=(collab['user_id']).astype('category').cat.codes)
collab = collab.assign(book_index=(collab['book_id']).astype('category').cat.codes)

n_users = collab.user_id.unique().shape[0]
n_items = collab.book_id.unique().shape[0]
print('DataFrame shape: {}'.format(collab.shape))
print('Number of unique users:', n_users)
print('Number of books:', n_items)
collab.sample(5)

The collaborative model requires a matrix using the user_index, book_index, and rating in order to measure the distance between each item and user. I created a training and test matrix. 

In [None]:
# Rearrange the order of the columns to prep for modeling
collab = collab[['user_index', 'book_index', 'rating', 'book_id', 'title', 'user_id', 'ratings_count',
                  'average_rating']]

#from sklearn.cross_validation import train_test_split
train_data, test_data = train_test_split(collab, test_size=0.25)

#Create two user-item matrices, one for training and another for testing
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]  

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

Once everything was placed in a train and test matrix, I was able to calculate the similarity between each user or item using pairwise_distance. I used the cosine metric which looks at the cosine of the angle between each item or user. The closer two items are, the larger the cosine. 

In [None]:
# Calculate the similarity using pairwise_distance
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

Now I'm ready to make predictions!

In [None]:
# Define a function to make predictions
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #We use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [None]:
# Make Predictions
user_prediction = predict(train_data_matrix, user_similarity, type='user')
item_prediction = predict(train_data_matrix, item_similarity, type='item')

### Evaluate the model 

Let's see how the model performed:

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [None]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

Not bad! Now let's see what kind of recommendations we can get. I'll be printing some recommendations for user 487.

In [None]:
# Create a function for making predictions
def collab_rec(predictions_df, user_id, books_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = user_id - 1 # UserID starts at 1, not 0
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)
    
    preds = pd.DataFrame(sorted_user_predictions)
    preds['book_id'] = preds.index + 1
    
    # Get the user's data and merge in the book information.
    user_data = original_ratings_df[original_ratings_df.user_id == (user_id)]
    user_full = (user_data.merge(books_df, how = 'left', left_on = 'book_id', right_on = 'book_id').
                     sort_values(['rating'], ascending=False)
                 )

    print('User {0} has already rated {1} books.'.format(user_id, user_full.shape[0]))
    print('Recommending the highest {0} predicted ratings books not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating books that the user hasn't seen yet.
    recommendations = (books_df[~books_df['book_id'].isin(user_full['book_id'])].
         merge(pd.DataFrame(preds).reset_index(), how = 'left',
               left_on = 'book_id',
              right_on = 'book_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

In [None]:
# Make predictions 
user_df = pd.DataFrame(user_prediction)

user_read, recommendations = collab_rec(user_df, 487, books, ratings, 10)

# print books already read by user 487
user_read[['title']].head(10)

In [None]:
# Print recommendations for user 487
recommendations[['title']]

Based on what this user has previously read, I would say these are decent recommendations. Let's see if we can do better with Singular Vector Decomposition. 

## Singular Vector Decomposition Model

The Singular Vector Decomposition model breaks a matrix (like the one used to create the collaborative recommendations above) into its component parts. This allows for more mathematically complex calculations on the matrix later in the modeling process. The matrix is broken down as follows: 

<img src="https://github.com/ameenamarie/Goodreads-Recommendation-System/blob/master/reports/figures/image2.png?raw=true}" title="SVD Formula", width=150 />

* *R* is the user ratings matrix;
* *U* is the user "features" matrix;
* *Σ* is the diagonal matrix of singular values (weights); And
* *Vt* is the books features matrix 

Before I can break the matrix down into its component parts, I need to filter the dataset to only include users who have rated at least 175 books like I did with the collaborative model. 

I will then convert the dataframe into a pivot table with one row for every user and one column for every book. The value in the pivot table is the rating that the user gave a particular book. This pivot table will eventually become the matrix that I will break down. 

In [None]:
svd = merged

# Filter data to only include users who have reviewed at least 175 books 
svd['number_of_reviews'] = svd['user_id'].groupby(svd['user_id']).transform('count') 
svd = svd[svd['number_of_reviews'] >= 175]
svd = svd.drop('number_of_reviews', 1)

# Create a pivot table of users and ratings
pivot = svd.pivot(index = 'user_id', columns ='book_id', values = 'rating').fillna(0)
pivot.head()

To build the matrix, I calculated each user's average rating and then replaced their rating in the matrix with the result of subtracting their average rating from the rating that they gave each book. 

In [None]:
# Convert the pivot table to a matrix
R = pivot.as_matrix()

# Identify a user's average rating 
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

I am now ready to pass the values through the svds model.

In [None]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)

sigma = np.diag(sigma)

all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = pivot.columns)

Now that I have the predicted ratings for each book, I can start making recommendations. I started by building a function that identifies the row for the desired user, sorts their predicted ratings, removes the books they've already read, and then from the remaining predicted ratings returns the top values as recommendations.

In [None]:
def recommend_books(predictions_df, user_id, books_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = user_id - 1 # UserID starts at 1, not 0
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)
    
    # Get the user's data and merge in the book information.
    user_data = original_ratings_df[original_ratings_df.user_id == (user_id)]
    user_full = (user_data.merge(books_df, how = 'left', left_on = 'book_id', right_on = 'book_id').
                     sort_values(['rating'], ascending=False)
                 )

    print('User {0} has already rated {1} books.'.format(user_id, user_full.shape[0]))
    print('Recommending the highest {0} predicted ratings books not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating books that the user hasn't seen yet.
    recommendations = (books_df[~books_df['book_id'].isin(user_full['book_id'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'book_id',
              right_on = 'book_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

I am going to once again test the function on User 487:

In [None]:
already_rated, recommendations = recommend_books(preds_df, 487, books, ratings, 10)

In [None]:
# View books the user has already rated
already_rated[['title']].head(10)

In [None]:
# View the 10 recommended books for User 487
recommendations

## Next Steps

The three models used above only used ratings to determine recommendations. I would like to try and build a model that incorporates book features such as the year of publication, author, genre, and title in order to make more specific recommendations for each user.

I can do this using natural language processing to look for similar words and phrases to group books and make recommendations.  