# Goodreads Recommender System
### Giselle Kurniawan 

### Project Objectives
* Making a book recommendation system using the Goodreads Dataset (https://github.com/zygmuntz/goodbooks-10k)
* Input: Book -> Output: Books similar to input book
* Content-Based Filtering
* Memory-Based Collaborative Filtering
    - Item Based
    - User Based

## Importing Libraries and Importing Our Data

In [1]:
# Importing libraries
import json
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

In [2]:
# Changing directory
os.chdir('/Users/gisellekurniawan/Desktop/Projects/goodreads')
print('Current working directory: ', os.getcwd())

Current working directory:  /Users/gisellekurniawan/Desktop/Projects/goodreads


## Loading Dataset

In [3]:
ratings = pd.read_csv("ratings.csv") 
books = pd.read_csv("books.csv")
book_tags = pd.read_csv("book_tags.csv")
to_read = pd.read_csv("to_read.csv")
tags = pd.read_csv("tags.csv")

## Dataset Cleaning

### Cleaning books.csv

In [4]:
# Looking at proportion of missing data
print(books.isnull().mean())

book_id                      0.0000
goodreads_book_id            0.0000
best_book_id                 0.0000
work_id                      0.0000
books_count                  0.0000
isbn                         0.0700
isbn13                       0.0585
authors                      0.0000
original_publication_year    0.0021
original_title               0.0585
title                        0.0000
language_code                0.1084
average_rating               0.0000
ratings_count                0.0000
work_ratings_count           0.0000
work_text_reviews_count      0.0000
ratings_1                    0.0000
ratings_2                    0.0000
ratings_3                    0.0000
ratings_4                    0.0000
ratings_5                    0.0000
image_url                    0.0000
small_image_url              0.0000
dtype: float64


* **language_code** has the highest percentage of missing data (~10.84%)  
* **isbn**, **isbn13**, **original_publication_year**, and **original_title** has a smaller percentage of missing data (.2-7% range)

In [5]:
# Dropping columns isbn, isbn13, and language_code as not important
## books.drop(['isbn', 'isbn13', 'language_code'], axis=1)
books = books.drop(['image_url', 'small_image_url', 'isbn', 'isbn13', 'language_code'], axis=1)
print(books.isnull().mean())

book_id                      0.0000
goodreads_book_id            0.0000
best_book_id                 0.0000
work_id                      0.0000
books_count                  0.0000
authors                      0.0000
original_publication_year    0.0021
original_title               0.0585
title                        0.0000
average_rating               0.0000
ratings_count                0.0000
work_ratings_count           0.0000
work_text_reviews_count      0.0000
ratings_1                    0.0000
ratings_2                    0.0000
ratings_3                    0.0000
ratings_4                    0.0000
ratings_5                    0.0000
dtype: float64


In [6]:
# Replacing all columns that contain NAs to ""
str_columns = ["original_publication_year", "original_title"]
for i in str_columns:
    books[i] = books[i].fillna("")

### Cleaning ratings.csv

In [7]:
# Checking for duplicate pairs
ratings[ratings.duplicated(['user_id', 'book_id'], keep=False)]

Unnamed: 0,user_id,book_id,rating


There are no duplicate pairs!

In [8]:
# Adding a title column to ratings.csv
ratings = pd.merge(ratings, books[['book_id', 'title']], on='book_id', how='left')
ratings.head(10)

Unnamed: 0,user_id,book_id,rating,title
0,1,258,5,The Shadow of the Wind (The Cemetery of Forgot...
1,2,4081,4,I am Charlotte Simmons
2,2,260,5,How to Win Friends and Influence People
3,2,9296,5,The Drama of the Gifted Child: The Search for ...
4,2,2318,3,The Millionaire Next Door: The Surprising Secr...
5,2,26,4,"The Da Vinci Code (Robert Langdon, #2)"
6,2,315,3,Who Moved My Cheese?
7,2,33,4,Memoirs of a Geisha
8,2,301,5,Heart of Darkness
9,2,2686,5,Blue Ocean Strategy: How To Create Uncontested...


## Content-Based Recommender

A **content-based recommender** suggests items based on the similarity between the content of the items and the user's preferences. 

I used https://www.datacamp.com/tutorial/recommender-systems-python as a guide.

**Steps:**
1. Identify item and user profiles
2. Map each book to its corresponding genre (using tags.csv)
3. Convert author names into lowercase and strip all spaces between them 
4. Use CountVectorizer to create a frequency matrix
5. Create a function that takes in a book, and outputs similar books
    

### Identifying item and user profiles

In a typical content-based recommendation system, each item is represented by a set of features that describe the item's content.
In our case, our features will be **authors**, **genres** and **title**.

In [9]:
# Importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

### Mapping each book to its genre

As we've determined one of our features to be genres, we need to add a genre column to our books dataframe.

We can define a function that takes in a book_id and outputs its corresponding genre, then apply this function to the entire books dataset.

In [10]:
def get_genres(x):
    t = book_tags[book_tags.goodreads_book_id==x]
    return [i.lower().replace(" ", "") for i in tags.tag_name.loc[t.tag_id].values]

In [11]:
books['genres'] = books.book_id.apply(get_genres)

### Convert Author Names into Lowercase and Strip Spaces between Them

Removing spaces between words is important so that our vectorizer doesn't count "Stephen Covey" and "Stephen King" as the same. 

In [12]:
books['authors'] = books.authors.apply(lambda x: x.strip().replace(" ", ""))

### Creating our CountVectorizer

To feed data to our vectorizer, we need to create a **metadata soup**.  
Our metadata soup will consist of our defined features: authors, genres, and title

In [13]:
# Creating our metadata soup
def create_soup(x):
    return ''.join(x['authors']) + ' ' + ''.join(x['genres']) + ' ' + ''.join(x['title'])
                    

In [14]:
books['soup'] = books.apply(create_soup, axis=1)

In [15]:
books['soup'].head()

0    SuzanneCollins to-readfantasyfavoritescurrentl...
1    J.K.Rowling,MaryGrandPré to-readcurrently-read...
2    StephenieMeyer to-readfavoritesfantasycurrentl...
3                     HarperLee  To Kill a Mockingbird
4    F.ScottFitzgerald favoritesfantasycurrently-re...
Name: soup, dtype: object

In [16]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(books['soup'])

In [17]:
# Compute Cosine Similarity based on the count matrix.
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [18]:
books = books.reset_index()
indices = pd.Series(books.index, index=books['title'])

### Function to recommend a book

In [19]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the book that matches the input 
    idx = indices[title]
    # Get pairwise similarity score of all books with our input
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort the books based on similarity score
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores of the top 10 most similar books
    sim_scores = sim_scores[1:11]
    # Get the book indices associated with these scores
    book_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return books['title'].iloc[book_indices]
    

In [20]:
get_recommendations("The Great Gatsby")

347                                         We Were Liars
8305                            Time and Again (Time, #1)
2                                 Twilight (Twilight, #1)
0                 The Hunger Games (The Hunger Games, #1)
1       Harry Potter and the Sorcerer's Stone (Harry P...
9                                     Pride and Prejudice
6818    Life is What You Make It: A Story of Love, Hop...
5                                  The Fault in Our Stars
6271                                       Being and Time
6272                                          In Our Time
Name: title, dtype: object

## Memory Based Collaborative Filtering

Memory-Based Collaborative Filtering uses historical data to compute similarities between users (user-based) and items (item-based).

## User Based

User-Based Collaborative Filtering recommends books that a user might like by looking at similar users' preferences. 

Steps:
1. Create a user-item matrix
2. For a user U, find similar users based on rating vectors consisting of given item ratings.
3. Predict ratings for books that user U have not rated using the ratings of given books that other similar users have already rated.

### Creating a user-item matrix

In [24]:
user_item_mat = ratings.pivot_table(index='user_id', columns='book_id', values='rating')

In [25]:
user_item_mat = user_item_mat.fillna(0)
user_item_mat.head()

book_id,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,5.0,0.0,0.0,5.0,0.0,0.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,5.0,0.0,4.0,4.0,0.0,4.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Defining a Function to Find Similar Users

In [26]:
from sklearn.metrics.pairwise import cosine_similarity
import operator

In [27]:
def similar_users(user_id, mat, k=3):
    # Creating a data frame of user_id x
    user = mat[mat.index == user_id]
    # Creating a data frame of other users
    other_users = mat[mat.index != user_id]
    # Creating list of indices (user_ids) of other_users
    indices = other_users.index.tolist()
    # Computing cosine similarity between user_id x and other users
    sim_score = cosine_similarity(user, other_users)[0].tolist()
    # Creating key-value pairs of other users' user_ids and their similarity scores
    index_sim = dict(zip(indices, sim_score))
    # Sorting Sim Scores in Decreasing Order
    index_sim_sorted = sorted(index_sim.items(), key=lambda item: item[1], reverse=True)
    # Getting only top k similar users
    top_k_users_sim = index_sim_sorted[:k] 
    # user_ids of top k similar users
    users = [x[0] for x in top_k_users_sim]
    return users
    

### Defining a Function that Suggests Books

In [28]:
def recommend_me(user_id, i=5):
    # Getting user_ids of similar users
    sim_users = similar_users(user_id, user_item_mat)
    # Similar Users Books and their Ratings
    sim_users = user_item_mat[user_item_mat.index.isin(sim_users)]
    # Getting other users overall average rating of books
    sim_users = sim_users.mean(axis=0)
    # Converting to dataframe
    sim_users_df =  pd.DataFrame(sim_users, columns=['mean'])
    # User_id's book and ratings
    user_books = user_item_mat[user_item_mat.index == user_id].transpose()
    # Rename column name in user_books to rating
    user_books.columns = ['rating']
    # Getting a list of unseen books
    books_unseen = user_books[user_books.rating == 0].index.tolist()
    # Filter sim_users_df to only show books user_id hasn't seen
    sim_users_df = sim_users_df[sim_users_df.index.isin(books_unseen)]
    # Sorting this dataframe based on decreasing average rating and getting only top i books
    sim_users_df_sorted = sim_users_df.sort_values(by='mean', ascending=False)[:i]
    # Converting this dataframe into a list
    top_books = sim_users_df_sorted.index.tolist()
    # Finding these books in the books dataframe to get names
    rec_books = books[books['book_id'].isin(top_books)]
    return rec_books
    

In [29]:
recommend_me(2)

Unnamed: 0,index,book_id,goodreads_book_id,best_book_id,work_id,books_count,authors,original_publication_year,original_title,title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,genres,soup
3,3,4,2657,2657,3275794,487,HarperLee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,[],HarperLee To Kill a Mockingbird
12,12,13,5470,5470,153313,995,"GeorgeOrwell,ErichFromm,CelâlÜster",1949.0,Nineteen Eighty-Four,1984,...,1956832,2053394,45518,41845,86425,324874,692021,908229,"[to-read, currently-reading, favorites, scienc...","GeorgeOrwell,ErichFromm,CelâlÜster to-readcurr..."
13,13,14,7613,7613,2207778,896,GeorgeOrwell,1945.0,Animal Farm: A Fairy Story,Animal Farm,...,1881700,1982987,35472,66854,135147,433432,698642,648912,[],GeorgeOrwell Animal Farm
27,27,28,7624,7624,2766512,458,WilliamGolding,1954.0,Lord of the Flies,Lord of the Flies,...,1605019,1671484,26886,92779,160295,425648,564916,427846,"[to-read, travel, non-fiction, currently-readi...",WilliamGolding to-readtravelnon-fictioncurrent...
34,34,35,865,865,4835472,458,"PauloCoelho,AlanR.Clarke",1988.0,O Alquimista,The Alchemist,...,1299566,1403995,55781,74846,123614,289143,412180,504212,[],"PauloCoelho,AlanR.Clarke The Alchemist"


## Item Based

Item-Based Collaborative Filtering recommends books by looking into the similarities between books, and predicting missing ratings by using ratings to the other items by the user.

Steps:
1. Create an item_user_matrix
2. Find similarities between all book pairs.
3. Recommend books by choosing top k books that are closely correlated with our given book.

### Creating an item-user matrix

In [35]:
item_user_mat = ratings.pivot_table(index='user_id', columns='book_id', values='rating')

In [37]:
item_user_mat = item_user_mat.fillna(0)
item_user_mat.head()

book_id,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,5.0,0.0,0.0,5.0,0.0,0.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,5.0,0.0,4.0,4.0,0.0,4.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [88]:
def get_similar(title, mat, b = 5):
    # Getting similarity matrix of only given book
    title_user_ratings = mat[title]
    # Computes pairwise correlation between user_item similarity matrix and the similarity matrix of only given book
    similar_to_title = mat.corrwith(title_user_ratings)
    # Creating a dataframe of books and its correlation scores
    corr_title = pd.DataFrame(similar_to_title, columns=['correlation'])
    corr_title.dropna(inplace=True)
    # Sort this dataframe in descending order
    corr_title.sort_values('correlation', ascending=False, inplace=True)
    # Finding top b books
    top_books = corr_title.index.tolist()[1:b]
    # Finding these books in the books dataframe to get names
    top_books = books[books['book_id'].isin(top_books)]
    return top_books

In [93]:
user_item_mat.corrwith(user_item_mat[1])

book_id
1        1.000000
2        0.314606
3        0.349821
4        0.117798
5        0.081628
           ...   
9996    -0.010968
9997    -0.021445
9998    -0.012843
9999     0.005286
10000   -0.019142
Length: 10000, dtype: float64