# Item-to-Item Recommender System

We will build an item-to-item recommender system for our books. With this recommender, users will be able to search for a book they've liked and view the top 5 most similar books. We only used explicit ratings (`1-10`) and included books that have more than 3 ratings. We then scaled our ratings to remove item-bias and avoid continuously recommending a book due to its popularity. Cosine similarity was used as our metric to determine how closely related our search and recommendations are. 

## Contents
-  [Preprocessing](#Preprocessing)
    -  [Count Threshold](#Count-Threshold)
    -  [Sparse DataFrame](#Sparse-DataFrame)
    -  [Scale Ratings](#Scale-Ratings)
    -  [Create Pivot Table](#Create-Pivot-Table)
-  [Modeling](#Modeling)
    -  [Cosine Similarity](#Cosine-Similarity)
    -  [Recommender System](#Recommender-System)
-  [Conclusions and Recommendations](#Conclusions-and-Recommendations)

In [1]:
import pandas as pd
import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib as plt
import seaborn as sns
import scipy.sparse
import pickle

sns.set_style('darkgrid')
sns.set_palette(palette='colorblind')
%matplotlib inline

## Preprocessing

Implicit ratings (`0`) indicate an interaction between the reader and book, but no preference was given. Since we do not know the rating an individual would give, we will only be using explicit ratings for our recommender. 

In [2]:
exp_ratings = pd.read_csv('../data/exp_ratings.csv')

In [3]:
exp_ratings.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,user_id,book_rating
0,843485676X,Gran Angular - Alerta Roja: Un Frio Viento Del...,C Puerto,0,Ediciones SM,184386,7
1,0821721828,Moontide Embrace,Constance O'Banyon,1987,Zebra Books,114368,5
2,8434852500,"Piedra del Toque, La Ga 6",Montserrat del Amo,1998,S &amp; M Books,184386,7
3,0415938740,"Boss Ladies, Watch Out!: Essays on Women, Sex ...",Terry Castle,2002,Routledge,74281,7
4,080943752X,The healthy heart (Library of health),Arthur Fisher,1981,school and library distribution by Silver Burd...,145166,5


The `exp_ratings` dataframe saved from `02_EDA_and_Cleaning` currently has the following:

In [4]:
print('Users:  ', exp_ratings['user_id'].nunique())
print('Books:  ', exp_ratings['isbn'].nunique())
print('Ratings:', exp_ratings.shape[0])

Users:   68092
Books:   139648
Ratings: 383852


### Count Threshold

Many books have only one rating, which we do not want to include in our recommender. The book's rating will not be a good indication of its quality since it's only one reader's preference. In order to mitigate this, we set a threshold of 3. This means that only books that have at least received 4 ratings will be included.

In [5]:
isbn_greater_three = exp_ratings['isbn'].value_counts().index[exp_ratings['isbn'].value_counts().values > 3]

In [6]:
new_exp_ratings = exp_ratings[exp_ratings['isbn'].isin(isbn_greater_three)]

In [7]:
print('Users:  ', new_exp_ratings['user_id'].nunique())
print('Books:  ', new_exp_ratings['isbn'].nunique())
print('Ratings:', new_exp_ratings.shape[0])

Users:   51762
Books:   18263
Ratings: 224114


The original data count that we originally imported and the current one after excluding books with 3 or less ratings are:

||Original|New|Decreased By|
|---|---|---|---|
|Users|68092|51762|16330|
|Books|139648|18263|121385|
|Ratings|383852|224114|159738|

-  After removing books with less than 4 ratings, we can see that all of our `users`, `books` and `ratings` have decreased.
-  The number of `books` decreased a substantial amount and we will be using the remaining `18,263` for recommendations.

### Sparse DataFrame

Given the size of `new_exp_ratings`, we want to change the structure of our dataframe before creating a sparse dataframe for faster compute. We turn our dataframe into a dictionary of dictionaries and save it as `wide_ratings`. Each key corresponds to a user, and the nested dictionary contains the book ISBN and the rating given by that user.

In [8]:
wide_ratings = {}
for user, book, rating in zip(new_exp_ratings['user_id'], new_exp_ratings['isbn'], new_exp_ratings['book_rating']):
    if wide_ratings.get(user):
        wide_ratings[user][book] = rating
    else:
        wide_ratings[user] = {book: rating}

We then turn our nested dictionary `wide_ratings` into a sparse dataframe for memory efficiency reasons. We expect our dataframe to be sparse, with mostly `NaNs`. By compressing it with `pd.SparseDataFrame`, we can omit these missing values.

In [9]:
sdf_ratings = pd.SparseDataFrame(wide_ratings)

In [10]:
sdf_ratings.head()

Unnamed: 0,276964,2276,16795,49133,76097,112109,114255,154824,162738,166003,...,188118,106956,178484,216817,80208,248399,104611,128296,69944,72866
1047647,,,,,,,,,,,...,,,,,,,,,,
2005018,,,,,,,,,,,...,,,,,,,,,,
2116286,,,,,,,,,,,...,,,,,,,,,,
2240114,,,,,,,,,,,...,,,,,,,,,,
2241447,,,,,,,,,,,...,,,,,,,,,,


Since we want an item-to-item recommender, our `sdf_ratings` have:
-  `isbn` as index
-  `users` as columns
-  `book_rating` as values

We'll save out our sparse dataframe to avoid having to create them again. Since index and columns are not included, we pickle them out separately in order to map them back to our dataframe when needed.

The column datatype needs to be converted to string to avoid returning an error first.

In [11]:
sdf_ratings.columns = sdf_ratings.columns.astype(str)

We use `to_coo` since it's a fast format for constructing sparse matrices.

In [12]:
sdf_coo = sdf_ratings.to_coo()

In [13]:
scipy.sparse.save_npz('../assets/sdf_ratings.npz', sdf_coo)

In [14]:
with open('../assets/cols_ratings.pkl', 'wb+') as f:
    pickle.dump(sdf_ratings.columns, f)

In [15]:
with open('../assets/index_ratings.pkl', 'wb+') as f:
    pickle.dump(sdf_ratings.index, f)

### Scale Ratings

Now, we scale the ratings by row to remove item-bias. It's important to remove item-bias to avoid popular books from always being recommended, regardless of the searched book.

In [16]:
sc_sdf_ratings = (sdf_ratings - np.nanmean(sdf_ratings, axis=0)) / np.nanstd(sdf_ratings, axis=0) 

In [21]:
sc_sdf_ratings.head()

Unnamed: 0,14958,178890,204567,205483,217740,28492,30411,50769,55187,62895,...,141840,143663,149245,175851,183129,189718,193272,213859,243376,244340
2005018,,,,,,,,,,,...,,,,,,,,,,
2116286,,,,,,,,,,,...,,,,,,,,,,
2240114,,,,,,,,,,,...,,,,,,,,,,
2243962,,,,,,,,,,,...,,,,,,,,,,
2251760,,,,,,,,,,,...,,,,,,,,,,


In our scaled dataframe, anything greater than the mean of 0 indicates that the book is favorable. Similarly, any rating less than 0 means it's not preferable.

Again, we save out the scaled, sparse dataframe to avoid having to run the code again later.

In [22]:
sc_sdf_ratings.columns = sc_sdf_ratings.columns.astype(str)

In [23]:
sc_sdf_coo = sc_sdf_ratings.to_coo()

In [24]:
scipy.sparse.save_npz('../assets/sc_ratings.npz', sc_sdf_coo)

In [25]:
with open('../assets/cols_sc_ratings.pkl', 'wb+') as f:
    pickle.dump(sc_sdf_ratings.columns, f)

In [26]:
with open('../assets/index_sc_ratings.pkl', 'wb+') as f:
    pickle.dump(sc_sdf_ratings.index, f)

### Create Pivot Table

Before calculating cosine similarity, we want to create a sparse matrix where `NaNs` are filled with 0s.

In [27]:
isbn_pivot = sparse.csr_matrix(sc_sdf_ratings.fillna(0))

## Modeling

### Cosine Similarity

We calculate the cosine similarity between books by setting `isbn_pivot` as our vectors. This is done by taking the dot product of the vectors and dividing by their magnitude. 

![Cosine Similarity](https://neo4j.com/docs/graph-algorithms/current/images/cosine-similarity.png)

In [28]:
cos_sim = cosine_similarity(isbn_pivot, isbn_pivot)

For easy readibility, we'll load `cos_sim` back into a dataframe. 

In [29]:
ratings_cos_sim = pd.DataFrame(cos_sim, index=sc_sdf_ratings.index, columns=sc_sdf_ratings.index)

In [30]:
ratings_cos_sim.head()

Unnamed: 0,0002005018,0002116286,0002240114,0002243962,0002251760,0002255081,0002259834,0002261820,0002550563,0003300277,...,9684112017,9707100036,9722016563,972210277X,9722105248,9726101794,9727110800,9728423160,9812327975,9871138148
2005018,1.0,0.013408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.049209,0.0
2116286,0.013408,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2240114,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2243962,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2251760,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The similarities range from -1 to 1:
-  `-1`: books are exactly opposite
-  `0`: books are orthogonal (not related)
-  `1`: books are exactly the same
-  `0 > similarity > 1`: intermediate similarity between books
-  `-1 < similarity < 1`: intermediate dissimilarity between books

### Recommender System

Our `ratings_cos_sim` have ISBNs as the index and columns. To allow users to search by book title, we create a dataframe containing unique ISBNs and their corresponding information.

In [31]:
unique_title = new_exp_ratings.drop_duplicates('isbn')

Define a function that returns the top 5 most similar books and their cosine similarity with the given ISBN.

In [32]:
def top_5_recs(isbn):
    titles = []
    cos_sim = []
    for book, sim in ratings_cos_sim[isbn].sort_values(ascending=False)[1:6].items():
        titles.append(unique_title[unique_title['isbn'] == book]['book_title'].values[0])
        cos_sim.append(sim)
        top_5 = pd.DataFrame(titles, columns=['Recommendations'])
        top_5['Similarity'] = cos_sim
    return top_5

Define a function where users can enter a search term, whether that's the book title or a portion of it. If the search term is:
-  **Found in one book**: the `top_5_recs` function defined above will be used to return the top 5 recommendations.
-  **Not found**: the user will be notified of this. 
-  **Found in multiple book titles**: we will request the user to further specify which book they're referring to. A list of available books will be printed for selection. The `top_5_recs` function will be used. 

In [33]:
def book_search():
    search_term = input('Please enter a book: ')
    search_mask = unique_title[unique_title['book_title'].str.contains(search_term)]['book_title'].values
    if len(search_mask) == 1:
        isbn = unique_title[unique_title['book_title'] == search_mask[0]]['isbn'].values[0]
        return top_5_recs(isbn)
    elif len(search_mask) == 0:
        print('Sorry, it doesn\'t look like we have this book in our system.')
    elif len(search_mask) > 1:
        print('Which book are you looking for?')
        titles = {}
        for i, title in enumerate(search_mask):
            print(i, title)
            titles[i] = title
        search_again = input('Please enter the number of the correct book: ')
        book_title = titles[int(search_again)]
        isbn = unique_title[unique_title['book_title'] == book_title]['isbn'].values[0]
        return top_5_recs(isbn)

We will use our function with:
-  `search_term = Mermaid`
-  `search_again = 4 The Little Mermaid`
-  [Synopsis](https://www.amazon.com/Little-Mermaid-Original-Illustrations/dp/0615963943/ref=sr_1_6?s=books&ie=UTF8&qid=1549620530&sr=1-6&keywords=the+little+mermaid)
> After saving a prince from drowning, a mermaid princess embraces a life of extreme self-sacrifice to win his love and gain an immortal soul.

In [34]:
book_search()

Which book are you looking for?
0 The Mermaids Singing
1 The Mermaids Singing (A Dr. Tony Hill & Carol Jordan Mystery)
2 Little Mermaid (Little Golden Book)
3 The Little Mermaid
4 The Little Mermaid
5 Disney's: The Little Mermaid (Disney's Wonderful World of Reading)


Unnamed: 0,Recommendations,Similarity
0,Cinder Edna,0.634178
1,Pound It (Popular Mechanics for Kids),0.634178
2,Los Detectives Salvajes,0.634178
3,Flowers for Travis,0.634178
4,A Time To Dance,0.634178


-  The top 5 recommendations for `The Little Mermaid` all have a cosine similarity of `0.634178`.
-  Based on [Cinder Edna's synopsis](https://www.amazon.com/Cinder-Edna-Ellen-Jackson/dp/0688162959), it's pretty similar to our searched book:
    > Once upon a time there were two girls who lived next door to each other. Cinder Edna was forced to work for her wicked stepmother and stepsisters, just as her neighbor, Cinderella, was. Edna, on the other hand, had learned a thing or two from doing all that housework, such as how to make tuna casserole sixteen different ways and how to get spots off everything from rugs to ladybugs. And she was strong and spunky and knew some good jokes. Then one day the king announced that he would give a ball ...*
-  The similarity with other books is not as apparant. 

## Conclusions and Recommendations

The quality of our recommendations varies depending on the book searched. Our search for The Little Mermaid returned books with equal similarity (`0.634178`). Despite having the same values, it's not readily apparant that they share the same qualities as our search. 

The inconsistent returns are largely due to the sparsity of our data. Since implicit ratings and books with 3 or less ratings were removed, our recommender system only consists of 18,263 books with 224,114 ratings.

To improve our recommender, we can:
-  **Collect more data:** Collecting more ratings data will be the most direct and advantageous method to improve our recommendations. This will help address the high count of implicit ratings and data sparsity issues.

-  **Build a hybrid recommender system** We can build a hybrid recommender system, taking into account both the ratings and book metadata. APIs like Google Books will be beneficial to grab data like page count, genre and book description.