## Collaborative Filtering

In [1]:
# import pandas
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

We will create an recommender engine based on Item Based Collaborative Filtering (IBCF) which searches for the most similar books based on the user ratings. We can download the data from [here](https://drive.google.com/file/d/1WvTmAfO09TCX7xp7uu06__ziic7JnrL5/view?usp=sharing).

In [2]:
book_ratings = pd.read_csv('BX-CSV-Dump/BX-Book-Ratings.csv',sep=";", encoding="latin")
books = pd.read_csv('BX-CSV-Dump/BX-Books.csv',sep=";", encoding="latin", error_bad_lines=False)

FileNotFoundError: [Errno 2] No such file or directory: 'BX-CSV-Dump/BX-Book-Ratings.csv'

* Explore both datasets

In [None]:
book_ratings.head()

In [None]:
plt.figure(figsize=(6,4))
book_ratings['Book-Rating'].hist(bins=50)
# we can see that most books have a rating of 0
# lets plot a count of the ratings for each book

In [None]:
book_ratings.groupby('ISBN')['Book-Rating'].count()

In [None]:
books.head()

* create dataframe with name 'df_book_features' from book_ratings that have `ISBN` as index, `User-ID` as columns and values are `Book-Rating`.
    - The data are quite big so it's OK to use a sample only in case your PC has limited RAM.


In [None]:
merged_df = pd.merge(books, book_ratings, on='ISBN')

In [None]:
df_book_features = pd.DataFrame(data=merged_df, columns=['ISBN', 'User-ID', 'Book-Rating', 'Book-Title'])
df_book_features = df_book_features.set_index('ISBN')
df_book_features.head()

In [None]:
# looks like we have quite a few NaN values here
df_book_features = df_book_features.dropna(axis=0, subset=['Book-Title'])

In [None]:
len(df_book_features.index)

In [None]:
len(df_book_features.index.unique())

In [None]:
# lets look at the rating count for each book since we know there are many with few ratings
df_book_features['rating_count'] = df_book_features.groupby('ISBN')['Book-Rating'].count()

In [None]:
df_book_features.head(2)

In [None]:
# awesome, now we want to get the books with multiple ratings and discard the ones with scarce ratings
# lets look at the stats of the rating_count column
pd.set_option('display.float_format', lambda x: '%.3f' %x)
df_book_features['rating_count'].describe()

In [None]:
# so the 75% percentile of books has 42 ratings
# because we have so many books, I will limit this to the top few percentiles
# lets examine them first
# this will print out all the quantiles starting from 0.9 all the way to 1 in steps of .01
# so we will print out the 90th to 99th percentile
print(df_book_features['rating_count'].quantile(np.arange(0.9, 1, .01)))

In [None]:
# the top 5% of books have 236 ratings, so i'll stick with that
threshold = 236
df_book_features = df_book_features[df_book_features['rating_count'] > threshold]

* create the instance of the NearestNeighbors class

In [None]:
nn = NearestNeighbors(n_neighbors=5, metric='cosine', algorithm='brute')

In [None]:
# creating the matrix to compare similarities
# it will be sparse as not all books are rated by all users
user_book_rating = df_book_features.pivot_table(index='Book-Title', columns='User-ID', values='Book-Rating')
user_book_rating.head(3)

* fit the NearestNeighbors using'df_book_features'

In [None]:
# we need to fill NaN values as 0's before using KNN since it measures distances between the rating vectors
user_book_rating = user_book_rating.fillna(0)

In [None]:
model = nn.fit(user_book_rating)

In [None]:
user_book_rating.head(2)

In [None]:
# testing the model with some recommendations
query_idx = np.random.choice(user_book_rating.shape[0])
distances, idx = nn.kneighbors(np.asarray(user_book_rating.iloc[query_idx, :]).reshape(1, -1), n_neighbors=5)
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(user_book_rating.index[query_idx]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, user_book_rating.index[idx.flatten()[i]], distances.flatten()[i]))

* create function that returns top 5 most similar movies (according to KNN model) for selected ISBN
    * the input will be Book-Title from the DataFrame books 
    * the output will be the Book-Titles of the top 5 most similar books.
    * for every book in the top 5 most similar books, print also the distance from the selected book (ISBN we chose as input to the function)

* create function that returns top 5 most similar movies (according to KNN model) for selected ISBN
    * the input will be Book-Title from the DataFrame books 
    * the output will be the Book-Titles of the top 5 most similar books.
    * for every book in the top 5 most similar books, print also the distance from the selected book (ISBN we chose as input to the function)

In [None]:
def top_5_similar(book, model=model):
    

* Apply the function to book of your choice