# Recommendation Systems

## Introduction

Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous can be commonly seen in online stores, movies databases and job finders.

<hl>

## 0. Acquiring the Data

To acquire and extract the data, simply run the following Bash scripts:  
Dataset acquired from [Book Crossing](http://www2.informatik.uni-freiburg.de/~cziegler/BX/)

In [None]:
!wget -O bookdataset.zip http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
!unzip -o bookdataset.zip -d /resources/data

Now you're ready to start working with the data!

## 1. Preprocessing

First, let's get all of the imports out of the way:

In [None]:
#Dataframe manipulation library
import pandas as pd
#Math functions, we'll only need the sqrt function so let's import only that
from math import sqrt
import numpy as np

Now let's read each file into their Dataframes:

In [None]:
#Storing the movie information into a pandas dataframe
books_df = pd.read_csv('/resources/data/BX-Books.csv', sep='";"')
#Storing the user information into a pandas dataframe
ratings_df = pd.read_csv('/resources/data/BX-Book-Ratings.csv', sep=';')

Let's begin by having you take a peek at how the dataset is organized:

In [None]:
#You code here


Highlight the green box below for the answer
<Oops! If you can read this, you can press Shift+Enter to convert this cell back to text.>
<p style= "padding: 6px; background-color: white
; border: green 2px solid"> 
<font color = "white">

#Head is a function that gets the first N rows of a dataframe. N's default is 5. <br>
books_df.head()
</font>
</p>

So each book has a unique ISBN, title, author, year of publication, publisher and three images of the book's cover in varying sizes. Let's first rename the columns to make them easier to access and also remove any unicode characters.

In [None]:
#New column names
books_df.columns = ['ISBN', 'Title', 'Author', 'Year', 'Publisher', 'ImageS', 'ImageM', 'ImageL']
#Applying a lambda function that removes unicode characters and strips the strings of any whitespace before or after
#the resulting string
books_df['Title'] = books_df['Title'].apply(lambda x: x.decode('unicode_escape').encode('ascii', 'ignore').strip())

Now, it's your turn! Finish cleaning up dataframe by removing quotation marks from the ISBN column and dropping the three unnecessary image columns.

In [None]:
#Your code here!






Highlight the green box below for the answer
<Oops! If you can read this, you can press Shift+Enter to convert this cell back to text.>
<p style= "padding: 6px; background-color: white
; border: green 2px solid"> 
<font color = "white">

#Dropping the three image columns <br>
books_df = books_df.drop('ImageS', 1) <br>
books_df = books_df.drop('ImageM', 1) <br>
books_df = books_df.drop('ImageL', 1) <br>
#Removing the quotes from the ISBN column <br>
books_df['ISBN'] = books_df['ISBN'].str.replace('"', '') <br>
</font>
</p>

Let's look at the final books dataframe!

In [None]:
books_df.head()

<br>

Next, let's look at the ratings dataframe.

In [None]:
ratings_df.head()

Every row in the ratings dataframe has a user id associated with at least one book's unique ISBN and its given rating varying from 0 to 10.

Let's just change the name of the columns for ease of access in the future:

In [None]:
ratings_df.columns = ['UserID', 'ISBN', 'Rating']

Here's how the final ratings Dataframe looks like:

In [None]:
ratings_df.head()

## 2. Collaborative Filtering

Now, let's start building a recommendation system.

Here's the user we'll be recommendation books to:

In [None]:
buffer = [
            {'Title':'Complete Sherlock Holmes', 'Rating':8},
            {'Title':"The Hitchhiker's Guide to the Galaxy", 'Rating':10},
            {'Title':'Pride and Prejudice', 'Rating':6},
            {'Title':'The Adventures of Tom Sawyer', 'Rating':5},
            {'Title':'You Can Surf the Net: Your Guide to the World of the Internet', 'Rating':3}
         ] 
inputUser = pd.DataFrame(buffer)
inputUser

#### Add rating to input user
With the input complete, let's extract the input books's ISBNs from the books dataframe and add them into our input.

We can achieve this by first filtering out the rows that contain the input books's title and then merging this subset with the input dataframe. We also drop unnecessary columns like the Author, Year and Publisher to save on memory space.

Try implementing this part yourself!

In [None]:
#Your code here
#1: Implement a way of retrieving information of books inserted through the input from the main Books dataframe and store it

#2: Get the stored information and merge it with the input

#3: Drop the Author, Year and Publisher columns

Highlight the green box below for the answer
<Oops! If you can read this, you can press Shift+Enter to convert this cell back to text.>
<p style= "padding: 6px; background-color: white
; border: green 2px solid"> 
<font color = "white">

#Filtering out the books by title <br>
inputId = books_df[books_df['Title'].isin(inputUser['Title'].tolist())]<br>
#Then merging it so we can get the ISBN. It's implicitly merging it by title.<br>
inputUser = pd.merge(inputId, inputUser)<br>
inputUser = inputUser.drop('Author', 1).drop('Year', 1).drop('Publisher', 1)<br>
#Final input dataframe<br>
#If a book you added in above isn't here, then it might not be in the original <br>
#dataframe or it might spelled differently, please check capitalisation.<br>
inputUser.head()<br>
</font>
</p>

Here's a look at the result:

In [None]:
inputUser.head()

#### The users have read the same books
Now with the book ISBN's in our input, we can now get the subset of users that have read and reviewed the movies in our input.

Try implementing this!

In [None]:
#Store the subset of users that have read the same books as our input in the variable below
userSubset = #Your code here


Highlight the green box below for the answer
<Oops! If you can read this, you can press Shift+Enter to convert this cell back to text.>
<p style= "padding: 6px; background-color: white
; border: green 2px solid"> 
<font color = "white">
#Filtering out users that have watched movies that the input has watched and storing it <br>
userSubset = ratings_df[ratings_df['ISBN'].isin(inputUser['ISBN'].tolist())] <br> 
</font>
</p>

We now group up the rows by user ID.

In [None]:
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['UserID'])

Let's look at one of the users, e.g. the one with UserID=11676

In [1]:
userSubsetGroup.get_group(11676)

NameError: name 'userSubsetGroup' is not defined

Let's also sort these groups so the users that share the most books in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user.

In [None]:
#Sorting it so users with most books in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

Now lets look at the first user

In [None]:
userSubsetGroup[0]

We will select a subset of users to iterate through. This limit is imposed because we don't want to waste too much time going through every single user.

Try implementing this!

In [None]:
#Simply subset the first 100 users from our group
userSubsetGroup = #Your code here

Highlight the green box below for the answer
<Oops! If you can read this, you can press Shift+Enter to convert this cell back to text.>
<p style= "padding: 6px; background-color: white
; border: green 2px solid"> 
<font color = "white">
userSubsetGroup = userSubsetGroup[0:100]
</font>
</p>

Now, we calculate the Pearson Correlation between input user and subset group, and store it in a dictionary, where the key is the user Id and the value is the coefficient


In [None]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='ISBN')
    inputUser = inputUser.sort_values(by='ISBN')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the books that they both have in common
    temp_df = inputUser[inputUser['ISBN'].isin(group['ISBN'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['Rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['Rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0
pearsonDF=pd.DataFrame(pearsonCorrelationDict.items(), columns=['UserID', 'similarityIndex'])
pearsonDF.head()

#### The top x similar users to input user
Now let's get the top 50 users that are most similar to the input.

In [None]:
topUsers = pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Now, let's start recommending books to the input user.

#### Rating of selected users to all books
We're going to do this by taking the weighted average of the ratings of the books using the Pearson Correlation as the weight. But to do this, we first need to get the books read by the users in our __pearsonDF__ from the ratings dataframe and then store their correlation in a new column called "similarityIndex". This is achieved below by merging these two tables.

In [None]:
topUsersRating = topUsers.merge(ratings_df, left_on='UserID', right_on='UserID', how='inner')
topUsersRating.head()

Now all we need to do is simply multiply the book rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.

We can easily do this by simply multiplying two columns, then grouping up the dataframe by ISBN and then dividing two columns:

It shows the idea of all similar users to candidate movies for the input user:

In [None]:
#Multiplies the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['Rating']
topUsersRating.head()

In [None]:
#Applies a sum to the topUsers after grouping it up by userId
tempTopUsersRating = topUsersRating.groupby('ISBN').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

In [None]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()

Implement the weighted average division!

In [None]:
#Simply apply the division using the correct columns
recommendation_df['weighted average recommendation score'] = #Your code here

Highlight the green box below for the answer
<Oops! If you can read this, you can press Shift+Enter to convert this cell back to text.>
<p style= "padding: 6px; background-color: white
; border: green 2px solid"> 
<font color = "white">
#Now we take the weighted average <br>
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating'] / tempTopUsersRating['sum_similarityIndex'] <br>
</font>
</p>

In [None]:
recommendation_df['ISBN'] = tempTopUsersRating.index
recommendation_df.head()

Now let's sort it and see the top 8 books that the algorithm recommended!

In [None]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head()

In [None]:
books_df.loc[books_df['ISBN'].isin(recommendation_df.head(8)['ISBN'].tolist())]

Author: Gabriel Garcez Barros Sousa

## References
[Book Crossing Dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/)

[Collaborative Filtering Recommender Systems](http://files.grouplens.org/papers/FnT%20CF%20Recsys%20Survey.pdf)

[Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/)