# Recommendation System

Recommendation systems are machine learning systems that investigate relationship between products and users to help
users discover relevant products.

There are three types recommendation systems
1. Collaborative Filtering
1. Content-Based Filtering
1. Hybrid Recommendation Systems

**Collaborative filtering** is based on collecting and analyzing information on user’s preferences and
predicting what they will like based on the similarity with other users.

**Content-Based filtering** is based on the item features to recommend other items similar to what the user likes.

**Hybrid Recommendation Systems** combines collaborative filtering and content-based filtering.

# Book Recommendation System

This notebook builds a simple book recommendation system based on Book-Crossing Dataset mined by
Cai-Nicolas Ziegler, DBIS Freiburg.

The data is collected from the Book-Cross community - book lovers community for exchanging books worldwide.

We will use **collaborative filtering** algorithm to make predictions about the user's interests by
collecting their preferences and from other similar users. The underlying assumption of
the collaborative filtering approach is that if person A has the same opinion as person B on an issue
(the book they both read), A is more likely to have B's opinion on a different issue (about another book)
than that of a randomly chosen person.

Based on the ratings a user has given to the books they already read, the algorithm tries to find the most "similar"
books that interested other users and predicts the users' rating for the unread books based on their ratings.

To find the most similar users, we will be using **K-Nearest Neighbors** (KNN) algorithm.
For measuring the "similarity" between the users, we will be using **hamming distance**.

### Install all needed libraries

In [85]:
!python3 -m pip install pandas scipy numpy --user

Collecting sklearn
  Using cached sklearn-0.0-py2.py3-none-any.whl
Installing collected packages: sklearn
Successfully installed sklearn-0.0
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


## Data

Dataset used will be downloaded from [here](http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip)

Zip file downloaded contains the following `csv` files:
- *BX-Books.csv* - contains information about the Books
- *BX-Users.csv* - contains information about the Users
- *BX-Book-Ratings.csv* - contains the Users rating about the books with ratings ranging from 1 to 5

### Download Data

In [86]:
from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile

tempPath = "/tmp/osd-demo/"
zipUrl = "http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip"

with urlopen(zipUrl) as zipresp:
    with ZipFile(BytesIO(zipresp.read())) as zfile:
        zfile.extractall(tempPath)

ratingsFilePath = "/tmp/osd-demo/BX-Book-Ratings.csv"
booksFilePath = "/tmp/osd-demo/BX-Books.csv"
usersFilePath = "/tmp/osd-demo/BX-Users.csv"

### Load the Data into Dataframes

#### Load Libraries

In [87]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import hamming 

#### Load Ratings file

In [88]:
def loadRatings():
    column_names = ["user","isbn","rating"]
    ratings = pd.read_csv(ratingsFilePath, sep=";", header=0, names=column_names, encoding='ISO-8859–1')
    return ratings

ratings = loadRatings()

#### Get the first 10 rows of the ratings

In [89]:
ratings.head(10)

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
5,276733,2080674722,0
6,276736,3257224281,8
7,276737,0600570967,6
8,276744,038550120X,7
9,276745,342310538,10


#### Load Books file by extracting only the first three columns.

In [96]:
def loadBooks():
    column_names = ['isbn',"title","author"]
    books = pd.read_csv(booksFilePath, sep=";", header=0, error_bad_lines=False, usecols=[0,1,2], names = column_names, index_col=0, encoding='ISO-8859–1')
    return books

books = loadBooks()

#### Get the first 10 rows of the ratings

In [97]:
books.head(10)

Unnamed: 0_level_0,title,author
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1
0195153448,Classical Mythology,Mark P. O. Morford
0002005018,Clara Callan,Richard Bruce Wright
0060973129,Decision in Normandy,Carlo D'Este
0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
0393045218,The Mummies of Urumchi,E. J. W. Barber
0399135782,The Kitchen God's Wife,Amy Tan
0425176428,What If?: The World's Foremost Military Histor...,Robert Cowley
0671870432,PLEADING GUILTY,Scott Turow
0679425608,Under the Black Flag: The Romance and the Real...,David Cordingly
074322678X,Where You'll Find Me: And Other Stories,Ann Beattie


In [98]:
def get_title_and_author_by_isbn(isbn):
    title = books.at[isbn, "title"]
    author = books.at[isbn, "author"]
    return title, author

In [99]:
print(get_title_and_author_by_isbn("038550120X"))
print(get_title_and_author_by_isbn("0060973129"))

('A Painted House', 'JOHN GRISHAM')
('Decision in Normandy', "Carlo D'Este")


In [100]:
ratings = ratings[ratings["isbn"].isin(books.index)]

In [101]:
ratings.shape

(1031175, 3)

### Create Rating Matrix

#### Reduce sparsity

Update ratings to only contains info for 
- books that were read by more than 10 users
- users that have read more than 10 books

In [102]:
def reduceSizeofData(ratings):
    
    # Find how many times each book is read
    usersPerISBN = ratings.isbn.value_counts()
    
    # Find how many times a book is rated by user
    ISBNsPerUser = ratings.user.value_counts()
    
    ratings = ratings[ratings["isbn"].isin(usersPerISBN[usersPerISBN > 10].index)]
    ratings = ratings[ratings["user"].isin(ISBNsPerUser[ISBNsPerUser > 10].index)]
    
    return ratings

#### Create Rating Matrix
Transform the ratings data into matrix that have book numbers for column, user ids for row index, and ratings for values.

In [103]:
def createRatingMatrix():
    reduced_ratings = reduceSizeofData(ratings)
    
    userBooksRatingMatrix=pd.pivot_table(reduced_ratings,
                                         values='rating',
                                         index=['user'],
                                         columns=['isbn'])
    return userBooksRatingMatrix

userBooksRatingMatrix = createRatingMatrix()

#### Show few linex from the matrix

In [104]:
userBooksRatingMatrix.head()

isbn,0002005018,0002251760,0002259834,0002558122,0006480764,000648302X,0006485200,000649840X,000651202X,0006512062,...,8845906884,8845915611,8878188212,8885989403,9074336329,9074336469,950491036X,9681500830,9681500954,9871138016
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,5.0,,,,,,,,,,...,,,,,,,,,,
99,,,,,,,,,,,...,,,,,,,,,,
242,,,,,,,,,,,...,,,,,,,,,,
243,,,,,,,,,,,...,,,,,,,,,,
254,,,,,,,,,,,...,,,,,,,,,,


### Calculate the distance between two users using hamming algorithm

Get all ratings both users have given and apply hamming algorithm.

In [105]:
def distance(user1, user2):
    try:
        user1Ratings = userBooksRatingMatrix.transpose()[user1]
        user2Ratings = userBooksRatingMatrix.transpose()[user2]
        distance = hamming(user1Ratings, user2Ratings)
    except: 
        distance = np.NaN
    
    return distance

In [106]:
distance(204622,10118)

0.9998705585399004

### Function that finds the K nearest neighbours of a specific user

In [115]:
def nearestNeighbors(user, K = 10):
    
    # Get all users Ids (the index of the Matrix contains userIDs)
    allUserIds = pd.DataFrame(userBooksRatingMatrix.index)
  
    # From the list of all userIDs remove the current user id
    allUserIds = allUserIds[allUserIds.user != user]
    
    # Add new column [distance] to the allUsersId dataframe by applying lamda function to each user.
    allUserIds["distance"] = allUserIds["user"].apply(lambda x: distance(user,x))
    
    # allUsersId contains all users (except the current user) with their corresponding distances to the current user.
    # Sort the dateFrame by distance in ascending order and get the top K users.
    KnearestUsers = allUserIds.sort_values(["distance"], ascending=True)["user"][:K]
    
    return KnearestUsers

In [116]:
user = 204622
KnearestUsers = nearestNeighbors(user)

In [117]:
KnearestUsers

3201     82893
3368     87555
2624     68555
1813     48046
5401    140036
7584    198711
565      16795
8866    232131
239       7346
9693    251422
Name: user, dtype: int64

### Find the top N Recomentations for a user.

In [111]:
def topNRecommendationsPerUser(user, N = 3):
    # Find N nearest neighbors of a user
    KnearestUserIds = nearestNeighbors(user)
    
    # from the Matrix get the ratings only of the N nearest neighbors
    NNRatings = userBooksRatingMatrix[userBooksRatingMatrix.index.isin(KnearestUserIds)]
    
    # Calculate the mean value of the book ratings give by the nearest neighbors
    # and ingnore the books that has NaN results
    # (books that do not have rating/ they are not read by the neighbours)
    avgRating = NNRatings.apply(np.nanmean).dropna()
    
    # find the books read by the user
    booksAlreadyRead = userBooksRatingMatrix.transpose()[user].dropna().index
    
    # from the avgRating remove the books read by the user (they should not be in the recommendation list)
    avgRating = avgRating[~avgRating.index.isin(booksAlreadyRead)]
    
    # Sort the avarage ratings in descending order and get the top N books
    topNISBNs = avgRating.sort_values(ascending=False).index[:N]
    
    # apply the get_title_and_author_by_isbn function to the topNISBNs go get the books metadata 
    return pd.Series(topNISBNs).apply(get_title_and_author_by_isbn)

In [19]:
topNRecommendationsPerUser(204813, 10)

  results[i] = self.f(v)


0    (Special Operations (Badge of Honor Novels (Pa...
1                       (Lady of Desire, Gaelen Foley)
2         (Up &amp; Out (Red Dress Ink), Ariella Papa)
3             (The Little Drummer Girl, John Le Carre)
4    (Stiff: The Curious Lives of Human Cadavers, M...
5     (A Kiss for Little Bear, Else Holmelund Minarik)
6                         (Name Der Rose, Umberto Eco)
7          (Sabriel (The Abhorsen Trilogy), Garth Nix)
8              (Me Talk Pretty One Day, David Sedaris)
9                    (Mixed Blessings, DANIELLE STEEL)
Name: isbn, dtype: object