# Example - Book recommendations

### Introduction

**Bookcrossing** is defined as *the practice of leaving a book in a public place to be picked up and read by others, who then do likewise*. The term is derived from BookCrossing (`bookcrossing.com`), a free online book club which was founded to encourage the practice, aiming to *make the whole world a library*.

The crossing or exchanging of books may take any of a number of forms, including wild-releasing books in public, direct swaps with other members of the websites, or *book rings* in which books travel in a specified order to participants who want to read a certain book. The community aspect of BookCrossing has grown and expanded in the form of blog or forum discussions, mailing lists and annual conventions throughout the world.

BookCrossing was launched on April 21, 2001. By November 2019, there were over 1.9 million members and over 13 million books traveling through 132 countries, of which over 25 thousand books newly "released in the wild" in the previous month across over 60 countries. Over 80% of the books were released in the eight most active countries (Germany, United States, Spain, Italy, Australia, United Kingdom, the Netherlands and Brasil). The world's first official International BookCrossing Day took place on April 21st, 2014.

Cai-Nicolas Zeigler collected over one million ratings of books from the BookCrossing website. Since the books are identified by the ISBN, what we call here a book is, really, an edition of a book. Popular books, such as *Harry Potter and the Sorcerer's Stone*, have several editions in this database.

### The data

The data for this example come in three tables. The file `book_users.csv` contains information about 278,858 users. The columns are:

* `user`, the user's ID (anonymized).

* `location`, the user's location, as 'city, state, country'. Example: `moscow, yukon territory, russia`.

* `age`, the user's age, in years. Almost 40% of the data are missing.

The file `book_items.csv` contains information about 271,360 books on circulation. The columns are: 

* `isbn`, the ISBN code of the book, which works as the book ID. Invalid ISBN's have already been removed.

* `title`, the title of the book.

* `author`, the book's author, obtained from Amazon Web Services. In case of several authors, only the first one is provided. One value missing.

* `year`, the year of publication, obtained from Amazon Web Services. 1.7% of the values missing.

* `publisher`, the book's publisher, obtained from Amazon Web Services. Two values missing.

* `image`, an URL linking to cover images. These URLs point to the Amazon web site. Two values missing.

The file `book_ratings.csv` contains 1,031,136 ratings. The columns are:

* `user`, the user's ID.

* `isbn`, the books's ISBN.

* `rating`, either the rating given by the user (1--10 range) or 0, when the user did not provide a rating.

Source: C-N Ziegler, SM McNee, JA Konstan & G Lausen (2005), Improving Recommendation Lists Through Topic Diversification, in *Proceedings of the 14th International World Wide Web Conference*, Chiba, Japan.

### Importing the data

We start by importing Pandas, as usual.

In [1]:
import pandas as pd

The data consists of three tables. We import the table `ratings` from the GitHub repository as we did with other tables before:

In [2]:
url1 = 'https://raw.githubusercontent.com/cinnData/DataSci/main/'
url2a = '12.%20Recommender%20systems/book_ratings.csv'
urla = url1 + url2a
ratings = pd.read_csv(urla)

In [3]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1031136 entries, 0 to 1031135
Data columns (total 3 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   user    1031136 non-null  int64 
 1   isbn    1031136 non-null  object
 2   rating  1031136 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 23.6+ MB


The same for the table `users`: 

In [4]:
url2b = '12.%20Recommender%20systems/book_users.csv'
urlb = url1 + url2b
users = pd.read_csv(urlb)

In [5]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   user      278858 non-null  int64  
 1   location  278858 non-null  object 
 2   age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


The table `items` is split in two parts, to comply with the GitHub limitation of 25 Mb. We import the two tables, putting them together with the Pandas function `concat`. By default, `concat` concatenates data frames vertically (`axis=0`).

In [6]:
url2c1 = '12.%20Recommender%20systems/book_items-1.csv'
urlc1 = url1 + url2c1
items1 = pd.read_csv(urlc1)
url2c2 = '12.%20Recommender%20systems/book_items-2.csv'
urlc2 = url1 + url2c2
items2 = pd.read_csv(urlc2)
items = pd.concat([items1, items2])

In [7]:
items.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271360 entries, 0 to 121359
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   isbn       271360 non-null  object 
 1   title      271360 non-null  object 
 2   author     271359 non-null  object 
 3   year       266742 non-null  float64
 4   publisher  271358 non-null  object 
 5   image      271357 non-null  object 
dtypes: float64(1), object(5)
memory usage: 14.5+ MB


### Merging tables

The Pandas function `merge` joins two data frames based on common columns. The default of this function performs what in SQL is called a **natural join**, that is is joins the tables using the columns with the same in both tables. 

We first join `ratings` with `items`, calling the resulting data frame `df`. The join is based on the column `isbn`.

In [8]:
df = pd.merge(ratings, items)

Next, we merge `df` with `users`, based on the column `user`. Again, we call the resulting data frame `df`:

In [9]:
df = pd.merge(df, users)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031136 entries, 0 to 1031135
Data columns (total 10 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   user       1031136 non-null  int64  
 1   isbn       1031136 non-null  object 
 2   rating     1031136 non-null  int64  
 3   title      1031136 non-null  object 
 4   author     1031135 non-null  object 
 5   year       1017127 non-null  float64
 6   publisher  1031134 non-null  object 
 7   image      1031132 non-null  object 
 8   location   1031136 non-null  object 
 9   age        753301 non-null   float64
dtypes: float64(2), int64(2), object(6)
memory usage: 86.5+ MB


### How often do bookcrossers rate the books they pick?

Compared to e-commerce sites like Amazon, rating in Bookcrossing is quite frequent, although more than one half of the books picked are not rated. The books effectively rated are those with positive rating, so we can use the Boolean mask `ratings['rating'] > 0` to capture them:

In [10]:
round((ratings['rating'] > 0).mean(), 3)

0.372

### Which titles have been picked more times?

Which titles have been picked more times? To answer this question, we apply `value_counts`:

In [11]:
df['title'].value_counts().head(10)

Wild Animus                                        2502
The Lovely Bones: A Novel                          1295
The Da Vinci Code                                   898
A Painted House                                     838
The Nanny Diaries: A Novel                          828
Bridget Jones's Diary                               815
The Secret Life of Bees                             774
Divine Secrets of the Ya-Ya Sisterhood: A Novel     740
The Red Tent (Bestselling Backlist)                 723
Angels &amp; Demons                                 670
Name: title, dtype: int64

### Which books have been rated highest? 

To calculate the average rating for the books of this collection, we first select the titles that have been rated by at least once with `df['rating'] > 0`. Then, we group by title and calculate the average rating. We add the number of observations for future filtering.

In [12]:
df1 = df[df['rating'] > 0].groupby(by='title')['rating'].agg(['mean', 'count'])

Now we can restrict the selection to the titles that have been rated by at least 25 users and sort by the average rating:

In [13]:
df1[df1['count'] >= 25].sort_values(by='mean', ascending=False).head(10)

Unnamed: 0_level_0,mean,count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
The Giving Tree,9.423077,26
Where the Sidewalk Ends : Poems and Drawings,9.4,25
"The Two Towers (The Lord of the Rings, Part 2)",9.330882,136
The Cat in the Hat,9.242424,33
Weirdos From Another Planet!,9.24,25
"The Return of the King (The Lord of the Rings, Part 3)",9.213592,103
Lonesome Dove,9.162162,37
The Grapes of Wrath,9.129032,31
Harry Potter and the Goblet of Fire (Book 4),9.125506,247
Roots,9.12,25


Raising the threshold to 100 would give us a more homogeneous selection:

In [14]:
df1[df1['count'] >= 100].sort_values(by='mean', ascending=False).head(10)

Unnamed: 0_level_0,mean,count
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"The Two Towers (The Lord of the Rings, Part 2)",9.330882,136
"The Return of the King (The Lord of the Rings, Part 3)",9.213592,103
Harry Potter and the Goblet of Fire (Book 4),9.125506,247
Harry Potter and the Sorcerer's Stone (Book 1),9.0625,176
Harry Potter and the Order of the Phoenix (Book 5),9.047393,211
Harry Potter and the Prisoner of Azkaban (Book 3),9.043321,277
To Kill a Mockingbird,8.977528,267
Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback)),8.936508,315
Ender's Game (Ender Wiggins Saga (Paperback)),8.92053,151
"The Fellowship of the Ring (The Lord of the Rings, Part 1)",8.882927,205


### Exploring *The Martian Chronicles*

In the second part of this example, we are going to extract a list of 10 books to be recommended to users having read the SF classic *The Martian Chronicles*, using two methods. Before getting specific, we explore the available data on this book.

In [15]:
items[items['title'] == 'The Martian Chronicles']

Unnamed: 0,isbn,title,author,year,publisher,image
175,553278223,The Martian Chronicles,RAY BRADBURY,1984.0,Spectra,http://images.amazon.com/images/P/0553278223.0...


The ISBN is 0553278223, so we can use this identifier. We have 65 readers:

In [16]:
(ratings['isbn'] == '0553278223').sum()

65

Among these readers, 30 have rated the book:

In [17]:
((ratings['isbn'] == '0553278223') & (ratings['rating'] > 0)).sum()

30

### Restricting the data set

To make the computations manageable, we will restrict the data to those related to *The Martian Chronicles*, which means, in practice, the books picked by the users who have picked *The Martian Chronicles*. These books are the only ones that matter.

We extract first a list of the users that have picked *The Martian Chronicles*, which we call `mc_users`:

In [18]:
mc_users = ratings[ratings['isbn'] == '0553278223']['user']
mc_users

8623          242
10271         882
13574        2313
18161        4098
22814        6280
            ...  
910825     243292
917945     245410
941440     250962
958315     255078
1027938    275922
Name: user, Length: 65, dtype: int64

Second, we extract a list containing all the ratings given by these users, which we call `mc_ratings`. This will be our transactions data set for the rest of the analysis. 

In [19]:
mc_ratings = ratings[ratings['user'].isin(mc_users)]
mc_ratings

Unnamed: 0,user,isbn,rating
8623,242,0553278223,10
8624,242,0971880107,0
8625,242,3150000335,10
8626,242,3257203659,9
8627,242,3257207522,10
...,...,...,...
1027974,275922,1578563763,9
1027975,275922,1579546463,10
1027976,275922,1579549586,9
1027977,275922,1583330844,0


Which are the books that have been picked by these users? We get the list, sorted by the number of readers, with `value_counts`:

In [20]:
mc_ratings['isbn'].value_counts()

0553278223    65
0345342968    20
044021145X    19
0971880107    18
0449212602    17
              ..
0441006841     1
0312274742     1
0553213504     1
0918273811     1
0385498861     1
Name: isbn, Length: 28862, dtype: int64

On top, we find *The Martian Chronicles*, with 65 readers. These are the 65 users involved in our transactions data set. The ISBN's of the books involved in our transaction data set are:

In [21]:
mc_isbn = mc_ratings['isbn'].value_counts().index[1:]
mc_isbn

Index(['0345342968', '044021145X', '0971880107', '0449212602', '0316666343',
       '0316769487', '0553572997', '0345337662', '0385504209', '0345313860',
       ...
       '188301106X', '0140280189', '0373765193', '0373160526', '0316279722',
       '0441006841', '0312274742', '0553213504', '0918273811', '0385498861'],
      dtype='object', length=28861)

The exercise consists in selecting 10 books among this list of 28,861.

### Recommendation based on association rules with strong confidence

To produce the top-10 recommendation to the readers of *The Martian Chronicles*, we will use two different approaches. Our first recommendation will be based on the **association rules** with strong **confidence**.

By dividing the counts by 65, we get the confidence for the rules that have ISBN 0553278223 as the consequent:

In [22]:
mc_conf = mc_ratings['isbn'].value_counts()[1:]/65.

The ISBN's of the top-10 recommendations are:

In [23]:
mc_conf.index[:10]

Index(['0345342968', '044021145X', '0971880107', '0449212602', '0316666343',
       '0316769487', '0553572997', '0345337662', '0385504209', '0345313860'],
      dtype='object')

Finally, we extract the corresponding titles from the `items` table. We join `mc_conf` with `items` based on the ISBN with `join`. To do this, on one side we transform `mc_conf` into a data frame. On the other side, we set the column `isbn` as the index of `items`. This is neede because we can only join data frames (not series) and `join`uses the index. 

In [24]:
pd.DataFrame({'conf': mc_conf[:10]}).join(items.set_index('isbn'))[['title', 'author', 'conf']]

Unnamed: 0,title,author,conf
0345342968,Fahrenheit 451,RAY BRADBURY,0.307692
044021145X,The Firm,John Grisham,0.292308
0971880107,Wild Animus,Rich Shapero,0.276923
0449212602,The Handmaid's Tale,Margaret Atwood,0.261538
0316666343,The Lovely Bones: A Novel,Alice Sebold,0.246154
0316769487,The Catcher in the Rye,J.D. Salinger,0.230769
0553572997,The Alienist,Caleb Carr,0.230769
0345337662,Interview with the Vampire,Anne Rice,0.230769
0385504209,The Da Vinci Code,Dan Brown,0.230769
0345313860,"The Vampire Lestat (Vampire Chronicles, Book II)",ANNE RICE,0.215385


### Neighborhood-based recommendation

In the second approach, we use the **neighborhood** of *The Martian Chronicles* defined by the **cosine similarity formula**. We use **implicit ratings**, that is, 1/0 values indicating whether a user has picked a book. 

The cosine formula is:

$$\cos(x,y)={x_1y_1+\cdots+x_ny_n\over\sqrt{\big(x_1^2+\cdots+x_n^2\big)\big(y_1^2+\cdots+y_n^2\big)}}\,.$$

In the application of the formula to this example, $x$ and $y$ are vectors containing the implicit ratings of two books. The output of the formula is the similarity between the books. Here, one of the two books is *The Martian Chronicles* and we apply the formula to calculate the similarities between that book and those from the list `mc_isbn`. The neighbors of the *The Martian Chronicles* will be those books with highest similarity.

Note that, in this application of the cosine formula, the numerator is zero except when the two books have at least a common reader. This is why we can restrict the analysis to the boos in the list `mc_isbn`.

Now, for a particular book, the numerator of the cosine formula, which in mathematical terms, is called a **dot product** can be calculated as the sum of as many ones as users have read both that book and *The Martian Chronicles*. We have already obtained these counts:

In [25]:
dotprod = mc_ratings['isbn'].value_counts()[1:]
dotprod

0345342968    20
044021145X    19
0971880107    18
0449212602    17
0316666343    16
              ..
0441006841     1
0312274742     1
0553213504     1
0918273811     1
0385498861     1
Name: isbn, Length: 28861, dtype: int64

The denominator has two factors. If $x$ stands for *The Martian Chronicles*, the sum of squares equals 65, whoch is the number of readers of that book. So this factor, which we denote by `mc_mod`, can be obtained as: 

In [26]:
import numpy as np
mc_mod = np.sqrt(65)

Suppose that $y$ stands for a particular book from the list `mc_isbn`. The sum of squares equals the sum of as many ones as as users have picked the book, which is, again calculated with `value_counts`. This is calculated only for the books of the list, which are filtered with the Boolean mask `ratings[ratings['isbn'].isin(mc_isbn)`.

In [27]:
items_mod = np.sqrt(ratings[ratings['isbn'].isin(mc_isbn)]['isbn'].value_counts())
items_mod

0971880107    50.019996
0316666343    35.986108
0385504209    29.715316
0060928336    27.055499
0312195516    26.888659
                ...    
020131245X     1.000000
0590400215     1.000000
0345224876     1.000000
0822005875     1.000000
0850095824     1.000000
Name: isbn, Length: 28861, dtype: float64

Putting these pieces together the cosine similarities are calculated as:

In [28]:
cos = dotprod/(mc_mod*items_mod)

In [29]:
cos = cos.sort_values(ascending=False)
cos.head()

0553206966    0.214834
0446325058    0.214834
0440405157    0.214834
0486297659    0.214834
0380730863    0.202548
Name: isbn, dtype: float64

The ISBN's of the top-10 recommendations are:

In [30]:
cos.index[:10]

Index(['0553206966', '0446325058', '0440405157', '0486297659', '0380730863',
       '055327449X', '0553096028', '0451524705', '034540291X', '0486287564'],
      dtype='object')

Finally, we extract the corresponding titles from the table `items`, as we did above for the confidence-based recommendation:

In [31]:
pd.DataFrame({'cos': cos[:10]}).join(items.set_index('isbn'))[['title', 'author', 'cos']]

Unnamed: 0,title,author,cos
0553206966,Demian the Story of Emil Sinclairs Youth,Hermann Hesse,0.214834
0446325058,The Patient in Cabin C,Mignon Good Eberhart,0.214834
0440405157,Two Weirdos and a Ghost,Patricia Windsor,0.214834
0486297659,The Taming of the Shrew (Dover Thrift Editions),William Shakespeare,0.214834
0380730863,A Medicine for Melancholy and Other Stories,Ray Bradbury,0.202548
055327449X,The Illustrated Man (Grand Master Editions),RAY BRADBURY,0.194325
0553096028,Philadelphia: A Novel,Christopher Davis,0.186052
0451524705,Idylls of the King and a Selection of Poems,Alfred Lord Tennyson,0.186052
034540291X,Requiem for Moses (Father Koesler Mystery),William X. Kienzle,0.186052
0486287564,"\Miniver Cheevy\"" and Other Poems (Dover Thrif...",Edwin Arlington Robinson,0.175412
