# Book Recommendation Dataset
In this notebook, I attempt to reproduce the recommendation engine from chapter 8 using the [Book-Crossing Dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/).

Source: *Improving Recommendation Lists Through Topic Diversification*,
Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear.



In [1]:
%%capture
!pip install --upgrade fastai

In [2]:
# Data Source http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
from fastai.collab import *
from fastai.tabular.all import *
import pandas as pd

path = untar_data("http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip")

In [3]:
bookratings = pd.read_csv(path.parent / 'BX-Book-Ratings.csv', sep=';', encoding = "ISO-8859-1")
booknames = pd.read_csv(path.parent / 'BX-Books.csv', sep=';', encoding = "ISO-8859-1",
                       quotechar='"', escapechar='\\', usecols=(0,1))

In [4]:
bookratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [5]:
booknames.head()

Unnamed: 0,ISBN,Book-Title
0,195153448,Classical Mythology
1,2005018,Clara Callan
2,60973129,Decision in Normandy
3,374157065,Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It
4,393045218,The Mummies of Urumchi


In [6]:
#print(booknames['Book-Title'][booknames['ISBN']=='0195153448'][0])
books = bookratings.merge(booknames, on='ISBN')

In [7]:
books.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title
0,276725,034545104X,0,Flesh Tones: A Novel
1,2313,034545104X,5,Flesh Tones: A Novel
2,6543,034545104X,0,Flesh Tones: A Novel
3,8680,034545104X,5,Flesh Tones: A Novel
4,10314,034545104X,9,Flesh Tones: A Novel


In [8]:
def get_title(isbn, booknames=booknames):
    return(booknames['Book-Title'][booknames['ISBN']==isbn].values[0])
    
get_title("0374157065")

'Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It'

Now we have our list of users and ratings and our dictionary of titles and isbn numbers. Let's proceed with our collaborative learner.

## Prepare the Data

In [9]:
dls = CollabDataLoaders.from_df(books, item_name='Book-Title', bs=64)
dls.show_batch()

Unnamed: 0,User-ID,Book-Title,Book-Rating
0,129503,El anatomista,9
1,25601,The Abacus/Book and Abacus,0
2,45971,How the Garcia Girls Lost Their Accents (Plume Contemporary Fiction),10
3,102624,The Blue Suit: A Memoir of Crime,8
4,170264,The Weight of Water,0
5,392,Una casa en Irlanda,9
6,85526,The Northern Lights: The True Story of the Man Who Unlocked the Secrets of the Aurora Borealis,0
7,148258,The Right Word!: How to Say What You Really Mean (Right! Series),10
8,235105,Office of Innocence,7
9,248496,LUCKY,9


In [10]:
books['Book-Rating'].max(), books['Book-Rating'].min()

(10, 0)

In [11]:
learn = collab_learner(dls, n_factors=50, y_range=(0,10.5))
learn.fit_one_cycle(5,5e-3,wd=0.1)

epoch,train_loss,valid_loss,time
0,13.423079,13.638583,03:54
1,13.661947,13.605501,03:53
2,13.296583,13.434108,03:53
3,12.895421,13.227719,03:53
4,12.179284,13.239008,03:53


### QUESTIONS
- How does it know which is the user and which is the item? Is there a meaningful difference?
- Why does this one take so much longer than the movies dataset? (I haven't actually checked the length).
- It doesn't seem to matter that the User-ID isn't any kind of index. Would we need to create an index to redo this in the same manual way as we did with the movie dataset? Or would `embedding` at least take care of that?

In [14]:
book_bias = learn.model.i_bias.weight.squeeze()
idxs = book_bias.argsort(descending=True)[:5]
[dls.classes['Book-Title'][i] for i in idxs]

['Harry Potter and the Prisoner of Azkaban (Book 3)',
 'Free',
 'Harry Potter and the Goblet of Fire (Book 4)',
 "Ender's Game (Ender Wiggins Saga (Paperback))",
 'Harry Potter and the Order of the Phoenix (Book 5)']

In [15]:
learn = collab_learner(dls, n_factors=50, y_range=(0,10.5))
learn.fit_one_cycle(5)

epoch,train_loss,valid_loss,time
0,14.531243,14.581391,03:55
1,12.123333,12.555926,03:56
2,11.019968,12.49255,03:57
3,9.853679,12.542894,03:57
4,9.234361,12.552869,03:56


Without weight decay, we have some evidence of overfitting. The training loss is considerably better than we saw before but the validation loss started increasing again.

I'm very curious about hyperparameter tuning in these situations.

In [16]:
book_bias = learn.model.i_bias.weight.squeeze()
idxs = book_bias.argsort(descending=True)[:5]
[dls.classes['Book-Title'][i] for i in idxs]

['Free',
 'Harry Potter and the Prisoner of Azkaban (Book 3)',
 'Harry Potter and the Goblet of Fire (Book 4)',
 "Ender's Game (Ender Wiggins Saga (Paperback))",
 "Harry Potter and the Sorcerer's Stone (Book 1)"]

In [19]:
booknames2 = pd.read_csv(path.parent / 'BX-Books.csv', sep=';', encoding = "ISO-8859-1",
                       quotechar='"', escapechar='\\', usecols=(0,1,2, 3, 4))
booknames2[booknames2['Book-Title']=='Free']

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
3028,1844262553,Free,Paul Vincent,2003,Upfront Publishing
6204,385468490,Free,Todd Komarnicki,1993,Bantam Dell Pub Group
8224,951426133,Free,M. Gilliland,1990,Attack International


I remain confused by "Free" and wonder if there's something about people looking for "free books" or something.