# Book Recommendation Dataset
In this notebook, I attempt to reproduce the recommendation engine from chapter 8 using the [Book-Crossing Dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/).

Source: *Improving Recommendation Lists Through Topic Diversification*,
Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear.



In [17]:
# Data Source http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
from fastai.collab import *
from fastai.tabular.all import *
import pandas as pd

path = untar_data("http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip")

In [83]:
bookratings = pd.read_csv(path.parent / 'BX-Book-Ratings.csv', sep=';', encoding = "ISO-8859-1")
booknames = pd.read_csv(path.parent / 'BX-Books.csv', sep=';', encoding = "ISO-8859-1",
                       quotechar='"', escapechar='\\', usecols=(0,1))

In [31]:
bookratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [84]:
booknames.head()

Unnamed: 0,ISBN,Book-Title
0,195153448,Classical Mythology
1,2005018,Clara Callan
2,60973129,Decision in Normandy
3,374157065,Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It
4,393045218,The Mummies of Urumchi


In [87]:
#print(booknames['Book-Title'][booknames['ISBN']=='0195153448'][0])
books = bookratings.merge(booknames, on='ISBN')

In [89]:
books.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title
0,276725,034545104X,0,Flesh Tones: A Novel
1,2313,034545104X,5,Flesh Tones: A Novel
2,6543,034545104X,0,Flesh Tones: A Novel
3,8680,034545104X,5,Flesh Tones: A Novel
4,10314,034545104X,9,Flesh Tones: A Novel


In [82]:
def get_title(isbn, booknames=booknames):
    return(booknames['Book-Title'][booknames['ISBN']==isbn].values[0])
    
get_title("0374157065")

'Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It'

Now we have our list of users and ratings and our dictionary of titles and isbn numbers. Let's proceed with our collaborative learner.

## Prepare the Data

In [90]:
dls = CollabDataLoaders.from_df(books, item_name='Book-Title', bs=64)
dls.show_batch()

Unnamed: 0,User-ID,Book-Title,Book-Rating
0,39646,The Trials of Tiffany Trott,0
1,136306,The Beast,0
2,265595,A Step from Cinnamon Alley,0
3,75860,Billy Boy: A Novel,0
4,254971,Sea Glass: A Novel,0
5,224249,House of Sand and Fog,5
6,180378,"Kid, You Sing My Songs of Love, and Loss, and Hope",7
7,242824,Princess of Fire (Zebra Historical Romance),0
8,35859,Grover's Adventure Under the Sea: Featuring Jim Henson's Sesame Street Muppets (Peek-a-Board Books),0
9,189835,The Lost Princess of Oz,5


In [92]:
books['Book-Rating'].max(), books['Book-Rating'].min()

(10, 0)

In [93]:
learn = collab_learner(dls, n_factors=50, y_range=(0,10.5))
learn.fit_one_cycle(5,5e-3,wd=0.1)

epoch,train_loss,valid_loss,time
0,20.10957,03:00,


KeyboardInterrupt: 

### QUESTIONS
- How does it know which is the user and which is the item? Is there a meaningful difference?
- Why does this one take so much longer than the movies dataset? (I haven't actually checked the length).
- It doesn't seem to matter that the User-ID isn't any kind of index. Would we need to create an index to redo this in the same manual way as we did with the movie dataset? Or would `embedding` at least take care of that?