<a href="https://colab.research.google.com/github/djliden/deep_learning_practice/blob/main/notebooks/Collaborative%20Filtering%20Book%20Recommendations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Book Recommendation Dataset
In this notebook, I attempt to reproduce the recommendation engine from chapter 8 using the [Book-Crossing Dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/).

Source: *Improving Recommendation Lists Through Topic Diversification*,
Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear.



In [69]:
%%capture
!pip install --upgrade fastai

Note the use of the `%%capture` tag -- this prevents the cell from printing the output, which is long and not particularly useful when simply reinstalling/upgrading fastai at the beginning of a colab session.

## Downloading and Exploring the Data
We will be following the same general process as in chapter 8 using the "Book-Crossing" dataset. Book-Crossing is a website for "releasing your books 'into the wild' for a stranger to find." I didn't actually see book ratings on their website, so I don't have any additional details on the rating system used.

In [70]:
# Data Source http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
from fastai.collab import *
from fastai.tabular.all import *
import pandas as pd

path = untar_data("http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip")

Note that, in this case, our unzipped data do *not* end up in their own separate directory but are unzipped directly into `~/.fastai/data` (the default location for `untar_data`. So to access the unzipped files, we need to look in `path.parent` as the default `untar_data` options (apparently) assume that a new directory is created. 

In [71]:
bookratings = pd.read_csv(path.parent / 'BX-Book-Ratings.csv', sep=';', encoding = "ISO-8859-1")
booknames = pd.read_csv(path.parent / 'BX-Books.csv', sep=';', encoding = "ISO-8859-1",
                       quotechar='"', escapechar='\\', usecols=(0,1,2))

Parsing the data took some work as several of the fields have quotes or semicolons. Semicolons are the separator for these `csv`s and quotes mark text fields, so we need to make sure that these characters are handled appropriately when they appear within fields. I also needed to specify the encoding, though I can't say with any certainty why that particular encoding worked. It was in a stackexchange answer.

In [72]:
bookratings.head(), bookratings.shape

(   User-ID        ISBN  Book-Rating
 0   276725  034545104X            0
 1   276726  0155061224            5
 2   276727  0446520802            0
 3   276729  052165615X            3
 4   276729  0521795028            6, (1149780, 3))

This is quite a bit bigger than the MovieLens subset we worked through in chapter 8, with 1.15 million ratings.

In [73]:
bookratings.nunique()

User-ID        105283
ISBN           340556
Book-Rating        11
dtype: int64

So we have more than 100,000 unique users and more than 340,000 unique books. We also have 11 ratings, on a 0 through 10 scale.

In [74]:
np.sort(bookratings['Book-Rating'].unique())

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

### Getting Book Titles
We also want a way to figure out the titles of books. These were stored in a separate csv file, which we extracted above. Let's take a look.

In [75]:
booknames['title-author'] = booknames['Book-Title'] + ' (' + booknames['Book-Author'] + ')'
booknames = booknames.drop(['Book-Title', 'Book-Author'],1)
booknames.head()

Unnamed: 0,ISBN,title-author
0,195153448,Classical Mythology (Mark P. O. Morford)
1,2005018,Clara Callan (Richard Bruce Wright)
2,60973129,Decision in Normandy (Carlo D'Este)
3,374157065,Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It (Gina Bari Kolata)
4,393045218,The Mummies of Urumchi (E. J. W. Barber)


And now we'll merge this into our book ratings dataframe.

In [76]:
books = bookratings.merge(booknames, on='ISBN')
books.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,title-author
0,276725,034545104X,0,Flesh Tones: A Novel (M. J. Rose)
1,2313,034545104X,5,Flesh Tones: A Novel (M. J. Rose)
2,6543,034545104X,0,Flesh Tones: A Novel (M. J. Rose)
3,8680,034545104X,5,Flesh Tones: A Novel (M. J. Rose)
4,10314,034545104X,9,Flesh Tones: A Novel (M. J. Rose)


Now we have our dataframe of users, ratings, and titles. Let's proceed with our collaborative learner.

### Prepare the DataLoaders

In [77]:
dls = CollabDataLoaders.from_df(books,
                                user_name='User-ID',
                                rating_name='Book-Rating',
                                item_name='title-author', bs=64,
                                seed=46)
dls.show_batch()

Unnamed: 0,User-ID,title-author,Book-Rating
0,232106,Confessions of an Ugly Stepsister : A Novel (Gregory Maguire),0
1,2132,Racing Moon (Cassie Edwards),0
2,171986,Stillwatch (MARY HIGGINS CLARK),0
3,185233,Every Day I Pray : Prayers for Awakening to the Grace of Inner Communion (Iyanla Vanzant),0
4,135061,Kittycat's Motor Is Running! (Bil Keane),6
5,94496,The Sisterhood of the Traveling Pants (Ann Brashares),0
6,251422,Murder Gets a Life: : A Southern Sisters Mystery (Southern Sisters Mysteries (Paperback)) (Anne George),0
7,179734,Single Wife: A Novel (Nina Solomon),10
8,96294,"TUTANKHAMUN (Thomas Pearsall Field, Hoving)",0
9,252282,Catch Me If You Can: The True Story of a Real Fake (Stan Redding),0


## Fit the First Model

In [78]:
learn = collab_learner(dls, n_factors=50, y_range=(0,10.5))
#learn.lr_find()
learn.fit_one_cycle(5,.1,wd=0.1)

epoch,train_loss,valid_loss,time
0,15.497638,15.470654,03:56
1,15.250862,15.386806,03:57
2,14.999686,14.837164,03:59
3,14.077789,14.059855,04:01
4,13.002513,13.337444,04:01


### QUESTIONS
- How does it know which is the user and which is the item? **Answer:** *Assumes user name is first column, item name is second, and rating is third. Though in this case I just made it explicit.*
- Why does this one take so much longer than the movies dataset? *Number of observations is much larger*
- It doesn't seem to matter that the User-ID isn't any kind of index. Would we need to create an index to redo this in the same manual way as we did with the movie dataset? Or would `embedding` at least take care of that?
- How do we interpret the loss in terms of ratings? Would $\sqrt{MSELossFlat}$ be in terms of the original scale? I.e. does a loss of 13.3 translate to an average ratings miss of about 3.6 (out of 11)?
- Hyperparameter tuning in fastai? More used to e.g. `scikit-learn` pipelines with grid search etc. What is the process here? Especially for something as time and resource intensive as this. 

In [79]:
book_bias = learn.model.i_bias.weight.squeeze()
idxs = book_bias.argsort(descending=True)[:5]
[dls.classes['title-author'][i] for i in idxs]

['Harry Potter and the Goblet of Fire (Book 4) (J. K. Rowling)',
 'Free (Paul Vincent)',
 'Harry Potter and the Prisoner of Azkaban (Book 3) (J. K. Rowling)',
 'The Little Prince (Antoine de Saint-ExupÃ©ry)',
 'Where the Sidewalk Ends : Poems and Drawings (Shel Silverstein)']

## As a Neural Network

In [80]:
embs = get_emb_sz(dls)
embs

[(81366, 600), (250105, 600)]

In [84]:
learn = collab_learner(dls, use_nn=True, y_range = (0,10.5), layers=[100,50])
learn.lr_find()
#learn.fit_one_cycle(5, 0.1, wd=0.1)

RuntimeError: ignored

In [None]:
learn.recorder.plot_loss()

The final loss was marginally worse than what we obtained before.

In [None]:
book_bias = learn.model.i_bias.weight.squeeze()
idxs = book_bias.argsort(descending=True)[:5]
[dls.classes['title-author'][i] for i in idxs]

Similar. I remain confused by "Free." It doesn't seem to be a paricularly popular or well-known book based on my searching. I found a fairly detailed EDA from a kaggle user [here](https://www.kaggle.com/somnambwl/very-basic-book-recommendations). It found that Free was, indeed, one of the most-rated books, with a high average rating.

At the end of this exercise, my biggest outstanding question is: what is a good process for hyperparameter tuning using `fastai`?

In [None]:
## Book Similarity

book_factors = learn.model.i_weight.weight
idx = dls.classes['title-author'].o2i["Harry Potter and the Goblet of Fire (Book 4) (J. K. Rowling)"]
distances = nn.CosineSimilarity(dim=1)(book_factors, book_factors[idx][None])
idxs = distances.argsort(descending=True)[1]
dls.classes['title-author'][idxs]

Makes sense. Though to be completely transparent, most of the "book similarities' did *not* make intuitive sense to me.