# Homework: Collaborative Filtering

This notebook is a graded homework that you will turn in. You must complete Problem 1+2 and then *either* Problem 3 *or* Problem 4. If you submit a notebook with both Problem 3 and 4 completed, I will just grade Problem 3.

You may:

* Ask questions of and work with other students in this course. You must each write up your own solutions.
* *Special exception: for Problems 1+2, you may work with other students and copy each others' code.* Those problems are copied verbatim from the in-class exercises.
* Ask questions of the instructor. Your instructor is generous with hints and guidance.
* Use the [Python Documentation](https://docs.python.org/3.11/), especially the [tutorial section](https://docs.python.org/3/tutorial/index.html).
* Use the PyTorch and fast.ai documentation, as well as any other relevant documentation.
* Use the [W3Schools](https://www.w3schools.com/python/) and [Python For Everbody](https://www.py4e.com/) tutorials/reference guides.

You *may not*:
* Discuss this with any other students, faculty, your friends, your family, &c. before it is submitted.
* Use Stack Overflow, Google, ChatGPT or other unspecified resources.


## Problems in this Notebook

1. Collaborative Filtering: New Dataset
2. Cosine Similarity
3. Improve the Model
4. PCA and UMAP Embeddings

In [None]:
# Load some libraries my dudes
from fastai.collab import *
from fastai.tabular.all import *

# 1. Another Dataset

Using a new dataset, fit the best recommender system you can, using the techniques from class. Here are some recommendations (haha) for datasets hosted on kaggle and around the web. Pick one you feel you know enough about, to make the rest of the problems more tractable.

* [Goodreads book ratings](https://www.kaggle.com/datasets/zygmunt/goodbooks-10k)
* [Anime ratings from MyAnimeList](https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020)
* [Board game recommendations from BoardGameGeek](https://www.kaggle.com/datasets/nfedorov/top-2000-board-games-ratings)
* [Steam video game interaction](https://www.kaggle.com/datasets/tamber/steam-video-games)
* [Amazon Music Reviews](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html)

Hints to get things running more smoothly:
1. [Rename your columns](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) to be `"user"`, `"title"`, and `"rating"`. Also, when you create a learner with `collab_learner()`, make sure to set `use_nn=False`. The code for Problem 2 assumes those, and you'll have to either rename your columns here or edit the code below.
1. Enable GPU on your notebook! Some recent update to PyTorch causes a bunch of warnings to pop up.
2. Scale up your batch size as large as you can and still fit into GPU RAM.
3. Scale down your dataset if necessary, using the [sample method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)
3. Make sure your epochs run relatively quickly! We don't have all day, people.

Load in your data and fit a model. You should be able to get better than about 15% accuracy (rmse of $\lesssim 0.8$ if ratings are out of 5, or $\lesssim 1.5$ if out of 10).

In [None]:
%env KAGGLE_USERNAME="donnydutch"
%env KAGGLE_KEY="e000856dbf910d714ae70609d63d00aa"

!kaggle datasets download -d hernan4444/anime-recommendation-database-2020
!unzip anime-recommendation-database-2020

env: KAGGLE_USERNAME="donnydutch"
env: KAGGLE_KEY="e000856dbf910d714ae70609d63d00aa"
Dataset URL: https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020
License(s): CC0-1.0
Downloading anime-recommendation-database-2020.zip to /content
100% 659M/661M [00:31<00:00, 22.2MB/s]
100% 661M/661M [00:31<00:00, 22.3MB/s]
Archive:  anime-recommendation-database-2020.zip
  inflating: anime.csv               
  inflating: anime_with_synopsis.csv  
  inflating: animelist.csv           
  inflating: html folder/html/1/details.html  
  inflating: html folder/html/1/pictures.html  
  inflating: html folder/html/1/recomendations.html  
  inflating: html folder/html/1/reviews_1.html  
  inflating: html folder/html/1/reviews_10.html  
  inflating: html folder/html/1/reviews_11.html  
  inflating: html folder/html/1/reviews_12.html  
  inflating: html folder/html/1/reviews_13.html  
  inflating: html folder/html/1/reviews_14.html  
  inflating: html folder/html/1/reviews_15.html  
 

In [None]:

rating = pd.read_csv("rating_complete.csv") #seeing qhat columns we need
rating = rating.sample(n=100000, random_state = 42)
rating = rating.dropna()
rating = rating.rename(columns = {"user_id":"user","anime_id":"ID"})
rating

Unnamed: 0,user,ID,rating
41605545,255419,4059,7
33600014,206382,10793,7
39884414,245012,34246,4
45251042,277697,5030,7
53777062,329602,227,7
...,...,...,...
52643198,322744,2890,9
55545607,340485,8769,5
33383720,205080,1519,7
2805155,17291,33558,7


In [None]:
anime = pd.read_csv("/content/anime.csv")
anime = anime[["MAL_ID","Name"]]
anime = anime.rename(columns = {"MAL_ID":"ID","Name":"title"})

In [None]:
rating = rating.merge(anime)

In [None]:
rating

Unnamed: 0,user,ID,rating,title
0,255419,4059,7,"Clannad: Mou Hitotsu no Sekai, Tomoyo-hen"
1,206382,10793,7,Guilty Crown
2,245012,34246,4,Kimi no Mana wa Rina Witch
3,277697,5030,7,Maria†Holic
4,329602,227,7,FLCL
...,...,...,...,...
99995,322744,2890,9,Gake no Ue no Ponyo
99996,340485,8769,5,Ore no Imouto ga Konnani Kawaii Wake ga Nai
99997,205080,1519,7,Black Lagoon: The Second Barrage
99998,17291,33558,7,Tales of Zestiria the Cross: Saiyaku no Jidai


In [None]:
dls = CollabDataLoaders.from_df(rating, item_name='title', bs=32)
dls.show_batch()

Unnamed: 0,user,title,rating
0,294358,Yoru wa Mijikashi Arukeyo Otome,9
1,220587,Araburu Kisetsu no Otome-domo yo.,8
2,89443,Another,6
3,74400,"Love Hina: Motoko no Sentaku, Koi ka Ken... ""Naku na""",10
4,212745,Senyuu.,5
5,32237,Ao no Exorcist Movie,8
6,312856,One Piece Film: Strong World,10
7,147786,Golden Kamuy 2nd Season,6
8,129350,Byousoku 5 Centimeter,9
9,55293,Sword Art Online II,9


In [None]:
x,y = dls.one_batch()

In [None]:
print(x[0:10,:])
print(y[0:10])

tensor([[ 3049,  5186],
        [51892,  2174],
        [26446,  7665],
        [20036,  5963],
        [18392,  6172],
        [54522,  3876],
        [47316,  2052],
        [49528,  2997],
        [ 1633,  1848],
        [49724,  6527]])
tensor([[7],
        [9],
        [4],
        [5],
        [9],
        [8],
        [7],
        [9],
        [8],
        [8]], dtype=torch.int8)


In [None]:
print(len(dls.classes["user"]))
print(len(dls.classes["title"]))

60776
7754


In [None]:
user_embedding = Embedding(60776, 5)
anime_embedding = Embedding(7754, 5)

In [None]:
user_features = user_embedding(x[:,0])
user_features[0:10,:]

tensor([[-0.0028,  0.0060, -0.0054, -0.0075, -0.0060],
        [-0.0143, -0.0132, -0.0180, -0.0089, -0.0026],
        [ 0.0012,  0.0029,  0.0123, -0.0010,  0.0003],
        [ 0.0027, -0.0028,  0.0060,  0.0133,  0.0114],
        [ 0.0056,  0.0027, -0.0146,  0.0015,  0.0095],
        [-0.0048, -0.0050,  0.0052,  0.0006, -0.0003],
        [ 0.0022,  0.0068,  0.0068, -0.0006,  0.0018],
        [ 0.0113,  0.0101,  0.0091, -0.0021, -0.0018],
        [ 0.0045, -0.0139,  0.0099,  0.0005, -0.0083],
        [-0.0032, -0.0081,  0.0066,  0.0073,  0.0047]],
       grad_fn=<SliceBackward0>)

In [None]:
anime_features = anime_embedding(x[:,1])
anime_features[0:10, :]

tensor([[ 1.0931e-02, -1.8621e-03, -1.7358e-02, -6.3469e-04, -1.6913e-03],
        [-9.0159e-03,  1.0898e-03,  3.6968e-03, -8.0159e-03, -1.1336e-03],
        [ 9.3665e-05,  6.4031e-03, -7.8150e-03, -1.0959e-03,  2.0662e-03],
        [-1.3998e-02,  1.9959e-03, -6.1591e-03, -6.9893e-03, -1.9610e-02],
        [-6.7251e-03, -3.8844e-03, -5.4091e-03, -1.0875e-02, -1.0280e-02],
        [ 1.3665e-02, -1.0590e-02,  6.0840e-03,  1.0763e-02, -1.3152e-02],
        [ 3.2062e-03,  1.6077e-03,  5.9711e-03,  1.1561e-02,  4.7092e-03],
        [-4.8273e-03,  5.9940e-03, -4.4720e-03,  8.0280e-03,  2.6282e-03],
        [ 5.8889e-03,  1.4251e-02,  1.9122e-03,  8.3904e-03,  5.0501e-03],
        [ 7.1998e-05, -3.0043e-03, -1.1708e-02, -7.0159e-04,  5.9380e-03]],
       grad_fn=<SliceBackward0>)

In [None]:
predictions = (user_features*anime_features).sum(axis=1)
predictions

tensor([-1.3922e-04, -9.0333e-05, -1.1379e-04, -1.5507e-05,  8.0330e-05,
         1.5089e-04, -7.4593e-05, -4.0581e-04, -7.4510e-06, -2.2548e-05,
        -2.9781e-04, -2.3359e-04, -1.0404e-04, -2.0093e-04, -7.1239e-05,
        -4.0012e-04, -1.0279e-04,  8.8769e-05,  2.1874e-04,  2.5051e-04,
        -2.1546e-04, -7.4139e-05,  3.8021e-04,  6.8862e-04, -3.5406e-04,
         5.7926e-05, -2.5436e-04, -1.9809e-04, -7.1280e-05,  1.8701e-04,
         7.4674e-05,  2.9053e-04], grad_fn=<SumBackward1>)

In [None]:
class DotProduct (Module):
    def __init__(self, n_users, n_animes, n_features):
        self.user_embedding  = Embedding(n_users, n_features)
        self.anime_embedding = Embedding(n_animes, n_features)

    def forward(self, x):
        users  = self.user_embedding(x[:,0])
        animes = self.anime_embedding(x[:,1])
        return (users*animes).sum(dim=1)

In [None]:
n_users  = len(dls.classes["user"])
n_animes = len(dls.classes["title"])
n_features = 50

model = DotProduct(n_users, n_animes, n_features)
learn = collab_learner(dls, model, loss_func=MSELossFlat(), metrics=rmse,use_nn=False)
learn.lr_find()
#learn.fit_one_cycle(5, 5e-4)

TypeError: empty(): argument 'size' failed to unpack the object at pos 2 with error "type must be tuple of ints,but got DotProduct"

## 2. Cosine Similarity

We can use the embeddings for each title to determine which items are closest to each other (alternatively, which other user has the most similar taste). Imagine each embedding as a vector in space. Two items are similar to each other if the angle between their vectors, $\theta$ is small. This is usually reported as $\cos\theta$, because it's fast to calculate. But it's also useful for humans: $\cos0^\circ=1$, and $\cos90^\circ=0$, so it will always be higher for similar vectors. (In general, for high dimensional spaces, two random directions are usually close to orthogonal to each other)

Below is code [adapted from the textbook](https://github.com/fastai/fastbook/blob/master/08_collab.ipynb) which will find some similar titles based on one submitted. You almost certainly will need to modify your code based on the exact format of your dataset.

(a) First, get the code below to work. It's based on when I did the homework, and I was looking at board games.

In [None]:
# Getting a list of some of the unique titles in the dataset
ratings["title"].unique()

In [None]:
# This is the item's name that I'm looking up.
# I did this assignment with board games
itemname = "6 nimmt!"

In [None]:
weights = learn.model.i_weight.weight                  # Grab the embeddings for the items
idx = torch.tensor(dls.classes['title'].o2i[itemname]) # Determine where this item is
idx # This will be 0 if the item you looked up doesn't exist
    # (or you picked the 0th entry, but you should know that)

In [None]:
# Calculate cosine distances
distances = nn.CosineSimilarity()(weights, weights[idx,None])

# Sort finding the closest distance
top10 = distances.argsort(descending=True)[0:10]
dls.classes['title'][top10]

(b) Produce a few recommendation lists based on items in the dataset. Comment on the results, especially relative to the model accuracy you got in Problem 1.

Do you see any trends in the predictions?

(c) If we're using the dot product bias model, do you expect two similar songs of different popularity to have a higher or lower cosine similarity than two very popular songs which are otherwise quite different?

Explain.

## 3. Modifying the Model

To add inputs to the model beyond the user, title, and rating, we're going to have to take a few steps:

1. Change the input data
2. Change the DataLoaders
3. Modify the model to take extra inputs

Let me lead you through those.

(a) Modify the DataFrame that contains your dataset to add at least one continuous variable related to the *item* or *title* that you're making predictions for.

(b) Use a `TabularDataLoaders`, rather than a `CollabDataLoaders`, to load in the data. Make sure your titles and usernames are categorical!

Once you've successfully loaded them in, use your TabularDataLoaders' `one_batch()` method to see what format your data will be passed to your model. You should see a tuple of three tensors: one with the categorical data (one column of ID numbers for the user, one of ID numbers for the title, in order based on how you loaded the data), one with the continuous variables, and one with the target values (the ratings).

(c) Now it's time to modify the model itself. Here's our neural network model from class. You'll need to modify two things:
* In the `__init__()` function, you need to be able to take in additional continuous variables, that is, however many you're using in parts a and b, above.
* In the forward function, you will be passed *three* arguments. In addition to `self`, there will be an argument for the categorical IDs (in the model below, those are just called `x`), but now there will be an additional argument with the numeric values. Make sure the function takes those in as well, and then actually passes them into the neural network you made.

In [None]:
class RecommenderNN (Module):
    def __init__(self, user_sz, item_sz, num_continuous, range=[0.5,5.5], n_act=[200,100]):

        self.user_embedding  = Embedding(*user_sz)
        self.movie_embedding = Embedding(*item_sz)

        # What needs to be modelled to take into account the number of continous
        # variables in the model?

        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act[0]),
            nn.ReLU(),
            nn.Linear(n_act[0], n_act[1]),
            nn.ReLU(),
            nn.Linear(n_act[1], 1))

        self.min = range[0]
        self.max = range[1]

    def forward(self, x): # Add another argument for the continuous inputs, then use them!
        users   = self.user_embedding(x[:,0])
        movies  = self.movie_embedding(x[:,1])
        embeddings = torch.cat([users, movies], dim=1)

        raw_rating = self.layers(embeddings)

        return torch.sigmoid(raw_rating)*(self.max-self.min) + self.min

In [None]:
# Here's some starter code for feeding everything into the model.
# The variables names may or may not match what you've got.

embs = get_emb_sz(dls)

model = RecommenderNN(*embs, 4, [0.5,5.5])
learn = Learner(dls, model, loss_func=MSELossFlat(), metrics=rmse)


(d) Train the model. Do you get an improvement in prediction quality?

(e) We did this with a neural network. Could we have done it with the dot product model instead? Why or why not?

## 4. PCA and UMAP: More Embedding Interpretation

Here's two (hopefully) useful articles to read. Seriously, read them. My explanations below are insufficient and more for me to learn how to write about them than actually useful.
* [Understanding PCA](https://setosa.io/ev/principal-component-analysis/)
* [Understanding UMAP](https://pair-code.github.io/understanding-umap/)

**Principal Component Analysis (PCA)** takes a matrix (like the weights in our model) and finds the best-fit line going through it (which will end up being a combination of each of the dimensions in the data). This is the principal component. The line won't be a perfect fit; so after removing it, another line is found, which is the 2nd principal component. This can be repeated $N$ times, where $N$ is the number of dimensions the data exist in (in our case, the number of latent factors in the embedding, not the number of unique items). In practice, usually just the first two components are used, so they can be graphed.

**Uniform Manifold Approximation and Projection (UMAP)** tries to maintain clusters and distances between clusters. It operates not on the weights directly, like PCA does, but finds all pairwise distances between entries, which forms a matrix of distances between items like this:

| |item 1|item 2|item 3|
|:-:|:-:|:-:|:-:|
|item 1|0|1.2|4.2|
|item 2|1.2|0|2.0|
|item 3|3.2|2.0|0|

It then attempts to find a two-dimensional projection that best maintains that distance matrix.

___

(a) Below is code I've more or less copied from the textbook, with some reworking. I've also added comments.

What the code does is cut the dataset down to the 1000 items with the most ratings and perform PCA on them. It then actually visualizes the positions of the items with the 50 most ratings, along with labels.

Make the code work so you get your own PCA visualization.

In [None]:
# Get a list of the 1000 most-rated items in the dataset
# This means that these items should have the best-established embeddings
most_rated = ratings.groupby('title')['rating'].count()
most_rated = most_rated.sort_values(ascending=False).index.values[:1000]
most_rated[0:10]

In [None]:
# Get the indices for the most-rated items
top_idxs = tensor([learn.dls.classes['title'].o2i[mr] for mr in most_rated])
# Extract the weights for those iems
weights = learn.model.i_weight.weight[top_idxs].cpu().detach()
# Perform PCA to get the 3 most informative dimensions
rating_pca = weights.pca(3)

In [None]:
fac0,fac1,fac2 = rating_pca.t()
top50 = list(range(50))

# Extract the arbitrary X and Y axes using factor 0 and 1 from the PCA
X = fac0[top50]
Y = fac1[top50]
plt.figure(figsize=(8,8))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[top50], X, Y):
    plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()

(b) Describe the *regions* in your plot, in the context of your dataset. You should be talking about the left/right, top/bottom, or topleft/bottomright &c. distinctions, not talking about specific clusterings. Point out a few examples to illustrate this.

If you check the textbook, it mentions that PCA on movie data seems to find a split between movies which are popular/unpopular amongst viewers (and made a lot of money) on one axis, and ones which are critically acclaimed or panned on another.

(c) Install and use the [UMAP](https://umap-learn.readthedocs.io/en/latest/) library, and repeat part b of this question. Luckily, UMAP uses essentially the same syntax as Scikit-Learn, which is what was used for PCA.

(d) UMAP preserves groupings and distances between groups, more or less. Repeat part b of this question, describing the clusters you find.