In [1]:
import fastbook
fastbook.setup_book()

In [2]:
from fastbook import *
from fastai.collab import *
from fastai.tabular.all import *

In [3]:
# Downloading the data the usual way
path = untar_data(URLs.ML_100k)

# Extracting the Ratings
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user', 'movie', 'rating', 'timestamp'])

# Extracting the Movie Titles
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
                    usecols=(0,1), names=('movie', 'title'), header=None)

# Merging the two dataframes
ratings = ratings.merge(movies)

# Creating our DataLoaders
dls = CollabDataLoaders.from_df(ratings,
                                 user_name='user',
                                 item_name='title',
                                 rating_name = 'rating',
                                 bs=64)

# Initialising our Latent Factors
n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

# Collaborative Filtering Deep Dive

## Using fastai.collab

We can create and train a collaborative filtering model using the exact structure shown earlier by using fastai's `collab_learner`.

In [4]:
learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))

learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,0.933766,0.952832,00:03
1,0.866779,0.877993,00:03
2,0.749645,0.834382,00:03
3,0.599252,0.823029,00:03
4,0.493067,0.823142,00:03


The names of the layers can be seen by printing the model.

In [5]:
learn.model

EmbeddingDotBias(
  (u_weight): Embedding(944, 50)
  (i_weight): Embedding(1665, 50)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)

We can use these to replicate any of the analyses we did in the previous section - for instance:


In [6]:
movie_bias = learn.model.i_bias.weight.squeeze()
idx = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idx]

['Shawshank Redemption, The (1994)',
 'Titanic (1997)',
 'Silence of the Lambs, The (1991)',
 'L.A. Confidential (1997)',
 'Star Wars (1977)']

Another interesting thing we can do with these learned embeddings is to look at *distance*.

## Embedding Distance

On a two-dimensional map, we can calculate the distance between two coordinates using Pythagoras theorem. For a 50-dimensional embedding, we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.

If there were two movies that were nearly identical, then their embedding vectors would also have to be nearly identical, because the users that would like them would be nearly exactly the same.
The general idea is that movie similarity can be defined by the similarity of users that like those movies. And that directly means that the distance between two movies' embedding vectors can define that similarity.

We can use this to find the most similar movie to *Silence of the Lambs*.

In [9]:
movie_factors = learn.model.i_weight.weight
idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]

'12 Angry Men (1957)'