<a href="https://colab.research.google.com/github/arunoda/fastai-v4/blob/master/07_1_colab_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Coloabrative Filtering with fast.ai**

## Setting Up on Colab

You only need to run this on Colab.

In [1]:
!pip install fastai2 > /dev/null 2>&1
!git clone https://github.com/arunoda/fastai-v4 > /dev/null 2>&1
%cd fastai-v4

/content/fastai-v4


In [0]:
from fastai2.collab import *
from fastai2.tabular.all import *

## **Loading the Dataset**

Here we are going to use a Mini dataset from Movie Lens.

In [3]:
data_path = untar_data(URLs.ML_100k)

In [4]:
!ls -all {data_path}

total 15784
drwxr-xr-x 2 root root    4096 May 29 15:37 .
drwxr-xr-x 3 root root    4096 May 29 15:37 ..
-rw-r--r-- 1 root root     716 May 29 15:37 allbut.pl
-rw-r--r-- 1 root root     643 May 29 15:37 mku.sh
-rw-r--r-- 1 root root    6750 May 29 15:37 README
-rw-r--r-- 1 root root 1586544 May 29 15:37 u1.base
-rw-r--r-- 1 root root  392629 May 29 15:37 u1.test
-rw-r--r-- 1 root root 1583948 May 29 15:37 u2.base
-rw-r--r-- 1 root root  395225 May 29 15:37 u2.test
-rw-r--r-- 1 root root 1582546 May 29 15:37 u3.base
-rw-r--r-- 1 root root  396627 May 29 15:37 u3.test
-rw-r--r-- 1 root root 1581878 May 29 15:37 u4.base
-rw-r--r-- 1 root root  397295 May 29 15:37 u4.test
-rw-r--r-- 1 root root 1581776 May 29 15:37 u5.base
-rw-r--r-- 1 root root  397397 May 29 15:37 u5.test
-rw-r--r-- 1 root root 1792501 May 29 15:37 ua.base
-rw-r--r-- 1 root root  186672 May 29 15:37 ua.test
-rw-r--r-- 1 root root 1792476 May 29 15:37 ub.base
-rw-r--r-- 1 root root  186697 May 29 15:37 ub.test
-rw-r--r-- 

In [5]:
!head {data_path/"u.data"}

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013


This is the file we want. Here are those fields:

```
user, movie, rating, timestamp
```

In [0]:
df_ml = pd.read_csv(data_path/"u.data", delimiter = "\t", header=None, names=["user", "movie", "rating", "timestamp"])


In [7]:
df_ml.head()

Unnamed: 0,user,movie,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [0]:
dls = CollabDataLoaders.from_df(df_ml, bs=5)

In [9]:
dls.classes

{'movie': (#1657) ['#na#',1,2,3,4,5,6,7,8,9...],
 'user': (#944) ['#na#',1,2,3,4,5,6,7,8,9...]}

In [0]:
x, y = dls.one_batch()

In [11]:
x

tensor([[536, 698],
        [307,  99],
        [654,   3],
        [504, 179],
        [125, 372]], device='cuda:0')

In [12]:
y

tensor([[3],
        [4],
        [3],
        [1],
        [1]], device='cuda:0')

So. Based on above data, first col is user and second is movies. `y` containes the result.

For example: **user `49` gave `3` points for movie `581`.**



## **Idea of Colab Filtering**

For now, let's give each movies a 2 factors and 2 factors for each user. So, it'll look like this:

In [0]:
user_factors = torch.randn((len(dls.classes['user']), 2), requires_grad=True)
movie_factors = torch.randn((len(dls.classes['movie']), 2), requires_grad=True)

Let's pick the factors for `49` th user and `581` movie.

In [14]:
user_factors[49], movie_factors[581]

(tensor([-0.6954,  0.6502]), tensor([0.3967, 0.9922]))

Let's create single number from these.

In [15]:
(user_factors[49] * movie_factors[581]).sum()

tensor(0.3693)

But, let's make sure we are within the 0-5 range.

In [0]:
pred_value = sigmoid_range((user_factors[49] * movie_factors[581]).sum(), 0, 5.1)

Now we need to compare that with the actual value.

In [17]:
(y[0][0] - pred_value).abs()

tensor(0.0155, device='cuda:0')

**That's it. Now we need to get a gradient out of this and update factors**

## **Doing this for a Batch**

Let's try to do this for our whole mini-batch


In [18]:
x

tensor([[536, 698],
        [307,  99],
        [654,   3],
        [504, 179],
        [125, 372]], device='cuda:0')

In [19]:
user_factors

tensor([[ 0.6531, -0.0616],
        [ 0.9453, -0.5028],
        [ 0.1189, -0.3562],
        ...,
        [-1.3274, -0.2172],
        [-1.5928, -1.0457],
        [ 1.4170, -0.0804]])

Basically, we need to pick users in `x` from `user_factors`. But if we do that, it's not a operation which can do gradients. That's where one hot encoding comes to play. So, let's do it.

In [32]:
one_hot([1], 10)

tensor([0, 1, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8)

In [0]:
one_hot_i = one_hot(536, len(user_factors)).float()

In [37]:
one_hot_i.shape, user_factors.shape

(torch.Size([944]), torch.Size([944, 2]))

In [38]:
one_hot_i @ user_factors

tensor([-0.2978, -1.0514])

**See. That's supported with gradient calculation.**

Now let's do it for the whole batch


In [51]:
one_hot_users = torch.stack([one_hot(i, len(user_factors)) for i in x[:, 0]]).float()
one_hot_users.shape

torch.Size([5, 944])

In [52]:
one_hot_movies = torch.stack([one_hot(i, len(movie_factors)) for i in x[:, 1]]).float()
one_hot_movies.shape

torch.Size([5, 1657])

In [83]:
picked_user_factors = one_hot_users @ user_factors
picked_user_factors

tensor([[-0.4655, -1.4049],
        [ 2.4640, -2.3262],
        [-0.9884, -0.9596],
        [-1.1680, -0.6322],
        [-0.3300,  1.1854]], grad_fn=<MmBackward>)

In [84]:
picked_movie_factors = one_hot_movies @ movie_factors
picked_movie_factors

tensor([[-1.3993, -0.9144],
        [ 0.5534,  0.4997],
        [ 0.4733,  0.0558],
        [ 0.4869,  1.9532],
        [ 1.1472, -0.7067]], grad_fn=<MmBackward>)

In [85]:
score_preds = sigmoid_range((picked_user_factors * picked_movie_factors).sum(dim=1), 0, 5.1)
score_preds

tensor([4.4570, 2.8057, 1.8999, 0.7212, 1.1657], grad_fn=<AddBackward0>)

In [86]:
loss = (score_preds - y.cpu()[:, 0]).abs().sum()
loss

tensor(4.1959, grad_fn=<SumBackward0>)

In [0]:
loss.backward()

In [96]:
movie_factors.grad.sum(), user_factors.grad.sum()

(tensor(2.9811), tensor(-4.3750))

**Above grad some has no meaning. I just wanted show that, getting gradient is possible.**
