# Quick start with the FINN.no recsys slate dataset [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/finn-no/recsys-slates-dataset/blob/master/quickstart-finn-recsys-slate-data.ipynb)

This notebook gives an introduction to the dataset released with the paper [Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sampling](https://arxiv.org/abs/2104.15046). 
It is compatible with google colab, and can be run interactive by using the "Open in Colab"-button.

### Install dependencies, download and unzip data
This step is necessary for google colab, not if you have manually downloaded the repo.

In [None]:
!sudo apt-get install git-lfs -q
!git lfs install
!echo Clone data repository..:
!git clone https://github.com/finn-no/recsys-slates-dataset.git
!echo Unzip datafile..:
!gunzip -c recsys-slates-dataset/data/data.pt.gz >recsys-slates-dataset/data/data.pt

Clone data repository..:
Cloning into 'recsys-slates-dataset'...
remote: Enumerating objects: 121, done.[K
remote: Counting objects: 100% (121/121), done.[K
remote: Compressing objects: 100% (84/84), done.[K
remote: Total 121 (delta 57), reused 90 (delta 34), pack-reused 0[K
Receiving objects: 100% (121/121), 844.46 KiB | 7.68 MiB/s, done.
Resolving deltas: 100% (57/57), done.
Filtering content: 100% (3/3), 1.30 GiB | 35.09 MiB/s, done.
Unzip datafile..:


In [None]:
import torch
import pickle

### Main dataset file `data.pt`
The dataset consist of 2.2M unique users that have interacted up to 20 times with the internet platform platform, and has been exposed to up to 25 items at each interaction.
`data.pt` contains all the slate and click data, and the two main arrays are `click` and `action`. 
The convention of the dimension of the arrays are that the first dimension is per user, second dimension is time and third dimension is the presented slate.
The full description of all array are as follows:

| Name        | Dimension           | Description  |
| ------------- |:-------------:| -----:|
| action      | [userId, interaction num, slate pos]| the presented slates to the users; |
| click      | [userId, interaction num]      | items clicked by the users in each slate |
| displayType      | [userId, interaction num]      | type of interaction the user had with the platform (search or recommendation) |
| click_idx      | [userId, interaction num]      | Auxillary data: The position of the click in the `action` dataframe (integer from 0-24). <br> Useful for e.g. categorical likelihoods |
| lengths      | [userId, interaction num]      | Auxillary data: the actual length of the slate. <br> Same as 25-`"number of pad index in action"` |





In [None]:
# Load dataset
dat = torch.load("recsys-slates-dataset/data/data.pt")

In [None]:
# Print dimensions of all arrays:
for key, val in dat.items():
  print(f"{key} : \t {val.size()}")

userId : 	 torch.Size([2277645])
lengths : 	 torch.Size([2277645, 20])
displayType : 	 torch.Size([2277645, 20])
action : 	 torch.Size([2277645, 20, 25])
click : 	 torch.Size([2277645, 20])
click_idx : 	 torch.Size([2277645, 20])


#### Example: Get one interaction
Get the presented slate + click for user 5 at interaction number 3

In [None]:
print("Slate:")
print(dat['action'][5,3])
print(" ")
print("Click:")
print(dat['click'][5,3])
print("Type of interaction: (1 implies search, see ind2val file)")
print(dat['displayType'][5,3])

Slate:
tensor([     1, 638995, 638947, 638711, 637590, 637930, 638894,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0])
 
Click:
tensor(637590)
Type of interaction: (1 implies search, see ind2val file)
tensor(1)


From the above extraction we can see that user 5 at interaction number 3 was presented with a total of 7 items: 6 "real" items and the "no-click" item that has index 1. The remaining positions in the array is padded with the index 0.
The "no-click" item is always present in the slates, as the user has the alternative not to click on any of the presented items in the slate.
Further, we see that the user clicked on the 4'th item in the slate.
The slate length and the click position can be found by the following auxillary arrays:

In [None]:
print("Click_idx:")
print(dat['click_idx'][5,3])
print("lengths:")
print(dat['lengths'][5,3])

Click_idx:
tensor(4)
lengths:
tensor(7)


### Index to item file `ind2val.pickle`
This files contains mapping from indices to values for the attributes userId, itemId, category and displayType.

| Name         | Length           | Description  |
| -------------|:----:| -----:|
| userId       | 1.3M | Scrambled id of users |
| itemId       | 2.3M | Scrambled id of items. <br> First indicies disclose pad, noclick and unk items. |
| category     | 290  | Mapping from the category index to a text string that describes the category. <br> The category value is a text string that describes the category and location of the group |
| displayType  | 3    | Indices of whether the presented slate originated from search or recommendations|

#### Example `ind2val`
We print out the first elements of each index.
For example, we see that category 3 is "BAP,antiques,Trøndelag" which implies the category contains antiques sold in the county of Trøndelag.

In [None]:
ind2val = pickle.load(open("recsys-slates-dataset/data/ind2val.pickle", "rb"))
for key, val in ind2val.items():
  print(" ")
  print(f"{key} first entries:")
  for idx, name in val.items():
    print(f"{idx}: {val[idx]}")
    if idx >3:
      break

 
itemId first entries:
0: PAD
1: noClick
2: <UNK>
3: item_3
4: item_4
 
category first entries:
0: PAD
1: noClick
2: <UNK>
3: BAP,antiques,Trøndelag
4: MOTOR,,Sogn og Fjordane
 
displayType first entries:
1: search
2: rec
0: <UNK>
 
userId first entries:
1: user_1
2: user_2
3: user_3
4: user_4


### Item attributes file `itemattr.pickle`
A small attribute file that provides two pieces of information on the items. These are stored as numpy arrays.

| Name        | Dimension           | Description  |
| ------------- |:-------------:| -----:|
| category      | [itemId] | The group that each item belong to |
| actions       | [itemId] | Auxillary data: count of the number of total exposures per item. <br> `-1` is used to pad special items (unk, pad,noclick) |


In [None]:
itemattr = pickle.load(open("recsys-slates-dataset/data/itemattr.pickle", "rb"))

for key, val in itemattr.items():
  print(f"{key} : {val.shape}")

print("\nThe full dictionary:")
itemattr

actions : (1311775,)
category : (1311775,)

The full dictionary:


{'actions': array([-1., -1., -1., ..., 39., 14.,  4.]),
 'category': array([  0.,   1.,   2., ..., 289., 289., 289.])}

#### Example `itemattr`
Get the category of the clicked item above (from user 5, interaction number 3)

In [None]:
print("Find the itemId that were click by user 5 in interaction 3:")
itemId = [dat['click'][5,3]]
print(f"itemId: {itemId}")

print("\nFind the category index of that item in itemattr:")
cat_idx = itemattr['category'][itemId]
print(f"Category index: {cat_idx}")

print("\nFinally, find the category name by using ind2val:")
cat_name = ind2val['category'][cat_idx.item()]
print(f"Category name: {cat_name}")

Find the itemId that were click by user 5 in interaction 3:
itemId: [tensor(637590)]

Find the category index of that item in itemattr:
Category index: [135.]

Finally, find the category name by using ind2val:
Category name: REAL_ESTATE,,Oppland


## Print some statistics about the dataset

In [None]:
print(f"Ratio of no clicks: {(dat['click']==1).sum() / (dat['click']!=0).sum():.2f}")
print(f"Average slate length: {(dat['lengths'][dat['lengths']!=0]).float().mean():.2f}")
print(f"Ratio of slates that are recommendations: {(dat['displayType']==2).sum() / (dat['displayType']!=0).sum():.3f}")
print(f"Average number of interactions per user: {(dat['click']!=0).sum(-1).float().mean():.2f}")

Ratio of no clicks: 0.24
Average slate length: 11.14
Ratio of slates that are recommendations: 0.303
Average number of interactions per user: 16.43


# Directly load Pytorch Dataloaders with train/valid/test split
It is possible to directly load the dataset as a pytorch dataloader which includes the same dataset splits etc as in the original paper.
Use the `load_dataloaders` function in the `dataset.py` file. It has the following options:

| Argument       | Description  |
| ------------- |-----:|
| batch_size       | Number of unique users sampled in each batch |
| split_trainvalid | Ratio of full dataset dedicated to train <br> (val/test is split evenly among the rest) |
| t_testsplit       | For users in valid and test, <br> how many interactions should belong to training set |
| sample_uniform_action | If this is True, the exposures in the dataset <br> are sampled as in the `all-item likelihood` (see paper) |

The outputs of the function is the same `ind2val` and `itemattr` as above.
It also returns a dictionary with all the dataloaders.

In [None]:
import dataset
ind2val, itemattr, dataloaders = dataset.load_dataloaders("recsys-slates-dataset/data",
                     batch_size=30000,
                     split_trainvalid=0.9,
                     t_testsplit = 5,
                     sample_uniform_action=False)
print(" ")
print("Dictionary containing the dataloaders:")
print(dataloaders)

2021-05-27 18:25:55,179 Load data..
2021-05-27 18:26:01,887 Loading dataset with action size=torch.Size([2277645, 20, 25]) and uniform candidate sampling=False
2021-05-27 18:26:02,651 In train: num_users: 2277645, num_batches: 76
2021-05-27 18:26:02,651 In valid: num_users: 113882, num_batches: 4
2021-05-27 18:26:02,651 In test: num_users: 113882, num_batches: 4
 
Dictionary containing the dataloaders:
{'train': <torch.utils.data.dataloader.DataLoader object at 0x7fe7c0631b50>, 'valid': <torch.utils.data.dataloader.DataLoader object at 0x7fe7c0631c40>, 'test': <torch.utils.data.dataloader.DataLoader object at 0x7fe7c06315b0>}


In [None]:
batch = next(iter(dataloaders['train']))
for key, val in batch.items():
    print(key, val.size())

userId torch.Size([30000])
lengths torch.Size([30000, 20])
displayType torch.Size([30000, 20])
action torch.Size([30000, 20, 25])
click torch.Size([30000, 20])
click_idx torch.Size([30000, 20])
mask_type torch.Size([30000, 20])


### Masking of train/test/val
Each batch returns a dictionary of pytorch tensors with data, and contains the usual data fields described above.
In addition, it contains a `mask_type` tensor which explains whether each click belongs to _train_, _valid_ or _test_.
It is of the same dimensionality as the click tensor (`num users * num interactions`).
This is because we want to return the full sequence of interactions so that e.g. the test set can use the first clicks of the user (which belongs to the training set) to build a user profile.
The mask is defined in the following way:

```
mask2split = {
    0 : 'PAD',
    1 : 'train',
    2 : 'valid',
    3 : 'test'
}
```
If the mask equals zero it means that the length of the user sequence was shorter than this index.
The modeler has to take care to not train on elements in the validation or test dataset.
Typically this can be done by masking all losses that does not originate from the training dataset:

In [None]:
train_mask = (batch['mask_type']==1)
train_mask

tensor([[True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        ...,
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True]])

For example, for user number 1 in this batch, the first five interactions belong to the training set, and the remaining belongs to the validation set.
We can extract the clicks that belong to the training set by using `mask_type`:

In [None]:
print("Mask of user 2:")
print(batch['mask_type'][1,])
print(" ")
print("Clicks belonging to the training set:")
print(train_mask[1,])
print(" ")
print("Select only the clicks in training dataset:")
batch['click'][1,][train_mask[1,]]

Mask of user 2:
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
 
Clicks belonging to the training set:
tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True])
 
Select only the clicks in training dataset:


tensor([  73296,   66666, 1154594,  613719,  642978, 1231978, 1231727,       1,
          56397,       0,       0,       0,       0,       0,       0,       0,
              0,       0,       0,       0])