## Install the recsys_slates_dataset pip package


In [1]:
!pip install recsys_slates_dataset -q

## Download and load dataloaders that are ready to use
It is possible to directly load the dataset as a pytorch dataloader which includes the same dataset splits etc as in the original paper.
Use the `load_dataloaders` function in the `dataset_torch` module. It has the following options:

| Argument       | Description  |
| ------------- |-----:|
| batch_size       | Number of unique users sampled in each batch |
| split_trainvalid | Ratio of full dataset dedicated to train <br> (val/test is split evenly among the rest) |
| t_testsplit       | For users in valid and test, <br> how many interactions should belong to training set |
| sample_candidate_items | Number of negative item examples sampled from the item universe for each interaction. If positive, the dataset provide an additional dictionary item "allitem". See Eide et. al. 2021 for more information. |

The outputs of the function are `ind2val`, `itemattr` and a dictionary with pytorch dataloaders for training, validation and test.

In [2]:
import torch
from recsys_slates_dataset import dataset_torch
ind2val, itemattr, dataloaders = dataset_torch.load_dataloaders(data_dir="dat")

print("Dictionary containing the dataloaders:")
print(dataloaders)

2021-11-09 14:08:28,371 Download data if not in data folder..
2021-11-09 14:08:28,372 Downloading data.npz
2021-11-09 14:08:28,372 Downloading ind2val.json
2021-11-09 14:08:28,373 Downloading itemattr.npz
2021-11-09 14:08:28,373 Done downloading all files.
2021-11-09 14:08:28,374 Load data..
2021-11-09 14:08:52,015 Loading dataset with slate size=torch.Size([2277645, 20, 25]) and number of negative samples=False
2021-11-09 14:08:52,037 Loading dataset with slate size=torch.Size([113882, 20, 25]) and number of negative samples=False
2021-11-09 14:08:52,058 Loading dataset with slate size=torch.Size([113882, 20, 25]) and number of negative samples=False
2021-11-09 14:08:52,059 In train: num_users: 2277645, num_batches: 2225
2021-11-09 14:08:52,059 In valid: num_users: 113882, num_batches: 112
2021-11-09 14:08:52,060 In test: num_users: 113882, num_batches: 112


Dictionary containing the dataloaders:
{'train': <torch.utils.data.dataloader.DataLoader object at 0x7f414b28d7f0>, 'valid': <torch.utils.data.dataloader.DataLoader object at 0x7f414b2261c0>, 'test': <torch.utils.data.dataloader.DataLoader object at 0x7f414b226ac0>}


### Batches
The batches are split by userId and provides the necessary information for training. We will explain each element below:

In [3]:
batch = next(iter(dataloaders['train']))
for key, val in batch.items():
    print(key, val.size())

userId torch.Size([1024])
click torch.Size([1024, 20])
click_idx torch.Size([1024, 20])
slate_lengths torch.Size([1024, 20])
slate torch.Size([1024, 20, 25])
interaction_type torch.Size([1024, 20])
phase_mask torch.Size([1024, 20])


### Interaction data (`data.npz`)
The dataset consist of 2.2M unique users that have interacted up to 20 times with the internet platform platform, and has been exposed to up to 25 items at each interaction.
`data.npz` contains all the slate and click data, and the two main arrays are `click` and `slate`. 
The convention of the dimension of the arrays are that the first dimension is per user, second dimension is time and third dimension is the presented slate.
The full description of all array are as follows:

| Name        | Dimension           | Description  |
| ------------- |:-------------:| -----:|
| slate      | [userId, interaction num, slate pos]| the presented slates to the users; |
| click      | [userId, interaction num]      | items clicked by the users in each slate |
| interaction_type      | [userId, interaction num]      | type of interaction the user had with the platform (search or recommendation) |
| click_idx      | [userId, interaction num]      | Auxillary data: The position of the click in the `slate` dataframe (integer from 0-24). <br> Useful for e.g. categorical likelihoods |
| slate_lengths      | [userId, interaction num]      | Auxillary data: the actual length of the slate. <br> Same as 25-`"number of pad index in action"` |





In [4]:
# Load interaction data
dat = dataloaders['train'].dataset.data

# Print dimensions of all arrays:
for key, val in dat.items():
  print(f"{key} : \t {val.size()}")

userId : 	 torch.Size([2277645])
click : 	 torch.Size([2277645, 20])
click_idx : 	 torch.Size([2277645, 20])
slate_lengths : 	 torch.Size([2277645, 20])
slate : 	 torch.Size([2277645, 20, 25])
interaction_type : 	 torch.Size([2277645, 20])
phase_mask : 	 torch.Size([2277645, 20])


#### Example: Get one interaction
Get the presented slate + click for user 5 at interaction number 3

In [5]:
print("Slate:")
print(dat['slate'][5,3])
print(" ")
print("Click:")
print(dat['click'][5,3])
print("Type of interaction: (1 implies search, see ind2val file)")
print(dat['interaction_type'][5,3])

Slate:
tensor([     1, 638995, 638947, 638711, 637590, 637930, 638894,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0],
       dtype=torch.int32)
 
Click:
tensor(637590, dtype=torch.int32)
Type of interaction: (1 implies search, see ind2val file)
tensor(1, dtype=torch.int32)


From the above extraction we can see that user 5 at interaction number 3 was presented with a total of 7 items: 6 "real" items and the "no-click" item that has index 1. The remaining positions in the array is padded with the index 0.
The "no-click" item is always present in the slates, as the user has the alternative not to click on any of the presented items in the slate.
Further, we see that the user clicked on the 4'th item in the slate.
The slate length and the click position can be found by the following auxillary arrays:

In [6]:
print("Click_idx:")
print(dat['click_idx'][5,3])
print("Slate lengths:")
print(dat['slate_lengths'][5,3])

Click_idx:
tensor(4, dtype=torch.int32)
Slate lengths:
tensor(7, dtype=torch.int32)


### Index to item (`ind2val.json`)
This files contains mapping from indices to values for the attributes category and interaction_type.

| Name         | Length           | Description  |
| -------------|:----:| -----:|
| category     | 290  | Mapping from the category index to a text string that describes the category. <br> The category value is a text string that describes the category and location of the group |
| interaction_type  | 3    | Indices of whether the presented slate originated from search or recommendations|

We created the category string by combining different tags and categories from the marketplace, and will contain internal names that is not always clear to the reader.
The first word is always the main category of the item, like "MOTOR","REAL_ESTATE" and "BAP" (some funny developer called the general merchandise category (FINN.no/torget) for "bits and pieces", and its been like that ever since).
Then the subcategories are added with commas if there are sufficient items in the group so we dont run any identification risk. Two consequtive commas (e.g. "MOTOR, , " implies that there was either too few items in the group, or that it was not an applicable category for the item.

We also have two special values in these strings: `PAD` and `<UNK>`.
`PAD` implies that there is no data for the field, whereas we use `<UNK>` to imply that there is data, but the value of the datapoint is missing. This can, unfortunately, happen in large systems.

#### Example `ind2val`
We print out the first elements of each index.
For example, we see that category 3 is "BAP,antiques,Trøndelag" which implies the category contains antiques sold in the county of Trøndelag.

In [7]:
for key, val in ind2val.items():
  print(" ")
  print(f"{key} first entries:")
  for idx, name in val.items():
    print(f"{idx}: {val[idx]}")
    if idx >3:
      break

 
category first entries:
0: PAD
1: noClick
2: <UNK>
3: BAP,antiques,Trøndelag
4: MOTOR,,Sogn og Fjordane
 
interaction_type first entries:
1: search
2: rec
0: <UNK>


### Item attributes (`itemattr.npz`)
A numpy array that encodes the category of each item.

| Name        | Dimension           | Description  |
| ------------- |:-------------:| -----:|
| category      | [itemId] | The group that each item belong to |


In [8]:
for key, val in itemattr.items():
  print(f"{key} : {val.shape}")

print("\nThe full dictionary:")
itemattr

category : (1311775,)

The full dictionary:


{'category': array([  0.,   1.,   2., ..., 289., 289., 289.])}

#### Example `itemattr`
Get the category of the clicked item above (from user 5, interaction number 3)

In [9]:
print("Find the itemId that were click by user 5 in interaction 3:")
itemId = [dat['click'][5,3]]
print(f"itemId: {itemId}")

print("\nFind the category index of that item in itemattr:")
cat_idx = itemattr['category'][itemId]
print(f"Category index: {cat_idx}")

print("\nFinally, find the category name by using ind2val:")
cat_name = ind2val['category'][cat_idx.item()]
print(f"Category name: {cat_name}")

Find the itemId that were click by user 5 in interaction 3:
itemId: [tensor(637590, dtype=torch.int32)]

Find the category index of that item in itemattr:
Category index: [135.]

Finally, find the category name by using ind2val:
Category name: REAL_ESTATE,,Oppland


## Print some statistics about the dataset

In [10]:
print(f"Ratio of no clicks: {(dat['click']==1).sum() / (dat['click']!=0).sum():.2f}")
print(f"Average slate length: {(dat['slate_lengths'][dat['slate_lengths']!=0]).float().mean():.2f}")
print(f"Ratio of slates that are recommendations: {(dat['interaction_type']==2).sum() / (dat['interaction_type']!=0).sum():.3f}")
print(f"Average number of interactions per user: {(dat['click']!=0).sum(-1).float().mean():.2f}")

Ratio of no clicks: 0.24
Average slate length: 11.14
Ratio of slates that are recommendations: 0.303
Average number of interactions per user: 16.43


### Masking of train/test/val
Each batch returns a dictionary of pytorch tensors with data, and contains the usual data fields described above.
In addition, it contains a `phase_mask` tensor which explains whether each interaction belongs to the current training phase.
It is of the same dimensionality as the click tensor (`num users * num interactions`).

For example, if the batch came from `dataloaders['train']` then each element of `batch['phase_mask']` will have a value of 1 if the interaction is part of the training dataset, and a 0 otherwise.
This is because we want to return the full sequence of interactions so that e.g. the test set can use the first clicks of the user (which belongs to the training set) to build a user profile.

This transformation happens inside the dataloaders. The actual data we store are 
The mask is defined in the following way:

```
mask2split = {
    0 : 'PAD',
    1 : 'train',
    2 : 'valid',
    3 : 'test'
}
```
If the mask equals zero it means that the length of the user sequence was shorter than this index.
The modeler has to take care to not train on elements in the validation or test dataset.
Typically this can be done by masking all losses that does not originate from the training dataset:

In [11]:
train_mask = (batch['phase_mask']==1)
train_mask

tensor([[True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        ...,
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True]])

For example, for user number 4 in this batch, the first five interactions belong to the training set, and the remaining belongs to the validation set.
We can extract the clicks that belong to the training set by using `phase_mask`:

In [12]:
print("Mask of user 2:")
print(batch['phase_mask'][4,])
print(" ")
print("Clicks belonging to the training set:")
print(train_mask[4,])
print(" ")
print("Select only the clicks in training dataset:")
batch['click'][4,][train_mask[4,]]

Mask of user 2:
tensor([ True,  True,  True,  True,  True, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False])
 
Clicks belonging to the training set:
tensor([ True,  True,  True,  True,  True, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False])
 
Select only the clicks in training dataset:


tensor([ 492578,  711722,       1, 1095461,       1], dtype=torch.int32)