# Quick start with the FINN.no recsys slate dataset [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/finn-no/recsys-slates-dataset/blob/master/quickstart-finn-recsys-slate-data.ipynb)

This notebook gives an introduction to the dataset released with the paper [XXX]. 
It is compatible with google colab, and can be run interactive by using the "Open in Colab"-button.

### Install dependencies, download and unzip data

In [1]:
!sudo apt-get install git-lfs -q
!git lfs install
!echo Clone data repository..:
!git clone https://github.com/finn-no/recsys-slates-dataset.git
!echo Unzip datafile..:
!gunzip -c recsys-slates-dataset/data/data.pt.gz >recsys-slates-dataset/data/data.pt

Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 31 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 1s (1,478 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package git-lfs.
(Reading database ... 160983 files and directories currently insta

In [2]:
%ls recsys-slates-dataset

[0m[01;34mdata[0m/  README.md


In [3]:
import torch
import pickle

### Main dataset file `data.pt`
The dataset consist of 2.2M unique users that have interacted up to 20 times with the internet platform platform, and has been exposed to up to 25 items at each interaction.
`data.pt` contains all the slate and click data, and the two main arrays are `click` and `action`. 
The convention of the dimension of the arrays are that the first dimension is per user, second dimension is time and third dimension is the presented slate.
The full description of all array are as follows:

| Name        | Dimension           | Description  |
| ------------- |:-------------:| -----:|
| action      | [userId, interaction num, slate pos]| the presented slates to the users; |
| click      | [userId, interaction num]      | items clicked by the users in each slate |
| displayType      | [userId, interaction num]      | type of interaction the user had with the platform (search or recommendation) |
| click_idx      | [userId, interaction num]      | Auxillary data: The position of the click in the `action` dataframe (integer from 0-24). <br> Useful for e.g. categorical likelihoods |
| lengths      | [userId, interaction num]      | Auxillary data: the actual length of the slate. <br> Same as 25-`"number of pad index in action"` |





In [4]:
# Load dataset
dat = torch.load("recsys-slates-dataset/data/data.pt")

In [5]:
# Print dimensions of all arrays:
for key, val in dat.items():
  print(f"{key} : \t {val.size()}")

userId : 	 torch.Size([2277645])
lengths : 	 torch.Size([2277645, 20])
displayType : 	 torch.Size([2277645, 20])
action : 	 torch.Size([2277645, 20, 25])
click : 	 torch.Size([2277645, 20])
click_idx : 	 torch.Size([2277645, 20])


#### Example: Get one interaction
Get the presented slate + click for user 5 at interaction number 3

In [6]:
print("Slate:")
print(dat['action'][5,3])
print(" ")
print("Click:")
print(dat['click'][5,3])
print("Type of interaction: (1 implies search, see ind2val file)")
print(dat['displayType'][5,3])

Slate:
tensor([     1, 638995, 638947, 638711, 637590, 637930, 638894,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0])
 
Click:
tensor(637590)
Type of interaction: (1 implies search, see ind2val file)
tensor(1)


From the above extraction we can see that user 5 at interaction number 3 was presented with a total of 7 items: 6 "real" items and the "no-click" item that has index 1. The remaining positions in the array is padded with the index 0.
Further, we see that the user clicked on the 4'th item in the slate.
The slate length and the click position can be found by the following auxillary arrays:

In [7]:
print("Click_idx:")
print(dat['click_idx'][5,3])
print("lengths:")
print(dat['lengths'][5,3])

Click_idx:
tensor(4)
lengths:
tensor(7)


### Index to item file `ind2val.pickle`
This files contains mapping from indices to values for the attributes userId, itemId, category and displayType.

| Name         | Length           | Description  |
| -------------|:----:| -----:|
| userId       | 1.3M | Scrambled id of users |
| itemId       | 2.3M | Scrambled id of items. <br> First indicies disclose pad, noclick and unk items. |
| category     | 290  | Mapping from the category index to a text string that describes the category. <br> The category value is a text string that describes the category and location of the group |
| displayType  | 3    | Indices of whether the presented slate originated from search or recommendations|

#### Example `ind2val`
We print out the first elements of each index.
For example, we see that category 3 is "BAP,antiques,Trøndelag" which implies the category contains antiques sold in the county of Trøndelag.

In [16]:
ind2val = pickle.load(open("recsys-slates-dataset/data/ind2val.pickle", "rb"))
for key, val in ind2val.items():
  print(" ")
  print(f"{key} first entries:")
  for idx, name in val.items():
    print(f"{idx}: {val[idx]}")
    if idx >3:
      break

 
itemId first entries:
0: PAD
1: noClick
2: <UNK>
3: item_3
4: item_4
 
category first entries:
0: PAD
1: noClick
2: <UNK>
3: BAP,antiques,Trøndelag
4: MOTOR,,Sogn og Fjordane
 
displayType first entries:
1: search
2: rec
0: <UNK>
 
userId first entries:
1: user_1
2: user_2
3: user_3
4: user_4


### Item attributes file `itemattr.pickle`
A small attribute file that provides two pieces of information on the items. These are stored as numpy arrays.

| Name        | Dimension           | Description  |
| ------------- |:-------------:| -----:|
| category      | [itemId] | The group that each item belong to |
| actions       | [itemId] | Auxillary data: count of the number of total exposures per item. <br> `-1` is used to pad special items (unk, pad,noclick) |


In [45]:
itemattr = pickle.load(open("recsys-slates-dataset/data/itemattr.pickle", "rb"))

for key, val in itemattr.items():
  print(f"{key} : {val.shape}")

print("\nThe full dictionary:")
itemattr

actions : (1311775,)
category : (1311775,)

The full dictionary:


{'actions': array([-1., -1., -1., ..., 39., 14.,  4.]),
 'category': array([  0.,   1.,   2., ..., 289., 289., 289.])}

#### Example `itemattr`
Get the category of the clicked item above (from user 5, interaction number 3)

In [40]:
print("Find the itemId that were click by user 5 in interaction 3:")
itemId = [dat['click'][5,3]]
print(f"itemId: {itemId}")

print("\nFind the category index of that item in itemattr:")
cat_idx = itemattr['category'][itemId]
print(f"Category index: {cat_idx}")

print("\nFinally, find the category name by using ind2val:")
cat_name = ind2val['category'][cat_idx.item()]
print(f"Category name: {cat_name}")

Find the itemId that were click by user 5 in interaction 3:
itemId: [tensor(637590)]

Find the category index of that item in itemattr:
Category index: [135.]

Finally, find the category name by using ind2val:
Category name: REAL_ESTATE,,Oppland
