Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform to numpy arrays #3

Merged
merged 14 commits into from Jun 30, 2021
Merged
2 changes: 2 additions & 0 deletions .gitattributes
@@ -1,3 +1,5 @@
*.pt filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.json filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
36 changes: 18 additions & 18 deletions README.md
@@ -1,12 +1,8 @@
# FINN.no Recommender Systems Slate Dataset
This repository accompany the paper ["Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sampling"](https://arxiv.org/abs/2104.15046) by Simen Eide, David S. Leslie and Arnoldo Frigessi.
The article is under review, and the pre-print can be obtained [here](https://arxiv.org/abs/2104.15046).

The repository is split into the dataset (`data/`) and the accompanying code for the paper (`code/`).

We release the *FINN.no recommender systems slate dataset* to improve recommender systems research.
The dataset includes both search and recommendation interactions between users and the platform over a 30 day period.
The dataset has logged both exposures and clicks, *including interactions where the user did not click on any of the items in the slate*.
To our knowledge there exist no such large-scale dataset, and we hope this contribution can help researchers constructing improved models and improve offline evaluation metrics.

![A visualization of a presented slate to the user on the frontpage of FINN.no](finn-frontpage.png)

Expand All @@ -16,25 +12,27 @@ The dataset consists of 37.4 million interactions, |U| ≈ 2.3) million users a
![A visualization of a presented slate to the user on the frontpage of FINN.no](interaction_illustration.png)

FINN.no is the leading marketplace in the Norwegian classifieds market and provides users with a platform to buy and sell general merchandise, cars, real estate, as well as house rentals and job offerings.

This repository is currently *work in progress*, and we will provide descriptions and tutorials. Suggestions and contributions to make the material more available is welcome.

For questions, email simen.eide@finn.no or file an issue.

## Download and prepare dataset

The data file is compressed, unzip by the following command: `gunzip -c data.pt.gz >data.pt`
## Organization
The repository is organized as follows:
- The dataset is placed in (`data/`).
- The code open sourced from the article ["Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sampling"](https://arxiv.org/abs/2104.15046) is found in (`code/`). However, we are in the process of making the data more generally available which makes the code incompatible with the current (newer) version of the data. Please use [the v1.0 release of the repository](https://github.com/finn-no/recsys-slates-dataset/tree/v1.0) for a compatible version of the code and dataset.

1. Install git-lfs: This repository uses `git-lfs` to store the dataset. Therefore you need the git-lfs package in addition to github. See [https://git-lfs.github.com/] for installation instructions.
(e.g. for apt-get `sudo apt-get install git-lfs`)
2. Clone the repository
3. The data file is compressed, unzip by the following command: `gunzip -c data.pt.gz >data.pt`
## Download and prepare dataset
The data files can either be obtained by cloning this repository with git lfs, or (preferably) use the [datahelper.download_data_files()](https://github.com/finn-no/recsys-slates-dataset/blame/transform-to-numpy-arrays/datahelper.py#L3) function which downloads the same dataset from google drive.
For pytorch users, they can directly use the `dataset_torch.load_dataloaders()` to get ready-to-use dataloaders for training, validation and test datasets.

## Quickstart dataset [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/finn-no/recsys-slates-dataset/blob/master/quickstart-finn-recsys-slate-data.ipynb)
We provide a quickstart jupyter notebook that runs on Google Colab (quickstart-finn-recsys-slate-data.ipynb) which includes all necessary steps above.

NB: This quickstart notebook is currently incompatible with the main branch.
We will update the notebook as soon as we have published a pip-package. In the meantime, please use [the v1.0 release of the repository](https://github.com/finn-no/recsys-slates-dataset/tree/v1.0)

## Citations
This repository accompany the paper ["Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sampling"](https://arxiv.org/abs/2104.15046) by Simen Eide, David S. Leslie and Arnoldo Frigessi.
The article is under review, and the pre-print can be obtained [here](https://arxiv.org/abs/2104.15046).

If you use either the code, data or paper, please consider citing the paper.

```
Expand All @@ -49,11 +47,13 @@ If you use either the code, data or paper, please consider citing the paper.
```

# Todo
There are some limitations on the repository today that we would like to improve:
This repository is currently *work in progress*, and we will provide descriptions and tutorials. Suggestions and contributions to make the material more available is welcome.
There are some features of the repository that we are working on:

- [ ] Release the dataset as numpy objects instead of pytorch arrays. This will help non-pytorch users to more easily utilize the data
- [ ] Maintain a pytorch dataset for easy usage
- [x] Release the dataset as numpy objects instead of pytorch arrays. This will help non-pytorch users to more easily utilize the data
- [x] Maintain a pytorch dataset for easy usage
- [ ] Create a pip package for easier installation and usage. the package should download the dataset using a function.
- [ ] Make the quickstart guide compatible with the pip package and numpy format.
- [ ] Add easily useable functions that compute relevant metrics such as hitrate, log-likelihood etc.
- [ ] Distribute the data on other platforms such as kaggle.
- [ ] Add a short description of the data in the readme.md directly.
Expand Down
42 changes: 22 additions & 20 deletions code/dataset.py
Expand Up @@ -5,41 +5,39 @@
import pickle
from torch.utils.data import Dataset, DataLoader
import torch

import json
def mkdir(path):
if os.path.isdir(path) == False:
os.makedirs(path)

import numpy as np
#%% DATALOADERS
class SequentialDataset(Dataset):
'''
Note: displayType has been uncommented for future easy implementation.
'''
def __init__(self, data, sample_uniform_action=False):
def __init__(self, data, sample_uniform_slate=False):

self.data = data
self.num_items = self.data['action'].max()+1
self.sample_uniform_action = sample_uniform_action
logging.info(f"Loading dataset with action size={self.data['action'].size()} and uniform candidate sampling={self.sample_uniform_action}")
self.num_items = self.data['slate'].max()+1
self.sample_uniform_slate = sample_uniform_slate
logging.info(f"Loading dataset with slate size={self.data['slate'].size()} and uniform candidate sampling={self.sample_uniform_slate}")

def __getitem__(self, idx):
batch = {key: val[idx] for key, val in self.data.items()}

if self.sample_uniform_action:
if self.sample_uniform_slate:
# Sample actions uniformly:
action = torch.randint_like(batch['action'], low=3, high=self.num_items)
action = torch.randint_like(batch['slate'], low=3, high=self.num_items)

# Add noclick action at pos0
# and the actual click action at pos 1 (unless noclick):
action[:,0] = 1
clicked = batch['click']!=1
action[:,1][clicked] = batch['click'][clicked]
batch['action'] = action
batch['slate'] = action
# Set click idx to 0 if noclick, and 1 otherwise:
batch['click_idx'] = clicked.long()



return batch

def __len__(self):
Expand All @@ -50,14 +48,19 @@ def load_dataloaders(data_dir,
batch_size=1024,
split_trainvalid=0.90,
t_testsplit = 5,
sample_uniform_action=False):
sample_uniform_slate=False):

logging.info('Load data..')
data = torch.load(f'{data_dir}/data.pt')
dataset = SequentialDataset(data, sample_uniform_action)
with np.load(f'{data_dir}/data.npz') as data_np:
simeneide marked this conversation as resolved.
Show resolved Hide resolved
data = {key: torch.tensor(val) for key, val in data_np.items()}
dataset = SequentialDataset(data, sample_uniform_slate)

with open(f'{data_dir}/ind2val.pickle', 'rb') as handle:
ind2val = pickle.load(handle)
with open(f'{data_dir}/ind2val.json', 'rb') as handle:
# Use string2int object_hook found here: https://stackoverflow.com/a/54112705
ind2val = json.load(
handle,
object_hook=lambda d: {int(k) if k.lstrip('-').isdigit() else k: v for k, v in d.items()}
)

num_validusers = int(len(dataset) * (1-split_trainvalid)/2)
simeneide marked this conversation as resolved.
Show resolved Hide resolved
num_testusers = int(len(dataset) * (1-split_trainvalid)/2)
Expand All @@ -79,16 +82,15 @@ def load_dataloaders(data_dir,
}

dataloaders = {
phase: DataLoader(ds, batch_size=batch_size, shuffle=True)
phase: DataLoader(ds, batch_size=batch_size, shuffle=(phase=="train"), num_workers=12)
simeneide marked this conversation as resolved.
Show resolved Hide resolved
for phase, ds in subsets.items()
}
for key, dl in dataloaders.items():
logging.info(
f"In {key}: num_users: {len(dl.dataset)}, num_batches: {len(dl)}"
)


with open(f'{data_dir}/itemattr.pickle', 'rb') as handle:
itemattr = pickle.load(handle)
with np.load(f'{data_dir}/itemattr.npz', mmap_mode=None) as itemattr_file:
itemattr = {key : val for key, val in itemattr_file.items()}

return ind2val, itemattr, dataloaders
3 changes: 3 additions & 0 deletions data/data.npz
Git LFS file not shown
3 changes: 0 additions & 3 deletions data/data.pt.gz

This file was deleted.

3 changes: 3 additions & 0 deletions data/ind2val.json
Git LFS file not shown
3 changes: 0 additions & 3 deletions data/ind2val.pickle

This file was deleted.

3 changes: 3 additions & 0 deletions data/itemattr.npz
Git LFS file not shown
3 changes: 0 additions & 3 deletions data/itemattr.pickle

This file was deleted.

19 changes: 19 additions & 0 deletions datahelper.py
@@ -0,0 +1,19 @@
import logging
from google_drive_downloader import GoogleDriveDownloader as gdd
simeneide marked this conversation as resolved.
Show resolved Hide resolved
def download_data_files(data_dir : str = "data", overwrite=False):
"""
Downloads the data from google drive.
If files exist they will not be downloaded again unless overwrite=True
"""
gdrive_file_ids = {
'data.npz' : '1VXKXIvPCJ7z4BCa4G_5-Q2XMAD7nXOc7',
'ind2val.json' : '1WOCKfuttMacCb84yQYcRjxjEtgPp6F4N',
'itemattr.npz' : '1rKKyMQZqWp8vQ-Pl1SeHrQxzc5dXldnR'
}
for filename, gdrive_id in gdrive_file_ids.items():
logging.info("Downloading {}".format(filename))
gdd.download_file_from_google_drive(file_id=gdrive_id,
dest_path="{}/{}".format(data_dir, filename),
overwrite=overwrite)
logging.info("Done downloading all files.")
return True
1 change: 0 additions & 1 deletion dataset.py

This file was deleted.

110 changes: 110 additions & 0 deletions dataset_torch.py
@@ -0,0 +1,110 @@
#%% Imports
import torch
NegatioN marked this conversation as resolved.
Show resolved Hide resolved
import datahelper
from torch.utils.data import Dataset, DataLoader
import torch
import json
import numpy as np
import logging
logging.basicConfig(format='%(asctime)s %(message)s', level='INFO')

#%% DATALOADERS
class SequentialDataset(Dataset):
'''
Note: displayType has been uncommented for future easy implementation.
'''
def __init__(self, data, sample_uniform_slate=False):

self.data = data
self.num_items = self.data['slate'].max()+1
self.sample_uniform_slate = sample_uniform_slate
logging.info(
"Loading dataset with slate size={} and uniform candidate sampling={}"
.format(self.data['slate'].size(), self.sample_uniform_slate))

def __getitem__(self, idx):
batch = {key: val[idx] for key, val in self.data.items()}

if self.sample_uniform_slate:
# Sample actions uniformly:
action = torch.randint_like(batch['slate'], low=3, high=self.num_items)

# Add noclick action at pos0
# and the actual click action at pos 1 (unless noclick):
action[:,0] = 1
clicked = batch['click']!=1
action[:,1][clicked] = batch['click'][clicked]
batch['slate'] = action
# Set click idx to 0 if noclick, and 1 otherwise:
batch['click_idx'] = clicked.long()

return batch

def __len__(self):
return len(self.data['click'])

#%% PREPARE DATA IN TRAINING
def load_dataloaders(data_dir,
batch_size=1024,
num_workers= 0,
sample_uniform_slate=False,
valid_pct= 0.05,
test_pct= 0.05,
t_testsplit= 5):

logging.info("Download data if not in data folder..")
datahelper.download_data_files(data_dir=data_dir)

logging.info('Load data..')
with np.load("{}/data.npz".format(data_dir)) as data_np:
data = {key: torch.tensor(val) for key, val in data_np.items()}
dataset = SequentialDataset(data, sample_uniform_slate)

with open('{}/ind2val.json'.format(data_dir), 'rb') as handle:
# Use string2int object_hook found here: https://stackoverflow.com/a/54112705
ind2val = json.load(
handle,
object_hook=lambda d: {
int(k) if k.lstrip('-').isdigit() else k: v
for k, v in d.items()
}
)

# Split dataset into train, validation and test:
num_validusers = int(len(dataset) * valid_pct)
num_testusers = int(len(dataset) * test_pct)
torch.manual_seed(0)
num_users = len(dataset)
perm_user = torch.randperm(num_users)
valid_user_idx = perm_user[:num_validusers]
test_user_idx = perm_user[num_validusers:(num_validusers+num_testusers)]
train_user_idx = perm_user[(num_validusers+num_testusers):]
# Mask type: 1: train, 2: valid, 3: test
dataset.data['mask_type'] = torch.ones_like(dataset.data['click'])
dataset.data['mask_type'][valid_user_idx, t_testsplit:] = 2
dataset.data['mask_type'][test_user_idx, t_testsplit:] = 3

subsets = {
'train': dataset,
'valid': torch.utils.data.Subset(dataset, valid_user_idx),
'test': torch.utils.data.Subset(dataset, test_user_idx)
}

# Build dataloaders for each data subset:
dataloaders = {
phase: DataLoader(ds, batch_size=batch_size, shuffle=(phase=="train"), num_workers=num_workers)
for phase, ds in subsets.items()
}
for key, dl in dataloaders.items():
logging.info(
"In {}: num_users: {}, num_batches: {}".format(key, len(dl.dataset), len(dl))
)

# Load item attributes:
with np.load('{}/itemattr.npz'.format(data_dir), mmap_mode=None) as itemattr_file:
itemattr = {key : val for key, val in itemattr_file.items()}

return ind2val, itemattr, dataloaders

if __name__ == "__main__":
load_dataloaders()
1 change: 1 addition & 0 deletions requirements.txt
Expand Up @@ -11,3 +11,4 @@ ax==0.52.0
plotly==4.14.3
pyro==3.16
PyYAML==5.4.1
googledrivedownloader==0.4