finn-no · simeneide · Jun 30, 2021 · Jun 25, 2021 · Jun 25, 2021 · Jun 25, 2021
diff --git a/.gitattributes b/.gitattributes
@@ -1,3 +1,5 @@
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
diff --git a/README.md b/README.md
@@ -1,12 +1,8 @@
 # FINN.no Recommender Systems Slate Dataset
-This repository accompany the paper ["Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sampling"](https://arxiv.org/abs/2104.15046) by Simen Eide, David S. Leslie and Arnoldo Frigessi.
-The article is under review, and the pre-print can be obtained [here](https://arxiv.org/abs/2104.15046).
-
-The repository is split into the dataset (`data/`) and the accompanying code for the paper (`code/`).
-
 We release the *FINN.no recommender systems slate dataset* to improve recommender systems research.
 The dataset includes both search and recommendation interactions between users and the platform over a 30 day period.
 The dataset has logged both exposures and clicks, *including interactions where the user did not click on any of the items in the slate*.
+To our knowledge there exist no such large-scale dataset, and we hope this contribution can help researchers constructing improved models and improve offline evaluation metrics.
 
 ![A visualization of a presented slate to the user on the frontpage of FINN.no](finn-frontpage.png)
 
@@ -16,25 +12,27 @@ The dataset consists of 37.4 million interactions, |U| ≈ 2.3) million  users a
 ![A visualization of a presented slate to the user on the frontpage of FINN.no](interaction_illustration.png)
 
 FINN.no is the leading marketplace in the Norwegian classifieds market and provides users with a platform to buy and sell general merchandise, cars, real estate, as well as house rentals and job offerings.
-
-This repository is currently *work in progress*, and we will provide descriptions and tutorials. Suggestions and contributions to make the material more available is welcome.
-
 For questions, email simen.eide@finn.no or file an issue.
 
-## Download and prepare dataset
-
-The data file is compressed, unzip by the following command: `gunzip -c data.pt.gz >data.pt`
+## Organization
+The repository is organized as follows:
+- The dataset is placed in (`data/`).
+- The code open sourced from the article ["Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sampling"](https://arxiv.org/abs/2104.15046) is found in (`code/`). However, we are in the process of making the data more generally available which makes the code incompatible with the current (newer) version of the data. Please use [the v1.0 release of the repository](https://github.com/finn-no/recsys-slates-dataset/tree/v1.0) for a compatible version of the code and dataset.
 
-1. Install git-lfs: This repository uses `git-lfs` to store the dataset. Therefore you need the git-lfs package in addition to github. See [https://git-lfs.github.com/] for installation instructions.
-(e.g. for apt-get `sudo apt-get install git-lfs`)
-2. Clone the repository
-3. The data file is compressed, unzip by the following command: `gunzip -c data.pt.gz >data.pt`
+## Download and prepare dataset
+The data files can either be obtained by cloning this repository with git lfs, or (preferably) use the [datahelper.download_data_files()](https://github.com/finn-no/recsys-slates-dataset/blame/transform-to-numpy-arrays/datahelper.py#L3) function which downloads the same dataset from google drive.
+For pytorch users, they can directly use the `dataset_torch.load_dataloaders()` to get ready-to-use dataloaders for training, validation and test datasets.
 
 ## Quickstart dataset [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/finn-no/recsys-slates-dataset/blob/master/quickstart-finn-recsys-slate-data.ipynb)
 We provide a quickstart jupyter notebook that runs on Google Colab (quickstart-finn-recsys-slate-data.ipynb) which includes all necessary steps above.
 
+NB: This quickstart notebook is currently incompatible with the main branch. 
+We will update the notebook as soon as we have published a pip-package. In the meantime, please use [the v1.0 release of the repository](https://github.com/finn-no/recsys-slates-dataset/tree/v1.0)
 
 ## Citations
+This repository accompany the paper ["Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sampling"](https://arxiv.org/abs/2104.15046) by Simen Eide, David S. Leslie and Arnoldo Frigessi.
+The article is under review, and the pre-print can be obtained [here](https://arxiv.org/abs/2104.15046).
+
 If you use either the code, data or paper, please consider citing the paper.
 
 ```
@@ -49,11 +47,13 @@ If you use either the code, data or paper, please consider citing the paper.
 ```
 
 # Todo
-There are some limitations on the repository today that we would like to improve:
+This repository is currently *work in progress*, and we will provide descriptions and tutorials. Suggestions and contributions to make the material more available is welcome.
+There are some features of the repository that we are working on:
 
-- [ ] Release the dataset as numpy objects instead of pytorch arrays. This will help non-pytorch users to more easily utilize the data
-- [ ] Maintain a pytorch dataset for easy usage
+- [x] Release the dataset as numpy objects instead of pytorch arrays. This will help non-pytorch users to more easily utilize the data
+- [x] Maintain a pytorch dataset for easy usage
 - [ ] Create a pip package for easier installation and usage. the package should download the dataset using a function.
+- [ ] Make the quickstart guide compatible with the pip package and numpy format.
 - [ ] Add easily useable functions that compute relevant metrics such as hitrate, log-likelihood etc.
 - [ ] Distribute the data on other platforms such as kaggle.
 - [ ] Add a short description of the data in the readme.md directly.

diff --git a/code/dataset.py b/code/dataset.py
@@ -5,41 +5,39 @@
 import pickle
 from torch.utils.data import Dataset, DataLoader
 import torch
-
+import json
 def mkdir(path):
     if os.path.isdir(path) == False:
         os.makedirs(path)
-
+import numpy as np
 #%% DATALOADERS
 class SequentialDataset(Dataset):
     '''
      Note: displayType has been uncommented for future easy implementation.
     '''
-    def __init__(self, data, sample_uniform_action=False):
+    def __init__(self, data, sample_uniform_slate=False):
 
         self.data = data
-        self.num_items = self.data['action'].max()+1
-        self.sample_uniform_action = sample_uniform_action
-        logging.info(f"Loading dataset with action size={self.data['action'].size()} and uniform candidate sampling={self.sample_uniform_action}")
+        self.num_items = self.data['slate'].max()+1
+        self.sample_uniform_slate = sample_uniform_slate
+        logging.info(f"Loading dataset with slate size={self.data['slate'].size()} and uniform candidate sampling={self.sample_uniform_slate}")
 
     def __getitem__(self, idx):
         batch = {key: val[idx] for key, val in self.data.items()}
 
-        if self.sample_uniform_action:
+        if self.sample_uniform_slate:
             # Sample actions uniformly:
-            action = torch.randint_like(batch['action'], low=3, high=self.num_items)
+            action = torch.randint_like(batch['slate'], low=3, high=self.num_items)
 
             # Add noclick action at pos0 
             # and the actual click action at pos 1 (unless noclick):
             action[:,0] = 1
             clicked = batch['click']!=1
             action[:,1][clicked] = batch['click'][clicked]
-            batch['action'] = action
+            batch['slate'] = action
             # Set click idx to 0 if noclick, and 1 otherwise:
             batch['click_idx'] = clicked.long()
 
-
-
         return batch
 
     def __len__(self):
@@ -50,14 +48,19 @@ def load_dataloaders(data_dir,
                      batch_size=1024,
                      split_trainvalid=0.90,
                      t_testsplit = 5,
-                     sample_uniform_action=False):
+                     sample_uniform_slate=False):
 
     logging.info('Load data..')
-    data = torch.load(f'{data_dir}/data.pt')
-    dataset = SequentialDataset(data, sample_uniform_action)
+    with np.load(f'{data_dir}/data.npz') as data_np:
+        data = {key: torch.tensor(val) for key, val in data_np.items()}
+    dataset = SequentialDataset(data, sample_uniform_slate)
 
-    with open(f'{data_dir}/ind2val.pickle', 'rb') as handle:
-        ind2val = pickle.load(handle)
+    with open(f'{data_dir}/ind2val.json', 'rb') as handle:
+        # Use string2int object_hook found here: https://stackoverflow.com/a/54112705
+        ind2val = json.load(
+            handle, 
+            object_hook=lambda d: {int(k) if k.lstrip('-').isdigit() else k: v for k, v in d.items()}
+            )
 
     num_validusers = int(len(dataset) * (1-split_trainvalid)/2)
     num_testusers = int(len(dataset) * (1-split_trainvalid)/2)
@@ -79,16 +82,15 @@ def load_dataloaders(data_dir,
         }
 
     dataloaders = {
-        phase: DataLoader(ds, batch_size=batch_size, shuffle=True)
+        phase: DataLoader(ds, batch_size=batch_size, shuffle=(phase=="train"), num_workers=12)
         for phase, ds in subsets.items()
     }
     for key, dl in dataloaders.items():
         logging.info(
             f"In {key}: num_users: {len(dl.dataset)}, num_batches: {len(dl)}"
         )
 
-
-    with open(f'{data_dir}/itemattr.pickle', 'rb') as handle:
-        itemattr = pickle.load(handle)
+    with np.load(f'{data_dir}/itemattr.npz', mmap_mode=None) as itemattr_file:
+        itemattr = {key : val for key, val in itemattr_file.items()}
 
     return ind2val, itemattr, dataloaders
diff --git a/data/data.npz b/data/data.npz
diff --git a/data/data.pt.gz b/data/data.pt.gz
diff --git a/data/ind2val.json b/data/ind2val.json
diff --git a/data/ind2val.pickle b/data/ind2val.pickle
diff --git a/data/itemattr.npz b/data/itemattr.npz
diff --git a/data/itemattr.pickle b/data/itemattr.pickle
diff --git a/datahelper.py b/datahelper.py
@@ -0,0 +1,19 @@
+import logging
+from google_drive_downloader import GoogleDriveDownloader as gdd
+def download_data_files(data_dir : str = "data", overwrite=False):
+    """
+    Downloads the data from google drive.
+    If files exist they will not be downloaded again unless overwrite=True
+    """
+    gdrive_file_ids = {
+        'data.npz' : '1VXKXIvPCJ7z4BCa4G_5-Q2XMAD7nXOc7',
+        'ind2val.json' : '1WOCKfuttMacCb84yQYcRjxjEtgPp6F4N',
+        'itemattr.npz' : '1rKKyMQZqWp8vQ-Pl1SeHrQxzc5dXldnR'
+    }
+    for filename, gdrive_id in gdrive_file_ids.items():
+        logging.info("Downloading {}".format(filename))
+        gdd.download_file_from_google_drive(file_id=gdrive_id,
+                                        dest_path="{}/{}".format(data_dir, filename),
+                                        overwrite=overwrite)
+    logging.info("Done downloading all files.")
+    return True
diff --git a/dataset.py b/dataset.py
diff --git a/dataset_torch.py b/dataset_torch.py
@@ -0,0 +1,110 @@
+#%% Imports
+import torch
+import datahelper
+from torch.utils.data import Dataset, DataLoader
+import torch
+import json
+import numpy as np
+import logging
+logging.basicConfig(format='%(asctime)s %(message)s', level='INFO')
+
+#%% DATALOADERS
+class SequentialDataset(Dataset):
+    '''
+     Note: displayType has been uncommented for future easy implementation.
+    '''
+    def __init__(self, data, sample_uniform_slate=False):
+
+        self.data = data
+        self.num_items = self.data['slate'].max()+1
+        self.sample_uniform_slate = sample_uniform_slate
+        logging.info(
+            "Loading dataset with slate size={} and uniform candidate sampling={}"
+            .format(self.data['slate'].size(), self.sample_uniform_slate))
+
+    def __getitem__(self, idx):
+        batch = {key: val[idx] for key, val in self.data.items()}
+
+        if self.sample_uniform_slate:
+            # Sample actions uniformly:
+            action = torch.randint_like(batch['slate'], low=3, high=self.num_items)
+
+            # Add noclick action at pos0 
+            # and the actual click action at pos 1 (unless noclick):
+            action[:,0] = 1
+            clicked = batch['click']!=1
+            action[:,1][clicked] = batch['click'][clicked]
+            batch['slate'] = action
+            # Set click idx to 0 if noclick, and 1 otherwise:
+            batch['click_idx'] = clicked.long()
+
+        return batch
+
+    def __len__(self):
+        return len(self.data['click'])
+
+#%% PREPARE DATA IN TRAINING
+def load_dataloaders(data_dir,
+                     batch_size=1024,
+                     num_workers= 0,
+                     sample_uniform_slate=False,
+                     valid_pct= 0.05,
+                     test_pct= 0.05,
+                     t_testsplit= 5):
+
+    logging.info("Download data if not in data folder..")
+    datahelper.download_data_files(data_dir=data_dir)
+
+    logging.info('Load data..')
+    with np.load("{}/data.npz".format(data_dir)) as data_np:
+        data = {key: torch.tensor(val) for key, val in data_np.items()}
+    dataset = SequentialDataset(data, sample_uniform_slate)
+
+    with open('{}/ind2val.json'.format(data_dir), 'rb') as handle:
+        # Use string2int object_hook found here: https://stackoverflow.com/a/54112705
+        ind2val = json.load(
+            handle, 
+            object_hook=lambda d: {
+                int(k) if k.lstrip('-').isdigit() else k: v 
+                for k, v in d.items()
+                }
+            )
+
+    # Split dataset into train, validation and test:
+    num_validusers = int(len(dataset) * valid_pct)
+    num_testusers = int(len(dataset) * test_pct)
+    torch.manual_seed(0)
+    num_users = len(dataset)
+    perm_user = torch.randperm(num_users)
+    valid_user_idx = perm_user[:num_validusers]
+    test_user_idx  = perm_user[num_validusers:(num_validusers+num_testusers)]
+    train_user_idx = perm_user[(num_validusers+num_testusers):]
+    # Mask type: 1: train, 2: valid, 3: test
+    dataset.data['mask_type'] = torch.ones_like(dataset.data['click'])
+    dataset.data['mask_type'][valid_user_idx, t_testsplit:] = 2
+    dataset.data['mask_type'][test_user_idx, t_testsplit:] = 3
+
+    subsets = {
+        'train': dataset, 
+        'valid': torch.utils.data.Subset(dataset, valid_user_idx),
+        'test': torch.utils.data.Subset(dataset, test_user_idx)
+        }
+
+    # Build dataloaders for each data subset:
+    dataloaders = {
+        phase: DataLoader(ds, batch_size=batch_size, shuffle=(phase=="train"), num_workers=num_workers)
+        for phase, ds in subsets.items()
+    }
+    for key, dl in dataloaders.items():
+        logging.info(
+            "In {}: num_users: {}, num_batches: {}".format(key, len(dl.dataset), len(dl))
+        )
+
+    # Load item attributes:
+    with np.load('{}/itemattr.npz'.format(data_dir), mmap_mode=None) as itemattr_file:
+        itemattr = {key : val for key, val in itemattr_file.items()}
+
+    return ind2val, itemattr, dataloaders
+
+if __name__ == "__main__":
+    load_dataloaders()
diff --git a/requirements.txt b/requirements.txt
@@ -11,3 +11,4 @@ ax==0.52.0
 plotly==4.14.3
 pyro==3.16
 PyYAML==5.4.1
+googledrivedownloader==0.4