Public Data Release 1.1.0
This repo contains the description of the data released together with our SIGIR eCom 2020 paper Fantastic Embeddings and How to Align Them: Zero-Shot Inference in a Multi-Shop Scenario.
The dataset is available for research and educational purposes at this page. To obtain the dataset, you are required to fill a form with information about you and your institution, and agree to the Terms And Conditions for fair usage of the data.
For convenience, Terms And Conditions are also included in a pure txt
format in this repo:
usage of the data implies the acceptance of these Terms And Conditions.
The dataset is provided in five files inside a zip
archive:
- a
json
file, structured as list of lists. Each list contains a cross-shop session, that is, a shopping session initiated on Shop A and terminated on Shop B (and vice versa); items in each list are ordered chronologically, and they all have the syntaxSHOP1_SKU41
, that is an identifier of the shop (hashed) first, followed by_
and a hashed identifier of the product the shopper interacted with. A samplejson
file is provided in this repo: cross-shop session["SHOP1_SKU21", "SHOP1_SKU32", "SHOP2_SKU13"]
means that an anonymous shopper interacted with products21
and32
on the first shop, then browsed to the second shop and interacted with13
. Please remember that each shop has a different identifier policy, which makes the aligning problem interesting. The cross-shop dataset contains a total of 12 259 sessions; - two
json
files, labelledoriginal_vectors
, one for each shop: they contain a map between product identifiers (hashed in the same way as in the cross-shop dataset) and related product embeddings as trained separately for each shop (a previous release included apickle
file - versions >= 1.1.0 are the recommended versions); - two
json
files, labelledaligned_vectors
, one for each shop: they contain a map between product identifiers and related product embeddings, after the alignment proposed in the paper.
We refer the reader to the original work for an extended explanation of the alignment problem. Usage of this data implies the acceptance of the Terms And Conditions as set forward in the download page.
For questions about the paper or the dataset, please reach out to Jacopo Tagliabue.
The original paper is a collaboration between industry and academia, over a dataset gently provided by Coveo. The authors of the paper are:
- Federico Bianchi - Postdoctoral Researcher at Università Bocconi
- Jacopo Tagliabue - Coveo AI Labs
- Bingqing Yu - Coveo
- Luca Bigon - Coveo
- Ciro Greco - Coveo AI Labs
The authors wish to thank Richard Tessier and Coveo's legal team for supporting our research and believing in this data sharing initiative.
If you make use of this dataset, please cite our work:
@inproceedings{BianchiSIGIReCom2020,
title = {Fantastic Embeddings and How to Align Them: Zero-Shot Inference in a Multi-Shop Scenario},
author = {Bianchi, Federico and Tagliabue, Jacopo and Yu, Bingqing and Bigon, Luca and Greco, Ciro},
url = {https://arxiv.org/abs/2007.14906},
booktitle = {Proceedings of the SIGIR 2020 eCom workshop, July 2020, Virtual Event, published at
http://ceur-ws.org (to appear)},
year = {2020}
}