# SIGIR eCOM 2021 Data Challenge

- [GitHub](https://github.com/coveooss/SIGIR-ecom-data-challenge) de la compétition
- Les données sont disponibles ici: https://www.coveo.com/en/ailabs/sigir-ecom-data-challenge
- La solution de l'équipe NVIDIA
    - [Winning the SIGIR eCommerce Challenge on session-based recommendation with Transformers](https://medium.com/nvidia-merlin/winning-the-sigir-ecommerce-challenge-on-session-based-recommendation-with-transformers-v2-793f6fac2994
    - [GitHub](https://github.com/NVIDIA-Merlin/competitions/tree/main/SIGIR_eCommerce_Challenge_2021)
    - [Paper](https://arxiv.org/abs/2107.05124)
- Transformers4Rec
  - NVIDIA, [Transformers4Rec](https://github.com/NVIDIA-Merlin/Transformers4Rec) sur GitHub
  - [Tutorial](https://nvidia-merlin.github.io/Transformers4Rec/main/examples/tutorial/index.html)

In [7]:
import os 
import pandas as pd

## [How to Start](https://github.com/coveooss/SIGIR-ecom-data-challenge/blob/main/README_DC_2021.md#how-to-start)

Download the `zip` folder and unzip it in your local machine. To verify that all is well, you can run the simple `start/dataset_stats.py` script in the folder: the script will parse the three files, show some sample rows and print out some basic stats and counts (if you don't modify the three paths, it will run on the sample `csv`).

In [14]:
PATH = "/Users/alain/data/SIGIR-ecom-data-challenge"

In [15]:
import csv
from datetime import datetime

# put here the file paths if you did not unzip in same folder
BROWSING_FILE_PATH = os.path.join(PATH, 'start/browsing_train_sample.csv')
SEARCH_TRAIN_PATH = os.path.join(PATH, 'start/search_train_sample.csv')
SKU_2_CONTENT_PATH = os.path.join(PATH, 'start/sku_to_content_sample.csv')


def get_rows(file_path: str, print_limit: int = 2):
    """
    Util function reading the csv file and printing the first few lines out for visual debugging.

    :param file_path: local path to the csv file
    :param print_limit: specifies how many rows to print out in the console for debug
    :return: list of dictionaries, one per each row in the file
    """
    rows = []
    print("\n============== {}".format(file_path))
    with open(file_path) as csvfile:
        reader = csv.DictReader(csvfile)
        for idx, row in enumerate(reader):
            # print out first few lines
            if idx < print_limit:
                print(row)
            rows.append(row)

    return rows


def get_descriptive_stats(
        browsing_train_path : str,
        search_train_path: str,
        sku_2_content_path: str
):
    """
    Simple function showing how to read the main training files, print out some
    example rows, and producing the counts found in the Data Challenge paper.

    We use basic python library commands, optimizing for clarity, not performance.

    :param browsing_train_path: path to the file containing the browsing interactions
    :param search_train_path: path to the file containing the search interactions
    :param sku_2_content_path: path to the file containing the product meta-data
    :return:
    """
    print("Starting our counts at {}".format(datetime.utcnow()))
    # first, just read in the csv files and display some rows
    browsing_events = get_rows(browsing_train_path)
    print("# {} browsing events".format(len(browsing_events)))
    search_events = get_rows(search_train_path)
    print("# {} search events".format(len(search_events)))
    sku_mapping = get_rows(sku_2_content_path)
    print("# {} products".format(len(sku_mapping)))
    # now do some counts
    print("\n\n=============== COUNTS ===============")
    print("# {} of distinct SKUs with interactions".format(
        len(set([r['product_sku_hash'] for r in browsing_events if r['product_sku_hash']]))))
    print("# {} of add-to-cart events".format(sum(1 for r in browsing_events if r['product_action'] == 'add')))
    print("# {} of purchase events".format(sum(1 for r in browsing_events if r['product_action'] == 'purchase')))
    print("# {} of total interactions".format(sum(1 for r in browsing_events if r['product_action'])))
    print("# {} of distinct sessions".format(
        len(set([r['session_id_hash'] for r in browsing_events if r['session_id_hash']]))))
    # now run some tests
    print("\n\n*************** TESTS ***************")
    for r in browsing_events:
        assert len(r['session_id_hash']) == 64
        assert not r['product_sku_hash'] or len(r['product_sku_hash']) == 64
    for p in sku_mapping:
        assert not p['price_bucket'] or float(p['price_bucket']) <= 10
    # say goodbye
    print("All done at {}: see you, space cowboy!".format(datetime.utcnow()))

    return


if __name__ == "__main__":
    get_descriptive_stats(
        BROWSING_FILE_PATH,
        SEARCH_TRAIN_PATH,
        SKU_2_CONTENT_PATH
    )

Starting our counts at 2023-10-15 15:16:56.860418

{'session_id_hash': '20c458b802f6ea9374783bfc528b19421be977a6769785ec2b357915dd6fda84', 'event_type': 'event_product', 'product_action': 'detail', 'product_sku_hash': 'd5157f8bc52965390fa21ad5842a8502bc3eb8b0930f3f8eafbc503f4012f69c', 'server_timestamp_epoch_ms': '1550885210881', 'hashed_url': '7e4527ac6a32deed4f4f06bb7c49b907b7ca371e59d57d2bb8b99951799f45ea'}
{'session_id_hash': '20c458b802f6ea9374783bfc528b19421be977a6769785ec2b357915dd6fda84', 'event_type': 'event_product', 'product_action': 'detail', 'product_sku_hash': '61ef3869355b78e11011f39fc7ac8f8dfb209b3442a9d516576eb808a0904dc3', 'server_timestamp_epoch_ms': '1550885213307', 'hashed_url': '4ed279f4f0deab6dfc80f4f7bf49d527fd894fa478a9ce44dc007eaed41f4ad9'}
# 34 browsing events

{'session_id_hash': '48fade624d47870058ce07dd789ccc04e46c70c0fa2a1b6a2bbbec644b8ced78', 'query_vector': '[-0.20255649089813232, -0.016908567398786545, 0.03185821324586868, -0.015581827610731125, -0.101

## Loading Data

See https://github.com/NVIDIA-Merlin/competitions/blob/main/SIGIR_eCommerce_Challenge_2021/task1_session_based_rec/0-eda/EDA_train_x_test_set.ipynb

### Train set

In [38]:
train_browsing_df = pd.read_csv(os.path.join(PATH, 'train/browsing_train.csv'))
len(train_browsing_df)

36079307

In [39]:
train_browsing_df

Unnamed: 0,session_id_hash,event_type,product_action,product_sku_hash,server_timestamp_epoch_ms,hashed_url
0,20c458b802f6ea9374783bfc528b19421be977a6769785...,event_product,detail,d5157f8bc52965390fa21ad5842a8502bc3eb8b0930f3f...,1550885210881,7e4527ac6a32deed4f4f06bb7c49b907b7ca371e59d57d...
1,20c458b802f6ea9374783bfc528b19421be977a6769785...,event_product,detail,61ef3869355b78e11011f39fc7ac8f8dfb209b3442a9d5...,1550885213307,4ed279f4f0deab6dfc80f4f7bf49d527fd894fa478a9ce...
2,20c458b802f6ea9374783bfc528b19421be977a6769785...,pageview,,,1550885213307,4ed279f4f0deab6dfc80f4f7bf49d527fd894fa478a9ce...
3,20c458b802f6ea9374783bfc528b19421be977a6769785...,event_product,detail,d5157f8bc52965390fa21ad5842a8502bc3eb8b0930f3f...,1550885215484,7e4527ac6a32deed4f4f06bb7c49b907b7ca371e59d57d...
4,20c458b802f6ea9374783bfc528b19421be977a6769785...,pageview,,,1550885215484,7e4527ac6a32deed4f4f06bb7c49b907b7ca371e59d57d...
...,...,...,...,...,...,...
36079302,0676f342dc490b0f8bd9c22d16e4c67f8f7af1f85679f1...,pageview,,,1552162324909,38f5bd3c9a1cc5b39e6b965f1aa6c565737f58e19a560a...
36079303,0676f342dc490b0f8bd9c22d16e4c67f8f7af1f85679f1...,pageview,,,1552162336608,38f5bd3c9a1cc5b39e6b965f1aa6c565737f58e19a560a...
36079304,0676f342dc490b0f8bd9c22d16e4c67f8f7af1f85679f1...,pageview,,,1552162343684,38f5bd3c9a1cc5b39e6b965f1aa6c565737f58e19a560a...
36079305,0676f342dc490b0f8bd9c22d16e4c67f8f7af1f85679f1...,pageview,,,1552162356368,433b0e71df1fe9a8d1f45647545701f6108414c40eef76...


In [40]:
train_browsing_df['server_timestamp_epoch_ms'].agg(['min', 'max'])

min    1547528564513
max    1555300798560
Name: server_timestamp_epoch_ms, dtype: int64

In [25]:
train_search_df = pd.read_csv(os.path.join(PATH, 'train/search_train.csv'))
len(train_search_df)

819516

In [26]:
train_search_items_clicked = train_search_df[~train_search_df['clicked_skus_hash'].isna()]['clicked_skus_hash'].explode().unique()
len(train_search_items_clicked)

73311

In [27]:
train_sku_to_content_df = pd.read_csv(os.path.join(PATH, 'train/sku_to_content.csv'))
len(train_sku_to_content_df)

66386

### Test set

In [28]:
import json 

with open(os.path.join(PATH, 'baselines/session_rec_sigir_data/test/rec_test_sample.json')) as json_file:
    test_queries = json.load(json_file)
    testset_recommendation_df = pd.json_normalize(test_queries, 'query')
    print(len(testset_recommendation_df))

59


**NOTE** : Pas le même fichier json que NVIDIA. J'obtiens 59 lignes tandis que le fichier de NVIDIA a 576435 lignes. Voir https://github.com/NVIDIA-Merlin/competitions/blob/main/SIGIR_eCommerce_Challenge_2021/task1_session_based_rec/0-eda/EDA_train_x_test_set.ipynb

## Analyzing train data

### Items

In [29]:
train_browsing_items = train_browsing_df['product_sku_hash'].unique()
len(train_browsing_items)

57484

In [30]:
import ast

def convert_str_to_list(x): 
    if pd.isnull(x): 
        return x
    return ast.literal_eval(x)

In [31]:
for col in ['product_skus_hash', 'clicked_skus_hash']: 
    train_search_df[col] = train_search_df[col].apply(convert_str_to_list)

In [32]:
train_search_items = train_search_df[~train_search_df['product_skus_hash'].isna()]['product_skus_hash'].explode().unique()
len(train_search_items)

30399

**NOTE** : Différence ici. J'obtiens 30399 tandis que NVIDIA obtiens 189796. Voir https://github.com/NVIDIA-Merlin/competitions/blob/main/SIGIR_eCommerce_Challenge_2021/task1_session_based_rec/0-eda/EDA_train_x_test_set.ipynb

In [33]:
content_items = train_sku_to_content_df['product_sku_hash'].unique()
len(content_items)

66386

In [34]:
len(set(train_browsing_items).difference(set(train_search_items)))

28182

In [35]:
len(set(train_search_items).difference(set(train_browsing_items)))

1097

In [41]:
len(set(train_search_items_clicked).difference(set(train_browsing_items)))

73311

**NOTE** : Grosse différence ici. J'obtiens 73311 tandis que NVIDIA obtiens 100.

### Sessions

In [42]:
train_browsing_df['session_id_hash'].nunique()

4934699

In [43]:
len(set(train_browsing_df['session_id_hash'].unique()).difference(set(train_search_df['session_id_hash'].unique())))

4460582

In [44]:
len(set(train_search_df['session_id_hash'].unique()).difference(set(train_browsing_df['session_id_hash'].unique())))

75983

## Comparing train and test data

See https://github.com/NVIDIA-Merlin/competitions/blob/main/SIGIR_eCommerce_Challenge_2021/task1_session_based_rec/0-eda/EDA_train_x_test_set.ipynb

**NOTE** : On ne peut pas réellement comparer puisque le fichier test `json` n'est pas comparable à celui de NVIDIA.

## Processing DATA with pandas

Voir https://github.com/NVIDIA-Merlin/competitions/blob/main/SIGIR_eCommerce_Challenge_2021/task1_session_based_rec/1-preprocessing/ecom_preproc_step1.ipynb

In [6]:
import os

In [45]:
DATA_FOLDER = "/Users/alain/data/SIGIR-ecom-data-challenge/train/"
FILENAME_PATTERN_BROWSING = 'browsing_train.csv'
FILENAME_PATTERN_SEARCH = 'search_train.csv'
FILENAME_PATTERN_SKU = 'sku_to_content.csv'
DATA_PATH_BROWSING = os.path.join(DATA_FOLDER, FILENAME_PATTERN_BROWSING)
DATA_PATH_SEARCH = os.path.join(DATA_FOLDER, FILENAME_PATTERN_SEARCH)
DATA_PATH_SKU = os.path.join(DATA_FOLDER, FILENAME_PATTERN_SKU)
OUTPUT_DIR = "/Users/alain/data/workspace"
!ls $DATA_PATH_BROWSING

/Users/alain/data/SIGIR-ecom-data-challenge/train/browsing_train.csv


In [3]:
MINIMUM_SESSION_LENGTH = 2

### Proprocessing of search tables

In [10]:
# load search data
search = pd.read_csv(DATA_PATH_SEARCH, sep=',', nrows=NROWS)

In [13]:
search.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   session_id_hash            100 non-null    object
 1   query_vector               100 non-null    object
 2   clicked_skus_hash          16 non-null     object
 3   product_skus_hash          69 non-null     object
 4   server_timestamp_epoch_ms  100 non-null    int64 
dtypes: int64(1), object(4)
memory usage: 4.0+ KB


In [15]:
search.head()

Unnamed: 0,session_id_hash,query_vector,clicked_skus_hash,product_skus_hash,server_timestamp_epoch_ms
0,48fade624d47870058ce07dd789ccc04e46c70c0fa2a1b...,"[-0.20255649089813232, -0.016908567398786545, ...",,,1548575194779
1,8731ca84ff7bb8cb647531d54e64feedb2519b4a7792a7...,"[-0.007610442116856575, -0.14909175038337708, ...",,['9ee9ffd7e2529a65f9a0b0c9eaae6330df85cf2e3af3...,1548276763869
2,9be980708345944960645d03606ea83b637cae9106b705...,"[-0.20023074746131897, -0.03151938319206238, 0...",,['7cc72dbed53bab78ec6a62feaa5052a7a1db7d201664...,1548937997295
3,9be980708345944960645d03606ea83b637cae9106b705...,"[-0.18556387722492218, -0.07620412111282349, 0...",,['62c4ddab6c1c81c74d315376b3c0dc7768c0286b3dc6...,1548938038268
4,9be980708345944960645d03606ea83b637cae9106b705...,"[-0.03269264101982117, -0.27234694361686707, 0...",,['2a0ee2924feabeec35e21e8fcb4d5b0684d190e46cef...,1548938093827


In [17]:
import ast

ast.literal_eval("[1,2,3]")

[1, 2, 3]