# Amazon ESCI dataset EDA

The goal of this notebook is to identify how this dataset was created, any interesting features, benchmarks, and metrics used to evaluate it. Also, some simple EDA was performed to see distribution of features

In [96]:
import plotly.express as px
import pandas as pd
from ranx import Qrels, Run, evaluate

In [97]:
df_examples = pd.read_parquet('../data/shopping_queries_dataset_examples.parquet')
df_products = pd.read_parquet('../data/shopping_queries_dataset_products.parquet')
df_sources = pd.read_csv("../data/shopping_queries_dataset_sources.csv")

In [98]:
# https://github.com/amazon-science/esci-data: suggested filter for task 1: Query-Product Ranking 
# Query-Product Ranking: Given a user specified query and a list of matched products, the goal of this 
# task is to rank the products so that the relevant products are ranked above the non-relevant ones.
df_examples_products = pd.merge(
    df_examples,
    df_products,
    how='left',
    left_on=['product_locale','product_id'],
    right_on=['product_locale', 'product_id']
)

df_task_1 = df_examples_products[df_examples_products["small_version"] == 1]
df_task_1_train = df_task_1[df_task_1["split"] == "train"]
df_task_1_test = df_task_1[df_task_1["split"] == "test"]

# simple EDA

In [99]:
# describe
df_task_1.describe()

Unnamed: 0,example_id,query_id,small_version,large_version
count,1118011.0,1118011.0,1118011.0,1118011.0
mean,1376919.0,69634.81,1.0,1.0
std,819569.7,41907.52,0.0,0.0
min,16.0,1.0,1.0,1.0
25%,645385.5,32029.0,1.0,1.0
50%,1405883.0,71429.0,1.0,1.0
75%,2159588.0,110668.0,1.0,1.0
max,2621255.0,130649.0,1.0,1.0


In [101]:
# split of queries per product location
df_task_1.product_locale.value_counts()

product_locale
us    601354
jp    297883
es    218774
Name: count, dtype: int64

### Align values to dataset description
Check the counts match what was shown on: https://github.com/amazon-science/esci-data
![reduced-dataset-count-product-locale](imgs/dataset_total.png)

### Dataset understanding

Each query_id is unique to a user search. \
Each judgement is a product that got manually evaluated per query_id. For e.g. 35 products were shown to the user for a given query. \
The depth is the count of the number of products that were evaluated per query.

In [106]:
# average queries per judgement
print("Average judgements per query", df_task_1.groupby('query_id')['example_id'].nunique().mean())
print("Max judgements per query", df_task_1.groupby('query_id')['example_id'].nunique().max())
print("Min judgements per query", df_task_1.groupby('query_id')['example_id'].nunique().min())

Average judgements per query 23.14722567287785
Max judgements per query 188
Min judgements per query 8


### ESCI understanding
- E: Exact
- S: Substitute
- C: Complement
- I: Irrelevant

These provide rough ranks for the judgements per query_id.

In [111]:
# calculate the average esci ratio for each query_id
esci_counts = df_task_1.groupby(['query_id', 'esci_label']).size().unstack(fill_value=0)
esci_ratios = esci_counts.div(esci_counts.sum(axis=1), axis=0)
avg_esci_ratio = esci_ratios.mean(axis=0)

avg_esci_ratio

esci_label
C    0.053341
E    0.437810
I    0.163013
S    0.345836
dtype: float64

### Model evaluation
Amazon trained and finetuned a BERT model to evaluate on the amazon-esci dataset. They use this dataset for three use cases - query product ranking, multiclass product classification and product substitute identification. For their first use case, they fine tuned a MS MARCO Cross-Encoder for the us locale. For their es and jp locales, they finetuned a multilingual MPNet. 

![dataset-benchmark](imgs/amazon_finetune_results.png)