# Amazon KDD Cup 2023 - Task 1 - Next Product Recommendation 

![](https://images.aicrowd.com/raw_images/challenges/banner_file/1116/6c8fecd6d7c225b4ed11.jpg)

This notebook will contains instructions and example submission with random predictions.



## Installations 🤖

1. `aicrowd-cli` for downloading challenge data and making submissions
2. `pyarrow` for saving to parquet for submissions

## Login to AIcrowd and download the data 📚

In [3]:
!aicrowd login

'aicrowd' �����ڲ����ⲿ���Ҳ���ǿ����еĳ���
���������ļ���


In [4]:
!aicrowd dataset download --challenge task-1-next-product-recommendation

'aicrowd' �����ڲ����ⲿ���Ҳ���ǿ����еĳ���
���������ļ���


## Setup data and task information

In [1]:
import os
import numpy as np
import pandas as pd
from functools import lru_cache

In [2]:
train_data_dir = './data/'
test_data_dir = './data/'
task = 'task1'
PREDS_PER_SESSION = 100

In [3]:
# Cache loading of data for multiple calls

@lru_cache(maxsize=1)
def read_product_data():
    return pd.read_csv(os.path.join(train_data_dir, 'products_train.csv'))

@lru_cache(maxsize=1)
def read_train_data():
    return pd.read_csv(os.path.join(train_data_dir, 'sessions_train.csv'))

@lru_cache(maxsize=3)
def read_test_data(task):
    return pd.read_csv(os.path.join(test_data_dir, f'sessions_test_{task}.csv'))

## Data Description

The Multilingual Shopping Session Dataset is a collection of **anonymized customer sessions** containing products from six different locales, namely English, German, Japanese, French, Italian, and Spanish. It consists of two main components: **user sessions** and **product attributes**. User sessions are a list of products that a user has engaged with in chronological order, while product attributes include various details like product title, price in local currency, brand, color, and description.

---

### Each product as its associated information:


**locale**: the locale code of the product (e.g., DE)

**id**: a unique for the product. Also known as Amazon Standard Item Number (ASIN) (e.g., B07WSY3MG8)

**title**: title of the item (e.g., “Japanese Aesthetic Sakura Flowers Vaporwave Soft Grunge Gift T-Shirt”)

**price**: price of the item in local currency (e.g., 24.99)

**brand**: item brand name (e.g., “Japanese Aesthetic Flowers & Vaporwave Clothing”)

**color**: color of the item (e.g., “Black”)

**size**: size of the item (e.g., “xxl”)

**model**: model of the item (e.g., “iphone 13”)

**material**: material of the item (e.g., “cotton”)

**author**: author of the item (e.g., “J. K. Rowling”)

**desc**: description about a item’s key features and benefits called out via bullet points (e.g., “Solid colors: 100% Cotton; Heather Grey: 90% Cotton, 10% Polyester; All Other Heathers …”)


## EDA 💽

In [4]:
def read_locale_data(locale, task):
    products = read_product_data().query(f'locale == "{locale}"')
    sess_train = read_train_data().query(f'locale == "{locale}"')
    sess_test = read_test_data(task).query(f'locale == "{locale}"')
    return products, sess_train, sess_test

def show_locale_info(locale, task):
    products, sess_train, sess_test = read_locale_data(locale, task)

    train_l = sess_train['prev_items'].apply(lambda sess: len(sess))
    test_l = sess_test['prev_items'].apply(lambda sess: len(sess))

    print(f"Locale: {locale} \n"
          f"Number of products: {products['id'].nunique()} \n"
          f"Number of train sessions: {len(sess_train)} \n"
          f"Train session lengths - "
          f"Mean: {train_l.mean():.2f} | Median {train_l.median():.2f} | "
          f"Min: {train_l.min():.2f} | Max {train_l.max():.2f} \n"
          f"Number of test sessions: {len(sess_test)}"
        )
    if len(sess_test) > 0:
        print(
             f"Test session lengths - "
            f"Mean: {test_l.mean():.2f} | Median {test_l.median():.2f} | "
            f"Min: {test_l.min():.2f} | Max {test_l.max():.2f} \n"
        )
    print("======================================================================== \n")

In [6]:
products = read_product_data()
locale_names = products['locale'].unique()
for locale in locale_names:
    show_locale_info(locale, task)

Locale: DE 
Number of products: 518327 
Number of train sessions: 1111416 
Train session lengths - Mean: 57.89 | Median 40.00 | Min: 27.00 | Max 2060.00 
Number of test sessions: 104568
Test session lengths - Mean: 57.23 | Median 40.00 | Min: 27.00 | Max 700.00 


Locale: JP 
Number of products: 395009 
Number of train sessions: 979119 
Train session lengths - Mean: 59.61 | Median 40.00 | Min: 27.00 | Max 6257.00 
Number of test sessions: 96467
Test session lengths - Mean: 59.90 | Median 40.00 | Min: 27.00 | Max 1479.00 


Locale: UK 
Number of products: 500180 
Number of train sessions: 1182181 
Train session lengths - Mean: 54.85 | Median 40.00 | Min: 27.00 | Max 2654.00 
Number of test sessions: 115936
Test session lengths - Mean: 53.51 | Median 40.00 | Min: 27.00 | Max 872.00 


Locale: ES 
Number of products: 42503 
Number of train sessions: 89047 
Train session lengths - Mean: 48.82 | Median 40.00 | Min: 27.00 | Max 792.00 
Number of test sessions: 0

Locale: FR 
Number of produc

In [8]:
products.sample(5)

Unnamed: 0,id,locale,title,price,brand,color,size,model,material,author,desc
1496332,B076NDYGZT,FR,"Yogi Biologique Relaxation, Infusion 100% Bio ...",2.53,YOGI,,12 Unité (Lot de 1),310801.0,,,Contenu : 1 boite d'infusions Relaxation Yogi ...
1541738,B019QCXUYU,IT,Iris Ohyama Set di 6 scatole di immagazzinaggi...,69.99,Iris Ohyama,Trasparente,45 L,103542.0,Plastica,,CARATTERISTICHE: 2 clip per fissare la chiusur...
791873,B09JVV89MH,JP,【最新12インチ】VANBAR ドライブレコーダー ミラー型 2.5K 【日本語音声コントロ...,13800.0,VANBAR,黒,"12""2.5k+1080P",,,,🚗【GPS機能搭載・24時間駐車監視・バック連動機能搭載】●GPS機能搭載によって、自車の走...
434019,B08KYC92VJ,DE,Simple Joy® PAX75 Organizer für ikea Kleidersc...,21.99,Oro-Kong GmbH,Grau,PAX75,,Vliesstoff; Polypropylen-Innenstruktur,,🇩🇪 Patentmarke SIMPLE JOY
738146,B09227GF62,JP,IFEND ビジネスリュック レディース ノートパソコンバッグ 大容量 バックパック 3Wa...,3580.0,IFEND,ブラック,M,,ポリエステル,,【USBポート付き＆スーツケース取り付け可能】バッグパックを明けずにスマホの充電ができます。...


In [7]:
train_sessions = read_train_data()
train_sessions.sample(5)

Unnamed: 0,prev_items,next_item,locale
2994445,['B009E40Y0E' 'B00BFDY8W2' 'B07JDFZ1CQ' 'B07JC...,B005KRN5YQ,UK
2164161,['B092XLR5N3' 'B07K3W35Y6' 'B0868K6WMF' 'B09M9...,B097LPT9Q8,UK
2188521,['B005BAK3ZG' 'B005BAK30G' 'B004LXSBGC' 'B004L...,B004LXVP92,UK
2309433,['B08PCZKGBS' 'B01MRUSHG9' 'B07HHN5VTT'],B09GTMZ1ZH,UK
2725300,['B09H3GTNTB' 'B0BHPL6LML'],B07Q7T134W,UK


In [8]:
test_sessions = read_test_data(task)
test_sessions.sample(5)

Unnamed: 0,prev_items,locale
56749,['B07WNNSZBF' 'B07WNNSZBF' 'B08B1B64TX'],DE
315342,['B07LH48G6R' 'B08ZL8G35H' 'B09XV42FZL' 'B09Z2...,UK
125785,['B07PJYY115' 'B09RB1PYVG' 'B07PJYY115' 'B07PJ...,JP
256525,['B0BKSZ2HBH' 'B0BKT1PG4D' 'B0BKT1PG4D' 'B0BKS...,UK
34598,['B01DU739RK' 'B00Y9AWBFO'],DE


In [9]:
train_sessions.shape, test_sessions.shape 

((3606249, 3), (316971, 2))

## Generate Submission 🏋️‍♀️



Submission format:
1. The submission should be a **parquet** file with the sessions from all the locales. 
2. Predicted products ids per locale should only be a valid product id of that locale. 
3. Predictions should be added in new column named **"next_item_prediction"**.
4. Predictions should be a list of string id values

In [10]:
def random_predicitons(locale, sess_test_locale):
    random_state = np.random.RandomState(42)
    products = read_product_data().query(f'locale == "{locale}"')
    predictions = []
    for _ in range(len(sess_test_locale)):
        predictions.append(
            list(products['id'].sample(PREDS_PER_SESSION, replace=True, random_state=random_state))
        ) 
    sess_test_locale['next_item_prediction'] = predictions
    sess_test_locale.drop('prev_items', inplace=True, axis=1)
    return sess_test_locale

In [11]:
test_sessions = read_test_data(task)
predictions = []
test_locale_names = test_sessions['locale'].unique()
print(test_locale_names)
for locale in test_locale_names:
    sess_test_locale = test_sessions.query(f'locale == "{locale}"').copy()
    predictions.append(
        random_predicitons(locale, sess_test_locale)
    )
predictions = pd.concat(predictions).reset_index(drop=True)
predictions.sample(5)

['DE' 'JP' 'UK']


Unnamed: 0,locale,next_item_prediction
13956,DE,"[B010B4UG3K, B07SKBHL5C, B077XRG25Q, B00TSBCC3..."
131849,JP,"[B0BJJ942TQ, B08L55NSPV, B09LS5JGG5, B01GK2F38..."
7428,DE,"[B0B93M9BXH, B0B17TPWNS, B07G2JBCBT, B09G75TPJ..."
152619,JP,"[B07VQLB33D, 4578210464, B083192XGN, B098M99GR..."
95849,DE,"[B08CFL3JXM, B0073E5880, B0868MJNSG, B08MW31D3..."


In [12]:
predictions.shape, test_sessions.shape 

((316971, 2), (316971, 2))

In [13]:
predictions.head(5)

Unnamed: 0,locale,next_item_prediction
0,DE,"[B0B3DSBCSC, B073TZ234Q, B0BGF8GJ8V, B09WK9GW1..."
1,DE,"[B01CO6E6WU, B09DFGZ1RK, B09KN7LH6X, B09PV5BKL..."
2,DE,"[B07C7HL1WZ, B01A8SYW6Q, B01MFCKG43, B01M0PFZJ..."
3,DE,"[B0BHP1HC4Y, B07TY1RFH5, B0876S52CV, B0BF5KWYG..."
4,DE,"[B089S6LXMZ, B097GXS9B9, B075LC9YJV, B09W5TGFD..."


# Validate predictions ✅

In [18]:
def check_predictions(predictions, check_products=False):
    """
    These tests need to pass as they will also be applied on the evaluator
    """
    test_locale_names = test_sessions['locale'].unique()
    for locale in test_locale_names:
        sess_test = test_sessions.query(f'locale == "{locale}"')
        preds_locale =  predictions[predictions['locale'] == sess_test['locale'].iloc[0]]
        assert sorted(preds_locale.index.values) == sorted(sess_test.index.values), f"Session ids of {locale} doesn't match"

        if check_products:
            # This check is not done on the evaluator
            # but you can run it to verify there is no mixing of products between locales
            # Since the ground truth next item will always belong to the same locale
            # Warning - This can be slow to run
            products = read_product_data().query(f'locale == "{locale}"')
            predicted_products = np.unique( np.array(list(preds_locale["next_item_prediction"].values)) )
            assert np.all( np.isin(predicted_products, products['id']) ), f"Invalid products in {locale} predictions"

In [19]:
check_predictions(predictions)

In [14]:
# Its important that the parquet file you submit is saved with pyarrow backend
predictions.to_parquet(f'output/submission_{task}_rand.parquet', engine='pyarrow')

In [15]:
predictions

Unnamed: 0,locale,next_item_prediction
0,DE,"[B0B3DSBCSC, B073TZ234Q, B0BGF8GJ8V, B09WK9GW1..."
1,DE,"[B01CO6E6WU, B09DFGZ1RK, B09KN7LH6X, B09PV5BKL..."
2,DE,"[B07C7HL1WZ, B01A8SYW6Q, B01MFCKG43, B01M0PFZJ..."
3,DE,"[B0BHP1HC4Y, B07TY1RFH5, B0876S52CV, B0BF5KWYG..."
4,DE,"[B089S6LXMZ, B097GXS9B9, B075LC9YJV, B09W5TGFD..."
...,...,...
316966,UK,"[B0866LD4XF, B08D1R268V, B08QCLCQDM, B00TI6MS8..."
316967,UK,"[B078V75PDB, B00SVEMNU2, B0BBNBSS48, B09NK5F56..."
316968,UK,"[B09BW1BHC3, B08P1NL4J1, B0795DP124, B08PX7HC2..."
316969,UK,"[B071DN8LJ8, B007G9URME, B081QJ8YX6, B07NRP32F..."


## Submit to AIcrowd 🚀

In [None]:
# You can submit with aicrowd-cli, or upload manually on the challenge page.
!aicrowd submission create -c task-1-next-product-recommendation -f "submission_task1.parquet"