## 5.0. Validate intuition

Running the full code is time and resource intensive. Thus, we designed a quick way to validate some ideas. There will be huge overfitting, but it works fine for testing some ideas.

Ideas tested:
1. Open carts. Intuitively, open carts are most likely to lead to orders.
2. Repeated orders. Intuitively, repeated orders of the item by user should mean that they need to buy that item often.
3. Recent carts/orders, carts and orders.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np

from collections import Counter

In [None]:
train_cart_4 = pd.read_parquet('/content/drive/MyDrive/0.capstone/train_cart_4.parquet')

train_order_wo_val_4 = pd.read_parquet('/content/drive/MyDrive/0.capstone/train_val/train_order_wo_val_4.parquet')
val_order_4 = pd.read_parquet('/content/drive/MyDrive/0.capstone/train_val/val_order_4.parquet')

**Create functions**

In [None]:
# Function to provide validation score
def validate_recall(validation):
  # Calculate the recall for each session
  recalls = []
  for session, group in validation.groupby('session'):
      ground_truth = Counter(group['aid_x'].explode())
      predicted = Counter(group['aid_y'].explode())
      intersection_count = sum((ground_truth & predicted).values())
      ground_truth_count = sum(ground_truth.values())
      if ground_truth_count > 0:
          recalls.append(intersection_count / ground_truth_count)

  # Aggregate mean of recall
  aggregate_recall = sum(recalls) / len(recalls)
  print(f'Validation recall score: {aggregate_recall}')

In [None]:
def merge_two(df_left, df_right):
  df_new = df_left.merge(df_right, on='session', how='left')
  return df_new[df_new['aid_y'].notna()]

**1. Validate list of open carts**

In [None]:
cart_order_4 = train_cart_4.merge(train_order_wo_val_4, how='left', on=['session', 'aid'])

# open carts where there are no orders or orders timestamp is not later than carts timestamp
cart_order_4 = cart_order_4[(pd.isna(cart_order_4['ts_y'])) | (cart_order_4['ts_x'] > cart_order_4['ts_y'])]

# keep only session and aid columns; drop duplicates
cart_order_4 = cart_order_4.loc[:, ['session', 'aid']].drop_duplicates(subset=['session', 'aid'], keep='first')

# create list of items by session
cart_order_4 = cart_order_4.groupby('session')['aid'].apply(list).reset_index(name='aid')

In [None]:
# Merge by session. x = ground truth ; y = predict
validation_open_carts = merge_two(val_order_4, cart_order_4)
validation_open_carts.head()

Unnamed: 0,session,aid_x,aid_y
0,0,"[1199474, 543308]","[974651, 280978, 1521766, 1760145, 275288, 543..."
1,15,[791627],"[791627, 11972]"
2,19,[383546],[383546]
3,35,"[1182605, 1148596]","[126784, 1182605, 1148596]"
4,36,"[1500844, 726640, 205516]","[726640, 1660229, 1709859, 636340, 1500844, 20..."


In [None]:
validate_recall(validation_open_carts)

Validation recall score: 0.875360544644799


Good validation score from only open orders

**2. Validate list of repeated orders**

In [None]:
# Group dataframe by session and create a list of all aids for each session
sessions = train_order_wo_val_4.groupby('session')['aid'].apply(list)

# Create counter for each aid in each session
aid_counts = {}
for session, aids in sessions.items():
    aid_counts[session] = Counter(aids)

# Create a dictionary of repeated aids for each session
repeated_aids = {}
for session, counts in aid_counts.items():
    repeated = {aid: count for aid, count in counts.items() if count > 1}
    if len(repeated) > 0:
        repeated_aids[session] = repeated

In [None]:
# Convert to dataframe
repeated_aids = {k: list(set(v)) for k, v in repeated_aids.items()}
repeated_aids_df = pd.DataFrame(repeated_aids.items(), columns=['session', 'aid'])
repeated_aids_df.head()

Unnamed: 0,session,aid
0,11603,[157700]
1,82998,[434410]
2,86070,"[1598401, 1689601, 545907, 283527, 1617448, 13..."
3,116550,[645863]
4,134544,[489150]


In [None]:
# Merge by session. x = ground truth ; y = predict
validation_repeated_orders = merge_two(val_order_4, repeated_aids_df)
validation_repeated_orders.head()

Unnamed: 0,session,aid_x,aid_y
815,11603,"[899558, 1477820, 1605401, 153169, 934840, 174...",[157700]
5319,82998,"[866199, 143414, 1792127, 879688, 693231, 8796...",[434410]
5596,86070,"[120556, 1598401, 120556, 1341343, 106402, 168...","[1598401, 1689601, 545907, 283527, 1617448, 13..."
8416,116550,"[531898, 1186014, 606921, 1040894, 1763549, 32...",[645863]
10027,134544,"[489150, 1117119, 1674938, 289762, 1117119, 48...",[489150]


In [None]:
validate_recall(validation_repeated_orders)

Validation recall score: 0.027656250000000004


Only a small number of repeated orders and the contribution to score is small.

**3.1. Recent 20 cart/orders**

In [None]:
# Combine data
recent = pd.concat([train_order_wo_val_4, train_cart_4], ignore_index=True)
recent = recent.sort_values(['session','ts'])
recent.head()

Unnamed: 0,session,aid,ts,type
9512,0,974651,1661336218,1
9513,0,974651,1661336225,1
9514,0,974651,1661336252,1
9515,0,280978,1661336260,1
9516,0,1521766,1661336265,1


In [None]:
# Keep only last 20 unique aids per session
recent = recent.groupby('session').aid.unique().apply(lambda x: x[-20:]).reset_index()
recent.head()

Unnamed: 0,session,aid
0,0,"[974651, 280978, 1521766, 1760145, 275288, 543..."
1,2,[161269]
2,4,"[1554752, 917213, 758750]"
3,15,"[791627, 11972]"
4,19,[383546]


In [None]:
# Merge by session. x = ground truth ; y = predict
validation_recent_buys = merge_two(val_order_4, recent)
validation_recent_buys.head()

Unnamed: 0,session,aid_x,aid_y
0,0,"[1199474, 543308]","[974651, 280978, 1521766, 1760145, 275288, 543..."
1,15,[791627],"[791627, 11972]"
2,19,[383546],[383546]
3,35,"[1182605, 1148596]","[126784, 1182605, 1148596]"
4,36,"[1500844, 726640, 205516]","[726640, 1660229, 1709859, 636340, 1500844, 20..."


In [None]:
validate_recall(validation_recent_buys)

Validation recall score: 0.908700604997075


Gives good score

**3.2. Recent 20 orders only**

In [None]:
recent_2 = train_order_wo_val_4
recent_2 = recent_2.sort_values(['session','ts'])
recent_2 = recent_2.groupby('session').aid.unique().apply(lambda x: x[-20:]).reset_index()
recent_2.head()

Unnamed: 0,session,aid
0,6057,"[726916, 1148683]"
1,8600,"[245182, 721202, 1289468, 974727, 1826023]"
2,11603,"[1841602, 908432, 906854, 157700, 103347, 1530..."
3,30980,"[784013, 1293914, 960233, 135394, 1511425, 110..."
4,32441,[882772]


In [None]:
# Merge by session. x = ground truth ; y = predict
validation_recent_orders = merge_two(val_order_4, recent_2)
validation_recent_orders.head()

Unnamed: 0,session,aid_x,aid_y
461,6057,"[1025795, 825289, 1486848, 681850, 1519855, 53...","[726916, 1148683]"
614,8600,"[603364, 8983, 393538, 1676953, 601093, 149198...","[245182, 721202, 1289468, 974727, 1826023]"
815,11603,"[899558, 1477820, 1605401, 153169, 934840, 174...","[1841602, 908432, 906854, 157700, 103347, 1530..."
2008,30980,"[404612, 1775588, 1603371, 157267, 1735077, 13...","[784013, 1293914, 960233, 135394, 1511425, 110..."
2080,32441,"[954252, 1847280, 1518733, 1389994, 234686, 92...",[882772]


In [None]:
validate_recall(validation_recent_orders)

Validation recall score: 0.06742876076872056


Recent orders did not really give good score.

3.3. Recent 20 carts only

In [None]:
recent_3 = train_cart_4
recent_3 = recent_3.sort_values(['session','ts'])
recent_3 = recent_3.groupby('session').aid.unique().apply(lambda x: x[-20:]).reset_index()
recent_3.head()

Unnamed: 0,session,aid
0,0,"[974651, 280978, 1521766, 1760145, 275288, 543..."
1,2,[161269]
2,4,"[1554752, 917213, 758750]"
3,15,"[791627, 11972]"
4,19,[383546]


In [None]:
# Merge by session. x = ground truth ; y = predict
validation_recent_carts = merge_two(val_order_4, recent_3)
validation_recent_carts.head()

Unnamed: 0,session,aid_x,aid_y
0,0,"[1199474, 543308]","[974651, 280978, 1521766, 1760145, 275288, 543..."
1,15,[791627],"[791627, 11972]"
2,19,[383546],[383546]
3,35,"[1182605, 1148596]","[126784, 1182605, 1148596]"
4,36,"[1500844, 726640, 205516]","[726640, 1660229, 1709859, 636340, 1500844, 20..."


In [None]:
validate_recall(validation_recent_carts)

Validation recall score: 0.9087389002843216


Did slightly better than carts and orders

**Summary**
- Recent carts have high influence on next item. This is aligned with ranker models we have done.
- Open carts have high influence on next item. However, Kaggle score later showed that it didn't improve the model score. This could be because these are already captured under recent carts. Thus we did not include this in the model.
- Repeated orders show to have small influence on prediction. This is later seen as well in the boosting reranker models, where the feature importance is small.