## 6.3. Add features to candidates for Reranker

We need to add features to train and test candidates to fit and predict on the ranker. The 15 features we selected are as follows:

**Key user-item feature:**
- rank according to candidate generation sequence ("buy rank")

This is to rank according to earlier rankings in Parts 6.1 / 6.2. Idea is to use this as a simple way to account for recency, type, frequency and covistation like how the rule based model does. Ideally, the model will weigh this feature as heavily important.

**User features:**
- total number of interactions for each user
- number of unique items for each user
- user session length

Intuition is that users who have a lot of activities are likely to make a next purchase.

**Item features:**
- number of unique users for each item
- total interactions for each item (carts only and orders only)
- number of co-visitation pairs

Intuition is that these items are more likely to be a popular items, thus higher chances of being within next 20 orders 

**Item feature:**
- cart_order_interaction_ratio

Intuition is that if there were high cart-order conversion, the item will have higher chances of being within the next 20 orders.



**User-item features:**
- recurring items (incl carts only and orders only): intuition is that items that were bought often by users are more likely to be purchased again, e.g. out of necessity
- last click / last cart: intuition is that the last click and last cart is most likely to have far more influence on the next action
<br><br>

---
<br>

**Use of Polars** <br>
We explore using Polars here. Polars is a DataFrame API built on Rust, a low-level systems programming language known for its performance and memory safety. Polars runs faster than pandas. However because it is newer, it has less documentation for finding how to do more obscure dataframe manipulation compared to Pandas.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import numpy as np
import pandas as pd

# !pip install polars
import polars as pl

import seaborn as sns

In [None]:
candidates_train = pd.read_parquet('/content/drive/MyDrive/0.capstone/covisit_prepare/ranker_candidates_train_3.parquet')
candidates_test = pd.read_parquet('/content/drive/MyDrive/0.capstone/covisit_prepare/ranker_candidates_test_3.parquet')

validation_data_a = pl.read_parquet('/content/drive/MyDrive/0.capstone/train_val/test.parquet')

train = pl.read_parquet('/content/drive/MyDrive/0.capstone/train.parquet')
test = pl.read_parquet('/content/drive/MyDrive/0.capstone/test.parquet')

In [None]:
top_30_buy2buy_train = pl.read_parquet('/content/drive/MyDrive/0.capstone/covisit_prepare/top_30_buy2buy_v4_0.parquet')
top_30_buy_train_1 = pl.read_parquet('/content/drive/MyDrive/0.capstone/covisit_prepare/top_30_carts_orders_v4_0.parquet')
top_30_buy_train_2 = pl.read_parquet('/content/drive/MyDrive/0.capstone/covisit_prepare/top_30_carts_orders_v4_1.parquet')

top_30_buy2buy_test = pd.read_parquet('/content/drive/MyDrive/0.capstone/covisit_prepare/top_30_buy2buy_v3_0.parquet')
top_30_buy_test_1 = pd.read_parquet('/content/drive/MyDrive/0.capstone/covisit_prepare/top_30_carts_orders_v3_0.parquet')
top_30_buy_test_2 = pd.read_parquet('/content/drive/MyDrive/0.capstone/covisit_prepare/top_30_carts_orders_v3_1.parquet')

**Preprocessing**

In [None]:
# Function to align datatypes for merge
def change_type(df):
  df = df.astype({'aid_x': 'int32', 'aid_y': 'int32'})
  return pl.from_pandas(df)

In [None]:
# Align datatypes for merge
top_30_buy2buy_test = change_type(top_30_buy2buy_test)
top_30_buy_test_1 = change_type(top_30_buy_test_1)
top_30_buy_test_2 = change_type(top_30_buy_test_2)

In [None]:
# Concat train with validation_data_a for aid features
train_new = pl.concat([train, validation_data_a])

# Concat train with validation_data_a for aid features
test_new = pl.concat([train, test])

**1 Key user-item feature:** Rank according to candidate generation sequence

In [None]:
# Assign rank based on candidate generation sequence
# Convert to i32 dtype and then convert to polar
def buy_rank(df):
    df['buy_rank'] = 1 / (df.groupby('session').cumcount() + 1)
    df = df.astype({'session': 'int32', 'aid': 'int32', 'buy_rank': 'float64'})
    return pl.from_pandas(df)

In [None]:
candidates_train = buy_rank(candidates_train)
candidates_test = buy_rank(candidates_test)

print(candidates_train.dtypes)
print(candidates_test.dtypes)

print(candidates_train.shape)
print(candidates_test.shape)

[Int32, Int32, Float64]
[Int32, Int32, Float64]
(20500399, 3)
(35835181, 3)


In [None]:
candidates_train.head(35)

session,aid,buy_rank
i32,i32,f64
11098528,11830,1.0
11098528,1732105,0.5
11098528,588923,0.333333
11098528,884502,0.25
11098528,876129,0.2
11098528,1157882,0.166667
11098528,231487,0.142857
11098528,571762,0.125
11098528,1182614,0.111111
11098528,1790438,0.1


In [None]:
candidates_test.head()

session,aid,buy_rank
i32,i32,f64
12899779,59625,1.0
12899779,397451,0.5
12899779,689970,0.333333
12899779,1493965,0.25
12899779,469285,0.2


**3 user features:** total interactions, number of unique aids, session length

In [None]:
test.head()

session,aid,ts,type
i32,i32,i32,u8
12899779,59625,1661724000,0
12899780,1142000,1661724000,0
12899780,582732,1661724058,0
12899780,973453,1661724109,0
12899780,736515,1661724136,0


In [None]:
# Session features
def session_features(df):
  return df.select([
          pl.col('session'),
          pl.col('aid'),
          (pl.col('ts')*1000 + 2*60*60*1000*1000).cast(pl.Datetime).alias('timestamp'),
      ]).groupby('session').agg([
          pl.col('timestamp').count().cast(pl.Int16).alias('session_total_interactions'),
          pl.col('aid').n_unique().cast(pl.Int16).alias('session_unique_aids'),
          pl.col('timestamp').min().alias('transaction_start_timestamp'),
          pl.col('timestamp').max().alias('transaction_end_timestamp')
      ]).select([
          '*',
          ((pl.col('transaction_end_timestamp') - pl.col('transaction_start_timestamp')).apply(lambda x: x.total_seconds())).cast(pl.Float32).alias('session_length')
      ]).drop(['transaction_end_timestamp', 'transaction_start_timestamp'])

In [None]:
train_session_features = session_features(validation_data_a)

train_session_features.head()

session,session_total_interactions,session_unique_aids,session_length
i32,i16,i16,f32
11670968,1,1,0.0
11979392,2,2,0.055
11271784,2,2,0.076
12124376,1,1,0.0
12326320,2,2,0.1


In [None]:
test_session_features = session_features(test)

test_session_features.head()

session,session_total_interactions,session_unique_aids,session_length
i32,i16,i16,f32
14026320,3,3,83.519997
14090576,1,1,0.0
12994424,2,2,0.761
13122576,59,22,251.837997
13790336,5,1,82.444


**2 x 2 item features:** number of unique sessions and total interactions (carts only and orders only)

In [None]:
# Aid features
def aid_features(df):
  return df.select([
      pl.col('session'),
      pl.col('aid'),    
      (pl.col('ts')*1000 + 2*60*60*1000*1000).cast(pl.Datetime).cast(pl.Date).alias('timestamp'),
      pl.col('session').n_unique().over(pl.lit(1)).alias('total_sessions')
      ]).groupby('aid').agg([
          pl.col('session').n_unique().cast(pl.Int32).alias('aid_unique_sessions'),
          pl.col('timestamp').count().cast(pl.Int32).alias('aid_total_interactions'),
      ])

In [None]:
# Filter orders only
train_orders = train_new.filter(pl.col('type') == pl.lit(2))
test_orders = test_new.filter(pl.col('type') == pl.lit(2))

In [None]:
train_orders_aid_features = aid_features(train_orders)
train_orders_aid_features = train_orders_aid_features.rename({
  'aid_unique_sessions': 'order_aid_unique_sessions',
  'aid_total_interactions': 'order_aid_total_interactions'
})
train_orders_aid_features.head()

aid,order_aid_unique_sessions,order_aid_total_interactions
i32,i32,i32
1746632,1,1
1024768,14,14
380608,2,2
1371504,3,6
800568,1,1


In [None]:
test_orders_aid_features = aid_features(test_orders)
test_orders_aid_features = test_orders_aid_features.rename({
  'aid_unique_sessions': 'order_aid_unique_sessions',
  'aid_total_interactions': 'order_aid_total_interactions'
})
test_orders_aid_features.head()

aid,order_aid_unique_sessions,order_aid_total_interactions
i32,i32,i32
669576,7,8
933920,8,8
1003544,1,1
1022576,5,5
746960,3,3


In [None]:
# Filter carts only
train_carts = train_new.filter(pl.col('type') == pl.lit(1))
test_carts = test_new.filter(pl.col('type') == pl.lit(1))

In [None]:
train_carts_aid_features = aid_features(train_carts)
train_carts_aid_features = train_carts_aid_features.rename({
  'aid_unique_sessions': 'cart_aid_unique_sessions',
  'aid_total_interactions': 'cart_aid_total_interactions'
})
train_carts_aid_features.head()

aid,cart_aid_unique_sessions,cart_aid_total_interactions
i32,i32,i32
229160,3,3
811448,35,51
995120,1,1
514672,1,1
1644456,1,1


In [None]:
test_carts_aid_features = aid_features(test_carts)
test_carts_aid_features = test_carts_aid_features.rename({
  'aid_unique_sessions': 'cart_aid_unique_sessions',
  'aid_total_interactions': 'cart_aid_total_interactions'
})
test_carts_aid_features.head()

aid,cart_aid_unique_sessions,cart_aid_total_interactions
i32,i32,i32
1198360,24,29
362904,2,2
1132072,1,1
1611400,38,46
587248,14,16


**2 item features:** number of co-visitation pairs 

In [None]:
top_30_buy2buy_train.head()

aid_x,aid_y,wgt,__index_level_0__
i32,i32,f32,i64
1,1533875,1.0,0
1,1832177,1.0,1
3,1180285,10.0,2
3,776187,6.0,3
3,1771163,4.0,4


In [None]:
def covisit_pair(df):
  return df.groupby('aid_x').agg(pl.col('aid_y').n_unique()).rename({'aid_x': 'aid', 'aid_y': 'pair_count'})

In [None]:
train_covisit_1_feature = covisit_pair(top_30_buy2buy_train)
train_covisit_1_feature.head()

aid,pair_count
i32,u32
1259936,8
1335176,28
709032,2
930024,19
971368,6


In [None]:
test_covisit_1_feature = covisit_pair(top_30_buy2buy_test)
test_covisit_1_feature.head()

aid,pair_count
i32,u32
754432,19
435768,26
1103192,4
1286720,30
925408,30


In [None]:
# Concat parts
top_30_buy_train = pl.concat([top_30_buy_train_1, top_30_buy_train_2])
top_30_buy_test = pl.concat([top_30_buy_test_1, top_30_buy_test_2])

In [None]:
train_covisit_2_feature = covisit_pair(top_30_buy_train)
train_covisit_2_feature = train_covisit_2_feature.rename({'pair_count': 'pair_count_2'})
train_covisit_2_feature.head()

aid,pair_count_2
i32,u32
953368,30
1314088,30
1540400,30
332784,30
1639928,30


In [None]:
test_covisit_2_feature = covisit_pair(top_30_buy_test)
test_covisit_2_feature = test_covisit_2_feature.rename({'pair_count': 'pair_count_2'})
test_covisit_2_feature.head()

aid,pair_count_2
i32,u32
615776,30
856872,30
357568,30
557008,30
786272,30


**1 x 3 session-aid feature:** recurring aids (incl carts only and orders only)

In [None]:
def session_aid_features(df):
  return df.select([
          pl.col('session'),
          pl.col('aid'),
      ]).groupby('session', 'aid').agg([
          (pl.count('aid').alias('num_recurring_aids')).cast(pl.Int16)
      ])

In [None]:
# Filter orders only
train_session_aid_orders = validation_data_a.filter(pl.col('type') == pl.lit(2))
test_session_aid_orders = test.filter(pl.col('type') == pl.lit(2))

In [None]:
train_session_aid_orders_features = session_aid_features(train_session_aid_orders)
train_session_aid_orders_features = train_session_aid_orders_features.rename({'num_recurring_aids': 'order_num_recurring_aids'})

test_session_aid_orders_features = session_aid_features(test_session_aid_orders)
test_session_aid_orders_features = test_session_aid_orders_features.rename({'num_recurring_aids': 'order_num_recurring_aids'})

In [None]:
# Filter carts only
train_session_aid_carts = validation_data_a.filter(pl.col('type') == pl.lit(2))
test_session_aid_carts = test.filter(pl.col('type') == pl.lit(2))

In [None]:
train_session_aid_carts_features = session_aid_features(train_session_aid_carts)
train_session_aid_carts_features = train_session_aid_carts_features.rename({'num_recurring_aids': 'cart_num_recurring_aids'})

test_session_aid_carts_features = session_aid_features(test_session_aid_carts)
test_session_aid_carts_features = test_session_aid_carts_features.rename({'num_recurring_aids': 'cart_num_recurring_aids'})

**1 x2 session-aid feature:** last click / cart

In [None]:
# filter only last cart by timestamp
train_last_cart = train_carts.sort('ts').groupby('session').tail(1)
test_last_cart = test_carts.sort('ts').groupby('session').tail(1)

# keep only relevant columns
train_last_cart = train_last_cart.select(['session', 'aid', 'type']).rename({'type': 'last_cart'})
test_last_cart = test_last_cart.select(['session', 'aid', 'type']).rename({'type': 'last_cart'})

train_last_cart.head()

session,aid,last_cart
i32,i32,u8
9604424,823247,1
8497872,236899,1
1638304,1015289,1
423176,880620,1
10402520,549486,1


In [None]:
# Filter clicks only
train_clicks = train_new.filter(pl.col('type') == pl.lit(0))
test_clicks = test_new.filter(pl.col('type') == pl.lit(0))

In [None]:
# filter only last click by timestamp
train_last_click = train_clicks.sort('ts').groupby('session').tail(1)
test_last_click = test_clicks.sort('ts').groupby('session').tail(1)

# keep only relevant columns
train_last_click = train_last_click.select(['session', 'aid', 'type']).rename({'type': 'last_click'})
test_last_click = test_last_click.select(['session', 'aid', 'type']).rename({'type': 'last_click'})

train_last_click = train_last_click.with_columns(pl.col('last_click').apply(lambda x: 1 if x == 0 else x))
test_last_click = test_last_click.with_columns(pl.col('last_click').apply(lambda x: 1 if x == 0 else x))

train_last_click.head()

session,aid,last_click
i32,i32,i64
925328,6851,1
7929784,1175706,1
4522432,1452249,1
9782680,498303,1
5980792,436495,1


**Join candidates with features**

In [None]:
train_candidates_features = (
    candidates_train
    .join(train_session_features, how='left', on='session')
    .join(train_orders_aid_features, how='left', on='aid')
    .join(train_carts_aid_features, how='left', on='aid')
    .join(train_covisit_1_feature, how='left', on='aid')
    .join(train_covisit_2_feature, how='left', on='aid')
    .join(train_session_aid_orders_features, how='left', on=['session', 'aid'])
    .join(train_session_aid_carts_features, how='left', on=['session', 'aid'])
    .join(train_last_cart, how='left', on=['session', 'aid'])
    .join(train_last_click, how='left', on=['session', 'aid'])
)

# create new column based on cart-order interaction ratio
train_candidates_features = train_candidates_features.select(
    "*",
    (pl.col("cart_aid_total_interactions") / pl.col("order_aid_total_interactions")).alias("cart_order_interaction_ratio")
)

train_candidates_features.head(50)

session,aid,buy_rank,session_total_interactions,session_unique_aids,session_length,order_aid_unique_sessions,order_aid_total_interactions,cart_aid_unique_sessions,cart_aid_total_interactions,pair_count,pair_count_2,order_num_recurring_aids,cart_num_recurring_aids,last_cart,last_click,cart_order_interaction_ratio
i32,i32,f64,i16,i16,f32,i32,i32,i32,i32,u32,u32,i16,i16,u8,i64,f64
11098528,11830,1.0,1,1,0.0,1195,1377,3294,4438,30,30,,,,,3.222948
11098528,1732105,0.5,1,1,0.0,565,654,1597,2268,30,30,,,,,3.46789
11098528,588923,0.333333,1,1,0.0,687,795,1671,2114,30,30,,,,1,2.659119
11098528,884502,0.25,1,1,0.0,1032,1183,2352,3003,30,30,,,,,2.538462
11098528,876129,0.2,1,1,0.0,840,944,1816,2211,30,30,,,,,2.342161
11098528,1157882,0.166667,1,1,0.0,785,913,2202,2712,30,30,,,,,2.970427
11098528,231487,0.142857,1,1,0.0,3951,4522,8816,10739,30,30,,,,,2.374834
11098528,571762,0.125,1,1,0.0,587,667,1258,1514,30,30,,,,,2.269865
11098528,1182614,0.111111,1,1,0.0,1544,1763,3326,4069,30,30,,,,,2.307998
11098528,1790438,0.1,1,1,0.0,462,536,1312,1609,30,30,,,,,3.001866


In [None]:
test_candidates_features = (
    candidates_test
    .join(test_session_features, how='left', on='session')
    .join(test_orders_aid_features, how='left', on='aid')
    .join(test_carts_aid_features, how='left', on='aid')
    .join(test_covisit_1_feature, how='left', on='aid')
    .join(test_covisit_2_feature, how='left', on='aid')
    .join(test_session_aid_orders_features, how='left', on=['session', 'aid'])
    .join(test_session_aid_carts_features, how='left', on=['session', 'aid'])
    .join(test_last_cart, how='left', on=['session', 'aid'])
    .join(test_last_click, how='left', on=['session', 'aid'])
)

# create new column based on cart-order interaction ratio
test_candidates_features = test_candidates_features.select(
    "*",
    (pl.col("cart_aid_total_interactions") / pl.col("order_aid_total_interactions")).alias("cart_order_interaction_ratio")
)

test_candidates_features.head(50)

session,aid,buy_rank,session_total_interactions,session_unique_aids,session_length,order_aid_unique_sessions,order_aid_total_interactions,cart_aid_unique_sessions,cart_aid_total_interactions,pair_count,pair_count_2,order_num_recurring_aids,cart_num_recurring_aids,last_cart,last_click,cart_order_interaction_ratio
i32,i32,f64,i16,i16,f32,i32,i32,i32,i32,u32,u32,i16,i16,u8,i64,f64
12899779,59625,1.0,1,1,0.0,,,,,,30,,,,1,
12899779,397451,0.5,1,1,0.0,5,5,19,23,30,30,,,,,4.6
12899779,689970,0.333333,1,1,0.0,,,2,2,21,30,,,,,
12899779,1493965,0.25,1,1,0.0,,,11,11,30,30,,,,,
12899779,469285,0.2,1,1,0.0,,,3,3,12,30,,,,,
12899779,438191,0.166667,1,1,0.0,4,4,98,114,30,30,,,,,28.5
12899779,731692,0.142857,1,1,0.0,4,5,14,17,30,30,,,,,3.4
12899779,737445,0.125,1,1,0.0,,,12,16,30,30,,,,,
12899779,1253524,0.111111,1,1,0.0,,,6,6,30,30,,,,,
12899779,1790770,0.1,1,1,0.0,1,1,2,2,17,30,,,,,2.0


In [None]:
print(train_candidates_features.shape)
print(test_candidates_features.shape)

(20500399, 17)
(35835181, 17)


In [None]:
# train_candidates_features.write_parquet("/content/drive/MyDrive/0.capstone/covisit_prepare/reranker_candidates_features_train_5.parquet", compression="snappy")
# test_candidates_features.write_parquet("/content/drive/MyDrive/0.capstone/covisit_prepare/reranker_candidates_features_test_5.parquet", compression="snappy")