##3.0 Baseline Model

As extablished from EDA, recency could have an significant impact on the products predicted. Our baseline model will be based on the recent most products (aids) for each user (session).
<br><br>

---
<br>

**Evaluation** <br>
Scores are evaluated on `Recall@20` for each action type, and the three recall values are weight-averaged:

    Score = (0.10 ⋅ Rclicks) + (0.30 ⋅ Rcarts) + (0.30 ⋅ Rorders)

where

    R = TP / P
TP is the number of correctly predicted items and P is lower of 20 or the number of ground truth items. 

Difference between Precision@k and Recall@k:<br>
- Precision@k is the proportion of recommended items in the top-k set that are relevant.
- Recall@k is the proportion of relevant items found in the top-k recommendations

Recall@k is a suitable metric as it measures the platform's ability to recommend all the relevant items that the user might be interested in.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import numpy as np
import pandas as pd

import pyarrow.parquet as pq

In [None]:
# Read parquet files
test = pd.read_parquet('/content/drive/MyDrive/0.capstone/test.parquet')

click_rank = pd.read_parquet('/content/drive/MyDrive/0.capstone/preprocessed/click_rank.parquet')
cart_rank = pd.read_parquet('/content/drive/MyDrive/0.capstone/preprocessed/cart_rank.parquet')
order_rank = pd.read_parquet('/content/drive/MyDrive/0.capstone/preprocessed/order_rank.parquet')

In [None]:
# Sort test by sessions and timestamp (latest below)
test = test.sort_values(['session', 'ts'])
test.head()

Unnamed: 0,session,aid,ts,type
0,12899779,59625,1661724000,0
1,12899780,1142000,1661724000,0
2,12899780,582732,1661724058,0
3,12899780,973453,1661724109,0
4,12899780,736515,1661724136,0


**Prepare baseline prediction: latest 20 products for each user**

Intuition: recency will increase chances of item being in the next 20.

In [None]:
# Preprocess by listing out latest 20 products (aid) for each user (session)
session_aids = test.groupby('session')['aid'].apply(lambda x: list(x)[-20:])
session_aids.head()

session
12899779                                              [59625]
12899780           [1142000, 582732, 973453, 736515, 1142000]
12899781    [141736, 199008, 57315, 194067, 199008, 199008...
12899782    [889671, 1099390, 987399, 987399, 638410, 1072...
12899783    [255297, 1114789, 255297, 300127, 198385, 3001...
Name: aid, dtype: object

In [None]:
session_types = ['clicks', 'carts', 'orders']

In [None]:
# Format for prediction: click, carts and orders for each user
session_type = []
labels = []

for session, aids in session_aids.items():
    for t in session_types:
        session_type.append(f'{session}_{t}')
        labels.append(' '.join([str(a) for a in aids]))

In [None]:
# Merge into dataframe
base_predict = pd.DataFrame({'session_type': session_type, 'labels': labels})
base_predict.head()

Unnamed: 0,session_type,labels
0,12899779_clicks,59625
1,12899779_carts,59625
2,12899779_orders,59625
3,12899780_clicks,1142000 582732 973453 736515 1142000
4,12899780_carts,1142000 582732 973453 736515 1142000


In [None]:
# base_predict.to_csv('/content/drive/MyDrive/0.capstone/for_submission/base_prediction.csv', index=False)

Baseline score (Kaggle): 0.46499

We have not fully utilised the 20 products we are allowed predict. We will thus fill it with the top 20 popular click/cart/orders. 

**Prepare baseline prediction: latest 20 products for each user (filled with top 20 popular click/cart/orders in Week 4)**

In [None]:
# Format for prediction: click, carts and orders for each user. Labels as list of integers. Include type column.
session_type_2 = []
type_2 = []
labels_2 = []

for session, aids in session_aids.items():
    for t in session_types:
        session_type_2.append(f'{session}_{t}')
        type_2.append(f'{t}')
        labels_2.append(aids[-20:])

base_predict_2 = pd.DataFrame({'session_type': session_type_2, 'labels': labels_2, 'type': type_2})

In [None]:
base_predict_2.head()

Unnamed: 0,session_type,labels,type
0,12899779_clicks,[59625],clicks
1,12899779_carts,[59625],carts
2,12899779_orders,[59625],orders
3,12899780_clicks,"[1142000, 582732, 973453, 736515, 1142000]",clicks
4,12899780_carts,"[1142000, 582732, 973453, 736515, 1142000]",carts


In [None]:
# Function to call list of top 20 clicks, carts and orders from Week 4
def prepare_base_model(aid_rank, column):
    aid_list = aid_rank[column].to_list()

    list_new = []
    for i in aid_list:
        list_new.append(str(i))

    return list_new

In [None]:
click_list = prepare_base_model(click_rank, 'week_4_item')
click_list

['485256',
 '1460571',
 '1551213',
 '108125',
 '1406660',
 '184976',
 '876493',
 '1531805',
 '1116095',
 '29735',
 '1236775',
 '554660',
 '332654',
 '613493',
 '959208',
 '1126038',
 '166037',
 '832192',
 '321547',
 '171982']

In [None]:
cart_list = prepare_base_model(cart_rank, 'week_4_item')
cart_list

['485256',
 '152547',
 '33343',
 '613493',
 '876493',
 '1406660',
 '166037',
 '122983',
 '1531805',
 '1022566',
 '1736857',
 '554660',
 '1460571',
 '332654',
 '660655',
 '544144',
 '1116095',
 '1562705',
 '1236775',
 '923948']

In [None]:
order_list = prepare_base_model(order_rank, 'week_4_item')
order_list

['876493',
 '1406660',
 '122983',
 '1445562',
 '1531805',
 '166037',
 '332654',
 '801774',
 '231487',
 '1022566',
 '923948',
 '1460571',
 '1534690',
 '321547',
 '1025795',
 '544144',
 '1257293',
 '162064',
 '258353',
 '1476166']

In [None]:
# Fill in with popular items if less than 20. Avoid duplicating.
for i, row in base_predict_2.iterrows():
    if len(row['labels']) < 20:
        items_to_add = []
        if row['type'] == 'clicks':
            items_to_add = [item for item in click_list if item not in row['labels']][:20-len(row['labels'])]
        elif row['type'] == 'carts':
            items_to_add = [item for item in cart_list if item not in row['labels']][:20-len(row['labels'])]
        elif row['type'] == 'orders':
            items_to_add = [item for item in order_list if item not in row['labels']][:20-len(row['labels'])]
        row['labels'] = row['labels'] + items_to_add

In [None]:
# Convert all integers to string and join with space in between
base_predict_2['labels'] = base_predict_2['labels'].apply(lambda x: ' '.join(map(str, x)))

In [None]:
# Drop excess column
base_predict_2.drop(columns='type', inplace=True)

In [None]:
base_predict_2.head()

Unnamed: 0,session_type,labels
0,12899779_clicks,59625 485256 1460571 1551213 108125 1406660 18...
1,12899779_carts,59625 485256 152547 33343 613493 876493 140666...
2,12899779_orders,59625 876493 1406660 122983 1445562 1531805 16...
3,12899780_clicks,1142000 582732 973453 736515 1142000 485256 14...
4,12899780_carts,1142000 582732 973453 736515 1142000 485256 15...


In [None]:
# base_predict_2.to_csv('/content/drive/MyDrive/0.capstone/for_submission/base_prediction_2.1.csv', index=False)

**Results** <br>
Baseline score v2 (Kaggle): 0.46682 <br>
An improvement of only 0.00183 from filling all empty slots with top 20 click/cart/orders from Week 4. 

Chose to use Week 4 instead of Week 5, because there may be leakage when using Week 5. Due to truncation of the test data at different timestamps, Week 5 Top 20 information might occur after these timestamp.

This may not be the best candidates to fill the remaining spaces. We will consider covisitation matrix next.