## 4.1 Rule based Model (Covisitation Matrix)

**Covisitation Matrix** <br>
Covisitation matrix captures products that are frequently viewed and bought together. This partially addresses the *cold start problem*. If there are not enough products viewed/purchased, we will use items that are usually viewed and bought together with those few viewed/purchased by user.

An alternative (or complementary) technique is *matrix factorization*. Matrix Factorization produces a user-to-product matrix of the user's preference of each product. KNN for example could be used to identify and rank the similarity. As the dataset is huge, it might require much more resources to compute. We will use only covisitation matrix.

How the covisitation matrix is computed here:
1. Items which have interaction within 1 day (24 hours) are considered a covisitation pair.
2. Count the number of interactions between each of such pair
3. If user has less than 20 products, fill in the remainder with the most common paired product for each of the user's product, without duplicates.

**Weighted prediction (for > 20 products)** <
If there is more than 20 products, we will weigh by:
- recency
- type: we expect more carts to result in orders, followed by previous orders and lastly clicks
- frequency of aid

<br>

**Rule based approach** <br>
This is a rule based approach, using heuristics and rules without using a reranker like XGB Ranker. This approach requires far less steps than the reranker and if done correctly can have quite good and explanable results. Theoretically, we want to push the rule-based as far as possible before using a reranker (a black box model) to boost the score further. <br>
Note that this current model does not differentiate in prediction for clicks/carts/orders.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np

from collections import defaultdict, Counter

In [None]:
train = pd.read_parquet('/content/drive/MyDrive/0.capstone/train.parquet')
test = pd.read_parquet('/content/drive/MyDrive/0.capstone/test.parquet')

In [None]:
# Chunk to manage the huge dataset
chunk_size = 10_000

# Counter for covisitaiton pair frequency
count_pair_aid = defaultdict(Counter)

sessions = train['session'].unique()

**Find covisitation pairs and their frequency**

In [None]:
# loop in chunks over the number of the session
for i in range(0, sessions.shape[0], chunk_size):
 
    # create current chunk. check that don't select more rows than exist
    current_chunk = train.loc[sessions[i]:sessions[min(sessions.shape[0]-1, i+chunk_size-1)]].reset_index(drop=True) 
    
    # pull the latest 30 rows for each session
    current_chunk = current_chunk.groupby('session', as_index=False).nth(list(range(-30,0))).reset_index(drop=True) 
    
    # self join current_chunk using merge function on session. remove redundant rows
    pair_aid = current_chunk.merge(current_chunk, on='session') 
    pair_aid = pair_aid[pair_aid.aid_x != pair_aid.aid_y] 
    
    # disable SettingWithCopyWarning
    pd.options.mode.chained_assignment = None
    # new column time between y and x
    pair_aid.loc[:, 'days_elapsed'] = (pair_aid.ts_y - pair_aid.ts_x) / (24 * 60 * 60)
    # re-enable SettingWithCopyWarning
    pd.options.mode.chained_assignment = 'warn'
    
    # only those within 1 day
    pair_aid = pair_aid[(pair_aid.days_elapsed >= 0) & (pair_aid.days_elapsed <= 1)]
   
    # covisitation matrix: loop over each row of pair_aid, extract pair of values for aid_x and aid_y. update the counter in the next_aid for each pair found. 
    for aid_x, aid_y in zip(pair_aid['aid_x'], pair_aid['aid_y']):
      count_pair_aid[aid_x][aid_y] += 1

In [None]:
# create dictionary of session types; create lists of aids and types by session
session_types = ['clicks', 'carts', 'orders']
test_session_aid = test.reset_index(drop=True).groupby('session')['aid'].apply(list)
test_session_types = test.reset_index(drop=True).groupby('session')['type'].apply(list)

**Rule-based Ranking**

In [None]:
labels = []

# assume that carts have greater weight, followed by orders and then clicks
type_weight = {0: 1, 1: 6, 2: 3}

# this looks at test session only
for aids, types in zip(test_session_aid, test_session_types):
    # weigh if more than 20 aids in a session
    if len(aids) >= 20:
        # weights increase exponential to log of base 2 by recency
        recency_weight = 2 ** np.arange(len(aids))[::-1]
        
        # zero as default if no values
        weight_temp = defaultdict(lambda: 0)

        # compute weight based on recency weight and type weight
        for aid, w, t in zip(aids, recency_weight, types):
          weight_temp[aid] += np.float32(w) * np.float32(type_weight[t])

        # sorted list of aids by weight, largest to smallest. append top 20 to labels
        sorted_aids = [k for k, v in sorted(weight_temp.items(), key=lambda item: -item[1])]
        labels.append(sorted_aids[:20])
    
    # for less than 20 aids in a session
    else:
        # consider most recent aid first, with duplicates removed
        aids = list(dict.fromkeys(aids[::-1]))
        
        # select top 20 common aids for each aid
        candidates = []
        for aid in aids:
          if aid in count_pair_aid: candidates += [aid for aid, count in count_pair_aid[aid].most_common(20)]
        
        # take top 40 aids and include in existing aids list, with no duplicates.
        aids += [aid for aid, cnt in Counter(candidates).most_common(40) if aid not in aids]
        
        # keep only top 20
        labels.append(aids[:20])

**Prepare prediction**

In [None]:
labels_formatted = [' '.join(map(str, label)) for label in labels]

weighted_predict_temp = pd.DataFrame(data={'session_type': test_session_aid.index, 'labels': labels_formatted})
weighted_predict_temp.head()

Unnamed: 0,session_type,labels
0,12899779,59625
1,12899780,1142000 736515 973453 582732 889686 487136 760...
2,12899781,918667 199008 194067 57315 141736 1460571 1681...
3,12899782,1596098 413962 45034 603159 779477 1037537 562...
4,12899783,1817895 607638 1754419 1216820 1729553 300127 ...


In [None]:
prediction_df = []

for t in session_types:
    modified_predictions = weighted_predict_temp.copy()
    modified_predictions.session_type = modified_predictions.session_type.astype('str') + f'_{t}'
    prediction_df.append(modified_predictions)

weighted_predict = pd.concat(prediction_df).reset_index(drop=True)
weighted_predict.head()

Unnamed: 0,session_type,labels
0,12899779_clicks,59625
1,12899780_clicks,1142000 736515 973453 582732 889686 487136 760...
2,12899781_clicks,918667 199008 194067 57315 141736 1460571 1681...
3,12899782_clicks,1596098 413962 45034 603159 779477 1037537 562...
4,12899783_clicks,1817895 607638 1754419 1216820 1729553 300127 ...


In [None]:
# weighted_predict.to_csv('/content/drive/MyDrive/0.capstone/for_submission/weighted_predict.csv', index=False)

**Results**

Kaggle score: 0.51218 <br>
By applying weights to recent items and co-visitation matrix, the score improved by about 0.05.

Running this code took a long time and had memory error many times. To proceed further, we need to use more powerful memory techniques shared on Kaggle discussions.