In this notebook we will train a <code>Word2Vec</code> model, aproximated nearest neighbour model and test it on local validation set.

We will use the <code>gensim</code> library for <code>Word2Vec</code>, aproximated nearest neighbour model form <code>annoy</code>.

For this purpose, I used the [dataset](https://www.kaggle.com/datasets/radek1/otto-full-optimized-memory-footprint) provided by Radek Osmulski. The prepared sets allow you to test the developed recommendation system. The developed code in this notebook is based on the example prepared by Radek (see [nootebook](https://www.kaggle.com/code/radek1/word2vec-how-to-training-and-submission)). To improve the result obtained by the model, we introduce some modifications. For tips related to these parameters please see [notebook](https://www.kaggle.com/code/balaganiarz0/word2vec-model-local-validation).

# Data preprocessing

In [1]:
!pip install polars
import gc
import polars as pl
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

train = pl.read_parquet('../input/otto-full-optimized-memory-footprint/train.parquet')
test = pl.read_parquet('../input/otto-full-optimized-memory-footprint/test.parquet')

Collecting polars
  Downloading polars-0.15.14-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.8/14.8 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: polars
Successfully installed polars-0.15.14
[0m

In [2]:
sentences_df =  pl.concat([train, test]).groupby('session').agg(
    pl.col('aid').alias('sentence')
)

sentences = sentences_df['sentence'].to_list()
del sentences_df; gc.collect() 

0

# Training word2vec model

In [3]:
%%time

w2vec = Word2Vec(sentences=sentences, vector_size= 64, window = 3, negative = 8, ns_exponent = 0.2, sg = 1, min_count=1, workers=4)

CPU times: user 6h 41min 10s, sys: 41.6 s, total: 6h 41min 51s
Wall time: 1h 45min 36s


# Preparing aproximated nearest neighbour model

In [4]:
%%time

from annoy import AnnoyIndex

aid2idx = {aid: i for i, aid in enumerate(w2vec.wv.index_to_key)}
index = AnnoyIndex(64, 'euclidean')

for aid, idx in aid2idx.items():
    index.add_item(idx, w2vec.wv.vectors[idx])
    
index.build(32)

CPU times: user 3min 2s, sys: 2.01 s, total: 3min 4s
Wall time: 59.3 s


True

# Creating aids predictions

In [5]:
import pandas as pd
import numpy as np

from collections import defaultdict
import collections

session_types = ['clicks', 'carts', 'orders']
test_session_AIDs = test.to_pandas().reset_index(drop=True).groupby('session')['aid'].apply(list)
test_session_types = test.to_pandas().reset_index(drop=True).groupby('session')['type'].apply(list)

In [6]:
labels = []

type_weight_multipliers = {0: 1, 1: 6, 2: 3}

session_num = len(test_session_AIDs)

for AIDs, types in zip(test_session_AIDs[:session_num], test_session_types[:session_num]):
    if len(AIDs) >= 20:
        # if we have enough aids (over equals 20) we don't need to look for candidates! we just use the old logic
        weights=np.logspace(0.1,1,len(AIDs),base=2, endpoint=True)-1
        aids_temp=defaultdict(lambda: 0)
        for aid,w,t in zip(AIDs,weights,types): 
            aids_temp[aid]+= w * type_weight_multipliers[t]
            
        sorted_aids=[k for k, v in sorted(aids_temp.items(), key=lambda item: -item[1])]
        labels.append(sorted_aids[:20])
    else:
        # here we don't have 20 aids to output -- we will use word2vec embeddings to generate candidates!
        AIDs = list(dict.fromkeys(AIDs[::-1]))
        
        # let's grab the up to 3 recent aids
        recent_len = max(min(3,len(AIDs)),1)
        
        # how many aids for each aid
        AIDs_num = round((20-len(AIDs))/recent_len) + 2
        
        # let's look for some neighbors!        
        nns_it = []
        for it in range(0,recent_len):
            nns_it += [w2vec.wv.index_to_key[i] for i in index.get_nns_by_item(aid2idx[AIDs[it]], AIDs_num)[1:]]
        
        # select repeating and unique neighbors
        nns_repeated = [item for item, count in collections.Counter(nns_it).items() if count > 1]
        nns_once = [item for item, count in collections.Counter(nns_it).items() if count == 1]

        # prepare selection
        nns = (nns_repeated+nns_once)[:20]
        labels.append((AIDs+nns)[:20])

# Preparing submission dataframe

In [7]:
labels_as_strings = [' '.join([str(l) for l in lls]) for lls in labels]

predictions = pd.DataFrame(data={'session_type': test_session_AIDs.index, 'labels': labels_as_strings})

prediction_dfs = []

for st in session_types:
    modified_predictions = predictions.copy()
    modified_predictions.session_type = modified_predictions.session_type.astype('str') + f'_{st}'
    prediction_dfs.append(modified_predictions)

submission = pd.concat(prediction_dfs).reset_index(drop=True)
submission.to_csv('submission.csv', index=False)

del labels, labels_as_strings, predictions, prediction_dfs
gc.collect()

0

In [8]:
submission.head()

Unnamed: 0,session_type,labels
0,12899779_clicks,59625 509607 943880 345122 1212829 559524 1548...
1,12899780_clicks,1142000 736515 973453 582732 889686 487136 136...
2,12899781_clicks,918667 199008 194067 57315 141736 1119163 1228...
3,12899782_clicks,834354 595994 740494 889671 987399 779477 1344...
4,12899783_clicks,1817895 607638 1754419 1216820 1729553 300127 ...


Wishing you all the best in the challenge! 🤞