참고, todo

[todo] - 길이 시각화, 세션 개념

- Evaluation 으로 사용하는 함수 (recall, MRR) mAP
- Session Based Task 이해
- Train/Valid/Test 전략
- Session-Parrarel Mini-Batch 를 왜 썼는지 -> 사실 요즘 논문에서는 거의 안쓴다.대신 데이터 특징을 살린 모델링.
- (참고) loss, sampling 제외
- 라벨을 자체적으로 구축

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

- [recsys 2015 challenge](https://recsys.yoochoose.net/challenge.html) dataset
- (참고) 7z 확장자로 압축되어 있음. 다운로드 및 압축푸는 과정은 생략함.

- ![aladin](./asset/시크릿모드.png)

>The YOOCHOOSE dataset contain a collection of sessions from a retailer, where each session
is encapsulating the click events that the user performed in the session.
For some of the sessions, there are also buy events; means that the session ended
with the user bought something from the web shop. The data was collected during several
months in the year of 2014, reflecting the clicks and purchases performed by the users
of an on-line retailer in Europe.  **To protect end users privacy, as well as the retailer,
all numbers have been modified.** Do not try to reveal the identity of the retailer.

In [1]:
import numpy as np
import pandas as pd
import datetime as dt

PATH_TO_ORIGINAL_DATA = '/Users/zimin/Downloads/archive/'  # 'D:\\data\\yoochoose-data\\'
PATH_TO_PROCESSED_DATA = '/Users/zimin/Downloads/archive/'  # 'D:\\data\\yoochoose-data\\'

In [2]:
data = pd.read_csv(PATH_TO_ORIGINAL_DATA + 'yoochoose-clicks.dat', sep=',', header=None, usecols=[0, 1, 2],
                   parse_dates=[1],
                   dtype={0: np.int32, 2: np.int32}, nrows=100000)
data.columns = ['SessionId', 'Time', 'ItemId']

session_lengths = data.groupby('SessionId').size()
data = data[np.in1d(data.SessionId, session_lengths[session_lengths > 1].index)]

item_supports = data.groupby('ItemId').size()
data = data[np.in1d(data.ItemId, item_supports[item_supports >= 5].index)]

session_lengths = data.groupby('SessionId').size()
data = data[np.in1d(data.SessionId, session_lengths[session_lengths >= 2].index)]

max_time = data['Time'].max()
session_max_times = data.groupby('SessionId')['Time'].max()
session_train = session_max_times[session_max_times < max_time - dt.timedelta(1)].index
session_test = session_max_times[session_max_times >= max_time - dt.timedelta(1)].index

train = data[np.in1d(data.SessionId, session_train)]
test = data[np.in1d(data.SessionId, session_test)]

test = test[np.in1d(test.ItemId, train.ItemId)]

test_length = test.groupby('SessionId').size()
test = test[np.in1d(test.SessionId, test_length[test_length >= 2].index)]

print(
    f'Full train set\n\tEvents: {len(train)}\n\tSessions: {train.SessionId.nunique()}\n\tItems: {train.ItemId.nunique()}')
train.to_csv(PATH_TO_PROCESSED_DATA + 'rsc15_train_full.txt', sep='\t', index=False)

print(f'Test set\n\tEvents: {len(test)}\n\tSessions: {test.SessionId.nunique()}\n\tItems: {test.ItemId.nunique()}')
test.to_csv(PATH_TO_PROCESSED_DATA + 'rsc15_test.txt', sep='\t', index=False)

max_train_time = train.Time.max()
session_max_times = train.groupby('SessionId').Time.max()
session_train = session_max_times[session_max_times < max_train_time - dt.timedelta(1)].index
session_valid = session_max_times[session_max_times >= max_train_time - dt.timedelta(1)].index
train_tr = train[np.in1d(train.SessionId, session_train)]
valid = train[np.in1d(train.SessionId, session_valid)]
valid = valid[np.in1d(valid.ItemId, train_tr.ItemId)]
valid_length = valid.groupby('SessionId').size()
valid = valid[np.in1d(valid.SessionId, valid_length[valid_length >= 2].index)]
print(
    f'Train set\n\tEvents: {len(train_tr)}\n\tSessions: {train_tr.SessionId.nunique()}\n\tItems: {train_tr.ItemId.nunique()}')
train_tr.to_csv(PATH_TO_PROCESSED_DATA + 'rsc15_train_tr.txt', sep='\t', index=False)

print(
    f'Validation set\n\tEvents: {len(valid)}\n\tSessions: {valid.SessionId.nunique()}\n\tItems: {valid.ItemId.nunique()}')
valid.to_csv(PATH_TO_PROCESSED_DATA + 'rsc15_train_valid.txt', sep='\t', index=False)

Full train set
	Events: 70278
	Sessions: 17794
	Items: 2933
Test set
	Events: 13568
	Sessions: 3416
	Items: 1771
Train set
	Events: 53254
	Sessions: 13629
	Items: 2873
Validation set
	Events: 16539
	Sessions: 4084
	Items: 2029


In [3]:
class SessionDataset:
    """Credit to yhs-968/pyGRU4REC."""

    def __init__(self, data, session_key='SessionId', item_key='ItemId', time_key='Time'):
        self.df = data
        self.session_key = session_key
        self.item_key = item_key
        self.time_key = time_key
        self.idx2id = self.add_item_indices()
        self.df.sort_values([session_key, time_key], inplace=True)
        # clicks within a session are next to each other, where the clicks within a session are time-ordered.
        self.click_offsets = self.get_click_offsets()
        self.session_idx_arr = np.arange(self.df[self.session_key].nunique())  # indexing to SessionId

    def add_item_indices(self):
        idx2id = {index: item_id for item_id, index in enumerate(self.df['ItemId'].unique())}
        self.df['item_idx'] = self.df['ItemId'].map(idx2id.get)
        return idx2id

    @property
    def items(self):
        return self.df['ItemId'].unique()

    def get_click_offsets(self):
        """
        Return the offsets of the beginning clicks of each session IDs,
        where the offset is calculated against the first click of the first session ID.
        """
        offsets = np.zeros(self.df[self.session_key].nunique() + 1, dtype=np.int32)
        # group & sort the df by session_key and get the offset values
        offsets[1:] = self.df.groupby(self.session_key).size().cumsum()

        return offsets


In [4]:
self = SessionDataset(train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


In [21]:
class SessionDataLoader:
    """Credit to yhs-968/pyGRU4REC."""

    def __init__(self, dataset, batch_size=50):
        """
        A class for creating session-parallel mini-batches.
        Args:
            dataset (SessionDataset): the session dataset to generate the batches from
            batch_size (int): size of the batch
        """
        self.dataset = dataset
        self.batch_size = batch_size
        self.done_sessions_counter = 0

    def __iter__(self):  # https://dojang.io/mod/page/view.php?id=2405
        """ Returns the iterator for producing session-parallel training mini-batches.
        Yields:
            input (B,):  Item indices that will be encoded as one-hot vectors later.
            target (B,): a Variable that stores the target item indices
            masks: Numpy array indicating the positions of the sessions to be terminated
        """

        df = self.dataset.df
        self.n_items = df['ItemId'].nunique() + 1
        click_offsets = self.dataset.click_offsets
        session_idx_arr = self.dataset.session_idx_arr

        iters = np.arange(self.batch_size)
        max_iter = iters.max()
        start = click_offsets[session_idx_arr[iters]]  # Session Start
        end = click_offsets[session_idx_arr[iters] + 1]  # Session End
        mask = []  # indicator for the sessions to be terminated
        finished = False

        while not finished:
            min_len = (end - start).min()  # Shortest Session
            # Item indices (for embedding) for clicks where the first sessions start
            for i in range(min_len - 1):
                # Build inputs & targets
                inp = df.item_idx.values[start + i]
                target = df.item_idx.values[start + i + 1]
                yield inp, target, mask

            # click indices where a particular session meets second-to-last element
            start = start + (min_len - 1)
            # see if how many sessions should terminate
            mask = np.arange(len(iters))[(end - start) <= 1]
            self.done_sessions_counter = len(mask)
            for idx in mask:
                max_iter += 1
                if max_iter >= len(click_offsets) - 1:
                    finished = True
                    break
                # update the next starting/ending point
                iters[idx] = max_iter
                start[idx] = click_offsets[session_idx_arr[max_iter]]
                end[idx] = click_offsets[session_idx_arr[max_iter] + 1]


In [24]:
two = SessionDataLoader(self)

- 모델 구조가 간단하므로 Funtional 모델

In [28]:
import numpy as np
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.layers import Input, Dense, Dropout, GRU
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tqdm import tqdm

In [66]:
class Args:
    def __init__(self, num_items, batch_size, hsz, drop_rate, lr, epochs):
        self.num_items = num_items
        self.batch_size = batch_size
        self.hsz = hsz
        self.drop_rate = drop_rate
        self.lr = lr
        self.epochs = epochs


args = Args(2934, 20, 50, 0.1, 0.001, 3)
args.train_samples_qty = train['SessionId'].nunique()

In [67]:
inputs = Input(batch_shape=(args.batch_size, 1, args.num_items))
gru, gru_states = GRU(args.hsz, stateful=True, return_state=True, name='GRU')(inputs)
dropout = Dropout(args.drop_rate)(gru)
predictions = Dense(args.num_items, activation='softmax')(dropout)
model = Model(inputs=inputs, outputs=[predictions])
model.compile(loss=categorical_crossentropy, optimizer=Adam(0.001))
model.summary()

Model: "functional_15"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_8 (InputLayer)         [(20, 1, 2934)]           0         
_________________________________________________________________
GRU (GRU)                    [(20, 50), (20, 50)]      447900    
_________________________________________________________________
dropout_7 (Dropout)          (20, 50)                  0         
_________________________________________________________________
dense_7 (Dense)              (20, 2934)                149634    
Total params: 597,534
Trainable params: 597,534
Non-trainable params: 0
_________________________________________________________________


In [68]:
def train_model(model, args):
    train_dataset = SessionDataset(train)
    train_loader = SessionDataLoader(train_dataset, batch_size=args.batch_size)

    for epoch in range(1, args.epochs + 1):
        loader = tqdm(train_loader, total=args.train_samples_qty)
        for i, (feat, target, mask) in enumerate(loader):
            reset_hidden_states(model, mask)

            input_ohe = to_categorical(feat, num_classes=train_loader.n_items)
            input_ohe = np.expand_dims(input_ohe, axis=1)
            target_ohe = to_categorical(target, num_classes=train_loader.n_items)

            tr_loss = model.train_on_batch(input_ohe, target_ohe)
            loader.set_postfix(train_loss=tr_loss)


def reset_hidden_states(model, mask):
    gru_layer = model.get_layer(name='GRU')
    hidden_states = gru_layer.states[0].numpy()
    for elt in mask:
        hidden_states[elt, :] = 0
    gru_layer.reset_states(states=hidden_states)

In [69]:
train_model(model, args)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
 12%|█▏        | 2111/17794 [00:17<02:06, 124.04it/s, train_loss=6.47]


KeyboardInterrupt: 