Movielens 1M Dataset 을 기반으로, Session based Recommendation 시스템을 제작해 보겠습니다.

**Seession-Based Recommendation** 세션 데이터를 기반으로 유저가 다음에 클릭 또는 구매할 아이템을 예측하는 추천을 말합니다.  
*Session 이란 유저가 서비스를 이용하면서 발생하는 중요한 정보를 담은 데이터를 말하며, 서버 쪽에 저장됩니다.*

# PROJECT 17 - Movielens 영화 SBR

## 1. 데이터 로드하기 

In [1]:
import datetime as dt
from pathlib import Path
import os
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

data_path = Path(os.getenv('HOME')+'/aiffel/exploration/yoochoose-data_') 
train_path = data_path / 'ratings.dat'

def load_data(data_path: Path, nrows=None):
    data = pd.read_csv(data_path, sep='::', header=None, usecols=[0, 1, 2, 3], dtype={0: np.int32, 1: np.int32, 2: np.int32}, nrows=nrows)
    data.columns = ['UserId', 'ItemId', 'Rating', 'Time']
    return data

data = load_data(train_path, None)
data.sort_values(['UserId', 'Time'], inplace=True)  # data를 id와 시간 순서로 정렬해줍니다.
data

Unnamed: 0,UserId,ItemId,Rating,Time
31,1,3186,4,978300019
22,1,1270,5,978300055
27,1,1721,4,978300055
37,1,1022,5,978300055
24,1,2340,3,978300103
...,...,...,...,...
1000019,6040,2917,4,997454429
999988,6040,1921,4,997454464
1000172,6040,1784,3,997454464
1000167,6040,161,3,997454486


- UserId : 유저 고유 번호  
- ItemId : 물건 고유 번호  
- Rating : 평점  
- Time : 1970년 1월 1일부터 경과된 초단위 시간

In [2]:
user_id_len = data['UserId'].nunique()
item_id_len = data['ItemId'].nunique()

print(f"유저수 : {user_id_len}")
print(f"아이템수 : {item_id_len}")

유저수 : 6040
아이템수 : 3706


유저수(세션수) 는 6040개, 아이템수는 3706개입니다.  

## 2. 데이터 전처리하기

데이터셋의 시간은 유닉스 데이터로 표기되어있기때문에, 유닉스시간의 데이터를 datetime 타입으로 변경합니다.

In [3]:
from datetime import datetime

times = data["Time"]
time_lst = []

for time in times: 
    temp_date = datetime.fromtimestamp(time) 
    time_lst.append(temp_date)
    
data["Time"] = time_lst
data

Unnamed: 0,UserId,ItemId,Rating,Time
31,1,3186,4,2001-01-01 07:00:19
22,1,1270,5,2001-01-01 07:00:55
27,1,1721,4,2001-01-01 07:00:55
37,1,1022,5,2001-01-01 07:00:55
24,1,2340,3,2001-01-01 07:01:43
...,...,...,...,...
1000019,6040,2917,4,2001-08-10 23:40:29
999988,6040,1921,4,2001-08-10 23:41:04
1000172,6040,1784,3,2001-08-10 23:41:04
1000167,6040,161,3,2001-08-10 23:41:26


In [4]:
data.columns = ['SessionId', 'ItemId', 'Rating', 'Time']
data

Unnamed: 0,SessionId,ItemId,Rating,Time
31,1,3186,4,2001-01-01 07:00:19
22,1,1270,5,2001-01-01 07:00:55
27,1,1721,4,2001-01-01 07:00:55
37,1,1022,5,2001-01-01 07:00:55
24,1,2340,3,2001-01-01 07:01:43
...,...,...,...,...
1000019,6040,2917,4,2001-08-10 23:40:29
999988,6040,1921,4,2001-08-10 23:41:04
1000172,6040,1784,3,2001-08-10 23:41:04
1000167,6040,161,3,2001-08-10 23:41:26


## 3. Session Length

session_length 란 사용자가 그 세션 동안 몇번의 액션을 취했는지(몇개의 상품정보를 클릭했는지)의 의미입니다.

In [5]:
session_length = data.groupby('SessionId').size()
session_length

SessionId
1        53
2       129
3        51
4        21
5       198
       ... 
6036    888
6037    202
6038     20
6039    123
6040    341
Length: 6040, dtype: int64

In [6]:
session_length.median(), session_length.mean()

(96.0, 165.5975165562914)

In [7]:
session_length.min(), session_length.max()

(20, 2314)

In [8]:
session_length.quantile(0.999)

1343.181000000005

각 세션의 길이는 보통 96~165 이고, 최대 길이는 2314 입니다. 그리고 99.9% 세션은 길이가 1343이하입니다.  

## 4. Session Time

데이터의 시간 관련 정보를 확인합니다.

In [9]:
oldest, latest = data['Time'].min(), data['Time'].max()
print(oldest) 
print(latest)

2000-04-26 08:05:32
2003-03-01 02:49:50


데이터는 3년치 데이터입니다.  

## 5. Data Cleansing

우리의 목적은 유저가 최소 1개 이상 클릭했을 때 다음 클릭을 예측하는 것이므로 길이가 1인 세션은 제거합니다.  
너무 적게 클릭된 아이템은 이상한 아이템일 가능성이 있습니다. 이 역시 제거합니다.

In [10]:
# short_session을 제거한 다음 unpopular item을 제거하면 다시 길이가 1인 session이 생길 수 있습니다.
# 이를 위해 반복문을 통해 지속적으로 제거 합니다.
def cleanse_recursive(data: pd.DataFrame, shortest, least_click) -> pd.DataFrame:
    while True:
        before_len = len(data)
        data = cleanse_short_session(data, shortest)
        data = cleanse_unpopular_item(data, least_click)
        after_len = len(data)
        if before_len == after_len:
            break
    return data


def cleanse_short_session(data: pd.DataFrame, shortest):
    session_len = data.groupby('SessionId').size()
    session_use = session_len[session_len >= shortest].index
    data = data[data['SessionId'].isin(session_use)]
    return data


def cleanse_unpopular_item(data: pd.DataFrame, least_click):
    item_popular = data.groupby('ItemId').size()
    item_use = item_popular[item_popular >= least_click].index
    data = data[data['ItemId'].isin(item_use)]
    return data

In [11]:
data_user = cleanse_recursive(data, shortest=2, least_click=5)
data_user

Unnamed: 0,SessionId,ItemId,Rating,Time
31,1,3186,4,2001-01-01 07:00:19
22,1,1270,5,2001-01-01 07:00:55
27,1,1721,4,2001-01-01 07:00:55
37,1,1022,5,2001-01-01 07:00:55
24,1,2340,3,2001-01-01 07:01:43
...,...,...,...,...
1000019,6040,2917,4,2001-08-10 23:40:29
999988,6040,1921,4,2001-08-10 23:41:04
1000172,6040,1784,3,2001-08-10 23:41:04
1000167,6040,161,3,2001-08-10 23:41:26


In [12]:
data_user = data_user.dropna(axis=0)
data_user

Unnamed: 0,SessionId,ItemId,Rating,Time
31,1,3186,4,2001-01-01 07:00:19
22,1,1270,5,2001-01-01 07:00:55
27,1,1721,4,2001-01-01 07:00:55
37,1,1022,5,2001-01-01 07:00:55
24,1,2340,3,2001-01-01 07:01:43
...,...,...,...,...
1000019,6040,2917,4,2001-08-10 23:40:29
999988,6040,1921,4,2001-08-10 23:41:04
1000172,6040,1784,3,2001-08-10 23:41:04
1000167,6040,161,3,2001-08-10 23:41:26


## 6. Train / Valid / Test split

모델 평가를 위한 Valid Set 과 Test Set 을 만듭니다.

In [13]:
def split_by_date(data: pd.DataFrame, n_days: int):
    final_time = data['Time'].max()
    session_last_time = data.groupby('SessionId')['Time'].max()
    session_in_train = session_last_time[session_last_time < final_time - dt.timedelta(n_days)].index
    session_in_test = session_last_time[session_last_time >= final_time - dt.timedelta(n_days)].index

    before_date = data[data['SessionId'].isin(session_in_train)]
    after_date = data[data['SessionId'].isin(session_in_test)]
    after_date = after_date[after_date['ItemId'].isin(before_date['ItemId'])]
    return before_date, after_date

In [14]:
tr, test = split_by_date(data_user, n_days=30)
tr, val = split_by_date(tr, n_days=30)

In [15]:
# data에 대한 정보를 살펴봅니다.
def stats_info(data: pd.DataFrame, status: str):
    print(f'* {status} Set Stats Info\n'
          f'\t Events: {len(data)}\n'
          f'\t Sessions: {data["SessionId"].nunique()}\n'
          f'\t Items: {data["ItemId"].nunique()}\n'
          f'\t First Time : {data["Time"].min()}\n'
          f'\t Last Time : {data["Time"].max()}\n')

In [16]:
stats_info(tr, 'train')
stats_info(val, 'valid')
stats_info(test, 'test')

* train Set Stats Info
	 Events: 919209
	 Sessions: 5858
	 Items: 3416
	 First Time : 2000-04-26 08:05:32
	 Last Time : 2002-12-30 11:26:14

* valid Set Stats Info
	 Events: 29477
	 Sessions: 79
	 Items: 2960
	 First Time : 2000-05-06 02:20:21
	 Last Time : 2003-01-29 12:00:40

* test Set Stats Info
	 Events: 50925
	 Sessions: 103
	 Items: 3172
	 First Time : 2000-05-01 20:15:13
	 Last Time : 2003-03-01 02:49:50



In [17]:
# train set에 없는 아이템이 val, test기간에 생길 수 있으므로 train data를 기준으로 인덱싱합니다.
id2idx = {item_id : index for index, item_id in enumerate(tr['ItemId'].unique())}

def indexing(df, id2idx):
    df['item_idx'] = df['ItemId'].map(lambda x: id2idx.get(x, -1))  # id2idx에 없는 아이템은 모르는 값(-1) 처리 해줍니다.
    return df

tr = indexing(tr, id2idx)
val = indexing(val, id2idx)
test = indexing(test, id2idx)

In [18]:
save_path = data_path / 'processed'
save_path.mkdir(parents=True, exist_ok=True)

tr.to_pickle(save_path / 'train.pkl')
val.to_pickle(save_path / 'valid.pkl')
test.to_pickle(save_path / 'test.pkl')

## 7. Session Dataset

데이터가 주어지면 세션이 시작되는 인덱스를 담는 값과 세션을 새로 인덱싱한 값을 갖는 클래스를 만들도록 하겠습니다.

In [19]:
class SessionDataset:
    """Credit to yhs-968/pyGRU4REC."""

    def __init__(self, data):
        self.df = data
        self.click_offsets = self.get_click_offsets()
        self.session_idx = np.arange(self.df['SessionId'].nunique())  # indexing to SessionId

    def get_click_offsets(self):
        """
        Return the indexes of the first click of each session IDs,
        """
        offsets = np.zeros(self.df['SessionId'].nunique() + 1, dtype=np.int32)
        offsets[1:] = self.df.groupby('SessionId').size().cumsum()
        return offsets

train 데이터로 SessionDataset 객체를 만들겠습니다.  

In [20]:
tr_dataset = SessionDataset(tr)
tr_dataset.df.head(10)

Unnamed: 0,SessionId,ItemId,Rating,Time,item_idx
31,1,3186,4,2001-01-01 07:00:19,0
22,1,1270,5,2001-01-01 07:00:55,1
27,1,1721,4,2001-01-01 07:00:55,2
37,1,1022,5,2001-01-01 07:00:55,3
24,1,2340,3,2001-01-01 07:01:43,4
36,1,1836,5,2001-01-01 07:02:52,5
3,1,3408,4,2001-01-01 07:04:35,6
7,1,2804,5,2001-01-01 07:11:59,7
47,1,1207,4,2001-01-01 07:11:59,8
0,1,1193,5,2001-01-01 07:12:40,9


click_offsets 변수는 각 세션이 시작된 인덱스를 담고 있습니다.

In [21]:
tr_dataset.click_offsets

array([     0,     53,    182, ..., 918745, 918868, 919209], dtype=int32)

session_idx 변수는 각 세션을 인덱싱한 np.array 입니다.

In [22]:
tr_dataset.session_idx

array([   0,    1,    2, ..., 5855, 5856, 5857])

## 8. Session Data Loader

SessionDataset 객체를 받아서 Session-Parallel mini-batch 를 만드는 클래스를 만들어보도록 하겠습니다.

maks는 후에 RNN Cell State 를 초기화 하는데 사용할 것입니다.

In [23]:
class SessionDataLoader:
    """Credit to yhs-968/pyGRU4REC."""

    def __init__(self, dataset: SessionDataset, batch_size=50):
        self.dataset = dataset
        self.batch_size = batch_size

    def __iter__(self):
        """ Returns the iterator for producing session-parallel training mini-batches.
        Yields:
            input (B,):  Item indices that will be encoded as one-hot vectors later.
            target (B,): a Variable that stores the target item indices
            masks: Numpy array indicating the positions of the sessions to be terminated
        """

        start, end, mask, last_session, finished = self.initialize()  # initialize 메소드에서 확인해주세요.
        """
        start : Index Where Session Start
        end : Index Where Session End
        mask : indicator for the sessions to be terminated
        """

        while not finished:
            min_len = (end - start).min() - 1  # Shortest Length Among Sessions
            for i in range(min_len):
                # Build inputs & targets
                inp = self.dataset.df['item_idx'].values[start + i]
                target = self.dataset.df['item_idx'].values[start + i + 1]
                yield inp, target, mask

            start, end, mask, last_session, finished = self.update_status(start, end, min_len, last_session, finished)

    def initialize(self):
        first_iters = np.arange(self.batch_size)    # 첫 배치에 사용할 세션 Index를 가져옵니다.
        last_session = self.batch_size - 1    # 마지막으로 다루고 있는 세션 Index를 저장해둡니다.
        start = self.dataset.click_offsets[self.dataset.session_idx[first_iters]]       # data 상에서 session이 시작된 위치를 가져옵니다.
        end = self.dataset.click_offsets[self.dataset.session_idx[first_iters] + 1]  # session이 끝난 위치 바로 다음 위치를 가져옵니다.
        mask = np.array([])   # session의 모든 아이템을 다 돌은 경우 mask에 추가해줄 것입니다.
        finished = False         # data를 전부 돌았는지 기록하기 위한 변수입니다.
        return start, end, mask, last_session, finished

    def update_status(self, start: np.ndarray, end: np.ndarray, min_len: int, last_session: int, finished: bool):  
        # 다음 배치 데이터를 생성하기 위해 상태를 update합니다.
        
        start += min_len   # __iter__에서 min_len 만큼 for문을 돌았으므로 start를 min_len 만큼 더해줍니다.
        mask = np.arange(self.batch_size)[(end - start) == 1]  
        # end는 다음 세션이 시작되는 위치인데 start와 한 칸 차이난다는 것은 session이 끝났다는 뜻입니다. mask에 기록해줍니다.

        for i, idx in enumerate(mask, start=1):  # mask에 추가된 세션 개수만큼 새로운 세션을 돌것입니다.
            new_session = last_session + i  
            if new_session > self.dataset.session_idx[-1]:  # 만약 새로운 세션이 마지막 세션 index보다 크다면 모든 학습데이터를 돈 것입니다.
                finished = True
                break
            # update the next starting/ending point
            start[idx] = self.dataset.click_offsets[self.dataset.session_idx[new_session]]     # 종료된 세션 대신 새로운 세션의 시작점을 기록합니다.
            end[idx] = self.dataset.click_offsets[self.dataset.session_idx[new_session] + 1]

        last_session += len(mask)  # 마지막 세션의 위치를 기록해둡니다.
        return start, end, mask, last_session, finished

In [24]:
tr_data_loader = SessionDataLoader(tr_dataset, batch_size=4)
tr_dataset.df.head(15)

Unnamed: 0,SessionId,ItemId,Rating,Time,item_idx
31,1,3186,4,2001-01-01 07:00:19,0
22,1,1270,5,2001-01-01 07:00:55,1
27,1,1721,4,2001-01-01 07:00:55,2
37,1,1022,5,2001-01-01 07:00:55,3
24,1,2340,3,2001-01-01 07:01:43,4
36,1,1836,5,2001-01-01 07:02:52,5
3,1,3408,4,2001-01-01 07:04:35,6
7,1,2804,5,2001-01-01 07:11:59,7
47,1,1207,4,2001-01-01 07:11:59,8
0,1,1193,5,2001-01-01 07:12:40,9


In [25]:
iter_ex = iter(tr_data_loader)

In [26]:
inputs, labels, mask =  next(iter_ex)
print(f'Model Input Item Idx are : {inputs}')
print(f'Label Item Idx are : {"":5} {labels}')
print(f'Previous Masked Input Idx are {mask}')

Model Input Item Idx are : [ 0 53 65 54]
Label Item Idx are :       [ 1 54 62 24]
Previous Masked Input Idx are []


## 9. Modeling

### Evaluation Metric

Session-Based Recommendation Task 에서는 모델이 k 개의 아이템을 제시했을때, 유저가 클릭/구매한 n개의 아이템이 많아야 좋습니다.

이번 자료에서는 MRR 과 Recall@k 를 사용하겠습니다. MMR 은 정답 아이템이 나온 순번의 역수 값입니다.  
따라서 정답 아이템이 추천 결과 앞쪽 순번에 나온다면 지표가 높아질 것이고 뒤쪽에 나오거나 안나온다면 지표가 낮아질 것입니다.

In [27]:
def mrr_k(pred, truth: int, k: int):
    indexing = np.where(pred[:k] == truth)[0]
    if len(indexing) > 0:
        return 1 / (indexing[0] + 1)
    else:
        return 0


def recall_k(pred, truth: int, k: int) -> int:
    answer = truth in pred[:k]
    return int(answer)

### Model Architecture

모델 구조가 간단한 편이므로 Functional 하게 모델을 만듭니다.

In [28]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, GRU
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tqdm import tqdm

In [29]:
def create_model(args):
    inputs = Input(batch_shape=(args.batch_size, 1, args.num_items))
    gru, _ = GRU(args.hsz, stateful=True, return_state=True, name='GRU')(inputs)
    dropout = Dropout(args.drop_rate)(gru)
    predictions = Dense(args.num_items, activation='softmax')(dropout)
    model = Model(inputs=inputs, outputs=[predictions])
    model.compile(loss=categorical_crossentropy, optimizer=Adam(args.lr), metrics=['accuracy'])
    model.summary()
    return model

In [30]:
class Args:
    def __init__(self, tr, val, test, batch_size, hsz, drop_rate, lr, epochs, k):
        self.tr = tr
        self.val = val
        self.test = test
        self.num_items = tr['ItemId'].nunique()
        self.num_sessions = tr['SessionId'].nunique()
        self.batch_size = batch_size
        self.hsz = hsz
        self.drop_rate = drop_rate
        self.lr = lr
        self.epochs = epochs
        self.k = k

args = Args(tr, val, test, batch_size=64, hsz=50, drop_rate=0.1, lr=0.001, epochs=3, k=20)

In [31]:
model = create_model(args)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(64, 1, 3416)]           0         
_________________________________________________________________
GRU (GRU)                    [(64, 50), (64, 50)]      520200    
_________________________________________________________________
dropout (Dropout)            (64, 50)                  0         
_________________________________________________________________
dense (Dense)                (64, 3416)                174216    
Total params: 694,416
Trainable params: 694,416
Non-trainable params: 0
_________________________________________________________________


### Model Training

지금까지 준비한 데이터셋과 모델을 통해 학습을 진행해 보겠습니다.  
학습은 총 3번 진행합니다.

In [32]:
# train 셋으로 학습하면서 valid 셋으로 검증합니다.
def train_model(model, args):
    train_dataset = SessionDataset(args.tr)
    train_loader = SessionDataLoader(train_dataset, batch_size=args.batch_size)

    for epoch in range(1, args.epochs + 1):
        total_step = len(args.tr) - args.tr['SessionId'].nunique()
        tr_loader = tqdm(train_loader, total=total_step // args.batch_size, desc='Train', mininterval=1)
        for feat, target, mask in tr_loader:
            reset_hidden_states(model, mask)  # 종료된 session은 hidden_state를 초기화합니다. 아래 메서드에서 확인해주세요.

            input_ohe = to_categorical(feat, num_classes=args.num_items)
            input_ohe = np.expand_dims(input_ohe, axis=1)
            target_ohe = to_categorical(target, num_classes=args.num_items)

            result = model.train_on_batch(input_ohe, target_ohe)
            tr_loader.set_postfix(train_loss=result[0], accuracy = result[1])

        val_recall, val_mrr = get_metrics(args.val, model, args, args.k)  # valid set에 대해 검증합니다.

        print(f"\t - Recall@{args.k} epoch {epoch}: {val_recall:3f}")
        print(f"\t - MRR@{args.k}    epoch {epoch}: {val_mrr:3f}\n")


def reset_hidden_states(model, mask):
    gru_layer = model.get_layer(name='GRU')  # model에서 gru layer를 가져옵니다.
    hidden_states = gru_layer.states[0].numpy()  # gru_layer의 parameter를 가져옵니다.
    for elt in mask:  # mask된 인덱스 즉, 종료된 세션의 인덱스를 돌면서
        hidden_states[elt, :] = 0  # parameter를 초기화 합니다.
    gru_layer.reset_states(states=hidden_states)


def get_metrics(data, model, args, k: int):  # valid셋과 test셋을 평가하는 코드입니다. 
                                             # train과 거의 같지만 mrr, recall을 구하는 라인이 있습니다.
    dataset = SessionDataset(data)
    loader = SessionDataLoader(dataset, batch_size=args.batch_size)
    recall_list, mrr_list = [], []

    total_step = len(data) - data['SessionId'].nunique()
    for inputs, label, mask in tqdm(loader, total=total_step // args.batch_size, desc='Evaluation', mininterval=1):
        reset_hidden_states(model, mask)
        input_ohe = to_categorical(inputs, num_classes=args.num_items)
        input_ohe = np.expand_dims(input_ohe, axis=1)

        pred = model.predict(input_ohe, batch_size=args.batch_size)
        pred_arg = tf.argsort(pred, direction='DESCENDING')  # softmax 값이 큰 순서대로 sorting 합니다.

        length = len(inputs)
        recall_list.extend([recall_k(pred_arg[i], label[i], k) for i in range(length)])
        mrr_list.extend([mrr_k(pred_arg[i], label[i], k) for i in range(length)])

    recall, mrr = np.mean(recall_list), np.mean(mrr_list)
    return recall, mrr

In [33]:
# 학습 시간이 다소 오래 소요됩니다.
train_model(model, args)

Train:  99%|█████████▊| 14089/14271 [02:00<00:01, 116.95it/s, accuracy=0.0156, train_loss=6.09]
Evaluation:  29%|██▊       | 131/459 [00:26<01:05,  4.97it/s]
Train:   0%|          | 0/14271 [00:00<?, ?it/s, accuracy=0.0312, train_loss=5.82]

	 - Recall@20 epoch 1: 0.253340
	 - MRR@20    epoch 1: 0.067813



Train:  99%|█████████▊| 14089/14271 [01:52<00:01, 124.71it/s, accuracy=0.0781, train_loss=5.81]
Evaluation:  29%|██▊       | 131/459 [00:25<01:03,  5.17it/s]
Train:   0%|          | 0/14271 [00:00<?, ?it/s, accuracy=0.0938, train_loss=5.28]

	 - Recall@20 epoch 2: 0.291746
	 - MRR@20    epoch 2: 0.084857



Train:  99%|█████████▊| 14089/14271 [01:51<00:01, 126.12it/s, accuracy=0.0469, train_loss=5.62]
Evaluation:  29%|██▊       | 131/459 [00:25<01:04,  5.10it/s]

	 - Recall@20 epoch 3: 0.303197
	 - MRR@20    epoch 3: 0.091478






### Inference

학습한 모델을 테스트셋에 대해서도 유사한 수준의 성능을 내는지 검증해보겠습니다.

In [34]:
def test_model(model, args, test):
    test_recall, test_mrr = get_metrics(test, model, args, 20)
    print(f"\t - Recall@{args.k}: {test_recall:3f}")
    print(f"\t - MRR@{args.k}: {test_mrr:3f}\n")

test_model(model, args, test)

Evaluation:  53%|█████▎    | 418/794 [01:19<01:11,  5.27it/s]

	 - Recall@20: 0.284951
	 - MRR@20: 0.082269






## 10. 세션 재정의

DateTime 이 1시간이상 차이나면 Session Id 를 구분합니다.

In [35]:
session = []

start = data.iloc[0, -1]
session_id = 0
session.append(session_id)

for i in range(len(data)-1):
    end = data.iloc[i+1, -1]
    
    day = (end-start).days
    hour = (end-start).seconds / 3600
    
    if day >= 1 or hour >= 1:
        session_id = session_id + 1
    
    start = data.iloc[i+1, -1]
    session.append(session_id)
   
data["SessionId"] = session
data

Unnamed: 0,SessionId,ItemId,Rating,Time
31,0,3186,4,2001-01-01 07:00:19
22,0,1270,5,2001-01-01 07:00:55
27,0,1721,4,2001-01-01 07:00:55
37,0,1022,5,2001-01-01 07:00:55
24,0,2340,3,2001-01-01 07:01:43
...,...,...,...,...
1000019,24230,2917,4,2001-08-10 23:40:29
999988,24230,1921,4,2001-08-10 23:41:04
1000172,24230,1784,3,2001-08-10 23:41:04
1000167,24230,161,3,2001-08-10 23:41:26


In [36]:
data_session = cleanse_recursive(data, shortest=2, least_click=5)
data_session

Unnamed: 0,SessionId,ItemId,Rating,Time
31,0,3186,4,2001-01-01 07:00:19
22,0,1270,5,2001-01-01 07:00:55
27,0,1721,4,2001-01-01 07:00:55
37,0,1022,5,2001-01-01 07:00:55
24,0,2340,3,2001-01-01 07:01:43
...,...,...,...,...
999923,24230,232,5,2001-08-10 23:39:58
1000019,24230,2917,4,2001-08-10 23:40:29
999988,24230,1921,4,2001-08-10 23:41:04
1000172,24230,1784,3,2001-08-10 23:41:04


In [37]:
data_session = data_session.dropna(axis=0)
data_session

Unnamed: 0,SessionId,ItemId,Rating,Time
31,0,3186,4,2001-01-01 07:00:19
22,0,1270,5,2001-01-01 07:00:55
27,0,1721,4,2001-01-01 07:00:55
37,0,1022,5,2001-01-01 07:00:55
24,0,2340,3,2001-01-01 07:01:43
...,...,...,...,...
999923,24230,232,5,2001-08-10 23:39:58
1000019,24230,2917,4,2001-08-10 23:40:29
999988,24230,1921,4,2001-08-10 23:41:04
1000172,24230,1784,3,2001-08-10 23:41:04


In [38]:
tr, test = split_by_date(data_session, n_days=30)
tr, val = split_by_date(tr, n_days=30)

In [39]:
stats_info(tr, 'train')
stats_info(val, 'valid')
stats_info(test, 'test')

* train Set Stats Info
	 Events: 988407
	 Sessions: 17353
	 Items: 3405
	 First Time : 2000-04-26 08:05:32
	 Last Time : 2002-12-30 11:26:14

* valid Set Stats Info
	 Events: 2289
	 Sessions: 137
	 Items: 1312
	 First Time : 2000-07-23 23:38:24
	 Last Time : 2003-01-29 15:58:42

* test Set Stats Info
	 Events: 2250
	 Sessions: 95
	 Items: 1232
	 First Time : 2000-05-21 02:02:15
	 Last Time : 2003-03-01 02:49:50



In [40]:
# train set에 없는 아이템이 val, test기간에 생길 수 있으므로 train data를 기준으로 인덱싱합니다.
id2idx = {item_id : index for index, item_id in enumerate(tr['ItemId'].unique())}

tr = indexing(tr, id2idx)
val = indexing(val, id2idx)
test = indexing(test, id2idx)

In [41]:
save_path = data_path / 'processed'
save_path.mkdir(parents=True, exist_ok=True)

tr.to_pickle(save_path / 'train.pkl')
val.to_pickle(save_path / 'valid.pkl')
test.to_pickle(save_path / 'test.pkl')

In [42]:
tr_dataset = SessionDataset(tr)
tr_dataset.df.head(10)

Unnamed: 0,SessionId,ItemId,Rating,Time,item_idx
31,0,3186,4,2001-01-01 07:00:19,0
22,0,1270,5,2001-01-01 07:00:55,1
27,0,1721,4,2001-01-01 07:00:55,2
37,0,1022,5,2001-01-01 07:00:55,3
24,0,2340,3,2001-01-01 07:01:43,4
36,0,1836,5,2001-01-01 07:02:52,5
3,0,3408,4,2001-01-01 07:04:35,6
7,0,2804,5,2001-01-01 07:11:59,7
47,0,1207,4,2001-01-01 07:11:59,8
0,0,1193,5,2001-01-01 07:12:40,9


In [43]:
tr_dataset.click_offsets

array([     0,     40,     53, ..., 988372, 988386, 988407], dtype=int32)

In [44]:
tr_dataset.session_idx

array([    0,     1,     2, ..., 17350, 17351, 17352])

In [45]:
tr_data_loader = SessionDataLoader(tr_dataset, batch_size=4)
tr_dataset.df.head(15)

Unnamed: 0,SessionId,ItemId,Rating,Time,item_idx
31,0,3186,4,2001-01-01 07:00:19,0
22,0,1270,5,2001-01-01 07:00:55,1
27,0,1721,4,2001-01-01 07:00:55,2
37,0,1022,5,2001-01-01 07:00:55,3
24,0,2340,3,2001-01-01 07:01:43,4
36,0,1836,5,2001-01-01 07:02:52,5
3,0,3408,4,2001-01-01 07:04:35,6
7,0,2804,5,2001-01-01 07:11:59,7
47,0,1207,4,2001-01-01 07:11:59,8
0,0,1193,5,2001-01-01 07:12:40,9


In [46]:
iter_ex = iter(tr_data_loader)

In [47]:
inputs, labels, mask =  next(iter_ex)
print(f'Model Input Item Idx are : {inputs}')
print(f'Label Item Idx are : {"":5} {labels}')
print(f'Previous Masked Input Idx are {mask}')

Model Input Item Idx are : [ 0 40 53 65]
Label Item Idx are :       [ 1 41 54 62]
Previous Masked Input Idx are []


In [48]:
args = Args(tr, val, test, batch_size=64, hsz=50, drop_rate=0.1, lr=0.001, epochs=3, k=20)

In [49]:
model = create_model(args)

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(64, 1, 3405)]           0         
_________________________________________________________________
GRU (GRU)                    [(64, 50), (64, 50)]      518550    
_________________________________________________________________
dropout_1 (Dropout)          (64, 50)                  0         
_________________________________________________________________
dense_1 (Dense)              (64, 3405)                173655    
Total params: 692,205
Trainable params: 692,205
Non-trainable params: 0
_________________________________________________________________


In [50]:
# 학습 시간이 다소 오래 소요됩니다.
train_model(model, args)

Train:  99%|█████████▉| 15068/15172 [01:56<00:00, 129.02it/s, accuracy=0.0469, train_loss=5.81]
Evaluation:  15%|█▌        | 5/33 [00:01<00:06,  4.17it/s]
Train:   0%|          | 0/15172 [00:00<?, ?it/s, accuracy=0.0625, train_loss=5.86]

	 - Recall@20 epoch 1: 0.112500
	 - MRR@20    epoch 1: 0.039164



Train:  99%|█████████▉| 15068/15172 [01:52<00:00, 133.38it/s, accuracy=0.0312, train_loss=5.56]
Evaluation:  15%|█▌        | 5/33 [00:00<00:05,  5.11it/s]
Train:   0%|          | 0/15172 [00:00<?, ?it/s, accuracy=0.0312, train_loss=5.63]

	 - Recall@20 epoch 2: 0.140625
	 - MRR@20    epoch 2: 0.053412



Train:  99%|█████████▉| 15068/15172 [01:54<00:00, 131.85it/s, accuracy=0.0625, train_loss=5.53]
Evaluation:  15%|█▌        | 5/33 [00:01<00:05,  4.76it/s]

	 - Recall@20 epoch 3: 0.146875
	 - MRR@20    epoch 3: 0.059934






In [51]:
test_model(model, args, test)

Evaluation:   6%|▌         | 2/33 [00:00<00:06,  4.66it/s]

	 - Recall@20: 0.101562
	 - MRR@20: 0.030130






## 11. 모델 구조 변경

기존의 모델에서 droupout 층과 gru 층을 하나씩 더 추가합니다.

In [57]:
def reset_hidden_states(model, mask):
    gru_layer = model.get_layer(name='GRU1')  # model에서 gru layer를 가져옵니다.
    hidden_states = gru_layer.states[0].numpy()  # gru_layer의 parameter를 가져옵니다.
    for elt in mask:  # mask된 인덱스 즉, 종료된 세션의 인덱스를 돌면서
        hidden_states[elt, :] = 0  # parameter를 초기화 합니다.
    gru_layer.reset_states(states=hidden_states)
    
    gru_layer = model.get_layer(name='GRU2')  # model에서 gru layer를 가져옵니다.
    hidden_states = gru_layer.states[0].numpy()  # gru_layer의 parameter를 가져옵니다.
    for elt in mask:  # mask된 인덱스 즉, 종료된 세션의 인덱스를 돌면서
        hidden_states[elt, :] = 0  # parameter를 초기화 합니다.
    gru_layer.reset_states(states=hidden_states)

In [58]:
def create_model2(args):
    inputs = Input(batch_shape=(args.batch_size, 1, args.num_items))
    gru, _ = GRU(args.hsz, stateful=True, return_state=True,return_sequences=True, name='GRU1')(inputs)
    dropout = Dropout(args.drop_rate)(gru)
    gru, _ = GRU(args.hsz, stateful=True, return_state=True, name='GRU2')(dropout)
    dropout = Dropout(args.drop_rate)(gru)    
    predictions = Dense(args.num_items, activation='softmax')(dropout)
    
    model = Model(inputs=inputs, outputs=[predictions])
    model.compile(loss=categorical_crossentropy, optimizer=Adam(args.lr), metrics=['accuracy'])
    model.summary()
    return model

In [59]:
model = create_model2(args)

Model: "model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         [(64, 1, 3405)]           0         
_________________________________________________________________
GRU1 (GRU)                   [(64, 1, 50), (64, 50)]   518550    
_________________________________________________________________
dropout_6 (Dropout)          (64, 1, 50)               0         
_________________________________________________________________
GRU2 (GRU)                   [(64, 50), (64, 50)]      15300     
_________________________________________________________________
dropout_7 (Dropout)          (64, 50)                  0         
_________________________________________________________________
dense_4 (Dense)              (64, 3405)                173655    
Total params: 707,505
Trainable params: 707,505
Non-trainable params: 0
_____________________________________________________

In [60]:
# 학습 시간이 다소 오래 소요됩니다.
train_model(model, args)

Train:  99%|█████████▉| 15068/15172 [02:19<00:00, 108.20it/s, accuracy=0.0469, train_loss=5.86]
Evaluation:  15%|█▌        | 5/33 [00:01<00:07,  3.58it/s]
Train:   0%|          | 0/15172 [00:00<?, ?it/s, accuracy=0.0312, train_loss=6.12]

	 - Recall@20 epoch 1: 0.100000
	 - MRR@20    epoch 1: 0.031209



Train:  99%|█████████▉| 15068/15172 [02:14<00:00, 112.16it/s, accuracy=0.0469, train_loss=5.62]
Evaluation:  15%|█▌        | 5/33 [00:01<00:05,  4.93it/s]
Train:   0%|          | 0/15172 [00:00<?, ?it/s, accuracy=0.0469, train_loss=5.44]

	 - Recall@20 epoch 2: 0.137500
	 - MRR@20    epoch 2: 0.047805



Train:  99%|█████████▉| 15068/15172 [02:11<00:00, 114.20it/s, accuracy=0.0469, train_loss=5.52]
Evaluation:  15%|█▌        | 5/33 [00:00<00:05,  5.02it/s]

	 - Recall@20 epoch 3: 0.146875
	 - MRR@20    epoch 3: 0.052307






In [61]:
test_model(model, args, test)

Evaluation:   6%|▌         | 2/33 [00:00<00:06,  4.81it/s]

	 - Recall@20: 0.085938
	 - MRR@20: 0.020358






## 12. 하이퍼파라미터 변경

hsz 값과 drop_rate 값을 변경합니다.

In [62]:
def reset_hidden_states(model, mask):
    gru_layer = model.get_layer(name='GRU')  # model에서 gru layer를 가져옵니다.
    hidden_states = gru_layer.states[0].numpy()  # gru_layer의 parameter를 가져옵니다.
    for elt in mask:  # mask된 인덱스 즉, 종료된 세션의 인덱스를 돌면서
        hidden_states[elt, :] = 0  # parameter를 초기화 합니다.
    gru_layer.reset_states(states=hidden_states)

In [63]:
args = Args(tr, val, test, batch_size=64, hsz=30, drop_rate=0.2, lr=0.001, epochs=3, k=20)

In [64]:
model = create_model(args)

Model: "model_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         [(64, 1, 3405)]           0         
_________________________________________________________________
GRU (GRU)                    [(64, 30), (64, 30)]      309330    
_________________________________________________________________
dropout_8 (Dropout)          (64, 30)                  0         
_________________________________________________________________
dense_5 (Dense)              (64, 3405)                105555    
Total params: 414,885
Trainable params: 414,885
Non-trainable params: 0
_________________________________________________________________


In [65]:
# 학습 시간이 다소 오래 소요됩니다.
train_model(model, args)

Train:  99%|█████████▉| 15068/15172 [01:58<00:00, 127.61it/s, accuracy=0.0156, train_loss=6.15]
Evaluation:  15%|█▌        | 5/33 [00:01<00:06,  4.02it/s]
Train:   0%|          | 0/15172 [00:00<?, ?it/s, accuracy=0.0156, train_loss=6.31]

	 - Recall@20 epoch 1: 0.096875
	 - MRR@20    epoch 1: 0.028244



Train:  99%|█████████▉| 15068/15172 [02:10<00:00, 115.40it/s, accuracy=0.0156, train_loss=5.76]
Evaluation:  15%|█▌        | 5/33 [00:01<00:05,  4.85it/s]
Train:   0%|          | 0/15172 [00:00<?, ?it/s, accuracy=0.0469, train_loss=5.81]

	 - Recall@20 epoch 2: 0.118750
	 - MRR@20    epoch 2: 0.041895



Train:  99%|█████████▉| 15068/15172 [01:52<00:00, 134.44it/s, accuracy=0.0469, train_loss=5.77]
Evaluation:  15%|█▌        | 5/33 [00:01<00:05,  4.97it/s]

	 - Recall@20 epoch 3: 0.131250
	 - MRR@20    epoch 3: 0.051495






In [66]:
test_model(model, args, test)

Evaluation:   6%|▌         | 2/33 [00:00<00:07,  4.41it/s]

	 - Recall@20: 0.062500
	 - MRR@20: 0.005799






## [ 결과 - 루브릭 ]
**1. Movielens 데이터셋을 session based recommendation 관점으로 전처리하는 과정이 체계적으로 진행되었다.**
- 데이터셋의 분석을 토대로 세션단위 정의 과정(길이분석, 시간분석)을 합리적으로 수행한 과정이 기술되었습니다. :)  

**2. RNN 기반의 예측 모델이 정상적으로 구성되어 안정적으로 훈련이 진행되었다.** 
- 적절한 epoch만큼의 학습이 진행되는 과정에서 train loss가 안정적으로 감소하고, validation 단계에서의 Recall, MRR이 개선되었습니다. :)
  
**3. 세션정의, 모델구조, 하이퍼파라미터 등을 변경해서 실험하여 Recall, MRR 등의 변화추이를 관찰하였다.**
- 3가지 이상의 변화를 시도하고 변화추이를 관찰하였습니다. :)