-----------------------

# Chosen Dataset: <u> YOOCHOOSE - RecSys Challenge 2015</u> 
<u>general explaination on the dataset:</u><br>
The YOOCHOOSE dataset contain a collection of sessions from a retailer, where each session<br>
is encapsulating the click events that the user performed in the session.<br>
For some of the sessions, there are also buy events; means that the session ended with the user bought something from the web shop.<br> The data was collected during several
months in the year of 2014, reflecting the clicks and purchases performed by the users of an on-line retailer in Europe.<br>
**We thus conclude that the dataset represents an implicit recommender system challange due to a binary representations of the data - clicked or not, bought or not.**<br>
The dataset is composed out of 3 files (and a readme as well):
 - yoochoose-buys.dat , ~55MB
 - yoochoose-clicks.dat, ~1.5GB
 - yoochoose-test.dat,  ~363 MB
 
<br>**The authors of the original paper ignored the testset and just splitted yoochoose-clicks.dat into train and test datasets.
in order to maintain consistency and to try and recreate the authors results, we will do the same**

#### <u>CLICKS DATASET FILE DESCRIPTION</u>

The file yoochoose-clicks.dat comprising the clicks of the users over the items.<br>
Each record/line in the file has the following fields/format: Session ID, Timestamp, Item ID, Category<br>
-Session ID – the id of the session. In one session there are one or many clicks. Could be represented as an integer number.<br>
-Timestamp – the time when the click occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ<br>
-Item ID – the unique identifier of the item that has been clicked. Could be represented as an integer number.<br>
-Category – the context of the click. The value "S" indicates a special offer, "0" indicates  a missing value, a number between 1 to 12 indicates a real category identifier,<br>
 any other number indicates a brand. E.g. if an item has been clicked in the context of a promotion or special offer then the value will be "S", if the context was a brand i.e BOSCH,<br>
 then the value will be an 8-10 digits number. If the item has been clicked under regular category, i.e. sport, then the value will be a number between 1 to 12. <br>
 
* The explanation above is based on the README.txt attached to the dataset.<br>
    This dataset is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
    International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/

In [1]:
import numpy as np
import pandas as pd
import datetime as dt
from tqdm.notebook import tqdm
import seaborn as sns


## Data Loading and Pre-processing

In [None]:
%%time
clicks_df = pd.read_csv('data/yoochoose-clicks.dat',names=['SessionID','Time', 'ItemID']
                        ,usecols=[0,1,2],
                        dtype={0:np.int32, 1:str, 2:np.int64})
# convert date into timestamp:
clicks_df['Time'] = clicks_df['Time'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%fZ').timestamp()) 

filter out sessions of only 1 interaction:

In [None]:
session_lengths = clicks_df.groupby('SessionID').size()
clicks_df = clicks_df[np.in1d(clicks_df['SessionID'], session_lengths[session_lengths>1].index)]

filter out items rarely bought items and leave only items which have been purchased 5 times or more:

In [None]:
item_supports = clicks_df.groupby('ItemID').size()
clicks_df = clicks_df[np.in1d(clicks_df['ItemID'], item_supports[item_supports>=5].index)]

re - filter out sessions of only 1 interaction 

In [None]:
session_lengths = clicks_df.groupby('SessionID').size()
clicks_df = clicks_df[np.in1d(clicks_df['SessionID'], session_lengths[session_lengths>1].index)]

### Train - Test split:

In [None]:
tmax = clicks_df['Time'].max()
day  = 86400

Split the dataset into
- test: last day of sessions
- train: all days of sessions except last

In [None]:
session_max_times = clicks_df.groupby('SessionID')['Time'].max()
session_train = session_max_times[session_max_times < tmax-day].index
session_test = session_max_times[session_max_times >= tmax-day].index
train = clicks_df[np.in1d(clicks_df['SessionID'], session_train)]
test = clicks_df[np.in1d(clicks_df['SessionID'], session_test)]

filter out clicks from the test set where the items are not in the train set

In [None]:
test = test[np.in1d(test['ItemID'], train['ItemID'])]

if by any chance there are sessions in test set which has less than 2 sessions - filter them out

In [None]:
tslength = test.groupby('SessionID').size()
test = test[np.in1d(test['SessionID'], tslength[tslength>=2].index)]

#### Final train and test files:

In [None]:
print('Full train set\n\tEvents: {}\n\tSessions: {}\n\tItems: {}'.format(len(train), train['SessionID'].nunique(), train['ItemID'].nunique()))
print('Test set set\n\tEvents: {}\n\tSessions: {}\n\tItems: {}'.format(len(test), test['SessionID'].nunique(), test['ItemID'].nunique()))


In [None]:
#saving the files:
train.to_csv('data/train.txt', sep='\t', index=False)
test.to_csv('data/test.txt', sep='\t', index=False)

### Creating validation set of training set
same mechanism as splitting clicks dataframe into train and test - last day of sessions is converted to validation

In [None]:
tmax = train['Time'].max()
session_max_times = train.groupby('SessionID')['Time'].max()
session_train = session_max_times[session_max_times < tmax-day].index
session_valid = session_max_times[session_max_times >= tmax-day].index
train_tr = train[np.in1d(train['SessionID'], session_train)]
valid = train[np.in1d(train['SessionID'], session_valid)]
valid = valid[np.in1d(valid['ItemID'], train_tr['ItemID'])]
tslength = valid.groupby('SessionID').size()
valid = valid[np.in1d(valid['SessionID'],tslength[tslength>=2].index)]
#Convert To CSV
print('Train set\n\tEvents: {}\n\tSessions: {}\n\tItems: {}'.format(len(train_tr), train_tr['SessionID'].nunique(), train_tr['ItemID'].nunique()))
train_tr.to_csv('data/train_tr.txt', sep=',', index=False)
print('Validation set\n\tEvents: {}\n\tSessions: {}\n\tItems: {}'.format(len(valid), valid['SessionID'].nunique(), valid['ItemID'].nunique()))
valid.to_csv('data/train_valid.txt', sep=',', index=False)

### Smaller Sample: 4.5 days of sessions,
#### Train - Test split:

In [None]:
(len(test) / len(train))*100

In the original paper, the author used a really small portion of data for test set.
we will try to remain around higher precentage of split to test and validation because our sample is alot smaller

In [None]:
tmin = clicks_df['Time'].min()
day  = 86400
tmax = tmin +day*4.5

Split the dataset into
- test: last day of sessions
- train: all days of sessions except last

In [None]:
clicks_df = clicks_df[clicks_df['Time'] <= tmax]

In [None]:
session_max_times = clicks_df.groupby('SessionID')['Time'].max()
session_train = session_max_times[session_max_times < tmax-day*0.5].index
session_test = session_max_times[session_max_times >= tmax-day*0.5].index
train_samp = clicks_df[np.in1d(clicks_df['SessionID'], session_train)]
test_samp = clicks_df[np.in1d(clicks_df['SessionID'], session_test)]

filter out clicks from the test set where the items are not in the train set

In [None]:
test_samp = test_samp[np.in1d(test_samp['ItemID'], train_samp['ItemID'])]

if by any chance there are sessions in test set which has less than 2 sessions - filter them out

In [None]:
tslength = test_samp.groupby('SessionID').size()
test_samp = test_samp[np.in1d(test_samp['SessionID'], tslength[tslength>=2].index)]

In [None]:
print('Sampled train set\n\tEvents: {}\n\tSessions: {}\n\tItems: {}'.format(len(train_samp), train_samp['SessionID'].nunique(), train_samp['ItemID'].nunique()))
print('Sampled Test set set\n\tEvents: {}\n\tSessions: {}\n\tItems: {}'.format(len(test_samp), test_samp['SessionID'].nunique(), test_samp['ItemID'].nunique()))

### Creating validation set of training set
same mechanism as splitting clicks dataframe into train and test - last day of sessions is converted to validation

In [None]:
tmax = train_samp['Time'].max()
session_max_times = train_samp.groupby('SessionID')['Time'].max()
session_train_samp = session_max_times[session_max_times < tmax-day*0.5].index
session_valid = session_max_times[session_max_times >= tmax-day*0.5].index
train_samp_tr = train_samp[np.in1d(train_samp['SessionID'], session_train_samp)]
valid = train_samp[np.in1d(train_samp['SessionID'], session_valid)]
valid = valid[np.in1d(valid['ItemID'], train_samp_tr['ItemID'])]
tslength = valid.groupby('SessionID').size()
valid = valid[np.in1d(valid['SessionID'],tslength[tslength>=2].index)]
#Convert To CSV
print('train_samp set\n\tEvents: {}\n\tSessions: {}\n\tItems: {}'.format(len(train_samp_tr), train_samp_tr['SessionID'].nunique(), train_samp_tr['ItemID'].nunique()))
train_samp_tr.to_csv('data/train_samp_tr.txt', sep=',', index=False)
print('Validation set\n\tEvents: {}\n\tSessions: {}\n\tItems: {}'.format(len(valid), valid['SessionID'].nunique(), valid['ItemID'].nunique()))
valid.to_csv('data/train_samp_valid.txt', sep=',', index=False)

filter out clicks from the test set where the items are not in the train set

In [None]:
test_samp = test_samp[np.in1d(test_samp['ItemID'], train_samp_tr['ItemID'])]

if by any chance there are sessions in test set which has less than 2 sessions - filter them out

In [None]:
tslength = test_samp.groupby('SessionID').size()
test_samp = test_samp[np.in1d(test_samp['SessionID'], tslength[tslength>=2].index)]

In [None]:
print('Sampled Test set set\n\tEvents: {}\n\tSessions: {}\n\tItems: {}'.format(len(test_samp), test_samp['SessionID'].nunique(), test_samp['ItemID'].nunique()))

In [None]:
test_samp.to_csv('data/test_samp.txt', sep=',', index=False)