## 1.1. Understanding and Processing Dataset

Dataset is huge at around 1GB (parquet files). In order to reduce memory error, we have performed all work on Google Colab.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import numpy as np
import pandas as pd
import pyarrow.parquet as pq

import datetime
import pytz

In [None]:
# import parquet files
train = pd.read_parquet('/content/drive/MyDrive/0.capstone/train.parquet')
test = pd.read_parquet('/content/drive/MyDrive/0.capstone/test.parquet')

In [None]:
# see number of rows and columns
print(train.shape)
print(test.shape)

(216716096, 4)
(6928123, 4)


As mentioned, this is a huge dataset with over 216m rows of data for the train set and about 7m rows of data for the test set.

In [None]:
train.dtypes

session    int32
aid        int32
ts         int32
type       uint8
dtype: object

Data has been reduced to less memory intensive forms.

In [None]:
train.head()

Unnamed: 0,session,aid,ts,type
0,0,1517085,1659304800,0
1,0,1563459,1659304904,0
2,0,1309446,1659367439,0
3,0,16246,1659367719,0
4,0,1781822,1659367871,0


In [None]:
test.head()

Unnamed: 0,session,aid,ts,type
0,12899779,59625,1661724000,0
1,12899780,1142000,1661724000,0
2,12899780,582732,1661724058,0
3,12899780,973453,1661724109,0
4,12899780,736515,1661724136,0


Test users are not in train dataset. This means that there is only 1 week of data available for test users. This might present a cold start problem where there is little information to do a recommendation for these users.

In [None]:
# Unique number of each feature
train.nunique()

session    12899779
aid         1855603
ts          2416913
type              3
dtype: int64

In [None]:
# Unique number of each feature
test.nunique()

session    1671803
aid         783486
ts          580195
type             3
dtype: int64

In [None]:
# Unique products in train or test set
len(set(test['aid']) - set(train['aid']))

0

No new products in test set. This makes it easier for analysis and predicting.

In [None]:
train.isnull().sum()


session    0
aid        0
ts         0
type       0
dtype: int64

In [None]:
test.isnull().sum()

session    0
aid        0
ts         0
type       0
dtype: int64

No missing data in train and test set. No need for further cleaning.

**Split by type**

In this part and in part 1.2, we will split the data by 3 types and 5 weeks to make it more manageable for analysis etc.

In some sections (e.g. modelling) we use files that were broken into even smaller parts (129 + 17 files) that is shared on Kaggle.

In [None]:
# split by type
train_click = train[train['type'] == 0].reset_index(drop=True)
test_click = test[test['type'] == 0].reset_index(drop=True)

train_cart = train[train['type'] == 1].reset_index(drop=True)
test_cart = test[test['type'] == 1].reset_index(drop=True)

train_order = train[train['type'] == 2].reset_index(drop=True)
test_order = test[test['type'] == 2].reset_index(drop=True)

In [None]:
print(train_click.shape)
print(train_cart.shape)
print(train_order.shape)

(194720954, 4)
(16896191, 4)
(5098951, 4)


In [None]:
train_click.nunique()

session    12899779
aid         1855603
ts          2416414
type              1
dtype: int64

In [None]:
train_cart.nunique()

session    3810706
aid        1234735
ts         2218096
type             1
dtype: int64

In [None]:
train_order.nunique()

session    1626338
aid         657940
ts         1374048
type             1
dtype: int64

From train data, ratio of number of products / number of users are as below:
- clicks: 0.14
- carts: 0.32
- orders: 0.40

Only 13% of users who clicked actually ordered (conversion rate of click-cart is 30%; conversion of cart-order is 43%). 35% of products that were clicked became orders.



In [None]:
train_click.dtypes

session    int32
aid        int32
ts         int32
type       uint8
dtype: object

In [None]:
# train_click.to_parquet('/content/drive/MyDrive/0.capstone/train_click.parquet')
# train_cart.to_parquet('/content/drive/MyDrive/0.capstone/train_cart.parquet')
# train_order.to_parquet('/content/drive/MyDrive/0.capstone/train_order.parquet')

# test_click.to_parquet('/content/drive/MyDrive/0.capstone/test_click.parquet')
# test_cart.to_parquet('/content/drive/MyDrive/0.capstone/test_cart.parquet')
# test_order.to_parquet('/content/drive/MyDrive/0.capstone/test_order.parquet')

**Split by weeks (part 1)**

In [None]:
# Create timezone object for Germany UTC+2
utc_plus_two = pytz.timezone('Europe/Berlin')

In [None]:
# Convert timestamps to datetime objects with timezone of UTC+2
train_min_date = datetime.datetime.fromtimestamp(train['ts'].min(), tz=pytz.utc).astimezone(utc_plus_two)
train_max_date = datetime.datetime.fromtimestamp(train['ts'].max(), tz=pytz.utc).astimezone(utc_plus_two)
test_min_date = datetime.datetime.fromtimestamp(test['ts'].min(), tz=pytz.utc).astimezone(utc_plus_two)
test_max_date = datetime.datetime.fromtimestamp(test['ts'].max(), tz=pytz.utc).astimezone(utc_plus_two)

print(f"train min date: {train_min_date}")
print(f"train max date: {train_max_date}")
print(f"test min date: {test_min_date}")
print(f"test max date: {test_max_date}")

train min date: 2022-08-01 00:00:00+02:00
train max date: 2022-08-28 23:59:59+02:00
test min date: 2022-08-29 00:00:00+02:00
test max date: 2022-09-04 23:59:51+02:00


Train data is from 1 Aug 22 to 28 Aug 22, spanning 4 weeks. <br>
Test data is from 28 Aug 22 to 4 Sep 22, spanning 1 week.

In [None]:
dt1 = datetime.datetime(2022, 8, 1, 0, 0, 0, tzinfo=datetime.timezone(datetime.timedelta(hours=2)))
dt2 = datetime.datetime(2022, 8, 8, 0, 0, 0, tzinfo=datetime.timezone(datetime.timedelta(hours=2)))
dt3 = datetime.datetime(2022, 8, 15, 0, 0, 0, tzinfo=datetime.timezone(datetime.timedelta(hours=2)))
dt4 = datetime.datetime(2022, 8, 22, 0, 0, 0, tzinfo=datetime.timezone(datetime.timedelta(hours=2)))
dt5 = datetime.datetime(2022, 8, 29, 0, 0, 0, tzinfo=datetime.timezone(datetime.timedelta(hours=2)))

# Convert the datetime object to a Unix timestamp
print(int(dt1.timestamp()))
print(int(dt2.timestamp()))
print(int(dt3.timestamp()))
print(int(dt4.timestamp()))
print(int(dt5.timestamp()))

1659304800
1659909600
1660514400
1661119200
1661724000


**Summary**
- This is a huge dataset which requires handling to reduce cases of memory error (parquet files, split into chunks, use of Google Colab RAM etc, code optimization, special dataframe libraries, GPUs etc)
- 4 weeks of train data and 1 week of test data. Test users are not in train dataset. This means that there is only 1 week of data available for test users. This might present a cold start problem where there is little information to do a recommendation for these users.
- There are users who click and/or cart but do not order. This means that there might be users who we predict 20 products they will order when they do not have any orders. Based on Kaggle scoring, this will not impose a penalty. In reality, these users will still be recommended products (as should be the case).
- Not all products listed that are clicked will be ordered. Only 35% of products that were clicked became orders.