# Data Understanding

Source: [KDD data mining cup 2013 - shopping cart abandonment prediction](https://www.data-mining-cup.com/reviews/dmc-2013/)

## Data Dictionary

- sessionNo: running number of the session
- startHour: hour in which the session has begun
- startWeekday: day of week in which the session has begun (1: Mon, 2: Tue, ..., 7: Sun)
- duration: time in seconds passed since start of the session
- cCount: number of the products clicked on
- cMinPrice: lowest price of a product clicked on
- cMaxPrice: highest price of a product clicked on
- cSumPrice: sum of the prices of all products clicked on
- bCount: number of products put in the shopping basket
- bMinPrice: lowest price of all products put in the shopping basket
- bMaxPrice: highest price of all products put in the shopping basket
- bSumPrice: sum of theprices of all products put in the shopping basket
- bStep: purchase processing step (1,2,3,4,5)
- onlineStatus: indication whether the customer is online
- availability: delivery status
- customerID: customer ID
- maxVal: maximum admissible purchase price for the customer
- customerScore: customer evaluation from the point of view of the shop
- accountLifetime: lifetime of the customer's account in months
- payments: number of payments affected by the customer
- age: age of the customer
- address: form of address of the customer (1: Mr, 2: Mrs, 3: company)
- lastOrder: time in days passed since the last order
- order: outcome of the session (y: purchase, n: non-purchase)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os

sns.set_context('talk')

In [2]:
input_train_path = os.path.join(
    '..',
    'dataset',
    'transact_train.txt'
)

In [3]:
df_train = pd.read_csv(input_train_path, sep="|")
df_train.head()

Unnamed: 0,sessionNo,startHour,startWeekday,duration,cCount,cMinPrice,cMaxPrice,cSumPrice,bCount,bMinPrice,...,availability,customerNo,maxVal,customerScore,accountLifetime,payments,age,address,lastOrder,order
0,1,6,5,0.0,1,59.99,59.99,59.99,1,59.99,...,?,1,600,70,21,1,43,1,49,y
1,1,6,5,11.94,1,59.99,59.99,59.99,1,59.99,...,completely orderable,1,600,70,21,1,43,1,49,y
2,1,6,5,39.887,1,59.99,59.99,59.99,1,59.99,...,completely orderable,1,600,70,21,1,43,1,49,y
3,2,6,5,0.0,0,?,?,?,0,?,...,completely orderable,?,?,?,?,?,?,?,?,y
4,2,6,5,15.633,0,?,?,?,0,?,...,completely orderable,?,?,?,?,?,?,?,?,y


In [4]:
df_train.shape

(429013, 24)

In [5]:
df_train.iloc[0]

sessionNo              1
startHour              6
startWeekday           5
duration             0.0
cCount                 1
cMinPrice          59.99
cMaxPrice          59.99
cSumPrice          59.99
bCount                 1
bMinPrice          59.99
bMaxPrice          59.99
bSumPrice          59.99
bStep                  ?
onlineStatus           ?
availability           ?
customerNo             1
maxVal               600
customerScore         70
accountLifetime       21
payments               1
age                   43
address                1
lastOrder             49
order                  y
Name: 0, dtype: object

In [6]:
df_train.describe()

Unnamed: 0,sessionNo,startHour,startWeekday,duration,cCount,bCount
count,429013.0,429013.0,429013.0,429013.0,429013.0,429013.0
mean,25274.631293,14.617061,5.924839,1573.90164,24.140317,4.135168
std,14441.366146,4.485914,0.79093,2427.123356,30.398164,4.451778
min,1.0,0.0,5.0,0.0,0.0,0.0
25%,12731.0,11.0,5.0,225.07,5.0,1.0
50%,25470.0,15.0,6.0,738.199,13.0,3.0
75%,37542.0,18.0,7.0,1880.265,31.0,5.0
max,50000.0,23.0,7.0,21580.092,200.0,108.0


In [7]:
df_train['order'].value_counts(True)

y    0.67604
n    0.32396
Name: order, dtype: float64

In [8]:
df_train['bStep'].value_counts()

?    191333
1     90058
2     60682
4     41142
3     30062
5     15736
Name: bStep, dtype: int64