## Processing Steps
- Separte variables by Object, encode as categorical variables.
- Separate DateTime variables from Object Type and split into Year, Month, Day, Hour, Min - **LabelEncode**
- Separate Float variables and treat them as continuous.
- Separate Integer variables and verify if any of those are categorical, apply binning if categorical.

In [55]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import category_encoders as ce
sDir = '/home/pabhijit/data/'

In [3]:
df_data = pd.read_pickle(sDir + 'step02_data.pkl')
df_data.shape

(290463, 651)

In [24]:
df_data['target'].value_counts()

0.0    111458
1.0     33773
Name: target, dtype: int64

In [26]:
df_data.drop(df_data[df_data['target'].isnull()].index, axis=0, inplace=True)

In [28]:
df_data.dtypes.value_counts()

float16    191
int16      182
int8       132
int32       86
float32     38
object      17
bool         4
float64      1
dtype: int64

In [29]:
df_data.head()

Unnamed: 0,VAR_0001,VAR_0005,VAR_0073,VAR_0075,VAR_0200,VAR_0204,VAR_0217,VAR_0226,VAR_0230,VAR_0232,...,VAR_1918,VAR_1919,VAR_1923,VAR_1924,VAR_1925,VAR_1929,VAR_1931,VAR_1932,VAR_1933,target
0,H,C,13MAR09:00:00:00,08NOV11:00:00:00,FT LAUDERDALE,29JAN14:21:16:00,08NOV11:02:00:00,False,False,True,...,9998,9998,999999998,9998,0,999999998,998,9998,9998,0.0
1,H,B,04SEP12:00:00:00,10NOV11:00:00:00,SANTEE,01FEB14:00:11:00,02OCT12:02:00:00,False,False,False,...,9996,111,999999998,9998,0,999999998,998,9998,9998,0.0
2,H,C,13MAR09:00:00:00,13DEC11:00:00:00,REEDSVILLE,30JAN14:15:11:00,13DEC11:02:00:00,False,False,True,...,9996,113,999999998,9998,0,999999998,998,9998,9998,0.0
3,H,C,13MAR09:00:00:00,23SEP10:00:00:00,LIBERTY,01FEB14:00:07:00,01NOV12:02:00:00,False,False,False,...,9998,9998,999999998,9998,0,999999998,998,9998,9998,0.0
4,R,N,13MAR09:00:00:00,15OCT11:00:00:00,FRANKFORT,29JAN14:19:31:00,15OCT11:02:00:00,False,False,True,...,9998,9998,999999998,9998,0,999999998,998,9998,9998,1.0


In [32]:
df_data[['VAR_0073', 'VAR_0075', 'VAR_0204', 'VAR_0217']].describe()

Unnamed: 0,VAR_0073,VAR_0075,VAR_0204,VAR_0217
0,13MAR09:00:00:00,08NOV11:00:00:00,29JAN14:21:16:00,08NOV11:02:00:00
1,04SEP12:00:00:00,10NOV11:00:00:00,01FEB14:00:11:00,02OCT12:02:00:00
2,13MAR09:00:00:00,13DEC11:00:00:00,30JAN14:15:11:00,13DEC11:02:00:00
3,13MAR09:00:00:00,23SEP10:00:00:00,01FEB14:00:07:00,01NOV12:02:00:00
4,13MAR09:00:00:00,15OCT11:00:00:00,29JAN14:19:31:00,15OCT11:02:00:00
...,...,...,...,...
145226,16MAY12:00:00:00,27APR10:00:00:00,31JAN14:16:36:00,07JUL12:02:00:00
145227,22MAY12:00:00:00,22DEC08:00:00:00,30JAN14:23:23:00,24MAY12:02:00:00
145228,07MAY12:00:00:00,29NOV11:00:00:00,31JAN14:21:10:00,21AUG12:02:00:00
145229,13MAR09:00:00:00,09MAY12:00:00:00,30JAN14:22:34:00,09MAY12:02:00:00


In [40]:
df_data[['VAR_0073', 'VAR_0075', 'VAR_0204', 'VAR_0217']].describe()

Unnamed: 0,VAR_0073,VAR_0075,VAR_0204,VAR_0217
count,145231,145231,145231,145231
unique,1458,2371,1192,397
top,13MAR09:00:00:00,22SEP10:00:00:00,31JAN14:15:54:00,06DEC11:02:00:00
freq,101387,1224,298,949


Now that redundant columns are removed, we will separate categorical and continuous variables and do some more preprocessing.

### Split Data into Training and Validation sets

In [44]:
from sklearn.model_selection import train_test_split

In [42]:
y = df_data[['target']]
X = df_data.drop(['target'], axis=1)

In [45]:
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

print('X_train', X_train.shape)
print('X_valid', X_valid.shape)
print('y_train', y_train.shape)
print('y_valid', y_valid.shape)

X_train (116184, 650)
X_valid (29047, 650)
y_train (116184, 1)
y_valid (29047, 1)


### Object Variables

In [49]:
df_data.select_dtypes(include=['bool'])

Unnamed: 0,VAR_0226,VAR_0230,VAR_0232,VAR_0236
0,False,False,True,True
1,False,False,False,True
2,False,False,True,True
3,False,False,False,True
4,False,False,True,True
...,...,...,...,...
145226,False,False,False,True
145227,False,False,False,True
145228,False,False,False,True
145229,False,False,True,True


In [50]:
df_data_obj = df_data.select_dtypes(include=['object', 'bool'])
df_data_obj.shape

(145231, 21)

In [51]:
df_data_obj.head()

Unnamed: 0,VAR_0001,VAR_0005,VAR_0073,VAR_0075,VAR_0200,VAR_0204,VAR_0217,VAR_0226,VAR_0230,VAR_0232,...,VAR_0237,VAR_0274,VAR_0283,VAR_0305,VAR_0325,VAR_0342,VAR_0352,VAR_0353,VAR_0354,VAR_1934
0,H,C,13MAR09:00:00:00,08NOV11:00:00:00,FT LAUDERDALE,29JAN14:21:16:00,08NOV11:02:00:00,False,False,True,...,FL,FL,S,S,-1,CF,O,U,O,IAPS
1,H,B,04SEP12:00:00:00,10NOV11:00:00:00,SANTEE,01FEB14:00:11:00,02OCT12:02:00:00,False,False,False,...,CA,MI,S,S,H,EC,O,R,R,IAPS
2,H,C,13MAR09:00:00:00,13DEC11:00:00:00,REEDSVILLE,30JAN14:15:11:00,13DEC11:02:00:00,False,False,True,...,WV,WV,S,P,R,UU,R,R,-1,IAPS
3,H,C,13MAR09:00:00:00,23SEP10:00:00:00,LIBERTY,01FEB14:00:07:00,01NOV12:02:00:00,False,False,False,...,TX,TX,S,P,H,-1,R,R,-1,RCC
4,R,N,13MAR09:00:00:00,15OCT11:00:00:00,FRANKFORT,29JAN14:19:31:00,15OCT11:02:00:00,False,False,True,...,IL,IL,S,P,S,-1,R,U,O,BRANCH


In [48]:
# Let's check how many unique values are in each categories
for col in df_data_obj.columns:
    print(col, df_data_obj[col].nunique())

VAR_0001 3
VAR_0005 4
VAR_0073 1458
VAR_0075 2371
VAR_0200 12385
VAR_0204 1192
VAR_0217 397
VAR_0237 45
VAR_0274 57
VAR_0283 7
VAR_0305 8
VAR_0325 9
VAR_0342 50
VAR_0352 4
VAR_0353 4
VAR_0354 4
VAR_1934 5


- **VAR_0200** - Looks like Cities. We can drop this variable since there is a corresponding variable representing state i.e "VAR_0237"
- **VAR_0237, VAR_0274** These are US States. We can split these into Regions and Drop these variables.
- **Datetime Variables** VAR_0073, VAR_0075, VAR_0204, VAR_0217 are datetime variables so those will be treated separately.

In [60]:
fetrs_catg = ['VAR_0001', 'VAR_0005', 'VAR_0200', 'VAR_0237', 'VAR_0274', 'VAR_0283', 'VAR_0305', 
              'VAR_0325', 'VAR_0342', 'VAR_0352', 'VAR_0353', 'VAR_0354', 'VAR_1934']

fetr_datatime = ['VAR_0073', 'VAR_0075', 'VAR_0204', 'VAR_0217']

fetr_bool = ['VAR_0226', 'VAR_0230', 'VAR_0232', 'VAR_0236']

catg_fetrs = fetrs_catg + fetr_bool

In [64]:
type(fetrs_catg)

list

#### Encode Categorical Variables

In [61]:
# Create Target Encoder
target_enc = ce.TargetEncoder(cols=catg_fetrs)

# Learn encoding from the training set. Use the 'is_attributed' column as the target.
target_enc.fit(X_train[catg_fetrs], y_train['target'])

# Apply encoding to the train and validation sets as new columns
X_train_encoded = target_enc.transform(X_train[catg_fetrs]).add_suffix('_target')
X_valid_encoded = target_enc.transform(X_valid[catg_fetrs]).add_suffix('_target')

In [62]:
print('X_train_encoded : ', X_train_encoded.shape)
print('X_valid_encoded : ', X_valid_encoded.shape)

X_train_encoded :  (116184, 17)
X_valid_encoded :  (29047, 17)


In [63]:
X_train_encoded.head()

Unnamed: 0,VAR_0001_target,VAR_0005_target,VAR_0200_target,VAR_0237_target,VAR_0274_target,VAR_0283_target,VAR_0305_target,VAR_0325_target,VAR_0342_target,VAR_0352_target,VAR_0353_target,VAR_0354_target,VAR_1934_target,VAR_0226_target,VAR_0230_target,VAR_0232_target,VAR_0236_target
1438,0.261643,0.192024,0.292683,0.274251,0.224892,0.22416,0.225135,0.157379,0.231586,0.25322,0.236862,0.231144,0.179237,0.233112,0.232938,0.160956,0.233062
70125,0.261643,0.192024,0.127273,0.183339,0.193531,0.276638,0.270561,0.248367,0.26123,0.259409,0.236862,0.231144,0.179237,0.233112,0.232938,0.160956,0.233062
13800,0.192149,0.192024,0.273006,0.249231,0.247607,0.22416,0.225135,0.235321,0.240108,0.192175,0.213545,0.21329,0.179237,0.233112,0.232938,0.160956,0.233062
7676,0.261643,0.259636,0.232967,0.187747,0.187221,0.22416,0.225135,0.235321,0.211661,0.25322,0.239092,0.231144,0.179237,0.233112,0.232938,0.160956,0.233062
23219,0.192149,0.259636,0.205882,0.228376,0.228743,0.22416,0.225135,0.235321,0.240108,0.25322,0.213545,0.231144,0.286781,0.233112,0.232938,0.160956,0.233062


#### TimeSeries Variables
We will parse the datetime string into separate variables **Day,Month,Year,Hour,Minute**

In [77]:
lstDateTimeVar = ['VAR_0204', 'VAR_0217', 'VAR_0073', 'VAR_0075']

In [78]:
df_data[lstDateTimeVar]

Unnamed: 0,VAR_0204,VAR_0217,VAR_0073,VAR_0075
0,29JAN14:21:16:00,08NOV11:02:00:00,12MAR09:00:00:00,08NOV11:00:00:00
1,01FEB14:00:11:00,02OCT12:02:00:00,04SEP12:00:00:00,10NOV11:00:00:00
2,30JAN14:15:11:00,13DEC11:02:00:00,12MAR09:00:00:00,13DEC11:00:00:00
3,01FEB14:00:07:00,01NOV12:02:00:00,12MAR09:00:00:00,23SEP10:00:00:00
4,29JAN14:19:31:00,15OCT11:02:00:00,12MAR09:00:00:00,15OCT11:00:00:00
...,...,...,...,...
290458,31JAN14:18:12:00,07AUG12:02:00:00,12MAR09:00:00:00,07AUG12:00:00:00
290459,29JAN14:21:15:00,06NOV11:02:00:00,03NOV11:00:00:00,08OCT11:00:00:00
290460,31JAN14:23:56:00,11OCT12:02:00:00,12MAR09:00:00:00,11OCT12:00:00:00
290461,31JAN14:18:26:00,12AUG12:02:00:00,28JUL12:00:00:00,17NOV11:00:00:00


In [79]:
df_datetime = pd.DataFrame()

for var in lstDateTimeVar:
    dtDatetime = pd.to_datetime(df_data[var], format='%d%b%y:%H:%M:%S')
    
    sColYY = var + '_yy' #year
    sColMM = var + '_mm' #month
    #sColDD = var + '_dd' # day
    sColDW = var + '_dw' # day of week
    #sColWY = var + '_wy' # week of year
    sColHH = var + '_hh' # hour
    #sColMI = var + '_mi' # minute
    
    df_datetime[sColYY] = dtDatetime.dt.year
    
    df_datetime[sColMM] = dtDatetime.dt.month
    
    #df_datetime[sColDD] = dtDatetime.dt.day
    df_datetime[sColDW] = dtDatetime.dt.dayofweek
#    df_datetime[sColWY] = dtDatetime.dt.weekofyear

    df_datetime[sColHH] = dtDatetime.dt.hour
    #df_datetime[sColMI] = dtDatetime.dt.minute


In [80]:
df_datetime

Unnamed: 0,VAR_0204_yy,VAR_0204_mm,VAR_0204_dw,VAR_0204_hh,VAR_0217_yy,VAR_0217_mm,VAR_0217_dw,VAR_0217_hh,VAR_0073_yy,VAR_0073_mm,VAR_0073_dw,VAR_0073_hh,VAR_0075_yy,VAR_0075_mm,VAR_0075_dw,VAR_0075_hh
0,2014,1,2,21,2011,11,1,2,2009,3,3,0,2011,11,1,0
1,2014,2,5,0,2012,10,1,2,2012,9,1,0,2011,11,3,0
2,2014,1,3,15,2011,12,1,2,2009,3,3,0,2011,12,1,0
3,2014,2,5,0,2012,11,3,2,2009,3,3,0,2010,9,3,0
4,2014,1,2,19,2011,10,5,2,2009,3,3,0,2011,10,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
290458,2014,1,4,18,2012,8,1,2,2009,3,3,0,2012,8,1,0
290459,2014,1,2,21,2011,11,6,2,2011,11,3,0,2011,10,5,0
290460,2014,1,4,23,2012,10,3,2,2009,3,3,0,2012,10,3,0
290461,2014,1,4,18,2012,8,6,2,2012,7,5,0,2011,11,3,0


In [81]:
# Identify DateTime Variables with No variation
s = df_datetime.var()
s0 = s[s==0]
s0.index

Index(['VAR_0073_hh', 'VAR_0075_hh'], dtype='object')

In [82]:
# Drop above variables
df_datetime = df_datetime.drop(s0.index, axis=1)
df_datetime.head()

Unnamed: 0,VAR_0204_yy,VAR_0204_mm,VAR_0204_dw,VAR_0204_hh,VAR_0217_yy,VAR_0217_mm,VAR_0217_dw,VAR_0217_hh,VAR_0073_yy,VAR_0073_mm,VAR_0073_dw,VAR_0075_yy,VAR_0075_mm,VAR_0075_dw
0,2014,1,2,21,2011,11,1,2,2009,3,3,2011,11,1
1,2014,2,5,0,2012,10,1,2,2012,9,1,2011,11,3
2,2014,1,3,15,2011,12,1,2,2009,3,3,2011,12,1
3,2014,2,5,0,2012,11,3,2,2009,3,3,2010,9,3
4,2014,1,2,19,2011,10,5,2,2009,3,3,2011,10,5


**Year Variables** - VAR_0217_yy, VAR_0073_yy, VAR_0075_yy  
These are ordinal and since numeric, we can use default OrdinalEncoder.

In [83]:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()

In [84]:
arr = enc.fit_transform(df_datetime[['VAR_0217_yy', 'VAR_0073_yy', 'VAR_0075_yy']])
arr

array([[ 0.,  1.,  9.],
       [ 1.,  4.,  9.],
       [ 0.,  1.,  9.],
       ...,
       [ 1.,  1., 10.],
       [ 1.,  4.,  9.],
       [ 1.,  1., 10.]])

In [85]:
dfYear = pd.DataFrame(arr, columns=['VAR_0217_yy', 'VAR_0073_yy', 'VAR_0075_yy'])
dfYear

Unnamed: 0,VAR_0217_yy,VAR_0073_yy,VAR_0075_yy
0,0.0,1.0,9.0
1,1.0,4.0,9.0
2,0.0,1.0,9.0
3,1.0,1.0,8.0
4,0.0,1.0,9.0
...,...,...,...
290458,1.0,1.0,10.0
290459,0.0,3.0,9.0
290460,1.0,1.0,10.0
290461,1.0,4.0,9.0


In [86]:
df_datetime.drop(['VAR_0217_yy', 'VAR_0073_yy', 'VAR_0075_yy'], axis=1, inplace=True)

In [87]:
df_datetime = pd.concat([dfYear, df_datetime], axis=1)
df_datetime.head()

Unnamed: 0,VAR_0217_yy,VAR_0073_yy,VAR_0075_yy,VAR_0204_yy,VAR_0204_mm,VAR_0204_dw,VAR_0204_hh,VAR_0217_mm,VAR_0217_dw,VAR_0217_hh,VAR_0073_mm,VAR_0073_dw,VAR_0075_mm,VAR_0075_dw
0,0.0,1.0,9.0,2014,1,2,21,11,1,2,3,3,11,1
1,1.0,4.0,9.0,2014,2,5,0,10,1,2,9,1,11,3
2,0.0,1.0,9.0,2014,1,3,15,12,1,2,3,3,12,1
3,1.0,1.0,8.0,2014,2,5,0,11,3,2,3,3,9,3
4,0.0,1.0,9.0,2014,1,2,19,10,5,2,3,3,10,5


#### Merge the Object Variables to a final Object Dataframe

In [88]:
df_processed = pd.concat([df_catgOHE, df_datetime], axis=1)
df_processed.head()

Unnamed: 0,VAR_0001_Q,VAR_0001_R,VAR_0005_C,VAR_0005_N,VAR_0005_S,VAR_0226_True,VAR_0230_True,VAR_0232_True,VAR_0236_True,VAR_0283_F,...,VAR_0204_mm,VAR_0204_dw,VAR_0204_hh,VAR_0217_mm,VAR_0217_dw,VAR_0217_hh,VAR_0073_mm,VAR_0073_dw,VAR_0075_mm,VAR_0075_dw
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,1,2,21,11,1,2,3,3,11,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,2,5,0,10,1,2,9,1,11,3
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,1,3,15,12,1,2,3,3,12,1
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,2,5,0,11,3,2,3,3,9,3
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,1,2,19,10,5,2,3,3,10,5


In [89]:
df_processed.columns

Index(['VAR_0001_Q', 'VAR_0001_R', 'VAR_0005_C', 'VAR_0005_N', 'VAR_0005_S',
       'VAR_0226_True', 'VAR_0230_True', 'VAR_0232_True', 'VAR_0236_True',
       'VAR_0283_F',
       ...
       'VAR_0204_mm', 'VAR_0204_dw', 'VAR_0204_hh', 'VAR_0217_mm',
       'VAR_0217_dw', 'VAR_0217_hh', 'VAR_0073_mm', 'VAR_0073_dw',
       'VAR_0075_mm', 'VAR_0075_dw'],
      dtype='object', length=114)

In [90]:
df_processed.to_pickle(sDir + 'step03_object.pkl')

In [91]:
# Delete DFs
del (df_data, df_catg, df_catgOHE, df_datetime, dfYear)

## Misc