## Preprocessing Training Data Development
In this notebook we will create dummy features and/or one-hot encode features if necessary. Additionally, we will scale the data or apply standardization methods if necessary. Lastly, we are able to split our data into training and testing subsets to use for modeling.

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

In [2]:
data = pd.read_csv('train.csv')

In [3]:
data = data.drop(['id', 'site_id'], axis=1)

There is no need to create dummy features for my data because I am using the click column as my main variable, which is already numerical as 1 for when a click was present and 0 for when a click was not present. Creating dummy variables for any of the other columns wouldn't be helpful and would just make my data a lot larger when it's already very big. 

In [4]:
data.dtypes

click                int64
hour                 int64
C1                   int64
banner_pos           int64
site_domain         object
site_category       object
app_id              object
app_domain          object
app_category        object
device_id           object
device_ip           object
device_model        object
device_type          int64
device_conn_type     int64
C14                  int64
C15                  int64
C16                  int64
C17                  int64
C18                  int64
C19                  int64
C20                  int64
C21                  int64
dtype: object

There is also no need to standardize my values because most of them are categorical and the rest are unknown variables (C14 - C21). 

In [5]:
#Here I will create my training and testing data splits. 

from sklearn.model_selection import train_test_split

columns = ['hour', 'C1', 'banner_pos', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model']
X = data[columns]
y = data.click

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

In [6]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((32343173, 21), (8085794, 21), (32343173,), (8085794,))

Looking at our testing and training subsets, they might be too large and unbalanced to efficiently and accurately run our models on. Therefore, we might need to sample the data to make it smaller and use an over sampling method to balance the amount of data in the 1 click vs. 0 click category. 

### Sampled Data

In [7]:
data_0 = data[data['click']==0]
data_1 = data[data['click']==1]
print(data_0.shape, data_1.shape)

(33563901, 22) (6865066, 22)


In [8]:
data_0 = data_0.sample(frac=0.01, random_state=42)
data_0.shape

(335639, 22)

In [9]:
data_1 = data_1.sample(frac=0.03, random_state=42)
data_1.shape

(205952, 22)

In [10]:
data_sample = pd.concat([data_0, data_1], axis = 0, ignore_index=True)
data_sample['click'].value_counts()

0    335639
1    205952
Name: click, dtype: int64

We won't apply the oversampling method yet because we want to use our original/organic data in our models first in order to see how they perform. If the performance metrices are lacking, we will use RandomOverSampler from the Imblearn library to balance the number of 0 and 1 clicks. 