# Data Preprocessing

## Summary of Data Preprocessing Steps
1. Remove Informational Offers
2. Remove unneeded features
3. Split data into Train/Test/Validation
4. Use Sklearn Pipeline to Preprocess Data
5. Upload Files to S3

### Import Libraries and Data

In [1]:
import pandas as pd
import numpy as np
import math
import json
import os
import time
from time import gmtime, strftime
import boto3
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn.model_selection
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker import image_uris
from sagemaker.predictor import csv_serializer

# This is an object that represents the SageMaker session that we are currently operating in. This
# object contains some useful information that we will need to access later such as our region.
session = sagemaker.Session()

# This is an object that represents the IAM role that we are currently assigned. When we construct
# and launch the training job later we will need to tell it what IAM role it should have. Since our
# use case is relatively simple we will simply assign the training job the role we currently have.
role = get_execution_role()

In [2]:
df_conversion = pd.read_pickle('../input/df_conversion.pkl')
df_conversion.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 63288 entries, 0 to 63287
Data columns (total 28 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   person                 63288 non-null  object        
 1   event                  63288 non-null  object        
 2   time_received          63288 non-null  int64         
 3   amount                 63288 non-null  float64       
 4   offer_id               63288 non-null  object        
 5   reward                 63288 non-null  float64       
 6   difficulty             63288 non-null  float64       
 7   duration               63288 non-null  float64       
 8   offer_type             63288 non-null  object        
 9   email                  63288 non-null  float64       
 10  mobile                 63288 non-null  float64       
 11  web                    63288 non-null  float64       
 12  social                 63288 non-null  float64       
 13  t

### Step 1) Remove Informational Offers

In [3]:
df_conversion_v2 = df_conversion[df_conversion['offer_type'] != 'informational']
df_conversion_v2.shape

(50637, 28)

### Step 2) Remove Unneeded Features

In [4]:
columns_to_drop = ['person', 'event', 'time_received', 'amount', 'offer_id', 'email', 'mobile', 'web', 'social', 'member_date', 'member_year'\
                 ,'offer_received_count', 'time_viewed', 'offer_viewed_count', 'time_completed', 'offer_completed_count', 'offer_completed_flag']
df = df_conversion_v2.drop(columns_to_drop, axis = 1)

# convert values where age is 118 to null
df['age'] = df['age'].map(lambda x: np.nan if x == 118 else x)
df.head()

Unnamed: 0,reward,difficulty,duration,offer_type,total_channels,gender,age,income,tenure,offer_viewed_flag,conversion_flag
0,2.0,10.0,7.0,discount,3.0,M,33.0,72000.0,1,0,0
3,5.0,5.0,5.0,bogo,4.0,M,33.0,72000.0,1,1,1
4,2.0,10.0,10.0,discount,4.0,M,33.0,72000.0,1,1,1
5,5.0,5.0,5.0,bogo,4.0,Unknown,,,0,1,0
6,5.0,20.0,10.0,discount,2.0,O,40.0,57000.0,0,1,1


### Step 3) Train/Test Split

In [5]:
# First we package up the input data and the target variable (the median value) as pandas dataframes. This
# will make saving the data to a file a little easier later on.
# shuffle the DataFrame rows 
df = df.sample(frac = 1, random_state = 10)

X = df.drop(['conversion_flag'], axis=1)
y = df['conversion_flag']

# We split the dataset into 2/3 training and 1/3 testing sets.
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.25, random_state = 10)

# Then we split the training set further into 2/3 training and 1/3 validation sets.
X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(X_train, Y_train, test_size=0.33, random_state = 10)

In [6]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_val.shape)
print(Y_test.shape)

(25444, 10)
(12533, 10)
(12660, 10)
(25444,)
(12533,)
(12660,)


### Step 4) Create Sklearn Pipeline

We'll utilize Scikit Learn Pipeline to apply necessary preprocessing steps to the data. The following steps will be applied to the data:
- Impute the median for null numeric features (income and age in our case)
- Apply StandardScaler to numeric features
- Impute 'missing' for null categorical data
- onehot encode categorical data

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [8]:
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

In [9]:
# Fit the pipeline to the data
X_train_processed = pipeline.fit_transform(X_train)
X_val_processed = pipeline.fit_transform(X_val)
X_test_processed = pipeline.fit_transform(X_test)   

# convert to pandas dataframe for convenience of writing to csv. Note that this step would not be ideal if the data
# was very large
X_train_processed = pd.DataFrame(X_train_processed, index=X_train.index)
X_val_processed = pd.DataFrame(X_val_processed, index=X_val.index)
X_test_processed = pd.DataFrame(X_test_processed, index=X_test.index)

print(X_train_processed.shape)
print(X_val_processed.shape)
print(X_test_processed.shape)

(25444, 14)
(12533, 14)
(12660, 14)


### Step 5): Uploading the training and validation files to S3

When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. We can use the SageMaker API to do this and hide some of the details.

### Save the data locally

First we need to create the train and validation csv files which we will then upload to S3.

In [10]:
# This is our local data directory. We need to make sure that it exists.
data_dir = '../input'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [11]:
# We use pandas to save our train and validation data to csv files. Note that we make sure not to include header
# information or an index as this is required by the built in algorithms provided by Amazon. Also, it is assumed
# that the first entry in each row is the target variable.
X_test_processed.to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)
X_test.to_csv(os.path.join(data_dir, 'test_unprocessed.csv'), header=True, index=False)
Y_test.to_csv(os.path.join(data_dir, 'Y_test.csv'), header=False, index=False)

pd.concat([Y_val, X_val_processed], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)
pd.concat([Y_train, X_train_processed], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

### Save the data to S3

In [12]:
bucket = session.default_bucket()
prefix = 'mle-capstone'

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), bucket = bucket, key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), bucket = bucket, key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), bucket = bucket, key_prefix=prefix)
Y_test_location = session.upload_data(os.path.join(data_dir, 'Y_test.csv'), bucket = bucket, key_prefix=prefix)
test_unprocessed_location = session.upload_data(os.path.join(data_dir, 'test_unprocessed.csv'), bucket = bucket, key_prefix=prefix)