<h1>Part 2 - Data Wrangling</h1>

In [1]:
#import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

**Import Data and Restate Variables**

In [2]:
train_data = pd.read_pickle('data/model/train_data.pkl')
test_data = pd.read_pickle('data/model/test_data.pkl')

vars_cont = ['Age', 'Flight Distance', 'Arrival Delay in Minutes', 'Departure Delay in Minutes']

vars_cat_num = ['Inflight wifi service', 'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness']

vars_cat_str = ['Gender', 'Customer Type', 'Type of Travel', 'Class', 'satisfaction']

**Map Categorical String Variables**

Here, we map the categorical string variables to numerical values for use in our models.<br/>

Here we justify mapping the categories into this numerical scheme on the basis that they are mostly binary. For the only cateforical variable with three values, we tried to make it such that the outcome would be more or less ordinal (i.e. in general prices paid Eco is ess than Eco Plus and price paid for Eco Plus is less than Business)

The plan is to use a an ensemble of trees, therefore it makes sense for splits to result in more easily interpretable sets.

In [3]:
gender_dict = dict([('Male',0), ('Female',1)])
cust_type_dict = dict([('disloyal Customer', 0), ('Loyal Customer', 1)])
travel_dict = dict([('Personal Travel',0), ('Business travel',1)])
class_dict = dict([('Eco',0),('Eco Plus',1), ('Business',2)])
sat_dict = dict([('neutral or dissatisfied',0), ('satisfied',1)])

category_dict = dict([('Gender', gender_dict),
                     ('Customer Type', cust_type_dict),
                     ('Type of Travel', travel_dict),
                     ('Class', class_dict),
                     ('satisfaction', sat_dict)])

train_data_num = train_data.copy()
test_data_num = test_data.copy()

for _ in vars_cat_str:
        train_data_num[_] = train_data_num[_].map(lambda x: category_dict[_][x])
        test_data_num[_] = test_data_num[_].map(lambda x: category_dict[_][x])

**Drop NAs and drop 0s in Survey Questions**

In [4]:
print('              Training Size   Test Size')
print('Before Drop:        {0:,d}      {1:,d}'.format(len(train_data_num), len(test_data_num)))
train_data_num.dropna(how='any', inplace=True)
test_data_num.dropna(how='any', inplace=True)

print('After NA Drop:      {0:,d}      {1:,d}'.format(len(train_data_num), len(test_data_num)))

train_data_num = train_data_num[~(train_data_num[vars_cat_num]==0).any(axis=1)]
test_data_num = test_data_num[~(test_data_num[vars_cat_num]==0).any(axis=1)]
print('After 0 Drop:        {0:,d}      {1:,d}'.format(len(train_data_num), len(test_data_num)))

              Training Size   Test Size
Before Drop:        103,904      25,976
After NA Drop:      103,594      25,893
After 0 Drop:        95,415      23,789


**Save Variables**<br/>
We will save the variables and move on to the next stage: Feature Extraction

In [5]:
pd.to_pickle(train_data_num, 'data/model/train_data_num_pkl')
pd.to_pickle(test_data_num, 'data/model/test_data_num_pkl')