# Data Exploration and Preprocess Prototyping

- No non-unique values in 'id' (no customer is represented more than once), drop 'id' column  

- 310 NaN values in 'Arrival Delay in Minutes'  
    - data dictionary does not explicitly address this, so I assume there was no delay in those  instances  
    - I will replace NaNs here with 0s  
    - No other missing or null values found  

- Intended target variable is very imbalanced    
    Loyal Customer:       0.817322  
    disloyal Customer:    0.182678  

- Mean and standard deviation of survey answer features is fairly ubiquitous. Mean in particular hovers around 3 (scale of 1-5), indicating this particular subset of features is fairly balanced.  

- 'Class' feature is represented by 3 categories, one of which is highly under-represented. It should also be noted that first-class flights are ***not*** represented at all. This should be kept in mind when interpreting the findings at the end of the project.  
    Business: 0.477989  
    Eco: 0.449886  
    Eco Plus: 0.072124 (outlier)

- 'Travel type' is a little more than 2/3 business

- maximum delay on arrival *and* departure is less than 30 minutes  




In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline 

In [2]:
df = pd.read_csv('data/train.csv.zip',index_col=0 ,compression='zip')
df = df.drop('id',axis=True)
df.head()

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103904 entries, 0 to 103903
Data columns (total 23 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Gender                             103904 non-null  object 
 1   Customer Type                      103904 non-null  object 
 2   Age                                103904 non-null  int64  
 3   Type of Travel                     103904 non-null  object 
 4   Class                              103904 non-null  object 
 5   Flight Distance                    103904 non-null  int64  
 6   Inflight wifi service              103904 non-null  int64  
 7   Departure/Arrival time convenient  103904 non-null  int64  
 8   Ease of Online booking             103904 non-null  int64  
 9   Gate location                      103904 non-null  int64  
 10  Food and drink                     103904 non-null  int64  
 11  Online boarding                    1039

In [4]:
objx = ['Gender','Customer Type','Type of Travel','Class','satisfaction']
df.loc[df['Arrival Delay in Minutes'].isna()].shape

(310, 23)

In [5]:
# replace NaNs with zeros for arrival delay feature
df['Arrival Delay in Minutes'] = df['Arrival Delay in Minutes'].replace({np.NaN:0})

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103904 entries, 0 to 103903
Data columns (total 23 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Gender                             103904 non-null  object 
 1   Customer Type                      103904 non-null  object 
 2   Age                                103904 non-null  int64  
 3   Type of Travel                     103904 non-null  object 
 4   Class                              103904 non-null  object 
 5   Flight Distance                    103904 non-null  int64  
 6   Inflight wifi service              103904 non-null  int64  
 7   Departure/Arrival time convenient  103904 non-null  int64  
 8   Ease of Online booking             103904 non-null  int64  
 9   Gate location                      103904 non-null  int64  
 10  Food and drink                     103904 non-null  int64  
 11  Online boarding                    1039

In [7]:
# split objects and numerics
objx_df = df[objx]
ints_df = df.drop(objx,axis=1)


In [8]:
objx_df.head()

Unnamed: 0,Gender,Customer Type,Type of Travel,Class,satisfaction
0,Male,Loyal Customer,Personal Travel,Eco Plus,neutral or dissatisfied
1,Male,disloyal Customer,Business travel,Business,neutral or dissatisfied
2,Female,Loyal Customer,Business travel,Business,satisfied
3,Female,Loyal Customer,Business travel,Business,neutral or dissatisfied
4,Male,Loyal Customer,Business travel,Business,satisfied


In [33]:
df.satisfaction.value_counts(normalize=True)

neutral or dissatisfied    0.566667
satisfied                  0.433333
Name: satisfaction, dtype: float64

In [9]:
objx_df.describe()

Unnamed: 0,Gender,Customer Type,Type of Travel,Class,satisfaction
count,103904,103904,103904,103904,103904
unique,2,2,2,3,2
top,Female,Loyal Customer,Business travel,Business,neutral or dissatisfied
freq,52727,84923,71655,49665,58879


In [10]:
# Gender is very balanced, target (loyalty) is extremely imbalanced, 
# Travel type is > 2/3 business, 
# Ticket type is fairly balanced with one minority third class, first-class is not represented at all
for col in objx_df.columns:
    print(df[col].value_counts(normalize=True))
    print("\n----------------------------------\n")

Female    0.507459
Male      0.492541
Name: Gender, dtype: float64

----------------------------------

Loyal Customer       0.817322
disloyal Customer    0.182678
Name: Customer Type, dtype: float64

----------------------------------

Business travel    0.689627
Personal Travel    0.310373
Name: Type of Travel, dtype: float64

----------------------------------

Business    0.477989
Eco         0.449886
Eco Plus    0.072124
Name: Class, dtype: float64

----------------------------------

neutral or dissatisfied    0.566667
satisfied                  0.433333
Name: satisfaction, dtype: float64

----------------------------------



In [11]:
# split survey data and continuos data (flight metrics)
conts = ['Age','Flight Distance','Departure Delay in Minutes','Arrival Delay in Minutes']
cont_df = df[conts]
survey_df = ints_df.drop(conts,axis=1)

In [12]:
cont_df.head()

Unnamed: 0,Age,Flight Distance,Departure Delay in Minutes,Arrival Delay in Minutes
0,13,460,25,18.0
1,25,235,1,6.0
2,26,1142,0,0.0
3,25,562,11,9.0
4,61,214,0,0.0


In [13]:
# good age range, good flight distance range, max delays < 30 min
cont_df.describe()

Unnamed: 0,Age,Flight Distance,Departure Delay in Minutes,Arrival Delay in Minutes
count,103904.0,103904.0,103904.0,103904.0
mean,39.379706,1189.448375,14.815618,15.133392
std,15.114964,997.147281,38.230901,38.649776
min,7.0,31.0,0.0,0.0
25%,27.0,414.0,0.0,0.0
50%,40.0,843.0,0.0,0.0
75%,51.0,1743.0,12.0,13.0
max,85.0,4983.0,1592.0,1584.0


In [31]:
df['Arrival Delay in Minutes'].value_counts()

0.0      58469
1.0       2211
2.0       2064
3.0       1952
4.0       1907
         ...  
823.0        1
459.0        1
518.0        1
370.0        1
429.0        1
Name: Arrival Delay in Minutes, Length: 455, dtype: int64

In [14]:
survey_df.head()

Unnamed: 0,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness
0,3,4,3,1,5,3,5,5,4,3,4,4,5,5
1,3,2,3,3,1,3,1,1,1,5,3,1,4,1
2,2,2,2,2,5,5,5,5,4,3,4,4,4,5
3,2,5,5,5,2,2,2,2,2,5,3,1,4,2
4,3,3,3,3,4,5,5,3,3,4,4,3,3,3


In [15]:
# looks good, looks normal
survey_df.describe()

Unnamed: 0,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness
count,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0
mean,2.729683,3.060296,2.756901,2.976883,3.202129,3.250375,3.439396,3.358158,3.382363,3.351055,3.631833,3.30429,3.640428,3.286351
std,1.327829,1.525075,1.398929,1.277621,1.329533,1.349509,1.319088,1.332991,1.288354,1.315605,1.180903,1.265396,1.175663,1.312273
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,3.0,2.0
50%,3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,3.0
75%,4.0,4.0,4.0,4.0,4.0,4.0,5.0,4.0,4.0,4.0,5.0,4.0,5.0,4.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


In [29]:
for col in survey_df.columns:
    print(col)
    print(df[col].value_counts())
    print("\n----------------------------------\n")

Inflight wifi service
3    25868
2    25830
4    19794
1    17840
5    11469
0     3103
Name: Inflight wifi service, dtype: int64

----------------------------------

Departure/Arrival time convenient
4    25546
5    22403
3    17966
2    17191
1    15498
0     5300
Name: Departure/Arrival time convenient, dtype: int64

----------------------------------

Ease of Online booking
3    24449
2    24021
4    19571
1    17525
5    13851
0     4487
Name: Ease of Online booking, dtype: int64

----------------------------------

Gate location
3    28577
4    24426
2    19459
1    17562
5    13879
0        1
Name: Gate location, dtype: int64

----------------------------------

Food and drink
4    24359
5    22313
3    22300
2    21988
1    12837
0      107
Name: Food and drink, dtype: int64

----------------------------------

Online boarding
4    30762
3    21804
5    20713
2    17505
1    10692
0     2428
Name: Online boarding, dtype: int64

----------------------------------

Seat comfort
4

- What will be the performance metric of focus? 
- consider cross validation since it's already split?
- How to deal with class imbalance ?
- 

##### Trying undersampling using Tomek

In [18]:
X = df.drop('Customer Type',axis=1)
y = df['Customer Type']

In [20]:
original_shapes = (X.shape,y.shape)

In [19]:
# import library
from imblearn.under_sampling import TomekLinks
TomekLinker = TomekLinks(sampling_strategy='majority')

In [21]:
# fit predictor and target variable
x_tl, y_tl = TomekLinker.fit_resample(X, y)


ValueError: could not convert string to float: 'Male'