# Feature Engineering

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import chi2_contingency

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import iqr
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

import pickle
import os

A major step in the machine learning cycle is feature engineering. The goal is to choose a good balance of informative versus non-informative features for optimal model performance. This can be done either by *feature selection*, *feature extraction* (that is, combining existing features into stronger ones)  or collecting extra features from external data. (Ref 1) Only the first two methods will be attempted in this project.

## Feature Selection

The simplest and most obvious method for feature engineering is the *feature selection* method. Variables which were shown to be correlated with the response variable `Satisfection` in the EDA notebook will be included in the model. All other features will be kept out. 

In [2]:
# Importing the preprocessed data:
train = pd.read_pickle('../Data/Preprocessed_1/train_preprocessed_1.pkl')
test = pd.read_pickle('../Data/Preprocessed_1/test_preprocessed_1.pkl')

In [6]:
train_fs = train[['Age', 'Class', 'Flight Distance', 'Inflight Wifi Service', 'Ease of Online Booking', 'Food and Drink', 'Online Boarding',
                       'Seat Comfort', 'Inflight Entertainment', 'On-board Service', 'Leg Room Service', 'Baggage Handling', 
                       'Checkin Service', 'Inflight Service', 'Cleanliness', 'Type of Travel', 'Customer Type', 'Gender']]

train_fs.shape

(98860, 18)

Before saving the dataset, nominal variables will be one-hot encoded.

In [8]:
train_fs = pd.get_dummies(train, columns = ['Gender', 'Customer Type', 'Type of Travel'], prefix = 'Dummy', dtype = int)

train_fs.head()

Unnamed: 0,id,Age,Class,Flight Distance,Inflight Wifi Service,Departure/Arrival Time Convenient,Ease of Online Booking,Gate Location,Food and Drink,Online Boarding,...,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,Satisfaction,Dummy_Female,Dummy_Male,Dummy_Disloyal customer,Dummy_Loyal customer,Dummy_Business travel,Dummy_Personal travel
0,70172,13,2,460,3,4,3,1,5,3,...,5,25,18.0,neutral or dissatisfied,0,1,0,1,0,1
1,5047,25,3,235,3,2,3,3,1,3,...,1,1,6.0,neutral or dissatisfied,0,1,1,0,1,0
2,110028,26,3,1142,2,2,2,2,5,5,...,5,0,0.0,satisfied,1,0,0,1,1,0
3,24026,25,3,562,2,5,5,5,2,2,...,2,11,9.0,neutral or dissatisfied,1,0,0,1,1,0
4,119299,61,3,214,3,3,3,3,4,5,...,3,0,0.0,satisfied,0,1,0,1,1,0


In [9]:
# Saving the 'feature selected' dataset to pickle to preserve data type information:
# train_fs.to_pickle('../Data/Feature_Selection/train_fs.pkl')

In [10]:
# Importing the preprocessed data:
train_fs = pd.read_pickle('../Data/Feature_Selection/train_fs.pkl')

## Feature Extraction

One obvious 

Here, an attempt to improve the models' performance is done through the incorporation of feature interactions. The first feature to be included will be `Total Delay`, which is simply the combination of departure delays and arrival delays. 

# References

1. Aurelian Géron, Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (3rd Edition). Kindle