# Feature Engineering

In [88]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import chi2_contingency

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import iqr
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

import pickle
import os

A major step in the machine learning cycle is feature engineering. The goal is to choose a good balance of informative versus non-informative features for optimal model performance. This can be done either by *feature selection*, *feature extraction* (that is, combining existing features into stronger ones)  or collecting extra features from external data. (Ref 1) Only the first two methods will be attempted in this project.

In [121]:
# Importing the preprocessed data:
train = pd.read_pickle('../Data/Preprocessed_1/train_preprocessed_1.pkl')

In [122]:
change = {'satisfied': 1, 'neutral or dissatisfied': 2}
train['Satisfaction'] = train['Satisfaction'].replace(change).infer_objects(copy=False)
train['Satisfaction'] = train['Satisfaction'].astype('int')

# Convert to ordered categorical:
train['Satisfaction'] = train['Satisfaction'].astype('category')
train['Satisfaction'] = train['Satisfaction'].cat.set_categories(new_categories = [1, 2], ordered = True)

  train['Satisfaction'] = train['Satisfaction'].replace(change).infer_objects(copy=False)
  train['Satisfaction'] = train['Satisfaction'].replace(change).infer_objects(copy=False)


In [123]:
# One-Hot Encoding:
train = pd.get_dummies(train, columns = ['Gender', 'Customer Type', 'Type of Travel'], prefix = 'Dummy', dtype = int)

In [125]:
# Saving the first preprocessed dataset to pickle to preserve data type information:
# train.to_pickle('../Data/Preprocessed_1/train_preprocessed_1_dummies.pkl')

In [161]:
# Importing the preprocessed data:
train = pd.read_pickle('../Data/Preprocessed_1/train_preprocessed_1_dummies.pkl')

## Feature Selection

The simplest and most obvious method for feature engineering is the *feature selection* method. Variables which were shown to be correlated with the response variable, `Satisfaction`, in the EDA notebook will be included in the model. All other features will be kept out. 

In [170]:
# Importing the preprocessed data:
train = pd.read_pickle('../Data/Preprocessed_1/train_preprocessed_1.pkl')

In [171]:
train_fs = train[['Age', 'Class', 'Flight Distance', 'Inflight Wifi Service', 'Ease of Online Booking', 'Food and Drink', 'Online Boarding',
                       'Seat Comfort', 'Inflight Entertainment', 'On-board Service', 'Leg Room Service', 'Baggage Handling', 
                       'Checkin Service', 'Inflight Service', 'Cleanliness', 'Type of Travel', 'Customer Type', 'Gender', 'Satisfaction']]

train_fs.shape

(98860, 19)

Before saving the dataset, nominal variables will be one-hot encoded.

In [172]:
train_fs = pd.get_dummies(train_fs, columns = ['Gender', 'Customer Type', 'Type of Travel'], prefix = 'Dummy', dtype = int)

In [173]:
change = {'satisfied': 1, 'neutral or dissatisfied': 2}
train_fs['Satisfaction'] = train_fs['Satisfaction'].replace(change).infer_objects(copy=False)
train_fs['Satisfaction'] = train_fs['Satisfaction'].astype('int')

# Convert to ordered categorical:
train_fs['Satisfaction'] = train_fs['Satisfaction'].astype('category')
train_fs['Satisfaction'] = train_fs['Satisfaction'].cat.set_categories(new_categories = [1, 2], ordered = True)

  train_fs['Satisfaction'] = train_fs['Satisfaction'].replace(change).infer_objects(copy=False)
  train_fs['Satisfaction'] = train_fs['Satisfaction'].replace(change).infer_objects(copy=False)


In [175]:
# Saving the 'feature selected' dataset to pickle to preserve data type information:
# train_fs.to_pickle('../Data/Feature_Selection/train_fs.pkl')

In [176]:
# Importing the preprocessed data:
train_fs = pd.read_pickle('../Data/Feature_Selection/train_fs.pkl')

## Feature Extraction

From the count plots in the exploratory data analysis notebook, it was evident that the categorical service-related features all had 'premium' services with which passengers were mostly satisfied. These corresponded to categories '4' and '5' for most variables. Two new variables will be created. The first will aggregate all the service-related features and give an overall score for the quality of service provided. The second, will be a binary variable indicating whether the overall service score is above '4'.

In [152]:
# Importing the preprocessed data:
train_oss = pd.read_pickle('../Data/Preprocessed_1/train_preprocessed_1_dummies.pkl')

In [153]:
train_oss['Overall Service Score'] = np.mean([train_oss['Departure/Arrival Time Convenient'].astype('int'), train_oss['Ease of Online Booking'].astype('int'), train_oss['Gate Location'].astype('int'),
                                             train_oss['Food and Drink'].astype('int'), train_oss['Online Boarding'].astype('int'), train_oss['Seat Comfort'].astype('int'), train_oss['Inflight Entertainment'].astype('int'),
                                             train_oss['On-board Service'].astype('int'), train_oss['Leg Room Service'].astype('int'), train_oss['Baggage Handling'].astype('int'), train_oss['Checkin Service'].astype('int')], axis = 0).round()

train_oss['Overall Service Score'] = train_oss['Overall Service Score'].astype('int')

In [154]:
premium_service = {1: 0,
                   2: 0,
                   3: 0,
                   4: 1,
                   5: 1}
train_oss['Premium Service'] = train_oss['Overall Service Score'].replace(premium_service)

In [156]:
# Saving the 'feature selected' dataset to pickle to preserve data type information:
# train_oss.to_pickle('../Data/Feature_Extraction/train_oss.pkl')

In [157]:
# Importing the preprocessed data:
train_oss = pd.read_pickle('../Data/Feature_Extraction/train_oss.pkl')

# References

1. Aurelian Géron, Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (3rd Edition). Kindle