# Feature Engineering

In [1]:
import pandas as pd
import numpy as np
import pickle

A major step in the machine learning cycle is feature engineering. The goal is to choose a good balance of informative versus non-informative features for optimal model performance. This can be done either by *feature selection*, *feature extraction* (that is, combining existing features into stronger ones)  or collecting extra features from external data [(Aurelian Géron, 2023)](#ref-Geron2023). Only the first two methods will be attempted in this project.

## Feature Selection

The simplest and most obvious method for feature engineering is the *feature selection* method. Variables shown to be correlated with the response variable, `Satisfaction`, in the EDA notebook will be included in the model. All other features will be kept out. 

In [19]:
# Importing the preprocessed data:
train = pd.read_pickle('../Data/Preprocessed/non_applicable_imputed.pkl')

In [20]:
train_fs = train.drop(['id', 'Departure/Arrival Time Convenient', 'Gate Location', 
                      'Departure Delay in Minutes', 'Arrival Delay in Minutes', 'Age'], axis = 1)
train_fs.shape

(103904, 21)

In [21]:
# Saving the 'feature selected' dataset to pickle to preserve data type information:
train_fs.to_pickle('../Data/Feature_Selection/train_fs.pkl')

## Feature Extraction

The count plots in the exploratory data analysis notebook revealed that the categorical service-related features all included 'premium' services with which passengers were mostly satisfied. These corresponded to categories '4' and '5' for most variables. Two new variables will be created: the first will aggregate all the service-related features and provide an overall score for the quality of service delivered; the second will be a binary variable indicating whether the overall service score is above '4'.

In [24]:
# Importing the preprocessed data:
train_oss = pd.read_pickle('../Data/Preprocessed/non_applicable_imputed.pkl')

In [25]:
train_oss['Overall Service Score'] = np.mean([train_oss['Departure/Arrival Time Convenient'].astype('int'), train_oss['Ease of Online Booking'].astype('int'), train_oss['Gate Location'].astype('int'),
                                             train_oss['Food and Drink'].astype('int'), train_oss['Online Boarding'].astype('int'), train_oss['Seat Comfort'].astype('int'), train_oss['Inflight Entertainment'].astype('int'),
                                             train_oss['On-board Service'].astype('int'), train_oss['Leg Room Service'].astype('int'), train_oss['Baggage Handling'].astype('int'), train_oss['Checkin Service'].astype('int')], axis = 0).round()

train_oss['Overall Service Score'] = train_oss['Overall Service Score'].astype('int')

In [26]:
premium_service = {1: 0,
                   2: 0,
                   3: 0,
                   4: 1,
                   5: 1}
train_oss['Premium Service'] = train_oss['Overall Service Score'].replace(premium_service)

In [27]:
# Saving the 'feature selected' dataset to pickle to preserve data type information:
train_oss.to_pickle('../Data/Feature_Extraction/train_oss.pkl')

# References

1. <a id="ref-Geron2023"></a>Aurelian Géron, Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (3rd Edition), O'Reilly Media Inc, 20 January 2023.