# Feature Selection

So when we have our preprocessed data, we want to see which features should actually be used.
<br><br>
This is important for multiple reasons, where some of them are:
1. `Occam's Rule`: From a more *philosophical* P.O.V. 'the simplest of competing theories(models) should be preferred'
2. Reducing the number of features without loosing too much information, can make model training/assessment more efficient
3. Some methods assume that the features are uncorrelated, and therefore we might need to cut out correlated features
    and only select the most significant one.

## Imports

In [91]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold, SelectFdr

## Global Parameters

In [92]:
LOAD_DATA_PATH = '../src/data/preprocessed/preprocess1.csv'
SAVE_DATA_PATH = '../src/data/preprocessed/selected_features.csv'
FEATURE_SELECTION_VARIANCE_THRESHOLD = 0.005
FEATURE_SELECTION_BH_ALPHA = 0.05

## Loading the preprocessed data

In [93]:
data = pd.read_csv(LOAD_DATA_PATH, sep=';', decimal=',')
data

Unnamed: 0,aircrafttype_221,aircrafttype_223,aircrafttype_295,aircrafttype_318,aircrafttype_319,aircrafttype_31B,aircrafttype_320,aircrafttype_321,aircrafttype_32A,aircrafttype_32B,...,sector_DK,sector_EG,sector_IQ,sector_IS,sector_MX,sector_NL,sector_QA,sector_SG,sector_US,loadfactor
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0.408451
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.189189
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0.570423
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0.333333
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0.204301
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36765,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0.522222
36766,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.532051
36767,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.602564
36768,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0.417989


In [94]:
X = data.iloc[:,:-1]
y = data.iloc[:,-1]

print(f'X shape: {X.shape}')
print(f'y shape: {y.shape}')

X shape: (36770, 539)
y shape: (36770,)


### Feature selection

So first of all, we want to drop features with very low variance, since they do not actually provide any useful information for the prediction problem

In [95]:
selector = VarianceThreshold(FEATURE_SELECTION_VARIANCE_THRESHOLD)
X = pd.DataFrame(selector.fit_transform(X), columns=selector.get_feature_names_out())
X.shape

(36770, 254)

Next we want to use the `Benjamini-Hochberg procedure` explained in the week 3 lecture. 

In [96]:
selector = SelectFdr(alpha=FEATURE_SELECTION_BH_ALPHA)
X = pd.DataFrame(selector.fit_transform(X, y), columns=selector.get_feature_names_out())
X

Unnamed: 0,aircrafttype_223,aircrafttype_319,aircrafttype_320,aircrafttype_321,aircrafttype_32A,aircrafttype_32B,aircrafttype_32N,aircrafttype_32Q,aircrafttype_333,aircrafttype_359,...,scheduletime week number_5,scheduletime week number_51,scheduletime week number_6,scheduletime year_2021,scheduletime year_2022,seatcapacity,sector_CA,sector_IS,sector_QA,sector_US
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,142,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,74,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,142,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,72,0,0,0,1
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,186,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36765,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,1,180,1,0,0,0
36766,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,1,156,1,0,0,0
36767,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,1,156,1,0,0,0
36768,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,189,0,0,0,1


In [97]:
final_data = X
final_data['LoadFactor'] = y
final_data

Unnamed: 0,aircrafttype_223,aircrafttype_319,aircrafttype_320,aircrafttype_321,aircrafttype_32A,aircrafttype_32B,aircrafttype_32N,aircrafttype_32Q,aircrafttype_333,aircrafttype_359,...,scheduletime week number_51,scheduletime week number_6,scheduletime year_2021,scheduletime year_2022,seatcapacity,sector_CA,sector_IS,sector_QA,sector_US,LoadFactor
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,142,0,0,0,1,0.408451
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,74,1,0,0,0,0.189189
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,142,0,0,0,1,0.570423
3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,72,0,0,0,1,0.333333
4,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,186,0,0,0,1,0.204301
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36765,0,0,1,0,0,0,0,0,0,0,...,0,1,0,1,180,1,0,0,0,0.522222
36766,0,1,0,0,0,0,0,0,0,0,...,0,1,0,1,156,1,0,0,0,0.532051
36767,0,1,0,0,0,0,0,0,0,0,...,0,1,0,1,156,1,0,0,0,0.602564
36768,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,189,0,0,0,1,0.417989


In [98]:
final_data.to_csv(SAVE_DATA_PATH, sep=',', decimal='.')