# Feature Selection

So when we have our preprocessed data, we want to see which features should actually be used.
<br><br>
This is important for multiple reasons, where some of them are:
1. `Occam's Razor`: From a more *philosophical* P.O.V. 'the simplest of competing theories(models) should be preferred'
2. Reducing the number of features without loosing too much information, can make model training/assessment more efficient
3. Some methods assume that the features are uncorrelated, and therefore we might need to cut out correlated features
    and only select the most significant one.

## Imports

In [10]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold, SelectFdr, f_regression, mutual_info_regression

## Global Parameters

In [2]:
LOAD_DATA_PATH = '../src/data/preprocessed/WeeklyDepartureDerivatives.csv'
SAVE_DATA_PATH = '../src/data/preprocessed/selected_features.csv'
FEATURE_SELECTION_VARIANCE_THRESHOLD = 0.005
FEATURE_SELECTION_BH_ALPHA = 0.00005

## Loading the preprocessed data

In [19]:
data = pd.read_csv(LOAD_DATA_PATH)
data

Unnamed: 0,aircrafttype_223,aircrafttype_319,aircrafttype_320,aircrafttype_321,aircrafttype_32A,aircrafttype_32B,aircrafttype_32N,aircrafttype_32Q,aircrafttype_333,aircrafttype_359,...,scheduletime week number_51,scheduletime week number_6,scheduletime year_2021,scheduletime year_2022,seatcapacity,sector_CA,sector_IS,sector_QA,sector_US,loadfactor
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,142.0,0.0,0.0,0.0,1.0,0.408451
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,74.0,1.0,0.0,0.0,0.0,0.189189
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,142.0,0.0,0.0,0.0,1.0,0.570423
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,72.0,0.0,0.0,0.0,1.0,0.333333
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,186.0,0.0,0.0,0.0,1.0,0.204301
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36765,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,180.0,1.0,0.0,0.0,0.0,0.522222
36766,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,156.0,1.0,0.0,0.0,0.0,0.532051
36767,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,156.0,1.0,0.0,0.0,0.0,0.602564
36768,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,189.0,0.0,0.0,0.0,1.0,0.417989


In [20]:
X = data.iloc[:,:-1].dropna()
y = data.iloc[:,-1]

print(f'X shape: {X.shape}')
print(f'y shape: {y.shape}')

X shape: (36770, 171)
y shape: (36770,)


### Feature selection

So first of all, we want to drop features with very low variance, since they do not actually provide any useful information for the prediction problem

In [21]:
selector = VarianceThreshold(FEATURE_SELECTION_VARIANCE_THRESHOLD)
X = pd.DataFrame(selector.fit_transform(X), columns=selector.get_feature_names_out())
X.shape

(36770, 171)

Next we want to use the `Benjamini-Hochberg procedure` explained in the week 3 lecture. 

In [26]:
X.columns

AttributeError: 'Index' object has no attribute 'apply'

In [11]:
selector = SelectFdr(score_func=mutual_info_regression, alpha=FEATURE_SELECTION_BH_ALPHA)
X = pd.DataFrame(selector.fit_transform(X, y), columns=selector.get_feature_names_out())
X

TypeError: object of type 'NoneType' has no len()

In [9]:
final_data = X
final_data['loadfactor'] = y
final_data

Unnamed: 0,aircrafttype_319,aircrafttype_320,aircrafttype_321,aircrafttype_32A,aircrafttype_32B,aircrafttype_32N,aircrafttype_32Q,aircrafttype_333,aircrafttype_359,aircrafttype_73H,...,scheduletime week number_3,scheduletime week number_4,scheduletime week number_41,scheduletime week number_6,scheduletime year_2021,scheduletime year_2022,seatcapacity,sector_CA,sector_QA,loadfactor
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,142.0,0.0,0.0,0.408451
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,74.0,1.0,0.0,0.189189
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,142.0,0.0,0.0,0.570423
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,72.0,0.0,0.0,0.333333
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,186.0,0.0,0.0,0.204301
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36765,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,180.0,1.0,0.0,0.522222
36766,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,156.0,1.0,0.0,0.532051
36767,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,156.0,1.0,0.0,0.602564
36768,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,189.0,0.0,0.0,0.417989


In [29]:
final_data.to_csv(SAVE_DATA_PATH, sep=',', decimal='.')

## Correlation

In [46]:
#Using Pearson Correlation
cor = final_data.corr()
cor_target = abs(cor['loadfactor'])
relevant_features = cor_target[cor_target > 0.5]

In [47]:
relevant_features.shape

(1,)

In [49]:
cor[['d_departure', 'dd_departure']]

Unnamed: 0,d_departure,dd_departure
aircrafttype_223,0.002613,0.000225
aircrafttype_319,0.006157,-0.002267
aircrafttype_320,0.006214,0.009445
aircrafttype_321,-0.005382,-0.005520
aircrafttype_32A,-0.003732,-0.011341
...,...,...
sector_CA,-0.001256,-0.001737
sector_IS,0.001579,-0.004007
sector_QA,-0.007973,-0.000039
sector_US,0.004861,0.003535
