# Feature Selection Phase
In this notebook I'm performing the feature selection on our train and test sets in order to produce more reliable predictions with our models.

**Auhtor**: Arthur G.
***

## Loading Dependencies
Here I'm loading all the dependencies for this notebook.

In [1]:
# adding custom functions
import sys
sys.path.append('../')

# libs
import os
import joblib
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# settings
seed = np.random.seed(42)
pd.set_option('display.precision', 3)
pd.set_option('display.max_columns', None)

## Loading Train and Test Sets
Here I'm loading our serialized tran and test sets.

In [2]:
# loading predictors
x_train = pd.read_csv(os.path.join('..', 'data', 'processed', 'x_train.csv'))
x_test = pd.read_csv(os.path.join('..', 'data', 'processed', 'x_test.csv'))

# loading targets
y_train = pd.read_csv(os.path.join('..', 'data', 'processed', 'y_train.csv'))
y_test = pd.read_csv(os.path.join('..', 'data', 'processed', 'y_test.csv'))

x_train.head()

Unnamed: 0,DEPARTURE_ARRIVAL_DURATION,ARRIVAL_DEPARTURE_DURATION,STOPOVERS,VESSEL_TYPE,HULL_MATERIAL,VESSEL_LENGTH,VESSEL_BEAN,VESSEL_DRAFT,VESSEL_DEPTH,MOTOR_POWER,SPEED,NUM_PROPELLERS,NUM_GENERATORS,DWT,DWT_na,LIGHT_DISPLACEMENT,LIGHT_DISPLACEMENT_na,CREW,PASS_CAPACITY,LOAD_CAPACITY
0,0.387,0.399,0.0,0.833,1.0,0.691,0.643,0.44,0.364,0.088,0.383,0.25,0.4,0.113,0.0,0.542,0.0,0.381,0.158,0.098
1,0.852,0.844,0.091,0.667,0.667,0.494,0.498,0.325,0.364,0.045,0.124,0.25,0.4,0.102,1.0,0.558,1.0,0.143,0.033,0.109
2,0.852,0.862,0.0,0.667,0.667,0.441,0.643,0.325,0.186,0.028,0.792,0.25,0.4,0.102,1.0,0.558,1.0,0.095,0.025,0.013
3,0.603,0.616,0.0,0.667,0.667,0.505,0.445,0.456,0.382,0.106,0.548,0.25,0.4,0.022,0.0,0.251,0.0,0.238,0.05,0.019
4,0.812,0.799,0.0,0.667,0.667,0.303,0.217,0.259,0.213,0.045,0.225,0.0,0.2,0.102,1.0,0.558,1.0,0.048,0.013,0.0


## Lasso for Feature Selection
Before understanding the use of LASSO for feature selection, let's first state what regularization is.

Regularization is a concept in machine learning and statistics that is used to avoid over-fitting a model with the dataset by adding penalty to achieve less variance with the test data. It reduces parameters and simplifies the model for it to have the lowest over-fitting.

LASSO makes use of two types of penalties to shrink it's coefficients towards zero and then eliminate weak predictors from the equation.

### L1 Regularization
This penalty is added to the absolute (mode) value of the magnitude of coefficients. In this process, weak coefficients will become zero, getting excluded from the analysis.

### L2 Regularization
This penalty is added to the square of the magnitude of coefficients. In this process some coefficients also become zero and get excluded from the equation.

Feature selection with LASSO is nothing more than getting to know which coefficient is greater than zero.

In [3]:
selector_ = SelectFromModel(Lasso(alpha=0.001, random_state=seed))
selector_.fit(x_train, y_train)

Now let's see which columns do we still have.

In [9]:
selected_features = x_train.columns[(selector_.get_support())]
x_train[selected_features].head()

Unnamed: 0,DEPARTURE_ARRIVAL_DURATION,STOPOVERS,VESSEL_TYPE,HULL_MATERIAL,VESSEL_LENGTH,VESSEL_BEAN,VESSEL_DRAFT,VESSEL_DEPTH,MOTOR_POWER,SPEED,NUM_PROPELLERS,NUM_GENERATORS,DWT,DWT_na,LIGHT_DISPLACEMENT,CREW,PASS_CAPACITY,LOAD_CAPACITY
0,0.387,0.0,0.833,1.0,0.691,0.643,0.44,0.364,0.088,0.383,0.25,0.4,0.113,0.0,0.542,0.381,0.158,0.098
1,0.852,0.091,0.667,0.667,0.494,0.498,0.325,0.364,0.045,0.124,0.25,0.4,0.102,1.0,0.558,0.143,0.033,0.109
2,0.852,0.0,0.667,0.667,0.441,0.643,0.325,0.186,0.028,0.792,0.25,0.4,0.102,1.0,0.558,0.095,0.025,0.013
3,0.603,0.0,0.667,0.667,0.505,0.445,0.456,0.382,0.106,0.548,0.25,0.4,0.022,0.0,0.251,0.238,0.05,0.019
4,0.812,0.0,0.667,0.667,0.303,0.217,0.259,0.213,0.045,0.225,0.0,0.2,0.102,1.0,0.558,0.048,0.013,0.0


Saving the selected features indicators.

In [10]:
pd.Series(selected_features).to_csv(os.path.join('..', 'data', 'interim', 'selected_features.csv'), index=False)

This concludes our feature selection phase.