# Introduction

In this Notebook we show how to use this library, that performs the feature selection by introducing a column with gaussian noise (average = 0, standard deviation = 1).



# Import toy regression data

We import a toy regression dataset.
The features consists of 1000 samples and 300 features, while the output consists of a single variable.

In [1]:
import numpy as np
import pandas as pd

X = pd.read_csv('data/toy-regression-features.csv')
y = pd.read_csv('data/toy-regression-labels.csv')

In [2]:
print(X.shape)
X.head()

(1000, 300)


Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,col_290,col_291,col_292,col_293,col_294,col_295,col_296,col_297,col_298,col_299
0,0.53707,-1.10214,1.614065,0.446704,2.150072,0.022795,-0.062325,0.842601,1.337646,0.423004,...,-0.923334,-0.481815,1.203645,-1.458585,-0.382055,0.778446,1.28167,0.083526,0.690047,0.117204
1,-0.291104,0.870662,0.989858,0.340181,0.462467,-0.582148,1.88876,1.326881,-1.654321,-0.130696,...,1.429273,0.845691,-1.089257,-0.918355,0.394018,2.608926,-1.485463,-0.907812,-0.17366,0.920506
2,-0.623914,0.645679,-0.603598,-0.382241,-1.038855,1.036846,-0.411746,0.309138,0.37786,1.115033,...,-0.853996,-1.977211,-0.36007,0.457125,-1.372804,0.320784,-0.961563,-0.203412,0.920264,0.799161
3,-0.00728,-1.159284,1.205723,-0.869215,-0.571466,0.540196,0.656639,0.041661,0.24431,-0.860549,...,0.646255,-0.762227,-0.940969,-0.889827,-0.534136,-0.649951,-0.387092,-1.089814,0.054935,0.955872
4,0.578238,-0.756635,-0.768636,1.339886,0.612525,-0.431343,-0.058266,0.975151,-1.992118,0.179272,...,0.263658,0.837735,0.724682,-2.493489,-2.1086,-1.64607,-0.674911,-0.344457,-0.771689,-0.691474


In [3]:
y.head()

Unnamed: 0,labels
0,219.664211
1,40.998126
2,109.423858
3,-280.667507
4,599.178538


# Use the library to select only the relevant features

We use the library to select only the relevant features.
A column containing gaussian noise (mean = 0, std. dev = 1) is created at each epoch. Then the feature importance is computed and all the features that are less important that the random one, are excluded.

The process is repeated for the number of selected `epochs`. It is also possible to put an early stopping, by assigning to the parameter `patience` a value that is smaller than the number of epochs. If the number of selected features remains the same for a number of epochs equal to patience, the process stops. 

In [4]:
from src.ml import get_relevant_features
from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=1)

X_reduced = get_relevant_features(X, y, 
                                  model=lasso_model, 
                                  epochs=100, 
                                  patience=10, 
                                  splitting_type='kfold',
                                  filename_output='results/reduced_dataset.csv',
                                  random_state=42)

Fitting the model with 301 features
Train score 0.9252
Test score 0.8687
Fitting the model with 301 features
Train score 0.9302
Test score 0.8272
Fitting the model with 301 features
Train score 0.926
Test score 0.864
Fitting the model with 301 features
Train score 0.9237
Test score 0.8608
Fitting the model with 301 features
Train score 0.9238
Test score 0.8639
Selected 286 features out of 300
Fitting the model with 287 features
Train score 0.9267
Test score 0.8527
Fitting the model with 287 features
Train score 0.9271
Test score 0.8579
Fitting the model with 287 features
Train score 0.9294
Test score 0.8502
Fitting the model with 287 features
Train score 0.9228
Test score 0.8675
Fitting the model with 287 features
Train score 0.9245
Test score 0.8594
Selected 110 features out of 286
Fitting the model with 111 features
Train score 0.9145
Test score 0.8962
Fitting the model with 111 features
Train score 0.9209
Test score 0.8703
Fitting the model with 111 features
Train score 0.9161
Test 

### Inspect the new (reduced) dataset

The new reduced dataset contains only a subset of features, namely the most relevant ones 

In [None]:
X_reduced.head()