# Introduction

In this Notebook we show how to use this library, that performs the feature selection by introducing a column with gaussian noise (average = 0, standard deviation = 1).



# Import toy regression data

We import a toy regression dataset.
The features consists of 1000 samples and 300 features, while the output consists of a single variable.

In [1]:
import numpy as np
import pandas as pd

X = pd.read_csv('data/toy-regression-features.csv')
y = pd.read_csv('data/toy-regression-labels.csv')

In [2]:
print(X.shape)
X.head()

(1000, 300)


Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,col_290,col_291,col_292,col_293,col_294,col_295,col_296,col_297,col_298,col_299
0,0.53707,-1.10214,1.614065,0.446704,2.150072,0.022795,-0.062325,0.842601,1.337646,0.423004,...,-0.923334,-0.481815,1.203645,-1.458585,-0.382055,0.778446,1.28167,0.083526,0.690047,0.117204
1,-0.291104,0.870662,0.989858,0.340181,0.462467,-0.582148,1.88876,1.326881,-1.654321,-0.130696,...,1.429273,0.845691,-1.089257,-0.918355,0.394018,2.608926,-1.485463,-0.907812,-0.17366,0.920506
2,-0.623914,0.645679,-0.603598,-0.382241,-1.038855,1.036846,-0.411746,0.309138,0.37786,1.115033,...,-0.853996,-1.977211,-0.36007,0.457125,-1.372804,0.320784,-0.961563,-0.203412,0.920264,0.799161
3,-0.00728,-1.159284,1.205723,-0.869215,-0.571466,0.540196,0.656639,0.041661,0.24431,-0.860549,...,0.646255,-0.762227,-0.940969,-0.889827,-0.534136,-0.649951,-0.387092,-1.089814,0.054935,0.955872
4,0.578238,-0.756635,-0.768636,1.339886,0.612525,-0.431343,-0.058266,0.975151,-1.992118,0.179272,...,0.263658,0.837735,0.724682,-2.493489,-2.1086,-1.64607,-0.674911,-0.344457,-0.771689,-0.691474


In [3]:
y.head()

Unnamed: 0,labels
0,219.664211
1,40.998126
2,109.423858
3,-280.667507
4,599.178538


# Use the library to select only the relevant features

We use the library to select only the relevant features.
A column containing gaussian noise (mean = 0, std. dev = 1) is created at each epoch. Then the feature importance is computed and all the features that are less important that the random one, are excluded.

The process is repeated for the number of selected `epochs`. It is also possible to put an early stopping, by assigning to the parameter `patience` a value that is smaller than the number of epochs. If the number of selected features remains the same for a number of epochs equal to patience, the process stops. 

In [9]:
from src.ml import get_relevant_features
from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=1)

X_reduced = get_relevant_features(X, y, 
                                  model=lasso_model, 
                                  epochs=100, 
                                  patience=5, 
                                  filename_output='results/reduced_dataset.csv',
                                  random_state=42)

Fitting the model with 301 features
Train score 0.9252
Test score 0.8687
Selected 133 features out of 300
Fitting the model with 134 features
Train score 0.9154
Test score 0.8853
Selected 109 features out of 133
Fitting the model with 110 features
Train score 0.9114
Test score 0.8976
Selected 109 features out of 109
The feature selection did not improve in the last 1 epochs
Fitting the model with 110 features
Train score 0.9164
Test score 0.8758
Selected 102 features out of 109
Fitting the model with 103 features
Train score 0.9178
Test score 0.8727
Selected 36 features out of 102
Fitting the model with 37 features
Train score 0.9018
Test score 0.8942
Selected 30 features out of 36
Fitting the model with 31 features
Train score 0.8997
Test score 0.8964
Selected 30 features out of 30
The feature selection did not improve in the last 1 epochs
Fitting the model with 31 features
Train score 0.8981
Test score 0.9026
Selected 21 features out of 30
Fitting the model with 22 features
Train sco

### Inspect the new (reduced) dataset

The new reduced dataset contains only a subset of features, namely the most relevant ones 

In [5]:
X_reduced.head()

Unnamed: 0,col_88,col_277,col_168,col_283,col_8,col_270,col_258,col_187,col_76,col_171,...,col_68,col_176,col_23,col_286,col_119,col_174,col_25,col_250,col_24,col_274
0,-1.019952,-0.003618,-0.003067,-2.092326,1.337646,0.220574,1.673497,0.236082,0.656732,-0.878309,...,0.348374,-1.667387,0.195945,0.581283,-0.645682,0.57702,1.178854,1.053236,-0.309741,0.692207
1,0.920548,0.259951,0.030676,0.685426,-1.654321,-0.04125,-0.05117,0.853362,-0.367285,-0.343505,...,-0.420332,0.043677,1.218682,-1.651792,2.052382,0.653817,-0.570561,-1.545429,0.830601,-0.808267
2,-3.277806,-1.351804,0.459295,0.933398,0.37786,0.758731,0.971993,-0.73901,0.200066,-0.076279,...,1.086326,1.938669,-1.066968,-0.526487,-0.745177,-0.26001,1.957113,2.176639,-1.145208,-1.399284
3,-0.936796,-0.413398,0.002783,1.153329,0.24431,-0.781265,0.744198,1.123666,0.328173,-0.379941,...,-1.436389,-0.581704,-2.105314,-0.025478,-1.638474,-0.077514,0.245085,-2.033497,-1.016165,0.221333
4,0.842412,3.081069,0.054107,-0.148948,-1.992118,0.513912,0.093666,2.831712,1.261611,0.741283,...,0.566953,0.895309,-0.994838,-0.6518,-0.557619,-0.676896,-0.719393,0.351711,2.785978,0.234328
