# Feature Selection

![Data Science Workflow](img/ds-workflow.png)

# Feature Selection
- **Feature selection** is about selecting attributes that have the greatest impact towards the **problem** you are solving.

- Notice: It should be clear that all steps are interconnected.

## Why Feature Selection?
- Higher accuracy
- Simpler models
- Reducing overfitting risk

See more details on [wikipedia](https://en.wikipedia.org/wiki/Feature_selection)

## Feature Selection Techniques
### Filter methods
- Independent of Model
- Based on scores of statistical
- Easy to understand
- Good for early feature removal
- Low computational requirements

#### Examples
- [Chi square](https://en.wikipedia.org/wiki/Chi-squared_test)
- [Information gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees)
- [Correlation score](https://en.wikipedia.org/wiki/Correlation_coefficient)
- [Correlation Matrix with Heatmap](https://vitalflux.com/correlation-heatmap-with-seaborn-pandas/)

### Wrapper methods
- Compare different subsets of features and run the model on them
- Basically a search problem

#### Examples
- [Best-first search](https://en.wikipedia.org/wiki/Best-first_search)
- [Random hill-climbing algorithm](https://en.wikipedia.org/wiki/Hill_climbing)
- [Forward selection](https://en.wikipedia.org/wiki/Stepwise_regression)
- [Backward elimination](https://en.wikipedia.org/wiki/Stepwise_regression)

See more on [wikipedia](https://en.wikipedia.org/wiki/Feature_selection#Subset_selection)

### Embedded methods
- Find features that contribute most to the accuracy of the model while it is created
- Regularization is the most common method - it penalizes higher complexity

#### Examples
- [LASSO](https://en.wikipedia.org/wiki/Lasso_(statistics))
- [Elastic Net](https://en.wikipedia.org/wiki/Elastic_net_regularization)
- [Ridge Regression](https://en.wikipedia.org/wiki/Ridge_regression)

### Feature Selection Resources
- [An Introduction to Feature Selection](https://machinelearningmastery.com/an-introduction-to-feature-selection/)
- [Comprehensive Guide on Feature Selection](https://www.kaggle.com/prashant111/comprehensive-guide-on-feature-selection/)

### Before Feature Selection
- Clean data (lesson 09)
- Divide into training and test set (lesson 10)
- Feature scaling (lesson 11)
- Only do feature selction on training set
    - To avoid overfitting

### Dataset
- [Santander Customer Satisfaction](https://www.kaggle.com/c/santander-customer-satisfaction/)
    - Which customers are happy customers?

## Filter Methods
### Constant features
- Remove constant features
- Constant features add no value

In [71]:
import pandas as pd
import sklearn

In [3]:
data = pd.read_parquet('./files/customer_satisfaction.parquet')
data.head()

Unnamed: 0_level_0,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [6]:
len(data), len(data.columns)

(76020, 370)

In [9]:
data['TARGET'].value_counts()/len(data)

0    0.960431
1    0.039569
Name: TARGET, dtype: float64

In [10]:
data.describe()

Unnamed: 0,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
count,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,...,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0
mean,-1523.199277,33.212865,86.208265,72.363067,119.529632,3.55913,6.472698,0.412946,0.567352,3.160715,...,7.935824,1.365146,12.21558,8.784074,31.505324,1.858575,76.026165,56.614351,117235.8,0.039569
std,39033.462364,12.956486,1614.757313,339.315831,546.266294,93.155749,153.737066,30.604864,36.513513,95.268204,...,455.887218,113.959637,783.207399,538.439211,2013.125393,147.786584,4040.337842,2852.579397,182664.6,0.194945
min,-999999.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5163.75,0.0
25%,2.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67870.61,0.0
50%,2.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,106409.2,0.0
75%,2.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,118756.3,0.0
max,238.0,105.0,210000.0,12888.03,21024.81,8237.82,11073.57,6600.0,6600.0,8237.82,...,50003.88,20385.72,138831.63,91778.73,438329.22,24650.01,681462.9,397884.3,22034740.0,1.0


#### Constant features directly with DataFrames

In [12]:
len(data.columns[(data == data.iloc[0]).all()])

34

#### Using Sklearn
- Remove constant and quasi constant features
- [`VarianceThreshold`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) Feature selector that removes all low-variance features.

In [73]:
from sklearn.feature_selection import VarianceThreshold

In [74]:
sel = VarianceThreshold()
sel.fit_transform(data)

array([[2.00000000e+00, 2.30000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 3.92051700e+04, 0.00000000e+00],
       [2.00000000e+00, 3.40000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 4.92780300e+04, 0.00000000e+00],
       [2.00000000e+00, 2.30000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 6.73337700e+04, 0.00000000e+00],
       ...,
       [2.00000000e+00, 2.30000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 7.40281500e+04, 0.00000000e+00],
       [2.00000000e+00, 2.50000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 8.42781600e+04, 0.00000000e+00],
       [2.00000000e+00, 4.60000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 1.17310979e+05, 0.00000000e+00]])

In [75]:
len(data.columns[sel.get_support()])

336

In [81]:
len(sel.get_feature_names_out())

AttributeError: 'VarianceThreshold' object has no attribute 'get_feature_names_out'

In [31]:
sel.get_feature_names_out()

AttributeError: 'VarianceThreshold' object has no attribute 'get_feature_names_out'

#### Quasi constant features
- Same value for the great majority of the observations

In [38]:
sel = VarianceThreshold(threshold = 0.01)

In [39]:
sel.fit(data)

VarianceThreshold(threshold=0.01)

In [40]:
len(sel.get_feature_names_out())

AttributeError: 'VarianceThreshold' object has no attribute 'get_feature_names_out'

In [41]:
quasi_constant = [col for col in data.columns if col not in sel.get_frature_names_out()]

AttributeError: 'VarianceThreshold' object has no attribute 'get_frature_names_out'

### Correaltion with color
- [`corr()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) Compute pairwise correlation of columns, excluding NA/null values.
    - For better readability use: `.style.background_gradient(cmap='Blues')`
- Good features are highly correlated with target
- Ideally features should be correlated with target, but uncorrelated amont themselves

In [59]:
train = data[sel.get_feature_names_out()]

AttributeError: 'VarianceThreshold' object has no attribute 'get_feature_names_out'

In [None]:
train.shape()

In [None]:
train.corr().stylebackground_gradient(cmap='Blues')

### Find correlated features
- The goal is to find and remove correlated features
- Calcualte correlation matrix (assign it to `corr_matrix`)
- A feature is correlated to any previous features if the following is true
    - Notice that we use correlation 0.8
```Python
feature = 'imp_op_var39_comer_ult1'
(corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()
```
- Get all the correlated features by using list comprehension

In [None]:
corr_matrix - train.corr()

In [None]:
feature = 'imp_op_var39_comer_ult1'
(corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()

In [None]:
corr_features = [feature for feature in corr_matix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]

In [None]:
len(corr_features)

## Wrapper Methods
### Forward Selection
- [`SequentialFeatureSelector`](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/#sequentialfeatureselector) Sequential Feature Selection for Classification and Regression.
- First install it by running the following in a cell
```
!pip install mlxtend
```
- For preparation remove all quasi-constant features and correlated features
```Python
X = data.drop(['TARGET'] + quasi_features + corr_features, axis=1)
y = data['TARGET']
```
- To demonstrate this we create a small training set
```Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.75, random_state=42)
```
- We will use the `SVC` model with the `SequentialFeatureSelector`.
    - For two features

In [85]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

In [None]:
X = data.drop(['TARGET'] + quasi_constant + corr_features, axis=1)
y = data['TARGET']

In [None]:
len(X.columns)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.9, random_state=42)

In [None]:
fsf = SFS(SVC(), k_features =2, verbose=2, cv=2, n_jobs=8)

In [None]:
sfs.fit(X_train, y_train)

#### Good score?

In [None]:
y_train.value_counts()/ len(y_train)