# Introduction
In Machine Learning, before model building, it is important for us to select "good" features. <i>Feature selection</i> is the process of selecting "good" features from a set of given features.<br><br>

Feature selection serves two purposes:

- It decreases the computational cost and memory required
- It increases the performance of the model
- It prevents the model from overfitting to the dataset
- It reduces the complexity of the model


For example, if we are trying to predict if a customer will default on a loan given his features(age, gender, occupation, etc.), what features do we consider while building the model? Should we put age in? Should we consider their occupation? <br><br>These questions are answered via several feature selection methods that we're going to go through in this notebook. 

# Feature selection strategies

There are two types of feature selection strategies:

- Supervised
- Unsupervised

<b>Supervised</b> takes the output variable into consideration while <b>unsupervised</b> doesn't. We're going to use supervised techniques in this notebook.

# Supervised Techniques

As explained above, supervised strategies uses the target variable to select the best features. These can be divided into three types:

- Wrapper methods
- Filter methods
- Intrinsic models

Wrapper and filter methods are evaluated based on the performance of a resulting model on a hold out dataset while intrinsic models are algorithms which have a built in mechanism to eliminate the irrelevant features

### 1) Wrapper methods

These methods create many models with different subset of input features and select those features that result in the best performing model according to a performance metric. These evaluate multiple models, adding or removing features to find the optimal combination that maximizes model's performance. <br><br>
The advantage of these methods is that they're not vulnerable to the variable types(numerial, categorical), they can handle everything. The disadvantage is that these can be computationally expensive.

<b>Example:</b> Recursive Feature Elimination (RFE)

### 2) Filter methods

Filter methods use some statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (or filter) those input variables.

<b>Example:</b> Chi-Square

### 3) Intrinsic models

This technique contains models that perform feature selection automatically as a part of training the model. 

<b>Example:</b> Lasso, Decision Trees

# Implementations

Lets go through their implementation in Python

### 1) Wrapper methods

In [1]:
import sklearn
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFE

In [2]:
dataset = load_breast_cancer()

RFE stands for Recursive Feature Elimination. What it does is, it starts with building a model(in this case a `DecisionTreeClassifier`) with all the available input features, evaluates every feature's importance in the built model, and eliminates the least important feature. It continues doing this till the number of features are equal to `n_features_to_select`. This is called a "Backward Elimination" technique.

In [3]:
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)

In [4]:
rfe.fit(X=dataset['data'], y=dataset['target'])

RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)

In [5]:
print(f"5 most important features according to RFE:\n\n{dataset['feature_names'][rfe.support_].tolist()}")

5 most important features according to RFE:

['worst radius', 'worst texture', 'worst area', 'worst smoothness', 'worst concave points']


### 2) Filter methods

Filter methods use various statistical measures for scoring the features. We'll use these features:

- Pearson's correlation statistic
- ANOVA's F measure statistic
- Chi-Squared statistic
- Mutual Information statistic

In [6]:
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest, f_regression, f_classif, chi2, mutual_info_classif

In [7]:
boston = load_boston()

#### Using `f_regression` which represents Pearson's correlation

In [8]:
skb = SelectKBest(score_func=f_regression, k=5)

In [9]:
skb.fit(boston['data'], boston['target'])

SelectKBest(k=5, score_func=<function f_regression at 0x7fdb36f912f0>)

In [10]:
print(f"5 most important feature according to Pearson Correlation:\n{boston['feature_names'][skb.get_support()]}")

5 most important feature according to Pearson Correlation:
['INDUS' 'RM' 'TAX' 'PTRATIO' 'LSTAT']


This is because these 5 features have the highest absolute value of Pearson's correlation coefficient with the target variable, shown here:

In [11]:
pd.DataFrame(np.c_[boston['data'], boston['target']], columns=boston['feature_names'].tolist() + ['target']).corr().iloc[-1].abs().nlargest(6)[1:]

LSTAT      0.737663
RM         0.695360
PTRATIO    0.507787
INDUS      0.483725
TAX        0.468536
Name: target, dtype: float64

#### Using `f_classif` (ANOVA F measure) (https://towardsdatascience.com/anova-for-feature-selection-in-machine-learning-d9305e228476)

In [12]:
skb2 = SelectKBest(f_classif, k=5)

In [13]:
skb2.fit(dataset['data'], dataset['target'])

SelectKBest(k=5)

In [14]:
print(f"5 most important features according to ANOVA:\n{dataset['feature_names'][skb2.get_support()].tolist()}")

5 most important features according to ANOVA:
['mean perimeter', 'mean concave points', 'worst radius', 'worst perimeter', 'worst concave points']


#### Using `chi2` (Chi-Square statistic) (https://machinelearningmastery.com/chi-squared-test-for-machine-learning/)

In [15]:
breast_cancer = pd.read_csv('breast-cancer.csv', header=None)

In [16]:
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

def prepare_inputs(X):
    oe = OrdinalEncoder()
    oe.fit(X)
    X_enc = oe.transform(X)
    return X_enc

def prepare_targets(y):
    le = LabelEncoder()
    le.fit(y)
    y_enc = le.transform(y)
    return y_enc

In [17]:
X = breast_cancer.values[:, :-1].astype(str)
y = breast_cancer.values[:, -1]

In [18]:
X = prepare_inputs(X)
y = prepare_targets(y)

In [19]:
skb3 = SelectKBest(chi2, k=5)

In [20]:
skb3.fit(X, y)

SelectKBest(k=5, score_func=<function chi2 at 0x7fdb36f91268>)

These are the 5 most important according to the Chi-Squared statistic: 

In [21]:
breast_cancer.iloc[:, :-1].iloc[:, skb3.get_support()]

Unnamed: 0,2,3,4,5,8
0,'15-19','0-2','yes','3','no'
1,'15-19','0-2','no','1','no'
2,'35-39','0-2','no','2','no'
3,'35-39','0-2','yes','3','yes'
4,'30-34','3-5','yes','2','no'
...,...,...,...,...,...
281,'30-34','6-8','yes','2','no'
282,'25-29','3-5','yes','2','yes'
283,'30-34','6-8','yes','2','no'
284,'15-19','0-2','no','2','no'


#### Using `mutual_info_classif` (Mutual Information) (https://machinelearningmastery.com/information-gain-and-mutual-information/)

In [22]:
skb4 = SelectKBest(mutual_info_classif, k=5)

In [23]:
skb4.fit(X, y)

SelectKBest(k=5, score_func=<function mutual_info_classif at 0x7fdb3689bf28>)

These are the 5 most important according to the Mutual Information statistic: 

In [24]:
breast_cancer.iloc[:, :-1].iloc[:, skb4.get_support()]

Unnamed: 0,0,2,3,4,8
0,'40-49','15-19','0-2','yes','no'
1,'50-59','15-19','0-2','no','no'
2,'50-59','35-39','0-2','no','no'
3,'40-49','35-39','0-2','yes','yes'
4,'40-49','30-34','3-5','yes','no'
...,...,...,...,...,...
281,'50-59','30-34','6-8','yes','no'
282,'50-59','25-29','3-5','yes','yes'
283,'30-39','30-34','6-8','yes','no'
284,'50-59','15-19','0-2','no','no'


# References

- https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/#:~:text=Feature%20selection%20is%20the%20process,the%20performance%20of%20the%20model.<br>
- https://machinelearningmastery.com/feature-selection-with-categorical-data/