# Feature Selection

How to select features for a machine learning dataset. Feature selection can improve model performance dramatically in some situations, and almost always speeds up training.

## 1) Regression

In [None]:
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load dataset
housing = fetch_california_housing()

# Convert to DataFrames for easier manipulation
df_housing = pd.DataFrame(housing.data, columns=housing.feature_names)
df_housing['MEDV'] = housing.target

print(df_housing.shape)

### 1.A) Filter Methods

**Q: What statistics could you apply to the features for purposes of selection?** <br>


#### Pearson's R Correlation

\begin{equation}
R = \frac{n \sum xy - (\sum x)(\sum y)}{\sqrt{[n \sum x^2 - (\sum x)^2][n \sum y^2 - (\sum y)^2]}}
\end{equation}

$R = 1$: Perfect Positive <br>
$R > 0.5$: Strong Positive <br>
$.3 < R \leq .5$: Moderate Positive <br>
$0 < R \leq .3$: Weak Positive <br>
$R = 0$: None <br>
$0 > R \geq –.3$: Weak Negative <br>
$–.3 < R \geq –.5$: Moderate Negative <br>
$R < –.5$: Strong Negative <br>
$R = -1$: Perfect Negative <br>


In [None]:
df_housing.head()

In [None]:
# Use Pearson's R to find correlations with the target variable
print(df_housing.corr())

plt.imshow(df_housing.corr())

features = housing.feature_names + housing.target_names
plt.xticks(range(len(features)),labels = features, rotation=45);
plt.yticks(range(len(features)),labels = features, rotation=45);
plt.colorbar()
plt.clim(-1,1)

**Q: From a quick glance, which features seem correlated with median house value?** <br>

**Q: Are there any features that could be removed?** <br>

**Q: Is there a situation where you would remove a feature, despite it having a good correlation with your target of interest?** <br>


In [None]:
# Selecting some features that both make sense to use and have some correlation

# Split dataset into train and test set

# Train a linear regression model for a quick test

# Plot results

Calculate a relevant evaluation metric for this model

In [None]:
# Evaluate model using a proper evaluation metric


How does this compare to a model trained on all features?

In [None]:
# Retrain a linear regression model on all features, calculate same evaluation metric and compare


**Q: How is the performance of the model trained with all features and with a subset of features? Explain why** <br>

### 1.B) Wrapper Methods

#### Forward Selection

We can use either sklearn's or mlxtend's ```SequentialFeatureSelector()``` class.

In mlxtend: <br>
```SequentialFeatureSelector()``` class accepts the following major parameters:
* ```LinearRegression()``` acts as an estimator for the feature selection process. Alternatively, it can be substituted with other regression or classification based algorithm.
* ```k_features``` indicates the number of features to be selected. For demonstration purposes, 5 features are selected from the original 13. This value can be optimized by analyzing the scores for different numbers of features.
* ```forward``` indicates the direction of the wrapper method used. ```forward = True``` for forward selection whereas ```forward = False``` for backward elimination.
* ```scoring``` argument specifies the evaluation criterion to be used.
* ```cv``` argument is for k-fold cross-validation. Be default, it will be set as 5. Bear in mind, a larger number of cross-validation can be time-consuming and computing-intensive.

In [None]:
# Ensure we selected all features to start


In [None]:
# Use sklearn to do forward selection
from sklearn.feature_selection import SequentialFeatureSelector as SFS


In [None]:
# Retrain a linear regression model on selected features, calculate same evaluation metric and compare


Let's try using mlxtend library this time.

In [None]:
# Ensure we selected all features to start


In [None]:
# Using mlxtend
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs


In [None]:
# How do we select the optimal number for k_features?


#### Recursive Feature Elimination (RFE)

In [None]:
# Ensure we selected all features to start


In [None]:
from sklearn.feature_selection import RFE


In [None]:
# Retrain a linear regression model on selected features, calculate same evaluation metric and compare


### 1.C) Embedded Methods
Let's try some Embedded Methods for Feature Selection Methods

In [None]:
# Ensure we selected all features to start


#### L1 Regularization / Lasso Regression

In [None]:
from sklearn.linear_model import Lasso


#### L2 Regularization / Ridge Regression

In [None]:
from sklearn.linear_model import Ridge


**Q: Compare the performance of the models obtained using Lasso vs Ridge Regression. Explain your observation** <br>


## 2) Classification

In [None]:
# Import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_wine
from scipy.stats import f_oneway

# Load dataset
wine = load_wine()

# Convert to DataFrames for easier manipulation
df_wine = pd.DataFrame(wine.data, columns=wine.feature_names)
df_wine['wine_class'] = wine.target

df_wine

### 2.A) Filter Methods
**Q: What statistics could you apply to the features for purposes of selection?** <br>


#### ANOVA

ANOVA is used to check the means of two or more groups that are significantly different from each other:

* $HO$: means of all groups are equal
* $H1$: at least one mean of the groups are different

ANOVA assumes: 1) linear relationship between the feature and the target, 2) the variables follow a Gaussian distribution.

One Way ANOVA tests the relationship between categorical predictor vs continuous response.

\begin{gather}
SS_{between} = \sum_{j=1}^p n_j (x_j - x)^2 \\
SS_{within} = \sum_{j=1}^p \sum_{i=1}^{n_j} (x_{ij} - x_j)^2 \\
\\
MS_{between} = \frac{SS_{between}}{k - 1}\\
MS_{within} = \frac{SS_{within}}{N - k}\\
\\
F = \frac{MS_{between}}{MS_{within}}
\end{gather}

where: <br>
$SS_{between} =$ sum of squares between the groups <br>
$SS_{within} =$ sum of squares within the groups <br>
$k =$ number of groups <br>
$N = $ total number of observations across all groups

In [None]:
# Doing a one way ANOVA for each feature
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif


Real world data collection is expensive. Select the top 3 features to train on.

Accuracy leaves some to be desired. What if we search all possible combinations of 3 features to empirically discover the optimal 3?

In [None]:
from itertools import combinations

combos = list(combinations(range(len(wine.feature_names)), 3))

best = 0.0
best_feats = []
for combo in combos:
    feature_ids = combo
    model = LogisticRegression(max_iter=10000).fit(wine.data[:,feature_ids], wine.target)
    preds = model.predict(wine.data[:,feature_ids])
    mat = confusion_matrix(wine.target, preds)
    accuracy = np.trace(mat) / np.sum(mat)
    if accuracy > best:
        best_feats = combo
        best = accuracy
print(best_feats, best)

**Q:If a one way ANOVA was used above to identify features, why did it not work so well?**

**Q: What are the pros and cons of searching for the best features by checking model results?**