# DSCI 6003 7.2 Lecture

## A Guide to Basic Feature Selection


### By the End of This Lecture You Will Be:
1. Familiar with the principles of feature construction
2. Familiar with Forward Selection
3. Familiar with Backward Selection
4. Familiar with Recursive Feature Elimination


Feature selection is intended to select the “best” subset of predictors. But why bother?

1. We want to explain the data in the simplest way — redundant predictors (features) should be removed. Simplest is always best. (QUIZ: Why? Robustness)
2. Unnecessary predictors (features) will add noise terms to the estimation of other quantities that we are interested in. Degrees of freedom (affects error estimates) will be wasted. 
3. Collinearity is caused by having too many variables trying to do the same job.
4. Cost: if the model is to be used for prediction, we can save time and/or money by avoiding measurement of redundant predictors in future experiments.

Prior to feature selection, during EDA:

1. Identify outliers and influential points - maybe exclude them at least temporarily.
2. Add in any transformations of the variables that seem appropriate. This may take a while - in particular look for transformations that reproduce linearity when projected against other variables.

###QUIZ: 
How do you determine appropriateness of a transformation?

## The nature of features - statistical relationships are not always obvious to the eye


### 1. Features can be meaningless alone, but informative together.

![f1](./images/Features_1.png)


###2. Sometimes you need both highly correlated features to properly distinguish classes.

![f2](./images/Features_2.png)

### 3. Sometimes highly relevant features are useless.

![f3](./images/Features_3a.png)

## Model Hierarchy

Some models have a natural hierarchy. For example, in polynomial models, $x^2$ is a term of higher order than x. You must respect this hierarchy. Higher order terms should be removed from the model before lower order terms in the same variable, due to the multiplication of effect from the error terms. if 

$$\tilde{x} = x_{observed}+\epsilon$$

Then the placement of this observable in the higher order terms propagates the error term directly. If the model includes polynomial terms, $y = \beta_{0}+\beta_{1}x+\beta_{2}x^2$, then the estimated y includes squared error terms:

$$\hat{y} = \beta_{0}+\beta_{1}(x_{observed}+\epsilon)+\beta_{2}(x_{observed}+\epsilon)^2$$


$$\hat{y} = \beta_{0}+\beta_{1}(x_{observed}+\epsilon)+\beta_{2}(x_{observed}^2+2\ x_{observed}\ \epsilon+\epsilon^2)$$

Which is a problem. Now the first order term in $x$ has reappeared, but without a corresponding weight to fit on.

Suppose you remove the first order term and then make a scale change $x \rightarrow x+a$, the same problem appears:

$$\hat{y} = \beta_{0}+\beta_{2}(x_{observed}^2+2\ x_{observed}\ a+a^2)$$

## Interaction Terms and Cross Terms

Suppose we produce a second-order model with a response surface as follows:

$$\hat{y} = \beta_{0}+\beta_{1}(x_{1})+\beta_{2}(x_{2})+\beta_{11}(x_{1})^2+\beta_{22}(x_{2})^2+\beta_{12}x_{1}\ x_{2}$$

We would not normally consider removing the $x_{1}x_{2}$ interaction term without considering the simultaneous removal of the x^2 terms. A joint removal would correspond to the clearly meaningful comparison of a quadratic surface and linear one. Just removing the  $x_{1}x_{2}$ term would correspond to a surface that is aligned with the coordinate axes.

Such a move would be difficult to interpret and would not be done unless some meaning can be attached. Any rotation of the feature space would reintroduce the interaction term (as above).

### In general: Effect Heredity

The effect heredity principle, indicates that an interaction can be active
only if one or both of its parent effects are also active. For example,
a two-factor interaction can be active only if one or both of the corresponding main effects are active.

## Statsmodels - Looking at OLS outputs

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std

np.random.seed(9876789)

###Artifical Data####
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)

####Add intercept####
X = sm.add_constant(X)
y = np.dot(X, beta) + e

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.020e+06
Date:                Wed, 09 Dec 2015   Prob (F-statistic):          2.83e-239
Time:                        11:56:07   Log-Likelihood:                -146.51
No. Observations:                 100   AIC:                             299.0
Df Residuals:                      97   BIC:                             306.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          1.3423      0.313      4.292      0.0

## Backward Selection


1. Start with all the predictors in the model
2. Remove the predictor with highest p-value greater than αcrit
3. Refit the model and goto 2
4. Stop when all p-values are less than $\alpha_{crit}$

The $\alpha_{crit}$ is sometimes called the “p-to-remove” and does not have to be 5%. If prediction performance is the goal, then a 15-20% cut-off may work best, although methods designed more directly for optimal
prediction should be preferred.

![BS_1](./images/BS_1.png)
![BS_1_b](./images/BS_1_bottom.png)

Typically we remove the largest p-value violator first (Area):
![BS_2](./images/BS_2a.png)

Then we can remove the the other two in sequence:

![BS_3](./images/BS_3a.png)

![BS_4](./images/BS_4a.png)

We end up getting a robust model with similar predictivity:

![BS_5_top](./images/BS_5_top.png)
![BS_5_main](./images/BS_5_main_a.png)

## Forward Selection

1. Start with no variables in the model.
2. For all predictors not in the model, check their p-value if they are added to the model. Choose the one with lowest p-value less than $\alpha_{crit}$
3. Continue until no new predictors can be added.

## Stepwise Regression

This is a combination of backward elimination and forward selection. This addresses the situation where variables are added or removed early in the process and we want to change our mind about them later. At each stage a variable may be added or removed. There are several variations on exactly how this is done, typically with automated methods of generating combinations. The version implemented in scikit is recursive feature elimination, which does a recursive search for adding and subtracting features along a chosen metric (as a tree). 

Stepwise procedures are relatively cheap computationally but they do have some drawbacks.

1. Because of the “one-at-a-time” nature of adding/dropping variables, it’s possible to miss the “optimal” model.
2. The p-values used should not be treated too literally. The removal of less significant predictors tends to increase the significance of the remaining predictors. This effect leads one to overstate the importance of the remaining predictors.
3. The procedures are not directly linked to final objectives of prediction or explanation and so may notreally help solve the problem of interest. Feature selection tends to amplify the statistical significance of the variables that stay in the model. Features that are dropped can still be correlated with the response.
4. Stepwise feature selection tends to pick models that are smaller than desirable for prediction purposes. To give a simple example, consider the simple regression with just one predictor variable. Suppose that the slope for this predictor is not quite statistically significant. We might not have enough evidence to say that it is related to y but it might be better to use it than not.

## Feature selection by Lasso (Elastic net)

You have already learned feature selection through regularization procedures. This can be a powerful alternative to classical feature selection; you will need to construct the squid plot (solution path plot) as outlined. Relationships amongst variables can be examined by eye, but careful analysis of the meaning of the features is required to determine which of these are most useful for a final model. 

## Feature selection by Feature Importance

Feature importance with random forests is a great way to tease out the most predictive features, however one must be careful of the heredity effect. Dropping a feature with a parental relationship with a more predictive feature is eliminating dependency structure in the model that is likely necessary. 