# Feature selection
Unit 2 / Lesson 1 / Assignment 7

__Feature selection__ is like handing out roses on The Bachelor.
We want to keep the features that have the strongest connection to the outcome, while also prioritizing features that bring something unique to the table.
Unlike _The Bachelor_, our goal isn't to narrow the options down to only one ideal featurette, but to settle on the set of features that is relatively straightforward to understand, is predictively powerful, minimizes overfitting, and is relatively computationally efficient.
__Feature selection__ is a balancing act between explanatory power and model parsimony.
Fortunately, many __feature selection__ algorithms are available to help data scientists optimize their feature sets.

The one thing all __feature selection__ algorithms have in common is that they work better when data is separated into a training set and a test set, and feature selection is run on the training set.

__Feature selection__ algorithms fall into three broad groups, _filter methods_, _wrapper methods_, and _embedded methods_.


### Filter methods

__Filter methods__ evaluate each feature separately and assign it a "score" that is used to rank the features, with scores above a certain cutoff point being retained or discarded.
The feature may be evaluated independently of the outcome, or in combination with it.
Variance thresholds, where only features with a variance above a certain cutoff are retained, are an example of independently evaluating features.
The correlation of each feature with the outcome can also be used as a __filter method__.

__Filter methods__ are good at selecting relevant features that are likely to be related to the outcome.
They are computationally simple and straightforward, but likely to produce lists of redundant features since inter-feature relationships are not considered.
Because they're "cheap" to run, you might use __filter methods__ as a first pass at reducing features before applying more computationally demanding algorithms like _wrapper methods_.


### Wrapper methods

__Wrapper methods__ select sets of features.
Different sets are constructed, evaluated in terms of their predictive power in a model, and performance is compared to the performance of other sets.
__Wrapper methods__ differ in terms of how the sets of features are constructed.
Two such feature construction methods are "forward passes" and "backward passes".
In _forward passes_, the algorithm begins with no features and adds features one-by-one, always adding the feature that results in the highest increase in predictive power and stopping at some predetermined threshold.
In _backward passes_, the algorithm begins with all features and drops features one-by-one, always dropping the feature with the least predictive power and stopping at some predetermined threshold.
_Forward_ and _backward pass_ methods are considered "greedy" because once a feature is added (forward) or removed (backward) it is never again evaluated for the model.

__Wrapper methods__ are good at selecting useful sets of features that effectively predict the outcome.
For larger sets of features, however, __wrapper methods__ can be highly computationally intensive and are more vulnerable to overfitting than filter methods.


### Embedded methods

__Embedded methods__ also select sets of features, but do so as an intrinsic part of the _fitting method_ for the particular type of model you're using.
This may involve _regularization_, where a "complexity penalty" is added to goodness-of-fit measures typically used to assess the predictive power of a model.
__Embedded methods__ provide the benefits of _wrapper methods_ but are less computationally intensive.
Different types of models will use different __embedded methods__.


For a deep dive into the world of feature selection algorithms, check out [An Introduction to Variable and Feature Selection by Isabelle Guyon and Andre Elisseef](http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf), in the Journal of Machine Learning Research.