## Feature selection

* Select the best features available;
* Don't select unecessary features;
* Create new features

## Adding a new feature

1. human intuition;
2. code up the new feature;
3. visualize;
4. repeat.

## Getting rid of features

* It's noisy;
* It causes overfitting;
* It is strongly related (highly correlated) with a feature that's already present;
* Addtional features slow down training/testing process

## Features != Information

* We want information to draw conclusions and have insights;
* Features is the actual number or caracteristics of a particular data point that's attempting to access information;
* Is a little bit like the difference between the quantity of something and the quality;
* Example of prepross is in `tools/email_preprocess.py`

## Sklearn options

* `SelectPercentile`: Selects the % of strong features;
* `SelectKBest`: Select k strong features.
* In `TfidfVectorizer` the argument `max_df` can be used for feature reduction (dimensionality reduction)

## Bias-variance dilemma and number of features
**High bias algorithm**
* Pays little attetion to the training data and is kind of oversimplified;
* High error on training set (in regression it means low r-squared or a large sum of the squared residual errors);
* Few features used.

**High variance**
* Pays too much attention to the data (doesn't generalize well overfits). It doesn't generalize well to new situations that it hasn't quite seen before (basically memorize the training examples, overfitting);
* Good fit to the training but has a higher error on test set because it's not generalizing very well;
* Many features, carefully optimized performance on training data.

Using few features can cause a classic high bias type regime, it's an oversimplified situation.

With a model where it was very carefully tuned to minimize the sum of squared errors on a regression (SSE) and using lots of features to try  to get every little bit of information out of the data that it could. **That's can cause a high variance situation**, can overfit to the data.

So there's this tradeoff between sort of the goodness of the fit and the simplicity of the fit.

Fit the algorithm with few features, but using the case of regression as a large r-squared (r²) or conversely a low sum of the squared residual errors (SSE).

## Balacing errors with the number of features

* How we can mathematically define what this arc might be so that the maximum point can be found;
* This process is called **regularization**;
* It's an automatic form of feature selection that some algorithms can do completely on their own, they can trade off between the precision, the goodness of it, the very low error and te complexity of fitting on lots of different features.

## Regularization in regression

Method for automatically penalizing extra features that is used in a model.

**Lasso regression**

minimize $SSE+\lambda|\beta|$

* Minimize the sum of the squared erros in my fit. Minimize the distance between my fit and any given data point or the square of that distance;
* In addition to minimizing the sum of the squared errors and the number of features to be used
* $\lambda$: penalty parameter;
* $\beta$: coefficients of my regression (number of features to use);

Comparing two different fits (with different number of features):
* The one that has more features will almost have a smaller sum of the squared error (it can fit more precisely to the points). But it pay penalty for using that extra feature;
* The gain that it get in precision (the goodness of fit of my regression) has to be a bigger gain thant the loss take as a result of having additional feature in the regression;
* Having small errors and having a simpler fit that's using fewer features;
* Automatically takes into account this penalty parameter, it actually figure out which features are the ones that have the most important effect on the regression;
* Once it's found those features, it can actually eliminate or set to zero the coefficients for the features that don't help.

## Lasso regression

For features that don't help the regression results enough, it can set the coefficient of those features to a very small value (zero).

* [sklearn Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)

In [11]:
from sklearn import linear_model

clf = linear_model.Lasso(alpha=0.1)
clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
print(clf.coef_)

[0.85 0.  ]


* Because we have two features for each training point, the coefficient of 0 means that it's effectively not being used in the regression. The second feature can disregard, at least in this particular regeression, all the discriminating power is coming from the first feature.

If a decision tree is overfit, we expect the accuracy on a test set to be pretty low and a high accuracy on the training set.