# Example Sets

# Cleaning
## drop nans
## filling in values
# Model prep
This section will cover functions and methods in the [sklearn.preprocessing](http://scikit-learn.org/stable/modules/preprocessing.html) module and patsy.

## Encoding class variables
### Binarizing

In [None]:
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=1.1)
binarizer.transform(X)

### [onehot](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
enc = preprocessing.OneHotEncoder()
enc.transform(X)

Note that patsy does one-hot encoding automatically.

## Train test split

## Scaling
### Standard scaling
#### Quick and dirty [scale](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html)

In [None]:
from sklearn import scale
X_scaled = scale(X)

#### The right way [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Scaling to a range

In [None]:
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.transform(X_test)

### Normalization
You can normalize data so that the columns have a unit norm of 1 using `l2` normalization. This is necessary for many applications of linear algebra.  [`sklearn.preprocessing.normalize`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html) is similar to [`np.linalg.norm`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html).

In [None]:
# Function
from sklearn.preprocessing import normalize
X_normalized = normalize(X, norm='l2')

# Object
from sklearn.preprocessing import Normalizer
normalizer = Normalizer(norm='l2').fit(X)

### Scaling sparse data
Scaling sparse data is a little more tricky, because you want to do it witout allocating a ton of memory.  Centering the data just won't do, because that would require us to fill in all the missing values with the mean.

Fortunately, sklearn has a few methods that can work with sparse data.

#### MaxAbs Scaler
The MaxAbs scaler scales the values by the max absolute value.  It does not touch the minimum value because, in many cases, sparse matrices hold counts and the missing values are all 0s.  Min-max scaling would turn all the 1s into 0s, and we don't want that.

We can use this scaler as a [`MaxAbsScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) object, or using the quick-and-dirty function [`maxabs_scale`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.maxabs_scale.html).

In [None]:
from sklearn.preprocessing import MaxAbsScaler
max_abs_scaler = MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_test_maxabs = max_abs_scaler.transform(X_test)

#### scale and StandardScaler
By default, `scale` and `StandardScaler` will throw an error if you use them with sparse matrices, but you can still use them if you set `with_mean=False`.  This will scale by standard deviation, but will not center the data. This is pretty much meaningless unless the data is centered, so I would only use this in a case where the data was already centered, i.e. we have positive and negative values in roughly equal proportion, and missing values represent 0.

### Scaling with outliers
If you have a lot of extreme outliers, the usual methods will not give you good results.  In this case, you can use a [`RobustScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html), which uses the median and interquartile range.  There is also a quick and dirty function [`robust_scale `](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.robust_scale.html).

## Odds and ends
There are a few things in model prep that are interesting, but I didn't want to go into, so I'll just point to them here.
### [Feature Extraction](http://scikit-learn.org/stable/modules/feature_extraction.html)

In [None]:
from sklearn.feature_extraction import DictVectorizer

## Higher-order features
[`PolynomialFeatures`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) will compute higher order features for you.  It has a few useful flags.
`degree` sets the highest order of the polynomials.  `PolynomialFeatures` will compute all of the degrees from 0 to `degree`.

`interaction_only` eliminates terms with only one feature, i.e. [a^2, b^2].  It does not exclude 0th or 1st order terms.

`include_bias`, if set to `True` includes a column of 1s.  Generally, this is a good idea, but it's worth paying attention to whether any of the other preprocessing steps will add another column of ones.  You should always add 1s after computing higher-order features, or else you will end up with a bunch of unnecessary cross-terms.

After applying this transform, you will probably want to know what your cross terms are.  Fortunately, `PolynomialFeatures` has a member function `get_feature_names` which takes an array of names of the old features, and will tell you the names of your new featuers.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=True)
X_poly = poly.fit_transform(X)
X_poly_names = poly.get_feature_names(input_features=None) # Uses [x0, x1 ..] by default

## Handling imbalanced classes: undersampling/oversampling
## Dimensionality reduction
### PCA
### SVD
# Regression
## Models
## Accuracy
### Scores
# Classification
## models
## accuracy
### ROC analysis
### Confusion matrix
### Scores
# Tuning
## Pipelines
[Pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) offer a convenient way to store all of the stages of a modeling process.  This is especially useful for tuning. 
## Parameter optimization
## Feature selection
### Recursive feature elimination
### nbest
#### linear corr
#### mutual information
#### distance correlation
# Visualization
## Regression line
## Space segmentation