# Preprocessing and Feature Engineering

## Scaling

Many models, K-NN and linear models particularly, benefit substantially from scaling (notable exceptions are NN, and Tree based models). We cover a few methods readily available in Scikit-learn and other packages. (i denotes row, j column)  

`from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, Normalizer, MaxAbsScaler`  

**&#9755; StandardScaler:**  

$\large{x_{ij}' = \frac{x_{ij} - \mu_j}{\sigma_j}} \quad \forall i,j$

all scaled features now have zero mean and std of 1. Note that scaled data is not bounded since it is only scaled by the
standard deviation. An outlier before the transformation will still be an outlier after transformation. StandardScaler is sensitive to outliers as they skew the mean and standard deviation. 


**&#9755; MinMaxScaler**  

$\large{x_{i,j}' = \frac{x_{ij} - min(x_j)}{max(x_j) - min(x_j)}}$  

maps the minimum of each column to 0 and max to 1. It is easy to see in the transformation that MinMaxScaler is very sensitive to outliers  since it will cluster points in one area of the positive unit quadrant. Useful when data has clearly defined boundaries i.e. greyscale image.  


**&#9755; MaxAbsScaler**  
Similar to MinMaxScaler except min and max are measured in absolute value. Useful for sparse data. 

**&#9755; RobustScaler**  
Some robust statistics stuff (read interquartile range (IQR) and the median absolute deviation (MAD)). Similar to
StandardScaler, except robust to outliers.  

**&#9755; Normalizer**  
Projects unto the L1 or L0 unit ball (i.e. makes sure vectors have length 1 either in euclidean measure or L1 measure). Can't think of use cases atm.  

## Pitfalls

**&#9841; Scaling Sparse data:**  
Do not center sparse data (i.e. apply zero mean, unit variance or MinMaxScaler) since this will make the matrix not sparse anymore and blow up RAM and CPU. Scale by a constant factor since constant times zero is zero, preserving sparsity. Use MaxAbsScaler for sparse data.  

**&#9841; Including test set in scaler's `fit()`**  
Including the test set in `fit()` will lead to artificially higher accuracy scores since we are parameterizing our scaler based on our test set. In deployment, we obviously can't parameterize our scalers on unseen data. 

**&#9841; Calling `fit()` on training and test set separately:**  
Never call `fit()` on the test set separately since it will change the relationship of test points to training points. Only call fit on training set. Then call transform on both training and test sets. 

**&#9841; Not using pipelines in cross validation:**  
When you perform cross validation on the scaled training set, the validation fold is scaled in the same way as the training fold, leading to pitfall 2. Using pipelines solves this pitfall.

<img src="files/images/scaling.png">

## Detour: Pipelines

In [15]:
from sklearn.pipeline import make_pipeline

# this is wrapper for Pipeline below
pipe = make_pipeline(StandardScaler(), Ridge())
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

# this gives flexibility in naming the steps. Useful when tuning parameters.
from sklearn.pipeline import Pipeline
pipe = Pipeline((("scaler", StandardScaler()),
                 ("regressor", KNeighborsRegressor)))

# cross validation with pipeline
knn_pipe = make_pipeline(StandardScaler(), KNeighborsRegressor())
scores = cross_val_score(knn_pipe, X_train, y_train, cv=10)
np.mean(scores), np.std(scores)


# pipeline and GridSearchCV. Use modified names ([step_name]__[parameter]) 
# in GridSearch param_grid
knn_pipe = make_pipeline(('scale', StandardScaler()), ('model', KNeighborsRegressor())
param_grid = {'model__n_neighbors': range(1, 10)}
grid = GridSearchCV(knn_pipe, param_grid, cv=10)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.score(X_test, y_test))

## Feature Distributions 

Linear Models may perform better when features are normally distributed. Scaling data does not change the distribution of the points. There are several transformations available that do just this. Most common tranformations are power transformations, particularly the Box Cox Transformation.  

**&#9755; Box-Cox transformation:**  

$bc_{\lambda}(x) = \cases{\frac{x^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0\cr log(x) & \text{if } \lambda = 0 }$  

only applicable for positive x!  



In [None]:
# sklearn 0.20-dev
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='box-cox') 
# soon: method='Yeo-Johnson'
pt.fit(X)

## Categorical Data  

We can one-hot encode a feature's categorical values (e.g. $\{'green', 'blue', 'yellow'\}$) using pandas. We can encode all k values, introducing redundancies, or k-1 to have more accurate interpretation of coefficients. We can also one hot encode using sklearn but this method requires categorical values to be in integer format.  
  
We face a problem when categorical feature values are of high cardiniality (50, 100, etc). We must find a way to compress the values into fewer than k features. This solution is often specific to the dataset. 

## Pitfalls

**&#9841; Categorical Values in test but not training set:**  
Make sure to one hot encode all possible values of the feature, not just those in the test and traning sets.

In [None]:
import pandas as pd
df = pd.DataFrame({'salary': [103, 89, 142, 54, 63, 219],
                   'boro': ['Manhattan', 'Queens', 'Manhattan',
                            'Brooklyn', 'Brooklyn', 'Bronx']})

# note that there are more categories in the definition of the 
# column than those seen in the column
df['boro'] = pd.Categorical(
    df.boro, categories=['Manhattan', 'Queens', 'Brooklyn', 'Bronx', 'Staten Island'])
pd.get_dummies(df, columns=['boro'])

## Feature interactions  

Linear models particularly benefit from feature interactions. Feature transformations allow models to fit non linear boundaries or curves to the data. However, these transformations blow up the feature space. Using kernel transformations allow for the power of the transformations without a significant increase in CPU or memory. 

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Imputation  

Imputation refers to dealing with missing values. We can always drop columns with NaN values but we would lose all the information given by non NaN values. We could also drop the observations that contain NaN values if they are few. There are 4 types of imputation methods that try to extrapolate missing values: Mean/median, kNN, regression, probabilistic. A general good practice with imputation is to create a dummy indicating if the value was NaN in addition to extrapolation of the value. Check out fancyimpute library (note that fancy impute does not implement fit/transform paradigm, thus does not work with pipelines...information leak!)

**&#9755; Mean Imputation:**  
Mean imputation fills in NaN values with the mean of the given feature. This, of course, only works for non binary data. Mean imputation can be acceptable if missing values are few. Else, this method can destroy the data distribution, hiding useful structural relationships between the input and target data.  

**&#9755; kNN Imputation:**  
kNN imputation works by taking k nearest neighbors of the observation with the missing value, and replaces NaN with the average value of that feature among the k neighbors. Note that kNN imputation only works if the features used to compute distance are not NaN, since Euclidiean distance is undefined for NaN values. So points with NaN values are thown away when computing  k neighbors.  

**&#9755; Model Driven Imputation:**  
Train a model on the non missing features to predict missing features (kNN imputation is arguably Model driven imputation but there are some slight differences). A popular model to use is random forests (see code below). 

**&#9755; (MICE) Multiple Imputation by Chained Equations:**  
Not sure yet how it works but this is a very popular imputation method. Not in sklearn, but is in fancyimpute. 

In [19]:
# mean imputation
from sklean.preprocessing import Imputer

imp = Imputer(strategy='mean').fit(X_train)
imp.transform(X_train)


# kNN imputation: very inefficient didactic implementation
# use library!
distances = np.zeros((X_train.shape[0], X_train.shape[0]))
for i, x1 in enumerate(X_train):
    for j, x2 in enumerate(X_train):
        dist = (x1 - x2) ** 2
        nan_mask = np.isnan(dist)
        distances[i, j] = dist[~nan_mask].mean() * X_train.shape[1]
neighbors = np.argsort(distances, axis=1)[:, 1:]
n_neighbors = 3
X_train_knn = X_train.copy()
for feature in range(X_train.shape[1]):
    has_missing_value = np.isnan(X_train[:, feature])
    for row in np.where(has_missing_value)[0]:
        neighbor_features = X_train[neighbors[row], feature]
        non_nan_neighbors = neighbor_features[~np.isnan(neighbor_features)]
        X_train_knn[row, feature] = non_nan_neighbors[:n_neighbors].mean()
    
    
# Model Driven Imputation with random forests.     
rf = RandomForestRegressor(n_estimators=100)
X_imputed = X_train.copy()
for i in range(10):
    last = X_imputed.copy()
    for feature in range(X_train.shape[1]):
        inds_not_f = np.arange(X_train.shape[1])
        inds_not_f = inds_not_f[inds_not_f != feature]
        f_missing = np.isnan(X_train[:, feature])
        rf.fit(X_imputed[~f_missing][:, inds_not_f], X_train[~f_missing, feature])
        X_imputed[f_missing, feature] = rf.predict(
            X_imputed[f_missing][:, inds_not_f])
    if (np.linalg.norm(last - X_imputed)) < .5:
        break
scores = cross_val_score(logreg, X_imputed, y_train, cv=10)
np.mean(scores)


# MICE with fancyimpute
import fancyimpute

mice = fancyimpute.MICE(verbose=0)
X_train_fancy_mice = mice.complete(X_train)
scores = cross_val_score(logreg, X_train_fancy_mice, y_train, cv=10)
scores.mean()

SyntaxError: invalid syntax (<ipython-input-19-dc41b017f0c6>, line 44)

<img src="files/images/imputation_methods.png">

## Feature Selection  

It is often beneficial to reduce the feature space. It leads to faster predictions and faster traning times. Less data to gather when model is in production. Less storage for model and dataset and more importantly, leads to more interpretable models. There are supervised and unsupervised feature selection methods.  

**&#9755; Covariance based:**  
When two features are highly correlated, we can remove 1 of them without substantially affecting model predictions (probably..it could be the small difference in variation of these features was really important for predictions. We can test this easily though.) Show code for sorting correlation matrix heatmap.

**&#9755; PCA:**  
PCA maps the data to a linear subpsace. Only reduces feature space for training and predictions, not for data gathering. It makes model less interpretable BUT it can lead to very useful visualizations in 2D and 3D and it can speed up training time considerably. It can remove useful information. 

**&#9755; Univariate Statistics:**  
We can build a linear regression model on single feature and the target and measure F and p values for the coefficient. These values let us know if feature is important for prediction. However, linear regression assumes linear relationship between input and target, which may be a poor assumption. Furthermore, the linear assumption decreases the importance of binary features.  

Mutual information relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances. MI Can deal with binary features (must specify which columns are binary). This is a good choice when assuming non linear interaction between feature and target. However, much more computationally intensive than f regression.  

**&#9755; Model Based feature selection:**  
Train a model and check coefficients (in the case of linear models) or splits (in the case of tree based models). Discard features with coefficient close to zero or without/non informative splits. This is usually a good choice but computationally expensive. 

**&#9755; Iterative Model based feature selection:**  
Removing many features at once may change the model more than we indend to. A better way to iteratively remove a single feature, each time retraining the model and selecting the least important. We can do this with RFE (Recursive Feature Elimination) in Sklearn. We can also iterate the opposite way, Sequential Feature Selection from library mlextend. 


## Pitfalls  

**&#9841; Univariate statistics do not account for correlation:**  
Univariate statistics give highly correlated features the same importance. Model driven feature selection accounts for correlation and can thus give much higher importance to some features while removing importance from others (which is which is usually random).

In [None]:
# Get F and p values so we can visualize and drop features
from sklearn.feature_selection import f_regression

f_values, p_values = f_regression(X, y)


# or we can use SelectKBest to automatically drop features below a threshold. 
from sklearn.feature_selection import SelectKBest, SelectPercentile, SelectFpr
from sklearn.linear_model import RidgeCV


select = SelectKBest(k=2, score_func=f_regression)
select.fit(X_train, y_train)
print(X_train.shape)
print(select.transform(X_train).shape)

# SelectKBest with pipelines
make_pipeline(StandardScaler(), SelectKBest(k=2, score_func=f_regression), RidgeCV())

# Mutual information
from sklearn.feature_selection import mutual_info_regression

scores = mutual_info_regression(X_train, y_train, discrete_features=[3])

# Model Driven feature selection
from sklearn.feature_selection import SelectFromModel

select_lassocv = SelectFromModel(LassoCV(), threshold=1e-5)
select_lassocv.fit(X_train, y_train)
print(select_lassocv.transform(X_train).shape)


# Recursive Feature Elimination
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# create ranking among all features by selecting only one
rfe = RFE(LinearRegression(), n_features_to_select=1)
rfe.fit(X_train_scaled, y_train)
rfe.ranking_

# RFE that automatically selects best number of features to keep
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFECV

rfe = RFECV(LinearRegression(), cv=10)
rfe.fit(X_train_scaled, y_train)
print(rfe.support_)
print(boston.feature_names[rfe.support_])

# Sequential Feature Selection
from mlxtend.feature_selection import SequentialFeatureSelector

sfs = SequentialFeatureSelector(LinearRegression(), forward=False, k_features=7)
sfs.fit(X_train_scaled, y_train)


# Dimensionality Reduction  

Transforming data using unsupervised learning can have many motivations. The most common motivations are visualization, compressing the data, and finding a representation that is more informative for further processing.

## PCA

On a high level, principal component analysis iteratively does the following: 

Finds the direction, orthogonal to all previously established directions, along which the data points vary the most (highest variance). Describe this direction with a unit vector. Continue until we have n directions. Build an n-dimensional linear map with these n vectors.

These n vectors, representing n directions are called principal components. For n dimensional data, an n-dimensional PCA is just a rotation (re-basing of the vector space) in which the dimensions are sorted in order of decreasing variance. 

TODO talk about PCA equation  

TODO Whitening

We usually scale the the data to have unit standard deviationg before applying PCA (PCA normalizes to 0 mean under the covers). It is important to note that PCA is an unsupervised method, and does not use any class
information when finding the rotation. It simply looks at the correlations in the data.  

## Pitfalls  

**&#9841; Uneven class distributions**  
When our dataset has skewed distributions (or outliers), PCA will give a lot of weight to the classes with many observations. We need to randomly throw out observations of commonly occuring classes so most classes have similar occurence


<img src="files/images/pca-intuition.png">

In [None]:
from sklearn.decomposition import PCA
X_train_scaled = StandardScaler(X_train)
pca = PCA(n_components=2)
X_train_pca = pca.fit(X_train_scaled)