# Scikit-learn processing pipelines

## Data preprocessing

Sources: http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html

### Encoding categorical features

In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings('once')

In [None]:
import pandas as pd

print(pd.get_dummies(['A', 'B', 'C', 'A', 'B', 'D']))

### Standardization of input features

Sources:

- http://scikit-learn.org/stable/modules/preprocessing.html

- http://stats.stackexchange.com/questions/111017/question-about-standardizing-in-ridge-regression


"Standardizing" or mean removal and variance scaling, is not systematic. For example multiple linear regression does not require it. However it is a good practice in many cases:

- The **variable combination method is sensitive scales**. If the input variables are combined via a distance function (such as Euclidean distance) in an RBF network, standardizing inputs can be crucial. The contribution of an input will depend heavily on its variability relative to other inputs. If one input has a range of 0 to 1, while another input has a range of 0 to 1,000,000, then the contribution of the first input to the distance will be swamped by the second input.

- **Regularized learning algorithm**. Lasso or Ridge regression regularize the linear regression by imposing a penalty on the size of coefficients. Thus the coefficients are shrunk toward zero and toward each other. But when this happens and if the independent variables does not have the same scale, the shrinking is not fair. Two independent variables with different scales will have different contributions to the penalized terms, because the penalized term is norm (a sum of squares, or absolute values) of all the coefficients. To avoid such kind of problems, very often, the independent variables are centered and scaled in order to have variance 1.


In [None]:
import numpy as np
from sklearn import linear_model as lm
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score

# dataset
np.random.seed(42)
n_samples, n_features, n_features_info = 100, 5, 3
X = np.random.randn(n_samples, n_features)
beta = np.zeros(n_features)
beta[:n_features_info] = 1
Xbeta = np.dot(X, beta)
eps = np.random.randn(n_samples)
y = Xbeta + eps

X[:, 0] *= 1e6  # inflate the first feature
X[:, 1] += 1e6  # bias the second feature
y = 100 * y + 1000  # bias and scale the output

print("== Linear regression: scaling is not required ==")
model = lm.LinearRegression()
model.fit(X, y)
print("Coefficients:", model.coef_, model.intercept_)
print("Test R2:%.2f" % cross_val_score(estimator=model, X=X, y=y, cv=5).mean())

print("== Lasso without scaling ==")
model = lm.LassoCV(cv=3)
model.fit(X, y)
print("Coefficients:", model.coef_, model.intercept_)
print("Test R2:%.2f" % cross_val_score(estimator=model, X=X, y=y, cv=5).mean())

print("== Lasso with scaling ==")
model = lm.LassoCV(cv=3)
scaler = preprocessing.StandardScaler()
Xc = scaler.fit(X).transform(X)
model.fit(Xc, y)
print("Coefficients:", model.coef_, model.intercept_)
print("Test R2:%.2f" % cross_val_score(estimator=model, X=Xc, y=y, cv=5).mean())


## Scikit-learn pipelines

Sources: http://scikit-learn.org/stable/modules/pipeline.html

Note that statistics such as the mean and standard deviation are computed from the training data, not from the validation or test data. The validation and test data must be standardized using the statistics computed from the training data. Thus Standardization should be merged together with the learner using a ``Pipeline``.

Pipeline chain multiple estimators into one. All estimators in a pipeline, except the last one, must have the ``fit()`` and ``transform()`` methods. The last must implement the ``fit()`` and ``predict()`` methods.

### Standardization of input features

In [None]:
from sklearn import preprocessing
import sklearn.linear_model as lm

from sklearn.pipeline import make_pipeline
model = make_pipeline(preprocessing.StandardScaler(), lm.LassoCV(cv=3))

# or
from sklearn.pipeline import Pipeline
model = Pipeline([('standardscaler', preprocessing.StandardScaler()), 
                  ('lassocv', lm.LassoCV(cv=3))])

scores = cross_val_score(estimator=model, X=X, y=y, cv=5)
print("Test  r2:%.2f" % scores.mean())

### Features selection

An alternative to features selection based on $\ell_1$ penalty is to use a preprocessing stp of univariate feature selection.

Such methods, called **filters**, are a simple, widely used method for supervised dimension reduction [26]. Filters are univariate methods that rank features according to their ability to predict the target, independently of other features. This ranking may be based on parametric (e.g., t-tests) or nonparametric (e.g., Wilcoxon tests) statistical methods. Filters are computationally efficient and more robust to overfitting than multivariate methods. However, they are blind to feature interrelations, a problem that can be addressed only with multivariate selection such as learning with $\ell_1$ penalty.

In [None]:
import numpy as np
import sklearn.linear_model as lm
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline

np.random.seed(42)
n_samples, n_features, n_features_info = 100, 100, 3
X = np.random.randn(n_samples, n_features)
beta = np.zeros(n_features)
beta[:n_features_info] = 1
Xbeta = np.dot(X, beta)
eps = np.random.randn(n_samples)
y = Xbeta + eps

X[:, 0] *= 1e6  # inflate the first feature
X[:, 1] += 1e6  # bias the second feature
y = 100 * y + 1000  # bias and scale the output

model = Pipeline([('anova', SelectKBest(f_regression, k=3)),
                  ('lm', lm.LinearRegression())])
scores = cross_val_score(estimator=model, X=X, y=y, cv=5)
print("Anova filter + linear regression, test  r2:%.2f" % scores.mean())

from sklearn.pipeline import Pipeline
model = Pipeline([('standardscaler', preprocessing.StandardScaler()),
                  ('lassocv', lm.LassoCV(cv=3))])
scores = cross_val_score(estimator=model, X=X, y=y, cv=5)
print("Standardize + Lasso, test  r2:%.2f" % scores.mean())

## Regression pipelines with CV for parameters selection

Now we combine standardization of input features, feature selection and learner with hyper-parameter within a pipeline which is warped in a grid search procedure to select the best hyperparameters based on a (inner)CV. The overall is plugged in an outer CV.

In [None]:
import numpy as np
from sklearn import datasets
import sklearn.linear_model as lm
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import sklearn.metrics as metrics

# Datasets
n_samples, n_features, noise_sd = 100, 100, 20
X, y, coef = datasets.make_regression(n_samples=n_samples, n_features=n_features, 
                                      noise=noise_sd, n_informative=5,
                                      random_state=42, coef=True)
 
# Use this to tune the noise parameter such that snr < 5
print("SNR:", np.std(np.dot(X, coef)) / noise_sd)

print("=============================")
print("== Basic linear regression ==")
print("=============================")

scores = cross_val_score(estimator=lm.LinearRegression(), X=X, y=y, cv=5)
print("Test  r2:%.2f" % scores.mean())

print("==============================================")
print("== Scaler + anova filter + ridge regression ==")
print("==============================================")

anova_ridge = Pipeline([
    ('standardscaler', preprocessing.StandardScaler()),
    ('selectkbest', SelectKBest(f_regression)),
    ('ridge', lm.Ridge())
])
param_grid = {'selectkbest__k':np.arange(10, 110, 10), 
              'ridge__alpha':[.001, .01, .1, 1, 10, 100] }

# Expect execution in ipython, for python remove the %time
print("----------------------------")
print("-- Parallelize inner loop --")
print("----------------------------")

anova_ridge_cv = GridSearchCV(anova_ridge, cv=5,  param_grid=param_grid, n_jobs=-1)
%time scores = cross_val_score(estimator=anova_ridge_cv, X=X, y=y, cv=5)
print("Test r2:%.2f" % scores.mean())

print("----------------------------")
print("-- Parallelize outer loop --")
print("----------------------------")

anova_ridge_cv = GridSearchCV(anova_ridge, cv=5,  param_grid=param_grid)
%time scores = cross_val_score(estimator=anova_ridge_cv, X=X, y=y, cv=5, n_jobs=-1)
print("Test r2:%.2f" % scores.mean())


print("=====================================")
print("== Scaler + Elastic-net regression ==")
print("=====================================")

alphas = [.0001, .001, .01, .1, 1, 10, 100, 1000] 
l1_ratio = [.1, .5, .9]

print("----------------------------")
print("-- Parallelize outer loop --")
print("----------------------------")

enet = Pipeline([
    ('standardscaler', preprocessing.StandardScaler()),
    ('enet', lm.ElasticNet(max_iter=10000)),
])
param_grid = {'enet__alpha':alphas ,
              'enet__l1_ratio':l1_ratio}
enet_cv = GridSearchCV(enet, cv=5,  param_grid=param_grid)
%time scores = cross_val_score(estimator=enet_cv, X=X, y=y, cv=5, n_jobs=-1)
print("Test r2:%.2f" % scores.mean())

print("-----------------------------------------------")
print("-- Parallelize outer loop + built-in CV      --")
print("-- Remark: scaler is only done on outer loop --")
print("-----------------------------------------------")

enet_cv = Pipeline([
    ('standardscaler', preprocessing.StandardScaler()),
    ('enet', lm.ElasticNetCV(max_iter=10000, l1_ratio=l1_ratio, alphas=alphas, cv=3)),
])

%time scores = cross_val_score(estimator=enet_cv, X=X, y=y, cv=5)
print("Test r2:%.2f" % scores.mean())

## Classification pipelines with CV for parameters selection

Now we combine standardization of input features, feature selection and learner with hyper-parameter within a pipeline which is warped in a grid search procedure to select the best hyperparameters based on a (inner)CV. The overall is plugged in an outer CV.

In [None]:
import numpy as np
from sklearn import datasets
import sklearn.linear_model as lm
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import sklearn.metrics as metrics
    
# Datasets
n_samples, n_features, noise_sd = 100, 100, 20
X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features,
                         n_informative=5, random_state=42)


def balanced_acc(estimator, X, y, **kwargs):
    '''
    Balanced acuracy scorer
    '''
    return metrics.recall_score(y, estimator.predict(X), average=None).mean()

print("=============================")
print("== Basic logistic regression ==")
print("=============================")

scores = cross_val_score(estimator=lm.LogisticRegression(C=1e8, 
                                   class_weight='balanced',
                                   solver='lbfgs'),
                         X=X, y=y, cv=5, scoring=balanced_acc)
print("Test  bACC:%.2f" % scores.mean())

print("=======================================================")
print("== Scaler + anova filter + ridge logistic regression ==")
print("=======================================================")

anova_ridge = Pipeline([
    ('standardscaler', preprocessing.StandardScaler()),
    ('selectkbest', SelectKBest(f_classif)),
    ('ridge', lm.LogisticRegression(penalty='l2', 
                                    class_weight='balanced',
                                   solver='lbfgs'))
])
param_grid = {'selectkbest__k':np.arange(10, 110, 10), 
              'ridge__C':[.0001, .001, .01, .1, 1, 10, 100, 1000, 10000]}


# Expect execution in ipython, for python remove the %time
print("----------------------------")
print("-- Parallelize inner loop --")
print("----------------------------")

anova_ridge_cv = GridSearchCV(anova_ridge, cv=5,  param_grid=param_grid, 
                              scoring=balanced_acc, n_jobs=-1)
%time scores = cross_val_score(estimator=anova_ridge_cv, X=X, y=y, cv=5,\
                               scoring=balanced_acc)
print("Test bACC:%.2f" % scores.mean())

print("----------------------------")
print("-- Parallelize outer loop --")
print("----------------------------")

anova_ridge_cv = GridSearchCV(anova_ridge, cv=5,  param_grid=param_grid,
                              scoring=balanced_acc)
%time scores = cross_val_score(estimator=anova_ridge_cv, X=X, y=y, cv=5,\
                               scoring=balanced_acc, n_jobs=-1)
print("Test bACC:%.2f" % scores.mean())


print("========================================")
print("== Scaler + lasso logistic regression ==")
print("========================================")

Cs = np.array([.0001, .001, .01, .1, 1, 10, 100, 1000, 10000])
alphas = 1 / Cs
l1_ratio = [.1, .5, .9]

print("----------------------------")
print("-- Parallelize outer loop --")
print("----------------------------")

lasso = Pipeline([
    ('standardscaler', preprocessing.StandardScaler()),
    ('lasso', lm.LogisticRegression(penalty='l1', 
                                    class_weight='balanced')),
])
param_grid = {'lasso__C':Cs}
enet_cv = GridSearchCV(lasso, cv=5,  param_grid=param_grid, scoring=balanced_acc)
%time scores = cross_val_score(estimator=enet_cv, X=X, y=y, cv=5,\
                               scoring=balanced_acc, n_jobs=-1)
print("Test bACC:%.2f" % scores.mean())


print("-----------------------------------------------")
print("-- Parallelize outer loop + built-in CV      --")
print("-- Remark: scaler is only done on outer loop --")
print("-----------------------------------------------")

lasso_cv = Pipeline([
    ('standardscaler', preprocessing.StandardScaler()),
    ('lasso', lm.LogisticRegressionCV(Cs=Cs, scoring=balanced_acc)),
])

%time scores = cross_val_score(estimator=lasso_cv, X=X, y=y, cv=5)
print("Test bACC:%.2f" % scores.mean())


print("=============================================")
print("== Scaler + Elasticnet logistic regression ==")
print("=============================================")

print("----------------------------")
print("-- Parallelize outer loop --")
print("----------------------------")

enet = Pipeline([
    ('standardscaler', preprocessing.StandardScaler()),
    ('enet', lm.SGDClassifier(loss="log", penalty="elasticnet",
                            alpha=0.0001, l1_ratio=0.15, class_weight='balanced')),
])

param_grid = {'enet__alpha':alphas,
              'enet__l1_ratio':l1_ratio}

enet_cv = GridSearchCV(enet, cv=5,  param_grid=param_grid, scoring=balanced_acc)
%time scores = cross_val_score(estimator=enet_cv, X=X, y=y, cv=5,\
    scoring=balanced_acc, n_jobs=-1)
print("Test bACC:%.2f" % scores.mean())