## Sklearn pipelines

Pipelines are containers of steps. A step can be one of the following:

- Transformer
- Estimator
- Pipeline
- FeatureUnion

Pipelines are specially usefull to package the preprocessing of the data and model fitting in a single (serializable) object.



From the help of the PipeLine object:
```
Sequentially apply a list of transforms and a final estimator.
Intermediate steps of the pipeline must be 'transforms', that is, they
must implement fit and transform methods.
The final estimator only needs to implement fit.


The purpose of the pipeline is to assemble several steps that can be
cross-validated together while setting different parameters.
For this, it enables setting parameters of the various steps using their
names and the parameter name separated by a '__', as in the example below.
A step's estimator may be replaced entirely by setting the parameter
with its name to another estimator, or a transformer removed by setting
to None.
```

In [1]:
import sklearn
from sklearn import pipeline

In [2]:
?sklearn.pipeline.Pipeline

In [3]:
from sklearn import svm
from sklearn.datasets import samples_generator
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline

# generate some data to play with
X, y = samples_generator.make_classification(
    n_informative=5, n_redundant=0, random_state=42)

# ANOVA SVM-C
anova_filter = SelectKBest(f_regression, k=5)
clf = svm.SVC(kernel='linear')
svmpipe = Pipeline([('anova', anova_filter), ('svc', clf)])

### Inspecting first part transformation

In [4]:
X.shape

(100, 20)

In [5]:
svmpipe.steps[0][1].fit(X,y)

SelectKBest(k=5, score_func=<function f_regression at 0x7f06ac7e4d08>)

In [6]:
svmpipe.steps[0][1].transform(X).shape

(100, 5)

In [7]:
svmpipe.steps[0][1].__dict__

{'score_func': <function sklearn.feature_selection.univariate_selection.f_regression(X, y, center=True)>,
 'k': 5,
 'scores_': array([1.23977183e-01, 2.54349641e-01, 4.38691648e+00, 8.50993664e+00,
        3.05588566e-01, 3.05419416e-01, 7.24129592e-01, 2.22731093e+01,
        1.01372597e-01, 2.14898175e+01, 5.56918995e-03, 1.09088355e+01,
        7.25814092e-01, 4.85637398e-01, 2.00376966e+00, 4.91894354e-01,
        7.69678969e-01, 9.48327951e-01, 3.33672446e-01, 2.58987004e-01]),
 'pvalues_': array([7.25516352e-01, 6.15160920e-01, 3.87954258e-02, 4.37964905e-03,
        5.81658683e-01, 5.81763072e-01, 3.96867119e-01, 7.86760605e-06,
        7.50866172e-01, 1.09815358e-05, 9.40663608e-01, 1.33672777e-03,
        3.96320524e-01, 4.87529621e-01, 1.60078717e-01, 4.84745633e-01,
        3.82462045e-01, 3.32542871e-01, 5.64829416e-01, 6.11960827e-01])}

### Accessing pipeline attributes with part__fieldname

In [8]:
svmpipe.steps

[('anova',
  SelectKBest(k=5, score_func=<function f_regression at 0x7f06ac7e4d08>)),
 ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False))]

In [9]:
svmpipe.set_params(svc__C=0.3)

Pipeline(memory=None,
     steps=[('anova', SelectKBest(k=5, score_func=<function f_regression at 0x7f06ac7e4d08>)), ('svc', SVC(C=0.3, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])

### Fitting the pipeline

In [10]:
svmpipe.set_params(anova__k=7, svc__C=.1)

Pipeline(memory=None,
     steps=[('anova', SelectKBest(k=7, score_func=<function f_regression at 0x7f06ac7e4d08>)), ('svc', SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])

In [11]:
svmpipe.fit(X, y)

Pipeline(memory=None,
     steps=[('anova', SelectKBest(k=7, score_func=<function f_regression at 0x7f06ac7e4d08>)), ('svc', SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])

In [12]:
prediction = svmpipe.predict(X)

In [13]:
svmpipe.score(X, y) 

0.82

In [14]:
svmpipe.named_steps['anova'].get_support()

array([False, False,  True,  True, False, False, False,  True, False,
        True, False,  True, False, False,  True, False, False,  True,
       False, False])

In [15]:
svmpipe.named_steps.anova.get_support()

array([False, False,  True,  True, False, False, False,  True, False,
        True, False,  True, False, False,  True, False, False,  True,
       False, False])

### Direct svm

In [16]:
clf = svm.SVC(kernel='linear')

In [17]:
clf.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [18]:
clf.score(X, y) 

0.9