# Recap
- we have seen the definitions of Transformer and Estimators with some examples from scikit-learn
- we have built our own Transformers. 

We even built a class that combines two transformer in a sequence. In this chapter we will see a way to do this more efficiently through **scikit-learn pipelines**

# Definition (Pipeline)

The word *pipeline* can refer to different things in data science/engineering. In our context we use this word to refer to scikit-learn pipelines:
Theses are sequences of **transformers** that end with a final **estimator**:

- the intermidiate steps of the pipelines must have fit and transform methods
- the final step of the pipeline must have a fit method


## Pipeline and make_pipeline

There are two ways you can build a pipeline in scikit-learn:

1. `Pipeline` is a scikit-learn class that takes a list of *steps* as inputs: the steps are tuples of (name, transformer). 
2. `make_pipeline` is a python function that represents a shorthard for ```Pipeline``` constructor


```python 
from sklearn.pipeline import Pipeline

Pipeline(steps = [('transformer_one', Transformer1(...)),
                 ('transformer_two', Transformer2(...)),
                 ....
                 ('final_estimator', Model(...))])
```
Example:
```python
Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression())])
```

```python
from scikit.pipeline import make_pipeline

make_pipeline(Transformer1(...),Transformer2(...), ... , Model())
```
Example:
```python
make_pipeline(CountVectorizer(), LogisticRegression()) 
```

Basically the only difference between ```Pipeline``` and ```make_pipeline``` is that with Pipeline you need to specify the names of each step. This can be useful if you need to use model selection utilities (like ```GridSearch```):

Example 1:

```python
pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression())])
param_grid = [{'clf__C': [1, 10, 100, 1000]}]
gs = GridSearchCV(pipe, param_grid)
gs.fit(X,y)
```

Example 2:

```python
pipe = make_pipeline(CountVectorizer(), LogisticRegression())
param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}]
gs = GridSearchCV(pipe, param_grid)
gs.fit(X,y)
```

As a consequence of this, using Pipeline we could for example replace the final estimator (from LosticRegression to RandomForest) and the name of the estimator would stay the same. On the other hand, with make_pipeline the names of the steps are autogenerated.
             

**Remember** we talked about object oriented programming? scikit-learn Pipelines are also classes with methods:
`fit`, `fit_transform`, `fit_predict`.


Check scikit-learn documentation for more information on Pipeline's methods: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

## Feature Union and make_union

In certain situations you want to apply a list of transformers in parallel instead of one after the other. Example: you have text data and you want to extract words frequencies using ```CountVectorizer``` and you also want to use the length of the text as feature. In this case you need to make a *union* of transformers

1. ```FatureUnion``` is a class that takes a list of transformers as input, in particular a list of tuples (name, transformer) and concatenate them
2. ```make_union``` is a function that represents a shorthand for `FeatureUnion`


```python
from scikit.pipeline import FeatureUnion
featunion = FeatureUnion([('count_vect', CountVectorizer(...)),('length', CustomTransformer(...))])
```

```python
# you can then combine the FeatureUnion into a pipeline
pipeline = Pipeline([
    ('feats', featunion),
    ('clf', Classifier())  # classifier
])
```

<img src="./images/diagram_data_union.png">

# Other Examples of Pipelines

The word *pipeline* in Data Science can refer to many different things but in general refers to the whole processo of working with data from getting raw data to delivering something meaningful.

Check here for pipelines in spark: https://spark.apache.org/docs/2.2.0/ml-pipeline.html