<font color='red'> HERE! Insert a recap</font>

### Assumption:
- We will be using scikit-learn interface to pipelines
- We will work with Pandas Dataframes for transformations

### Definition

**Transformers** are python classes used in data manipulation to select features, normalize, reduce dimension,... permorf transformations on data.
A Transformer implements the method ``transform`` that applies the transformation. Examples:
- a transformer might take a Dataframe and add a column to it
- a tranformer might take an array as input, calculate the mean and normalize it

**Estimators** are python classes that learns something from the data by "fitting" or "training" on data. Estimators implement the method ``fit``, which accepts an input (DataFrame, Array, etc)

<font>

### Examples

One-Hot Encoder (Transformer)
    https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
    
Logistic Regression (Estimator)
    https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

**Transformer** are said to be:
- **statefull** if they learn something from the data using the ``fit`` method
- **stateless** if they do not need to learn (the ``fit`` method does not do anything)

Example of a stateless Transformer in scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html

### Custom Transformer and Estimators

In scikit-learn it is possible to write your own transformers and estimators to perform transformations on data or to use a custom model for making predictions.
In scikit-learn custom transformers are written as classes that inherit attributes and methods from two special scikit-learn classes: ``BaseEstimator`` and ``TransformerMixin``

```python
from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer(BaseEstimator, TransformerMixin):
    ....
```

```BaseEstimator``` https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/base.py#L176
This is a python class that represents the base from all estimators. An estimator can inherit methods from the BaseEstimator like ```get_param```

```TransformerMixin``` https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/base.py#L490
A Mixin is a class that contains method used by other classes without having to be parent class of those other classes (ex. the method ``fit_transform`` that concatenates fit+transform methods)

Can you give an example of a scikit-learn estimator that inherits from the BaseEstimator?

### Examples

In [3]:
#Example of a Transformer that does not do anything
from sklearn.base import BaseEstimator, TransformerMixin

class LazyTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, x, y = None):
        return self

    def transform(self, x):
        return x


...

Exercise
--------------

1. Write a transformer that adds some number to the input, the number that is added should be passed in `__init__`
2. Write a transformer that normalizes the input:
   - in the fit method you must save the column means
3. Combine these 2 transformers into a pipeline:
   - hint: write a class that accepts list of transformers as argument

## Pipeline and make_pipeline

The word "Pipeline" can refer to different things in software engeneering. In our context we use this word to refer to scikit-learn pipelines:
Theses are **sequences** of transformers that end with a final estimator. Intermediate steps of the pipelines must have fit and transform methods, while the very last step must have just a fit method. 

Different ways to create a Pipeline in scikit-learn: 

1.```Pipeline```

2.```make_pipeline```


1. **Pipeline** is a class in scikit-learn:```class sklearn.pipeline.Pipeline(steps, memory=None)```

```python 
from sklearn.pipeline import Pipeline

Pipeline(steps = [('transformer_one', Transformer1(...)),
                 ('transformer_two', Transformer2(...)),
                 ....
                 ('final_estimator', Model(...))])
````
Example 1:
```python
Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression())])
```

2. **make_pipeline** is a python function that represents a shorthard for ```Pipeline``` constructor

```python
from scikit.pipeline import make_pipeline

make_pipeline(Transformer1(...),Transformer2(...), ... , Model())
```
Example 2:
```python
make_pipeline(CountVectorizer(), LogisticRegression()) 
```

The only difference between ```Pipeline``` and ```make_pipeline``` is that with Pipeline you need to specify the names of each Estimator. This can be usefull if you need to use model selection utilities (like ```GridSearch```):

Example 1:

```python
pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression())])
param_grid = [{'clf__C': [1, 10, 100, 1000]}]
gs = GridSearchCV(pipe, param_grid)
gs.fit(X,y)
```

Example 2:

```python
pipe = make_pipeline(CountVectorizer(), LogisticRegression())
param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}]
gs = GridSearchCV(pipe, param_grid)
gs.fit(X,y)
```

As a consequence of this, using Pipeline we could for example replace the final estimator (from LosticRegression to RandomForest) and the name of the estimator would stay the same. On the other hand, with make_pipeline the names of the steps are autogenerated.
             

By calling ```fit``` on the pipeline, all the transformers in the pipeline are applied with fit and transform methods
one after the other: the transformed data is passed to the next transformer as input and the fit method of the last estimator is called. The method ```fit_predict``` fit the pipeline and generates the predictions from the last estimator.

## FeatureUnion and make_union

In certain situations you want to apply a list of transformers in parallel instead of one after the other.

Example: you have text data and you want to extract words frequencies using ```CountVectorizer``` and you also want to use the length of the text as feature. In this case a pipeline of estimators is not suitable since you need to do the operation on the same data and you cannot apply a transformer after the other. 

```FeatureUnion``` and its corresponding shorthand version ```make_union``` creates a union of transformers and concatenate their results.

Example:
```python
from scikit.pipeline import FeatureUnion
featunion = FeatureUnion([('count_vect', CountVectorizer(...)),('length', CustomTransformer(...))])
# you can then combine the FeatureUnion into a pipeline
pipeline = Pipeline([
    ('feats', featunion),
    ('clf', Classifier())  # classifier
])
```