# Data Pipelines

## Introduction

We have learnt that data is not always in the ideal format, and thus, it has to be transformed in order to make it suitable for analysing or ingesting to learning algorithms.

`scikit-learn` provides an unified way of integrating data transformations, so the whole process can be run in an homogenized way.

## What you will learn in this session

* Understand the Pipeline paradigm
* Know how to create Pipelines using `scikit-learn`
* How to convert Categoric variables
* How to transform Numeric Variables

## Contents
* [What is a Pipeline?](#What-is-a-Pipeline?)
* [What is `scikit-learn`](#What-is-scikit-learn?)
    * [`scikit-learn` Transformers](#scikit-learn-Transformers)
    * [`scikit-learn` Pipelines](#scikit-learn-Pipelines)
* [Transforming Variables](#Transforming-Variables)
    * [Categorical to Numeric](#Categorical-to-Numeric)
    * [Numeric Data](#Numeric-Data)
    * [Heterogeneous Data](#Heterogeneous-Data)

## Acknowledgments
* https://pbpython.com/categorical-encoding.html
* https://ramhiser.com/post/2018-04-16-building-scikit-learn-pipeline-with-pandas-dataframe/
* https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf
* https://chrisalbon.com/python/data_wrangling/pandas_create_pipeline/
* https://scikit-learn.org/stable/modules/compose.html

## What is a Pipeline?

We have seen that preparing data for analysis may involve several steps, including different functions for data transformation.

When we have completed the design of a data preparation, often we want to apply same transformations to different data sets. 

For example, we have an update of the previous dataset, and we want to apply same previous transformations (one after the following one) to new data.

The process of chaining a series of transformations over a dataset is commonly know as Pipeline.

Let's see a basic example:

* We have designed a 3 step process
    * select columns
    * convert strings
    * replace `NaN`

In [None]:
def select_columns(in_df, column_list):
    return in_df.loc[:, column_list]

def name_lower(in_df, column):
    in_df[column] = in_df.loc[:, column].str.lower()
    return in_df

def replace_nan(in_df, column, replace_value):
    in_df[column] = in_df[column].fillna(replace_value) 
    return in_df

In [None]:
df = pd.DataFrame({
    "age": [12, 42, 24, np.nan],
    "name": ["alice", "BOB", "Charlie", "dan"],
    "city": ["Lleida", "Barcelona", "New York", "Dubai"]
}
)

out_df = select_columns(df, ["age", "name"])
out_df = name_lower(out_df, "name")
mean_age = out_df["age"].mean()
out_df = replace_nan(out_df, "age", mean_age)
out_df

##### `pandas.DataFrame.pipe`

We can use `DataFrame.pipe` when chaining together functions that expect `Series`, `DataFrames` or `GroupBy` objects.

In [None]:
(df.pipe(select_columns, column_list=["age", "name"])
    .pipe(name_lower, column="name")
    .pipe(replace_nan, column="age", replace_value=out_df["age"].mean())
)

## What is `scikit-learn`?

`scikit-learn` is a Machine Learning library written in Python. 

It contains most of the state-of-the-art algorithms such as KNN, XGBoost, random forest, SVM among others. 

It also contains several tools for data management.

It's built upon some of the technology have already seen, like `NumPy`, `pandas`, and `Matplotlib`!

It is also widely known for having an API that has become a kind of standard in the Machine Learning community. As a very fast summary, it contains two types of algorithms:

* **Transformers:** these have two main methods: `.fit()` and `.transform()`
* **Estimators:** these have two main methods: `.fit()` and `.predict()`

Normally estimators are algorithms that return models.

On the other hand, transformers are data management algorithms, that take as input `numpy.arrays` and return `numpy.arrays`.

Using this API, one can chain transformers and estimators to define a Pipeline.

### `scikit-learn` Transformers

A Transformer is an object that has two main methods: `.fit(X, y=None)` and `.transform(X)` 
* `.fit(X, y=None)` takes `X` values and sets up a mapping between `X` values and values in a new domain
* `transform(X)` takes `X` values and maps them to values in the new domain

In [None]:
from sklearn.base import TransformerMixin

class DumbFeaturizer(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        if len(X)%2 == 0:
            self.middle = X[(len(X)//2)-1:(len(X)//2)+1]/2
        else:
            self.middle = X[(len(X)//2)]
        return self

    def transform(self, X):
        return [0 if x > self.middle else 1 for x in X]

In [None]:
dumb = DumbFeaturizer()
dumb.fit([1,2,3,4,5,6,7])
dumb.transform([-1,2,3,6, 7, 8])

### `scikit-learn` Pipelines

Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it on to the next step. 

The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the Pipeline can be used as a classifier. 

If the last estimator is a transformer, again, so is the pipeline.

In [None]:
from sklearn.pipeline import Pipeline

steps = [("dumb_feat1", DumbFeaturizer()), ("dumb_feat2", DumbFeaturizer())]
p = Pipeline(steps)

p.fit_transform([1,2,3,4,5])

## Transforming Variables

We will see how to use `Transformers` and `Pipelines` to convert variables in order to be able to train models in `scikit-learn`.

To this end we will use a real dataset: automobile dataset.

### Automobile Dataset

This data set consists of three types of entities: 
1. the specification of an auto in terms of various characteristics
2. its assigned insurance risk rating
3. its normalized losses in use as compared to other cars

For more info: https://archive.ics.uci.edu/ml/datasets/Automobile

It is an interesting dataset because it has a mix of categoric and numeric variables.

In [None]:
import pandas as pd
import numpy as np

headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

df = pd.read_csv(
    "http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
     header=None, 
    names=headers, 
    na_values="?"
)

df.sample().T

In [None]:
df.dtypes

We can see that `doors` and `num_cylinders` are encoded as strings, while they seem numeric.

In [None]:
df.num_doors.value_counts()

### Categorical to Numeric

Some algorithms can not handle categorical data, so a conversion to numerical has to be done before using any `Estimator`.

##### `sklearn.preprocessing.LabelEncoder`

Encode labels with value between 0 and `n_classes-1`. 

This is specially useful for encoding classification target variables.

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# we have to all the values in the variable range
_ = le.fit([1, 2, 2, 6])

In [None]:
le.classes_

We can get the transformations

In [None]:
le.transform([1, 1, 2, 6]) 

If we try to transform a variable value not in the range fitted we get an error.

In [None]:
le.transform([3]) 

Once we have a transformed variable we can have the original value

In [None]:
le.inverse_transform([0, 0, 1, 2])

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

In [None]:
le = preprocessing.LabelEncoder()
_ = le.fit(df.make)

In [None]:
list(le.classes_)

In [None]:
le.transform(['mazda',
 'mitsubishi',
 'nissan',
 'peugot',
 'porsche',]) 

In [None]:
list(le.inverse_transform([2, 2, 1]))

##### `sklearn.preprocessing.OneHotEncoder`

One of the problems of transforming a categoric variable into a integer value is that our learning algorithm can use values as weights and provide biased results.

To avoid this behaviour we can use the so called One Hot Encoding.

The basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. 

This has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set.

Let's see an example

In [None]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)

In [None]:
enc.categories_

In [None]:
enc.transform([['Female', 1], ['Male', 4]]).toarray()

In [None]:
enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])

In [None]:
enc.get_feature_names()

Note the sparsity it provokes in the dataset.

In [None]:
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit([[m] for m in df.make.values])

In [None]:
enc.transform([["peugot"]]).toarray()

##### `sklearn.feature_extraction.text.CountVectorizer`

We have worked with text, and using it as data set feature is not easy.

One common approach it to count how many words from a corpus there is in each text. A corpus is a set of words.

Let's see an example.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

In [None]:
vectorizer.get_feature_names()

In [None]:
X.toarray() 

### Numeric Data
#### Standardization, or mean removal and variance scaling

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. 

If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

The function scale provides a quick and easy way to perform this operation on a single array-like dataset:

In [None]:
%matplotlib inline
df.length.plot(kind="kde")

In [None]:
from sklearn import preprocessing
import numpy as np

X_scaled = preprocessing.scale(df.length)

pd.Series(X_scaled).plot(kind="kde")

In [None]:
X_scaled.mean(axis=0)

In [None]:
X_scaled.std(axis=0)

The preprocessing module further provides a utility class `StandardScaler` that implements the `Transformer` API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. 

This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:

In [None]:
scaler = preprocessing.StandardScaler().fit([[l] for l in df.length.values])
scaler

In [None]:
scaler.mean_

In [None]:
scaler.scale_                             

In [None]:
scaler.transform(X_train)                           

The scaler instance can then be used on new data to transform it the same way it did on the training set:

In [None]:
flat_list = [v for l in scaler.transform([[l] for l in df.length.values]) for v in l]
pd.Series(flat_list).plot(kind="kde")

It is possible to disable either centering or scaling by either passing `with_mean=False` or `with_std=False` to the constructor of `StandardScaler`.

##### Scaling features to a range

An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. 

This can be achieved using `MinMaxScaler` or `MaxAbsScaler`, respectively.

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

Here is an example to scale a toy data matrix to the `[0, 1]` range:

In [None]:
df.wheel_base.plot(kind="kde")

In [None]:
X_scaled = preprocessing.scale(df.wheel_base)
pd.Series(X_scaled).plot(kind="kde")

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform([[l] for l in df.wheel_base.values])

flat_list = [v for l in X_train_minmax for v in l]
pd.Series(flat_list).plot(kind="kde")

The same instance of the transformer can then be applied to some new test data unseen during the fit call: the same scaling and shifting operations will be applied to be consistent with the transformation performed on the train data:

In [None]:
X_test = np.array([[-3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

It is possible to introspect the scaler attributes to find about the exact nature of the transformation learned on the training data:

In [None]:
min_max_scaler.scale_                             

In [None]:
min_max_scaler.min_

If `MinMaxScaler` is given an explicit `feature_range=(min, max)` the full formula is:

In [None]:
X = df.wheel_base.values
x_max, x_min = X.max(), X.min()
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (x_max - x_min) + x_min

`MaxAbsScaler` works in a very similar fashion, but scales in a way that the training data lies within the range `[-1, 1]` by dividing through the largest maximum value in each feature. 

It is meant for data that is already centered at zero or sparse data.

In [None]:
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform([[v] for v in df.wheel_base])

flat_list = [v for l in X_train_maxabs for v in l]
pd.Series(flat_list).plot(kind="kde")

In [None]:
X_test = np.array([[ -3., -1.,  4.]])
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs                 

In [None]:
max_abs_scaler.scale_         

As with scale, the module further provides convenience functions minmax_scale and maxabs_scale if you don't want to create an object.

### Heterogeneous Data

Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps. 

Often it is easiest to preprocess data before applying `scikit-learn` methods, for example using `pandas`. 

Processing your data before passing it to `scikit-learn` might be problematic for one of the following reasons:

* Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.

* You may want to include the parameters of the preprocessors in a parameter search.

The `ColumnTransformer` helps performing different transformations for different columns of the data, within a `Pipeline` that is safe from data leakage and that can be parametrized. 

`ColumnTransformer` works on arrays, sparse matrices, and pandas `DataFrames`.

To each column, a different transformation can be applied, such as preprocessing or a specific feature extraction method:

For our data, we might want to encode `make` column as a categorical variable using `preprocessing.OneHotEncoder` but apply a `preprocessing.MinMaxScaler()` to the `length` column. 

As we might use multiple feature extraction methods on the same column, we give each transformer a unique name, say `make_category` and `scaled_length`. By default, the remaining rating columns are ignored (`remainder='drop'`):

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
column_trans = ColumnTransformer(
     [('make_category', OneHotEncoder(dtype='int'),['make']),
      ('scaled_length', preprocessing.MinMaxScaler(), ['length'])],
     remainder='drop')

column_trans.fit(df) 

In [None]:
column_trans.transform(df).toarray()

We can keep the remaining rating columns by setting `remainder='passthrough'`. The values are appended to the end of the transformation:

In [None]:
column_trans = ColumnTransformer(
     [('make_category', OneHotEncoder(dtype='int'),['make']),
      ('scaled_length', preprocessing.MinMaxScaler(), ['length'])],
     remainder='passthrough')

column_trans.fit_transform(df)

The `remainder` parameter can be set to an estimator to transform the remaining rating columns. 

The transformed values are appended to the end of the transformation:

In [None]:
column_trans = ColumnTransformer(
     [('make_category', OneHotEncoder(), ['make']),
      ('length_scaled', preprocessing.MinMaxScaler(), ["length"])],
     remainder=preprocessing.MaxAbsScaler())

column_trans.fit_transform(df.loc[:, ["wheel_base", "length", "width", "height", "make"]]).toarray()

The `make_column_transformer` function is available to more easily create a `ColumnTransformer` object. 

Specifically, the names will be given automatically. The equivalent for the above example would be:

In [None]:
from sklearn.compose import make_column_transformer
column_trans = make_column_transformer(
     (OneHotEncoder(), ['make']),
     (preprocessing.MinMaxScaler(), ["length"]),
     remainder=preprocessing.MaxAbsScaler())
column_trans 