# Transformer and Pipeline Quickstart

`Transformer` faces the engineering of **data preprocessing**.

## Applicable Scene

In steps of data preprocessing, we always need to do some **duplication things**.

When we finished dealing with the training dataset, we also need to sort those
preprocessing steps out and make them to a function, an API, or something.

## Sample Data

<div class="alert alert-info">
Note

All data are virtual.
</div>

There are some stores sale data of one chain brand.

- These stores place one region.
- Time is one specific year.
- Sale is a year total amount.
- Population is surrounding $200m$ buffer daily people numbers.
- Score is given by the expert, ranges from 0 to 10.

In [None]:
import pandas as pd

In [None]:
store_sale_dict = {
    "code": ["811-10001", "811-10002", "811-10003", "811-10004"],
    "name": ["A", "B", "C", "D"],
    "floor": ["1F", "2F", "1F", "B2"],
    "level": ["strategic", "normal", "important", "normal"],
    "type": ["School", "Mall", "Office", "Home"],
    "area": [100, 95, 177, 70],
    "population": [3000, 1000, 2000, 1500],
    "score": [10, 8, 6, 5],
    "opendays": [300, 100, 250, 15],
    "sale": [8000, 5000, 3000, 1500],
}
df = pd.DataFrame(store_sale_dict)
df

## Feature Types and Dealing Steps

First of all, we should know there are three types of features ($X$) and one label ($y$).

- Additional information features: drop
  - code
  - name
- Categorical features: encode to one-hot
  - floor
  - type: drop `'Home'` type, this type store numbers are very small.
- Number features: scale
  - level: it is not **categorical** type, because it could be compared.
  - area
  - population: there is buffer ranging population, but more want to enter store population, equal to  $\frac{score}{10} \times population$.
  - score
  - opendays: filter `opendays <= 30` stores then drop this field
- Label: need to balance, should transform to daily sale, equal to $\frac{sale}{opendays}$ then scale


<div class="alert alert-info">
Mission

Our mission is to find some relationships between these features and label.
</div>

## The Pandas Way

In pandas code, most users might type something like this:

Set a series of feature name constants.

In [None]:
features_category = ["floor", "type"]
features_number = ["level", "area", "population", "score"]
features = features_category + features_number
label = ["sale"]

### Process X and y

Filter opendays' store less than 30 days.
Because these samples are not normal stores.

In [None]:
df = df.query("opendays > 30")
df

Filter `'Home'` store.

In [None]:
df = df[df["type"] != "Home"]
df

Transform sale to daily sale.

In [None]:
df.eval("sale = sale / opendays", inplace=True)
df

Transform population to entry store population.

In [None]:
df.eval("population = score / 10 * population", inplace=True)
df

Split `df` to `df_x` and `y`and separately process them.

In [None]:
df_x = df[features]
df_x

In [None]:
y = df[label]
y

### Process y

Scale `y`.

In [None]:
from sklearn.preprocessing import MinMaxScaler

y_scaler = MinMaxScaler()

Scaler handle a column as a unit

In [None]:
y = y.values.reshape(-1, 1)
y = y_scaler.fit_transform(y)
y

The model always requires a 1d array otherwise would give a warning.

In [None]:
y = y.ravel()
y

### Process X

Replace store types to ranking numbers.

In [None]:
df_x.replace({"normal": 1, "important": 2, "strategic": 3}, inplace=True)
df_x

Encode categorical features.

In [None]:
from sklearn.preprocessing import OneHotEncoder

x_encoder = OneHotEncoder(sparse=False)
x_category = x_encoder.fit_transform(df_x[features_category])
x_category

Scale number features.

In [None]:
x_scaler = MinMaxScaler()
x_scaler = x_scaler.fit_transform(df_x[features_number])
x_scaler

Merge all features to one.

In [None]:
import numpy as np

X = np.hstack([x_scaler, x_category])
X

## The Pipeline Way

From [The Pandas Way](#the-pandas-way) section, we can see that:

- The intermediate variables are full of steps. We don't care about them atthe most time except debugging and reviewing.
- Data workflow is messy. Hard to separate data and operations.
- The outputting datastruct is not comfortable. The inputting type is `pandas.DataFrame` but the outputting type is `numpy.ndarray`.
- Hard to apply in prediction data.

### Further One Step to Pipeline

`sklearn.pipeline.Pipeline` is a good frame to fix these problems.

Transform [process X](#process-x) and [process y](#process-y) section codes to pipeline codees.

But actually, these things are hard to transform to pipeline.
Most are pandas methods, only OneHotEncoder and MinMaxScaler is could be added
into `sklearn.pipeline.Pipeline`.

The codes are still messy on **typing** and **applying** two ways.

## The `dtoolkit.transformer` Way

Frame is good, but from [Further One Step to Pipeline](#further-one-step-to-pipeline) section we could
see that the core problem is **missing transformer**.

- Pandas's methods couldn't be used as a transformer.
- Numpy's methods couldn't be used as a transformer.
- Sklearn's transformers can't pandas in and pandas out.

In [None]:
from dtoolkit.transformer import (
    EvalTF,
    FilterInTF,
    GetTF,
    MinMaxScaler,
    ReplaceTF,
    OneHotEncoder,
    QueryTF,
    make_union,
    RavelTF,
)
from sklearn.pipeline import make_pipeline

In [None]:
pl_xy = make_pipeline(
    QueryTF("opendays > 30"),
    FilterInTF({"type": ["School", "Mall", "Office"]}),
    EvalTF("sale = sale / opendays"),
    EvalTF("population = score / 10 * population"),
)
pl_xy

In [None]:
pl_x = make_pipeline(
    GetTF(features),
    ReplaceTF({"normal": 1, "important": 2, "strategic": 3}),
    make_union(
        make_pipeline(
            GetTF(features_category),
            OneHotEncoder(),
        ),
        make_pipeline(
            GetTF(features_number),
            MinMaxScaler(),
        ),
    ),
)
pl_x

In [None]:
pl_y = make_pipeline(
    GetTF(label),
    MinMaxScaler(),
    RavelTF(),
)
pl_y

In [None]:
store_sale_dict = {
    "code": ["811-10001", "811-10002", "811-10003", "811-10004"],
    "name": ["A", "B", "C", "D"],
    "floor": ["1F", "2F", "1F", "B2"],
    "level": ["strategic", "normal", "important", "normal"],
    "type": ["School", "Mall", "Office", "Home"],
    "area": [100, 95, 177, 70],
    "population": [3000, 1000, 2000, 1500],
    "score": [10, 8, 6, 5],
    "opendays": [300, 100, 250, 15],
    "sale": [8000, 5000, 3000, 1500],
}
df = pd.DataFrame(store_sale_dict)
df

In [None]:
xy = pl_xy.fit_transform(df)
xy

In [None]:
X = pl_x.fit_transform(xy)
X

In [None]:
y = pl_y.fit_transform(xy)
y

We could also save these pipelines as a binary file via `pickle` or `joblib`.
When new data coming we could quickly transform them via binary file.

## Other Ways to Handle This

`pandas.DataFrame.pipe` and `function` ways are ok.

But they are:

- hard to transform to application codes rightly
- hard to debug, and check the processing data

## What's Next - Learn or Build Transformers

In this tutorial we've a quickly glance about `dtoolkit.transformer`.

And the next steps, should learn about other transformers,
see documentation on [Transformer API](../reference/transformer.rst).
If those transformers don't meet your requirements, you could build your own
transformer, follow the documentation on [How to Build Transformer](build_transformer.ipynb).