# Pipeline

[Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) é uma classe do sklearn que permite aplicar uma sequência de transformações em um estimador final. <br>
Para isso, os passos intermediários devem ter implementados métodos de `fit` e `transform` enquanto o estimador final só precisa ter o `fit` implementado. <br>
O propósito do `pipeline` é:
- reunir várias etapas para serem validadas de forma cruzada (cross-validation) ao definir parâmetros diferentes
- ajudar a criar códigos que possuam um padrão que possa ser facilmente entendido e compartilhando entre times de cientista e engenheiro de dados.

<img src="images/pipeline.png" text="https://nbviewer.org/github/rasbt/python-machine-learning-book/blob/master/code/ch06/ch06.ipynb#Combining-transformers-and-estimators-in-a-pipeline">



## Pipeline

Os exemplos utilizados foram extraídos de https://towardsdatascience.com/pipeline-columntransformer-and-featureunion-explained-f5491f815f

Antes de começar precisamos definir dois termos:

- __Transformer:__ Um transformador se refere à um objeto de uma classe que possuim oe métodos fit() e transform() e que nos ajudam a transformar o dado na forma que queremos. OneHotEncoder. SimpleImputer e MinMaxScaler são exemplos de transformers.
- __Estimator:__ Um estimador se refere à um modelo de ML. Ele é um objeto de uma classe que possui os métodos fit() e predict(). [Aqui](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) se encontram exemplos de estimadores.

Hoje vamos utilizar um dataset mais simples de exemplo e apenas as 5 primeira linhas. Usaremos os dados de gorjeta cuja descrição encontra-se [nesse link](https://vincentarelbundock.github.io/Rdatasets/doc/reshape2/tips.html).

In [1]:
# vamos importar todas as libs de uma vez
# Set seed for reproducibility
seed = 123

# Import package/module for data
import numpy as np
import pandas as pd
from seaborn import load_dataset

# Import modules for feature engineering and modelling
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

# Load dataset
df = load_dataset('tips').drop(columns=['tip', 'sex']).sample(n=5, random_state=seed)

# Add missing values -> pra dar uma graça no dataset
df.iloc[[1, 2, 4], [2, 4]] = np.nan
df

Unnamed: 0,total_bill,smoker,day,time,size
112,38.07,No,Sun,Dinner,3.0
19,20.65,No,,Dinner,
187,30.46,Yes,,Dinner,
169,10.63,Yes,Sat,Dinner,2.0
31,18.35,No,,Dinner,


## Pipeline
Vamos começar só com as colunas _smoker_, _day_ e _time_ para predizer a coluna _total_bill_

In [3]:
# Dropando a coluna size e particionando os dados
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['total_bill', 'size']), 
                                                    df['total_bill'], 
                                                    test_size=.2, 
                                                    random_state=seed)

Para poder utilizar esses dados em um modelo precisamos fazer as seguintes transformações:
- Imputar os missing values (aqui substituiremos por ‘missing’)
- Fazer o One-hot encode

<br>
Como nós faríamos normalmente:

In [21]:
# Impute training data
imputer = SimpleImputer(strategy='constant', fill_value='missing')
X_train_imputed = imputer.fit_transform(X_train)

# Encode training data
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_train_encoded = encoder.fit_transform(X_train_imputed)

# Inspect training data before and after
print("******************** Training data ********************")
print('Antes')
display(X_train)
print('Depois do imputer')
display(pd.DataFrame(X_train_imputed, columns=X_train.columns))
print('Depois do encoder')
display(pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names_out(X_train.columns)))

# Transform test data
X_test_imputed = imputer.transform(X_test)
X_test_encoded = encoder.transform(X_test_imputed)

# Inspect test data before and after
print("******************** Test data ********************")
print('Antes')
display(X_test)
print('Depois do imputer')
display(pd.DataFrame(X_test_imputed, columns=X_train.columns))
print('Depois do encoder')
display(pd.DataFrame(X_test_encoded, columns=encoder.get_feature_names_out(X_train.columns)))

print("******************** Predicted data ********************")

# Fit model to training data
model = LinearRegression()
model.fit(X_train_encoded, y_train)

# Predict training data
y_train_pred = model.predict(X_train_encoded)
print(f"Predictions on training data: {y_train_pred}")

# Predict test data
y_test_pred = model.predict(X_test_encoded)
print(f"Predictions on test data: {y_test_pred}")

******************** Training data ********************
Antes


Unnamed: 0,smoker,day,time
169,Yes,Sat,Dinner
31,No,,Dinner
112,No,Sun,Dinner
187,Yes,,Dinner


Depois do imputer


Unnamed: 0,smoker,day,time
0,Yes,Sat,Dinner
1,No,missing,Dinner
2,No,Sun,Dinner
3,Yes,missing,Dinner


Depois do encoder


Unnamed: 0,smoker_No,smoker_Yes,day_Sat,day_Sun,day_missing,time_Dinner
0,0.0,1.0,1.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,1.0,1.0
2,1.0,0.0,0.0,1.0,0.0,1.0
3,0.0,1.0,0.0,0.0,1.0,1.0


******************** Test data ********************
Antes


Unnamed: 0,smoker,day,time
19,No,,Dinner


Depois do imputer


Unnamed: 0,smoker,day,time
0,No,missing,Dinner


Depois do encoder


Unnamed: 0,smoker_No,smoker_Yes,day_Sat,day_Sun,day_missing,time_Dinner
0,1.0,0.0,0.0,0.0,1.0,1.0


******************** Predicted data ********************
Predictions on training data: [10.63 18.35 38.07 30.46]
Predictions on test data: [18.35]


Primeiro vamos fazer um pipeline sem o estimador para conseguir visualizar as transformações nos dados:

In [26]:
# Fit pipeline to training data
pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
pipe.fit(X_train)

# Inspect training data before and after
print("******************** Training data ********************")
print('Antes')
display(X_train)
print('Depois')
display(pd.DataFrame(pipe.transform(X_train), columns=pipe['encoder'].get_feature_names_out(X_train.columns)))

# Inspect test data before and after
print("******************** Test data ********************")
print('Antes')
display(X_test)
print('Depois')
display(pd.DataFrame(pipe.transform(X_test), columns=pipe['encoder'].get_feature_names_out(X_train.columns)))

******************** Training data ********************
Antes


Unnamed: 0,smoker,day,time
169,Yes,Sat,Dinner
31,No,,Dinner
112,No,Sun,Dinner
187,Yes,,Dinner


Depois


Unnamed: 0,smoker_No,smoker_Yes,day_Sat,day_Sun,day_missing,time_Dinner
0,0.0,1.0,1.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,1.0,1.0
2,1.0,0.0,0.0,1.0,0.0,1.0
3,0.0,1.0,0.0,0.0,1.0,1.0


******************** Test data ********************
Antes


Unnamed: 0,smoker,day,time
19,No,,Dinner


Depois


Unnamed: 0,smoker_No,smoker_Yes,day_Sat,day_Sun,day_missing,time_Dinner
0,1.0,0.0,0.0,0.0,1.0,1.0


Para adicionar um estimador no pipelines precisamos passar o y_train no fit()

In [25]:
# Fit pipeline to training data
pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse=False)), 
    ('model', LinearRegression())
])
pipe.fit(X_train, y_train)

# Predict training data
y_train_pred = pipe.predict(X_train)
print(f"Predictions on training data: {y_train_pred}")

# Predict test data
y_test_pred = pipe.predict(X_test)
print(f"Predictions on test data: {y_test_pred}")

Predictions on training data: [10.63 18.35 38.07 30.46]
Predictions on test data: [18.35]


Mas como fazemos se não queremos aplicar o OneHotEncoder em todas as colunas e queremos transformações específicas para outras colunas?

## ColumnTransformer()
Vamos realizar o split emt treino e test considerando todas as colunas:

In [27]:
# Partition data
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['total_bill']), 
                                                    df['total_bill'], 
                                                    test_size=.2, 
                                                    random_state=seed)

# Define categorical columns
categorical = list(X_train.select_dtypes('category').columns)
print(f"Categorical columns are: {categorical}")

# Define numerical columns
numerical = list(X_train.select_dtypes('number').columns)
print(f"Numerical columns are: {numerical}")

Categorical columns are: ['smoker', 'day', 'time']
Numerical columns are: ['size']


Primeiro vamos criar um pipeline específicos para as variáveis categóricas utilizando o ColumnTransformer:

In [35]:
# Define categorical pipeline
cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

# Fit column transformer to training data 
# O parâmetro remainder='passthrough' serve para manter as colunas que não estamos utilizando na transformação
preprocessor = ColumnTransformer([('cat', cat_pipe, categorical)], remainder='passthrough')
preprocessor.fit(X_train)

# Prepare column names
cat_columns = preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out(categorical)
columns = np.append(cat_columns, numerical)

# Inspect training data before and after
print("******************** Training data ********************")
print('Antes')
display(X_train)
print('Depois')
display(pd.DataFrame(preprocessor.transform(X_train), columns=columns))

# Inspect test data before and after
print("******************** Test data ********************")
print('Antes')
display(X_test)
print('Depois')
display(pd.DataFrame(preprocessor.transform(X_test), columns=columns))

******************** Training data ********************
Antes


Unnamed: 0,smoker,day,time,size
169,Yes,Sat,Dinner,2.0
31,No,,Dinner,
112,No,Sun,Dinner,3.0
187,Yes,,Dinner,


Depois


Unnamed: 0,smoker_No,smoker_Yes,day_Sat,day_Sun,day_missing,time_Dinner,size
0,0.0,1.0,1.0,0.0,0.0,1.0,2.0
1,1.0,0.0,0.0,0.0,1.0,1.0,
2,1.0,0.0,0.0,1.0,0.0,1.0,3.0
3,0.0,1.0,0.0,0.0,1.0,1.0,


******************** Test data ********************
Antes


Unnamed: 0,smoker,day,time,size
19,No,,Dinner,


Depois


Unnamed: 0,smoker_No,smoker_Yes,day_Sat,day_Sun,day_missing,time_Dinner,size
0,1.0,0.0,0.0,0.0,1.0,1.0,


A saída obtida é a mesma da anterior para as variáveis categóricas, mas agora temos a coluna size. <br>
Agora vamos fazer as seguintes transformações na coluna numérica:
- imputar a mediana nos valores missing
- fazer um MinMaxScaler

In [37]:
# Define categorical pipeline
cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

# Define numerical pipeline
num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler())
])

# Fit column transformer to training data
preprocessor = ColumnTransformer([
    ('cat', cat_pipe, categorical),
    ('num', num_pipe, numerical)
])
preprocessor.fit(X_train)

# Prepare column names
cat_columns = preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out(categorical)
columns = np.append(cat_columns, numerical)

# Inspect training data before and after
print("******************** Training data ********************")
print('Antes')
display(X_train)
print('Depois')
display(pd.DataFrame(preprocessor.transform(X_train), columns=columns))

# Inspect test data before and after
print("******************** Test data ********************")
print('Antes')
display(X_test)
print('Depois')
display(pd.DataFrame(preprocessor.transform(X_test), columns=columns))

******************** Training data ********************
Antes


Unnamed: 0,smoker,day,time,size
169,Yes,Sat,Dinner,2.0
31,No,,Dinner,
112,No,Sun,Dinner,3.0
187,Yes,,Dinner,


Depois


Unnamed: 0,smoker_No,smoker_Yes,day_Sat,day_Sun,day_missing,time_Dinner,size
0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,1.0,1.0,0.5
2,1.0,0.0,0.0,1.0,0.0,1.0,1.0
3,0.0,1.0,0.0,0.0,1.0,1.0,0.5


******************** Test data ********************
Antes


Unnamed: 0,smoker,day,time,size
19,No,,Dinner,


Depois


Unnamed: 0,smoker_No,smoker_Yes,day_Sat,day_Sun,day_missing,time_Dinner,size
0,1.0,0.0,0.0,0.0,1.0,1.0,0.5


Podemos appendar um modelo após o preprocessor:

In [38]:
# Fit a pipeline with transformers and an estimator to the training data
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])
pipe.fit(X_train, y_train)

# Predict training data
y_train_pred = pipe.predict(X_train)
print(f"Predictions on training data: {y_train_pred}")

# Predict test data
y_test_pred = pipe.predict(X_test)
print(f"Predictions on test data: {y_test_pred}")

Predictions on training data: [10.63 18.35 38.07 30.46]
Predictions on test data: [18.35]


In [39]:
# Podemos visualizar nosso pipe
from sklearn import set_config
set_config(display="diagram")
pipe  # click on the diagram below to see the details of each step

In [47]:
pipe['preprocessor']

<img src="images/columntransform.png" width="500px">
<br>
<br>


## Resumo
<img src="images/comparing.png" width="500px">

Para um exemplo de FeatureUnion aplicado nesse cosnjunto de treino acessar o 
[link](https://towardsdatascience.com/pipeline-columntransformer-and-featureunion-explained-f5491f815f)

In [51]:
# Load dataset
df = load_dataset('tips').drop(columns=['tip', 'sex']).sample(n=5, random_state=seed)

# Add missing values -> pra dar uma graça no dataset
df.iloc[[1, 2, 4], [2, 4]] = np.nan

# Partition data
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['total_bill']), 
                                                    df['total_bill'], 
                                                    test_size=.2, 
                                                    random_state=seed)

In [52]:
from sklearn.model_selection import cross_validate

Fazendo um exemplo mais simples: Queremos apenas a acurácia (que é global, não é uma pra cada classe)

In [56]:
from sklearn.metrics import SCORERS
SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_wei

In [57]:
cross_validate(pipe, X_train, y_train, scoring='neg_mean_squared_error', cv=3)

{'fit_time': array([0.00892138, 0.00877595, 0.00743437]),
 'score_time': array([0.00527382, 0.0039556 , 0.0034914 ]),
 'test_score': array([-362.79439062, -878.233225  , -370.81921111])}

Fazendo um exemplo mais complexo: Eu quero a precision e o recall de cada classe.

In [63]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer

In [70]:
scoring = {'mean_squared_error': make_scorer(mean_squared_error),
           'root_mean_squared_error': make_scorer(mean_squared_error, squared=False),
           'mean_absolute_error':make_scorer(mean_absolute_error)}

In [71]:
cross_validate(pipe, X_train, y_train, scoring=scoring, cv=3)

{'fit_time': array([0.00911951, 0.00874209, 0.00697446]),
 'score_time': array([0.00798941, 0.00455952, 0.00368953]),
 'test_mean_squared_error': array([362.79439062, 878.233225  , 370.81921111]),
 'test_root_mean_squared_error': array([19.04716227, 29.635     , 19.25666667]),
 'test_mean_absolute_error': array([18.82375   , 29.635     , 19.25666667])}

## Bibliografia e Aprofundamento
- [Python Machine Learning Book](https://github.com/rasbt/python-machine-learning-book-3rd-edition)
- [Documentação](https://scikit-learn.org/stable/modules/compose.html)
- [ColumnTransformer](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data)
- [FeatureUnion](https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces)