### Fluxo de Processos de Machine Learning



*   Pipelines permitem automatizar os fluxos de processos.
*   Reduzimos a quantidade de código.
*   Menos risco de erros.



In [0]:
# Instale o category_encoders
!pip install category_encoders

In [0]:
from sklearn.pipeline import Pipeline
from sklearn import datasets
from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn import tree
import pandas as pd

In [0]:
from google.colab import drive
drive.mount('/content/drive')

**Lendo o dataset**

In [0]:
df = pd.read_csv('/content/drive/My Drive/Live/adult.data')

In [0]:
df.head()

**Remove colunas desnecessárias**

In [0]:
df.drop(['education'], axis=1, inplace=True)

**Separa dados e classes**

In [0]:
X = df.drop('income', axis=1, inplace=False)
y = df.income

**Separa porção de treino e teste**

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

**Seleciona colunas não numéricas**

In [0]:
df.select_dtypes(include='object')

**Aplicando One Hot Encoder**

In [0]:
ohe = OneHotEncoder(use_cat_names=True)

In [0]:
X_train = ohe.fit_transform(X_train)

In [0]:
X_train.head()

**Aplicando um pré-processador**

In [0]:
scaler = StandardScaler().fit(X_train)

In [0]:
scaler

In [0]:
valores_scalados = scaler.transform(X_train)

In [0]:
valores_scalados[:10]

In [0]:
X_train = scaler.transform(X_train)

**Gera o modelo**

In [0]:
clf_tree = tree.DecisionTreeClassifier()

In [0]:
clf_tree = clf_tree.fit(X_train,y_train)

**Aplica OHE e o Pré-processador nos dados de teste**

In [0]:
X_test = ohe.transform(X_test)

In [0]:
scaler_test = StandardScaler().fit(X_test)

In [0]:
X_test = scaler_test.transform(X_test)

In [0]:
X_test[:10]

In [0]:
clf_tree.predict(X_test)

**Validação do modelo**

In [0]:
acuracia = clf_tree.score(X_test, y_test)

In [0]:
acuracia

### Criando Pipelines



*   Encadeia etapas em sequencia.
*   Aplicação das etapas em dados de treino e teste



In [0]:
pip_1 = Pipeline([
    ('ohe', OneHotEncoder()),                 
    ('scaler', StandardScaler()),
    ('clf', tree.DecisionTreeClassifier())
])

In [0]:
pip_1.

In [0]:
pip_1.steps

**Fluxo de processos com Pipeline**

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [0]:
pip_1.fit(X_train,y_train)

**Validaçao do modelo**

In [0]:
acuracia = pip_1.score(X_test, y_test)

In [0]:
acuracia

### Criando outros Pipelines



*   Crie pipelines com configuraçoes distintas.
*   Valide vários modelos.



In [0]:
from sklearn.preprocessing import MinMaxScaler

In [0]:
pip_minmax = Pipeline([
    ('ohe', OneHotEncoder()),              
    ('min_max_scaler', MinMaxScaler()),
    ('clf', tree.DecisionTreeClassifier())
])

pip_max_depth = Pipeline([
    ('ohe', OneHotEncoder()),              
    ('min_max_scaler', MinMaxScaler()),
    ('clf', tree.DecisionTreeClassifier(max_depth=3))
])

pip_max_depth_std = Pipeline([
    ('ohe', OneHotEncoder()),              
    ('standardscaler', StandardScaler()),
    ('clf', tree.DecisionTreeClassifier(max_depth=3))
])

**Validando modelos**

In [0]:
pip_minmax.fit(X_train,y_train)
acuracia = pip_minmax.score(X_test, y_test)
acuracia

In [0]:
pip_max_depth.fit(X_train,y_train)
acuracia = pip_max_depth.score(X_test, y_test)
acuracia

In [0]:
pip_max_depth_std.fit(X_train,y_train)
acuracia = pip_max_depth_std.score(X_test, y_test)
acuracia

### Processando colunas distintas

In [0]:
X.head()

In [0]:
from sklearn.compose import ColumnTransformer

In [0]:
from sklearn.impute import SimpleImputer

**Pipeline com etapa de preenchimento pela mediana**

In [0]:
mediana = Pipeline(steps=[
    ('mediana', SimpleImputer(strategy='median'))
])

**Pipeline com etapa de preenchimento por mais frequente**

In [0]:
frequente = Pipeline(steps=[
    ('frequente', SimpleImputer(strategy='most_frequent'))
])

**Criação do Pipeline que compoem os outros dois**

In [0]:
data_cleaning = ColumnTransformer(transformers=[
    ('mediana', mediana, ['education-num']),
    ('frequent', frequente, ['race'])
])


**Pipeline Final**

In [0]:
pipeline_final = Pipeline([
    ('datacleaning', data_cleaning),                    # primeira etapa do pipeline é o datacleaning
    ('ohe', OneHotEncoder()),                           # aplicação de ohe nos dados
    ('standardscaler', StandardScaler()),               # préprocessamento de dados com o standardscaler
    ('tree', tree.DecisionTreeClassifier())])           # gera o modelo usando árvore de decisão

In [0]:
pipeline_final.fit(X_train, y_train)

In [0]:
pipeline_final.predict(X_test)

In [0]:
acuracia = pipeline_final.score(X_test, y_test)
acuracia

### Grid Search e Pipelines

In [0]:
from sklearn.model_selection import GridSearchCV

**Parâmetros para fazer o grid**

In [0]:
parametros_grid = dict(tree__max_depth=[3,4,5,6,7,8,9,10])

**Cria objeto gridsearch com os parâmetros definidos e a configuração de validação cruzada com 5 folds**

In [0]:
grid = GridSearchCV(pipeline_final, param_grid=parametros_grid, cv=5, scoring='accuracy')

**Executa o GridSearch**

In [0]:
grid.fit(X,y)

**Resultados**

In [0]:
grid.cv_results_

In [0]:
grid.best_params_

In [0]:
grid.best_score_