# 1. PIPELINES - DEFINIÇÃO

## 1. Definição: 
#### Pipelines são classes que permitem automatizar o fluxo de preparação dos dados durante projetos de Machine Learning. 

## 2. Por que devo utilizar pipelines?
#### a) O processo de preparação dos dados em conjunto de treinamento, validação e testes pode ser custoso. Uma vez construído, o pipeline pode ser aplicado em todos os conjuntos de dados(treinamento, validação e testes);
#### b) Permite-nos chamar uma única vez os métodos fit() e predict() para uma sequência de estimadores;
#### c) Permite que executemos a otimização de parâmetros de todos os nossos estimadores dentro de um pipeline;
#### d) Código mais limpo

# 2. OBTENDO OS DADOS

In [1]:
import pandas as pd

In [38]:
PATH_TRAIN = '../datasets/train.csv'
PATH_TEST = '../datasets/test.csv'

In [39]:
def load_dataset(path):
    return pd.read_csv(path)

In [41]:
train = load_dataset(PATH_TRAIN)
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [42]:
test = load_dataset(PATH_TEST)
test.head(2)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S


# 3. CRIANDO PIPELINES

### imports

In [43]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import classification_report

### settings fiels

In [45]:
labels = ['Survived', 'Pclass', 'Age', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked']

numerical_labels = ['Age', 'Fare']
categorical_labels = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']

In [46]:
from sklearn.model_selection import train_test_split

treino, validacao = train_test_split(train[labels], test_size=0.3)

X_treino = treino.drop('Survived', axis=1)
y_treino = treino['Survived']

X_valid = validacao.drop(columns=['Survived'], axis=1)
y_valid = validacao['Survived']

### Pipeline para atributos numéricos

In [47]:
num_pipe = Pipeline(steps=[
    ('imp', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

### Pipeline para atributos categoricos

In [48]:
cat_pipe = Pipeline(steps=[
    ('imp', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder())
])

### Juntando todos os pipelines em uma ColumnTransformer

In [49]:
preprocess = ColumnTransformer(transformers=[
    ('num_pipe', num_pipe, numerical_labels),
    ('cat_pipe', cat_pipe, categorical_labels)
])

# 4. PIPELINES E ESTIMADORES

In [52]:
def cross_validate_models(model):

   pipeline = Pipeline(steps=[
      ('preprocess', preprocess),
      ('classifier', model)
   ])
   
   pipeline.fit(X_treino, y_treino)
   y_predict = pipeline.predict(X_valid)
   print(classification_report(y_valid, y_predict))

In [54]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

models = [RandomForestClassifier(), LogisticRegression()]

for model in models:
    cross_validate_models(model)

              precision    recall  f1-score   support

           0       0.80      0.86      0.83       162
           1       0.76      0.67      0.71       106

    accuracy                           0.78       268
   macro avg       0.78      0.76      0.77       268
weighted avg       0.78      0.78      0.78       268

              precision    recall  f1-score   support

           0       0.82      0.90      0.86       162
           1       0.82      0.71      0.76       106

    accuracy                           0.82       268
   macro avg       0.82      0.80      0.81       268
weighted avg       0.82      0.82      0.82       268

