# Sklearn Pipeline
<img src="pipeline-diagram.png">
O pipeline tem como objetivo automatizar uma sequência de comandos.  
Exemplo: usando pipeline para fazer a seleção dos parâmetros de pré processamento e dos hyperparâmetros do modelo?

Vantagens do CV pipeline: one-hot encoder transform dentro do pipeline com a nova versão do sklearn (0.20.2 ou + recente).  
  

Usando dados do Titanic, descritos em: https://www.openml.org/d/40945

## Análise exploratória dos dados

In [12]:
# !pip install pandas
# !pip install scikit-learn
# !pip install matplotlib
# !pip install sklearn

In [89]:
import pandas as pd
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out
IOStream.flush timed out


In [14]:
# df = pd.read_csv('http://bit.ly/kaggletrain')
df = pd.read_csv(
    "https://www.openml.org/data/get_csv/16826755/phpMYEkMl", na_values="?"
)

In [15]:
df.shape

(1309, 14)

In [16]:
df.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

In [17]:
df.cabin

0            B5
1       C22 C26
2       C22 C26
3       C22 C26
4       C22 C26
         ...   
1304        NaN
1305        NaN
1306        NaN
1307        NaN
1308        NaN
Name: cabin, Length: 1309, dtype: object

In [18]:
df.isna().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [19]:
df = df.loc[df.embarked.notna(), ["pclass", "survived", "sex", "embarked"]]

In [20]:
df.isna().sum()

pclass      0
survived    0
sex         0
embarked    0
dtype: int64

In [21]:
df.shape

(1307, 4)

In [22]:
df.head()

Unnamed: 0,pclass,survived,sex,embarked
0,1,1,female,S
1,1,1,male,S
2,1,0,female,S
3,1,0,male,S
4,1,0,female,S


In [23]:
df.sample(10)

Unnamed: 0,pclass,survived,sex,embarked
83,1,1,female,S
1176,3,0,male,S
1298,3,0,male,S
1300,3,1,female,C
703,3,0,male,Q
668,3,0,male,S
1199,3,0,male,S
767,3,0,male,S
1189,3,1,female,S
592,2,0,male,S


In [24]:
df.survived.value_counts()

survived
0    809
1    498
Name: count, dtype: int64

In [25]:
df.survived.value_counts(normalize=True)

survived
0    0.618975
1    0.381025
Name: proportion, dtype: float64

dica:  
Qual a accurácia mínima aceitável?  
Em casos de classificação binária como este, seria 61,89% (predição da classe predominante: '0' = Não sobreviveu!)

## Cross-validation - validação cruzada
<img src='cross-validation.png'>  
Consiste em dividir os dados em partes, exemplo k-fold = 5, dividir em 5 partes e treinar o modelo 5 vexes.  
Cada treinamento uma parte dos dados é separada para servir de teste (avaliação do modelo).

### sem cross-validation

In [26]:
X = df.loc[:, ["pclass"]]
y = df.survived

In [27]:
X

Unnamed: 0,pclass
0,1
1,1
2,1
3,1
4,1
...,...
1304,3
1305,3
1306,3
1307,3


In [28]:
from sklearn.linear_model import LogisticRegression

In [29]:
reg_log = LogisticRegression(solver="lbfgs")

In [30]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=40
)

In [31]:
reg_log.fit(X_train, y_train)

In [32]:
reg_log.score(X_test, y_test)

0.7633587786259542

### com cross-validation

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [34]:
# df = df.sample(len(df))

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


In [35]:
X = df.loc[:, ["pclass"]]
y = df.survived

In [36]:
# Havendo muitos dados, separe um conjunto de teste que você só poderá utilizar uma única vez ao final do se
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=45, stratify=X.pclass
)

In [37]:
reg_log = LogisticRegression(solver="lbfgs")

In [38]:
cv_result = cross_val_score(reg_log, X_train, y_train, cv=3, scoring="accuracy")

cv_result.mean()

0.6583863035053629

In [39]:
X_test

Unnamed: 0,pclass
1265,3
964,3
374,2
55,1
254,1
...,...
443,2
799,3
567,2
1166,3


In [40]:
X_test.pclass.value_counts() / len(X_test.pclass)

pclass
3    0.541985
1    0.244275
2    0.213740
Name: count, dtype: float64

In [41]:
X_train.pclass.value_counts() / len(X_train.pclass)

pclass
3    0.542584
1    0.245933
2    0.211483
Name: count, dtype: float64

In [42]:
cv_result

array([0.64469914, 0.66954023, 0.66091954])

In [43]:
reg_log.fit(X_train, y_train)

In [44]:
reg_log.score(X_test, y_test)

0.7480916030534351

In [45]:
cv_result.std()

0.010298313710304641

In [46]:
cv_result = cross_val_score(
    reg_log,
    X,
    y,
    cv=10,
    scoring="accuracy",
)

cv_result.mean()

0.6762301820317088

In [47]:
cv_result

array([0.38167939, 0.67938931, 1.        , 0.98473282, 0.61832061,
       0.61832061, 0.61832061, 0.61538462, 0.62307692, 0.62307692])

In [48]:
cv_result.std()

0.17498119933511946

In [49]:
from sklearn.model_selection import KFold

In [50]:
# Forma de visualizar quais dados entram em quais dobras do CV em treino ou teste
cv10 = KFold(10, shuffle=True, random_state=42)

In [51]:
[k for k in cv10.split(X, y)]

[(array([   0,    1,    2, ..., 1303, 1304, 1305]),
  array([  23,   32,   43,   44,   49,   51,   58,   63,   65,   76,   78,
           81,  101,  115,  184,  192,  198,  208,  210,  231,  240,  243,
          244,  247,  254,  261,  270,  275,  282,  291,  308,  309,  316,
          322,  331,  351,  358,  371,  382,  390,  394,  409,  410,  427,
          429,  432,  438,  447,  481,  482,  503,  506,  536,  538,  549,
          552,  573,  583,  584,  602,  614,  630,  651,  676,  705,  707,
          711,  722,  731,  741,  743,  744,  745,  752,  764,  774,  787,
          792,  793,  802,  818,  841,  845,  846,  861,  862,  868,  875,
          903,  908,  910,  914,  932,  936,  945,  947,  968,  973,  986,
         1009, 1015, 1024, 1037, 1040, 1043, 1047, 1057, 1091, 1100, 1106,
         1110, 1114, 1116, 1140, 1156, 1157, 1164, 1170, 1174, 1176, 1192,
         1225, 1243, 1246, 1265, 1281, 1283, 1291, 1295, 1299, 1306])),
 (array([   0,    1,    2, ..., 1304, 1305, 1306]),

In [52]:
# Façamos um shuffle manual
df_shuffle = df.sample(frac=1).reset_index()
df_shuffle

Unnamed: 0,index,pclass,survived,sex,embarked
0,817,3,0,male,S
1,116,1,1,female,S
2,1299,3,0,male,C
3,334,2,0,male,S
4,554,2,0,male,S
...,...,...,...,...,...
1302,947,3,1,female,S
1303,993,3,1,female,Q
1304,1248,3,0,male,S
1305,799,3,0,male,S


In [53]:
X = df_shuffle.loc[:, ["pclass"]]
y = df_shuffle.survived

In [54]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=45
)

In [55]:
reg_log = LogisticRegression(solver="lbfgs")

In [56]:
cv_result = cross_val_score(reg_log, X_train, y_train, cv=10, scoring="accuracy")

cv_result

array([0.64761905, 0.67619048, 0.68571429, 0.65714286, 0.67619048,
       0.67307692, 0.70192308, 0.68269231, 0.66346154, 0.69230769])

In [57]:
print(f"média = {cv_result.mean()}")
print(f"dp = {cv_result.std()}")

média = 0.6756318681318682
dp = 0.01550047133999041


In [58]:
cv_result = cross_val_score(reg_log, X_train, y_train, cv=5, scoring="accuracy")

cv_result

array([0.66028708, 0.67464115, 0.66985646, 0.6937799 , 0.67942584])

In [59]:
print(f"média = {cv_result.mean()}")
print(f"dp = {cv_result.std()}")

média = 0.675598086124402
dp = 0.011077355887837516


In [60]:
cv_result = cross_val_score(reg_log, X_train, y_train, cv=2, scoring="accuracy")

cv_result

array([0.66730402, 0.68390805])

In [61]:
print(f"média = {cv_result.mean()}")
print(f"dp = {cv_result.std()}")

média = 0.6756060306366893
dp = 0.00830201534032221


## Encoding variáveis (features) categoricas :
Se as variáveis não possuem uma ordem específica, geralmente a melhor opção é one hot-encoding (dummy encoding)

In [62]:
from sklearn.preprocessing import OneHotEncoder

hot_enc = OneHotEncoder(sparse=False)

In [63]:
df.head()

Unnamed: 0,pclass,survived,sex,embarked
0,1,1,female,S
1,1,1,male,S
2,1,0,female,S
3,1,0,male,S
4,1,0,female,S


In [64]:
df.tail()

Unnamed: 0,pclass,survived,sex,embarked
1304,3,0,female,C
1305,3,0,female,C
1306,3,0,male,C
1307,3,0,male,C
1308,3,0,male,S


### Codificando só a variável 'embarked' para ver como fica

In [65]:
["embarked_C", "embarked_Q", "embarked_S"]

['embarked_C', 'embarked_Q', 'embarked_S']

In [66]:
df[["embarked_C", "embarked_Q", "embarked_S"]] = hot_enc.fit_transform(df[["embarked"]])



In [67]:
df.sample(5)

Unnamed: 0,pclass,survived,sex,embarked,embarked_C,embarked_Q,embarked_S
1283,3,0,male,S,0.0,0.0,1.0
738,3,1,female,S,0.0,0.0,1.0
515,2,1,male,S,0.0,0.0,1.0
1306,3,0,male,C,1.0,0.0,0.0
1032,3,0,male,Q,0.0,1.0,0.0


In [68]:
hot_enc.categories_

[array(['C', 'Q', 'S'], dtype=object)]

### Codificando todas as variáveis: 'sex' e 'embarked' 

In [69]:
X = df[["pclass", "sex", "embarked"]]

In [70]:
X.head()

Unnamed: 0,pclass,sex,embarked
0,1,female,S
1,1,male,S
2,1,female,S
3,1,male,S
4,1,female,S


In [71]:
X.sample(10)

Unnamed: 0,pclass,sex,embarked
373,2,female,S
1181,3,male,S
824,3,male,S
1058,3,female,S
1059,3,male,S
514,2,male,S
340,2,female,S
888,3,male,S
279,1,male,S
533,2,female,S


In [72]:
from sklearn.compose import make_column_transformer

In [73]:
col_transf = make_column_transformer(
    (OneHotEncoder(), ["sex", "embarked"]),
    remainder="passthrough",
)

In [74]:
col_transf.fit_transform(X)

array([[1., 0., 0., 0., 1., 1.],
       [0., 1., 0., 0., 1., 1.],
       [1., 0., 0., 0., 1., 1.],
       ...,
       [0., 1., 1., 0., 0., 3.],
       [0., 1., 1., 0., 0., 3.],
       [0., 1., 0., 0., 1., 3.]])

In [75]:
col_transf.get_feature_names_out()

array(['onehotencoder__sex_female', 'onehotencoder__sex_male',
       'onehotencoder__embarked_C', 'onehotencoder__embarked_Q',
       'onehotencoder__embarked_S', 'remainder__pclass'], dtype=object)

In [76]:
df_transform = pd.DataFrame(
    col_transf.fit_transform(X), columns=col_transf.get_feature_names_out()
)

In [77]:
df_transform

Unnamed: 0,onehotencoder__sex_female,onehotencoder__sex_male,onehotencoder__embarked_C,onehotencoder__embarked_Q,onehotencoder__embarked_S,remainder__pclass
0,1.0,0.0,0.0,0.0,1.0,1.0
1,0.0,1.0,0.0,0.0,1.0,1.0
2,1.0,0.0,0.0,0.0,1.0,1.0
3,0.0,1.0,0.0,0.0,1.0,1.0
4,1.0,0.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...
1302,1.0,0.0,1.0,0.0,0.0,3.0
1303,1.0,0.0,1.0,0.0,0.0,3.0
1304,0.0,1.0,1.0,0.0,0.0,3.0
1305,0.0,1.0,1.0,0.0,0.0,3.0


## Pipeline com codificação categóricas e cross-validation

In [78]:
from sklearn.pipeline import make_pipeline

In [79]:
X = df[["pclass", "sex", "embarked"]]
y = df["survived"]

In [80]:
col_transf = make_column_transformer(
    (OneHotEncoder(), ["sex", "embarked"]), remainder="passthrough"
)

In [81]:
reg_log = LogisticRegression(solver="lbfgs")

In [82]:
pipe = make_pipeline(col_transf, reg_log)

In [83]:
pipe

In [84]:
cross_val_score(pipe, X, y, cv=10, scoring="accuracy").mean()

0.7450733998825602

In [85]:
novo_X = X.sample(5, random_state=123)
novo_X

Unnamed: 0,pclass,sex,embarked
358,2,female,S
653,3,female,C
870,3,female,S
84,1,male,C
816,3,male,C


In [86]:
pipe.fit(X, y)

In [87]:
pipe.predict(novo_X)

array([1, 1, 1, 1, 0])

In [88]:
y.sample(5, random_state=123)

358    1
653    1
870    1
84     0
816    0
Name: survived, dtype: int64

### pipeline tem algumas vantagens:

1- Seu arquivo de treinamento permanece o mesmo e não vai crescer por causa do one-hot encoding.  
2- Na predição de novos dados, não é necessário fazer pandas dummies no novo arquivo. Também evita eventuais problemas caso os novos dados não tenham todas as categorias que existem nos dados de treinamento. As dimensões do novo dataset será diferente e vai dar erro.  
3- É possível fazer grid search para os parâmetros de pré-processamento e os parâmetros do modelo.  
