<a href="https://colab.research.google.com/github/cotozelo/Ciencia_Dados_-_Diversos/blob/main/estudo_label.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Target Encoder

Um estudo do target encoder que mostra como transformar variáveis categóricas em numéricas.

Lembrando que esse estudo foi feito com a base Titanic, as variáveis categóricas contém poucas categorias distintas. O ideal para usar o Target Encoder é em variáveis com muitas categoria distintas, pois nesse caso ao One-Hot-Encoder gera muitas colunas com muito desbalanceamento.

In [None]:
import pandas as pd
import numpy as np

df_result = pd.DataFrame()

### Data Set: Titanic

In [None]:
from catboost.datasets import titanic
df_titanic, _ = titanic()

df_titanic.fillna(0, inplace=True)
display(df_titanic.head(3))

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,0,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,0,S


* PassengerId: Número de identificação do passageiro
* Survived: Informa se o passageiro sobreviveu ao desastre
    * 0 = Não
    * 1 = Sim
* Pclass: Classe do bilhete
    * 1 = 1ª Classe
    * 2 = 2ª Classe
    * 3 = 3ª Classe
* Name: Nome do passageiro
* Sex: Sexo do passageiro
* Age: Idade do passageiro
* SibSp: Quantidade de cônjuges e irmãos a bordo
* Parch: Quantidade de pais e filhos a bordo
* Ticket: Número da passagem
* Fare: Preço da Passagem
* Cabin: Número da cabine do passageiro
* Embarked: Porto no qual o passageiro embarcou
    * C = Cherbourg
    * Q = Queenstown
    * S = Southampton

## One-Hot-Encodgin

Convertendo as colunas string em numéricas. As colunas transformadas serão: Pclass, Sex, Embarked

#### *Aplicando o Método*

In [None]:
X_Pclass_one = pd.get_dummies(df_titanic['Pclass'], prefix='Pclass')
X_Sex_one = pd.get_dummies(df_titanic['Sex'], prefix='Sex')
X_Embarked_one = pd.get_dummies(df_titanic['Embarked'], prefix='Embarked')

df_one = pd.concat([df_titanic, X_Pclass_one, X_Sex_one, X_Embarked_one], axis=1)
df_one.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Embarked,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_0,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,S,0,0,1,0,1,0,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,C,1,0,0,1,0,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,...,S,0,0,1,1,0,0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,...,S,1,0,0,1,0,0,0,0,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,...,S,0,0,1,0,1,0,0,0,1


#### *Selecionando as colunas*

In [None]:
X_one = df_one[['Age', 'SibSp', 'Parch', 'Fare',
            'Pclass_1', 'Pclass_2', 'Pclass_3',
            'Sex_female', 'Sex_male',
            'Embarked_C', 'Embarked_Q', 'Embarked_S']]
display(X_one.head(5))

y_one = df_one['Survived']
display(y_one.head(5))

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,22.0,1,0,7.25,0,0,1,0,1,0,0,1
1,38.0,1,0,71.2833,1,0,0,1,0,1,0,0
2,26.0,0,0,7.925,0,0,1,1,0,0,0,1
3,35.0,1,0,53.1,1,0,0,1,0,0,0,1
4,35.0,0,0,8.05,0,0,1,0,1,0,0,1


0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

#### *Separando treino e teste*

In [None]:
from sklearn.model_selection import train_test_split

X_train_one, X_test_one, y_train_one, y_test_one = train_test_split(X_one, y_one, test_size=0.2, random_state=13051980)
display(X_train_one.head(5))
display(X_test_one.head(5))

display(y_train_one.head(5))
display(y_test_one.head(5))

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
675,18.0,0,0,7.775,0,0,1,0,1,0,0,1
774,54.0,1,3,23.0,0,1,0,1,0,0,0,1
664,20.0,1,0,7.925,0,0,1,0,1,0,0,1
517,0.0,0,0,24.15,0,0,1,0,1,0,1,0
130,33.0,0,0,7.8958,0,0,1,0,1,1,0,0


Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
400,39.0,0,0,7.925,0,0,1,0,1,0,0,1
568,0.0,0,0,7.2292,0,0,1,0,1,1,0,0
135,23.0,0,0,15.0458,0,1,0,0,1,1,0,0
341,24.0,3,2,263.0,1,0,0,1,0,0,0,1
239,33.0,0,0,12.275,0,1,0,0,1,0,0,1


675    0
774    1
664    1
517    0
130    0
Name: Survived, dtype: int64

400    1
568    0
135    0
341    1
239    0
Name: Survived, dtype: int64

#### *Treinando Modelo*

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, mean_absolute_error, make_scorer

rf_one = RandomForestClassifier(n_estimators=100, random_state=42)
rf_one.fit(X_train_one, y_train_one)

rf_one

#### *Predição & Erro*

In [None]:
pred_train_one = rf_one.predict_proba(X_train_one)[:, 1]
pred_test_one = rf_one.predict_proba(X_test_one)[:, 1]

df_result = pd.DataFrame({'Método':['One-Hot-Encoding'],
                          'Train ROC':[roc_auc_score(y_train_one, pred_train_one)],
                          'Test ROC':[roc_auc_score(y_test_one, pred_test_one)]})
df_result

Unnamed: 0,Método,Train ROC,Test ROC
0,One-Hot-Encoding,0.997621,0.867666


## Target Encoder

O Target Encoder é um método que tem como objetivo tranformar as categorias em suas frequencia do target. Ref 01

É preciso tomar muito cuidado para não ter vazamento de informação, ou seja, usar informação futura no desenvolvimento.

As colunas transformadas serão: Pclass, Sex, Embarked

Usaremos uma biblioteca pronta para fazer a transformação [Ref 02], nessa biblioteca setando min_samples_leaf=0, smoothing=0, conseguimos fazer o target encoding puro, ou seja, sem suavização.


- Ref 01: https://www.youtube.com/watch?v=589nCGeWG1w
- Ref 02: https://contrib.scikit-learn.org/category_encoders/targetencoder.html


#### *Separando treino e teste*

In [None]:
from sklearn.model_selection import train_test_split

X_TE = df_titanic[['Age', 'SibSp', 'Parch', 'Fare',
            'Pclass',
            'Sex',
            'Embarked']]
display(X_TE.head(5))

y_TE = df_titanic['Survived']
display(y_TE.head(5))

X_train_TE, X_test_TE, y_train_TE, y_test_TE = train_test_split(X_TE, y_TE, test_size=0.2, random_state=13051980)
display(X_train_TE.head(5))
display(X_test_TE.head(5))

display(y_train_TE.head(5))
display(y_test_TE.head(5))

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked
0,22.0,1,0,7.25,3,male,S
1,38.0,1,0,71.2833,1,female,C
2,26.0,0,0,7.925,3,female,S
3,35.0,1,0,53.1,1,female,S
4,35.0,0,0,8.05,3,male,S


0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked
675,18.0,0,0,7.775,3,male,S
774,54.0,1,3,23.0,2,female,S
664,20.0,1,0,7.925,3,male,S
517,0.0,0,0,24.15,3,male,Q
130,33.0,0,0,7.8958,3,male,C


Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked
400,39.0,0,0,7.925,3,male,S
568,0.0,0,0,7.2292,3,male,C
135,23.0,0,0,15.0458,2,male,C
341,24.0,3,2,263.0,1,female,S
239,33.0,0,0,12.275,2,male,S


675    0
774    1
664    1
517    0
130    0
Name: Survived, dtype: int64

400    1
568    0
135    0
341    1
239    0
Name: Survived, dtype: int64

#### *Aplicando Método*

In [None]:
import category_encoders as ce

# Pclass
enc_Pclass_TE = ce.TargetEncoder(min_samples_leaf=0, smoothing=0)
enc_Pclass_TE_fit = enc_Pclass_TE.fit(X_train_TE[['Pclass']], y_train_TE)
X_train_TE['Pclass_TE'] = enc_Pclass_TE_fit.transform(X_train_TE['Pclass'])
X_test_TE['Pclass_TE'] = enc_Pclass_TE_fit.transform(X_test_TE['Pclass'])

# SEX
enc_Sex_TE = ce.TargetEncoder(min_samples_leaf=0, smoothing=0)
enc_Sex_TE_fit = enc_Sex_TE.fit(X_train_TE[['Sex']], y_train_TE)
X_train_TE['Sex_TE'] = enc_Sex_TE_fit.transform(X_train_TE['Sex'])
X_test_TE['Sex_TE'] = enc_Sex_TE_fit.transform(X_test_TE['Sex'])

# Embarked
enc_Embarked_TE = ce.TargetEncoder(min_samples_leaf=0, smoothing=0)
enc_Embarked_TE_fit = enc_Embarked_TE.fit(X_train_TE[['Embarked']], y_train_TE)
X_train_TE['Embarked_TE'] = enc_Embarked_TE_fit.transform(X_train_TE['Embarked'])
X_test_TE['Embarked_TE'] = enc_Embarked_TE_fit.transform(X_test_TE['Embarked'])

display(X_train_TE.head(5))
display(X_test_TE.head(5))



Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked,Pclass_TE,Sex_TE,Embarked_TE
675,18.0,0,0,7.775,3,male,S,3,0.197849,0.327485
774,54.0,1,3,23.0,2,female,S,2,0.716599,0.327485
664,20.0,1,0,7.925,3,male,S,3,0.197849,0.327485
517,0.0,0,0,24.15,3,male,Q,3,0.197849,0.378788
130,33.0,0,0,7.8958,3,male,C,3,0.197849,0.564885


Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked,Pclass_TE,Sex_TE,Embarked_TE
400,39.0,0,0,7.925,3,male,S,3,0.197849,0.327485
568,0.0,0,0,7.2292,3,male,C,3,0.197849,0.564885
135,23.0,0,0,15.0458,2,male,C,2,0.197849,0.564885
341,24.0,3,2,263.0,1,female,S,1,0.716599,0.327485
239,33.0,0,0,12.275,2,male,S,2,0.197849,0.327485


#### *Selecionando as colunas*

In [None]:
X_train_TE = X_train_TE[['Age', 'SibSp', 'Parch', 'Fare',
                         'Sex_TE', 'Pclass_TE', 'Embarked_TE']]
display(X_train_TE.head(5))

X_test_TE = X_test_TE[['Age', 'SibSp', 'Parch', 'Fare',
                       'Sex_TE', 'Pclass_TE', 'Embarked_TE']]
display(X_test_TE.head(5))

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_TE,Pclass_TE,Embarked_TE
675,18.0,0,0,7.775,0.197849,3,0.327485
774,54.0,1,3,23.0,0.716599,2,0.327485
664,20.0,1,0,7.925,0.197849,3,0.327485
517,0.0,0,0,24.15,0.197849,3,0.378788
130,33.0,0,0,7.8958,0.197849,3,0.564885


Unnamed: 0,Age,SibSp,Parch,Fare,Sex_TE,Pclass_TE,Embarked_TE
400,39.0,0,0,7.925,0.197849,3,0.327485
568,0.0,0,0,7.2292,0.197849,3,0.564885
135,23.0,0,0,15.0458,0.197849,2,0.564885
341,24.0,3,2,263.0,0.716599,1,0.327485
239,33.0,0,0,12.275,0.197849,2,0.327485


#### *Treinando Modelo*

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, mean_absolute_error, make_scorer

rf_TE = RandomForestClassifier(n_estimators=100, random_state=42)
rf_TE.fit(X_train_TE, y_train_TE)

rf_TE

#### *Predição & Erro*

In [None]:
pred_train_TE = rf_TE.predict_proba(X_train_TE)[:, 1]
pred_test_TE = rf_TE.predict_proba(X_test_TE)[:, 1]

df_aux_TE = pd.DataFrame({'Método':['Target Encoder'],
                          'Train ROC':[roc_auc_score(y_train_TE, pred_train_TE)],
                          'Test ROC':[roc_auc_score(y_test_TE, pred_test_TE)]})

df_result = pd.concat([df_result, df_aux_TE], axis=0)
df_result

Unnamed: 0,Método,Train ROC,Test ROC
0,One-Hot-Encoding,0.997621,0.867666
0,Target Encoder,0.997722,0.86618


## Target Encoder + Suavização

Para a suavização usaremos a mesma biblioteca, mas setaremos os parâmetros com: min_samples_leaf=20, smoothing=10.

Essa suavização faz com que categorias com pouca ocorrência fiquem mais próximas da média global.

#### *Separando treino e tests*

In [None]:
from sklearn.model_selection import train_test_split

X_TES = df_titanic[['Age', 'SibSp', 'Parch', 'Fare',
            'Pclass',
            'Sex',
            'Embarked']]
display(X_TES.head(5))

y_TES = df_titanic['Survived']
display(y_TES.head(5))

X_train_TES, X_test_TES, y_train_TES, y_test_TES = train_test_split(X_TES, y_TES, test_size=0.2, random_state=13051980)
display(X_train_TES.head(5))
display(X_test_TES.head(5))

display(y_train_TES.head(5))
display(y_test_TES.head(5))

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked
0,22.0,1,0,7.25,3,male,S
1,38.0,1,0,71.2833,1,female,C
2,26.0,0,0,7.925,3,female,S
3,35.0,1,0,53.1,1,female,S
4,35.0,0,0,8.05,3,male,S


0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked
675,18.0,0,0,7.775,3,male,S
774,54.0,1,3,23.0,2,female,S
664,20.0,1,0,7.925,3,male,S
517,0.0,0,0,24.15,3,male,Q
130,33.0,0,0,7.8958,3,male,C


Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked
400,39.0,0,0,7.925,3,male,S
568,0.0,0,0,7.2292,3,male,C
135,23.0,0,0,15.0458,2,male,C
341,24.0,3,2,263.0,1,female,S
239,33.0,0,0,12.275,2,male,S


675    0
774    1
664    1
517    0
130    0
Name: Survived, dtype: int64

400    1
568    0
135    0
341    1
239    0
Name: Survived, dtype: int64

#### *Aplicando Método*

In [None]:
import category_encoders as ce

# Pclass
enc_Pclass_TES = ce.TargetEncoder(min_samples_leaf=20, smoothing=10)
enc_Pclass_TES_fit = enc_Pclass_TES.fit(X_train_TES[['Pclass']], y_train_TES)
X_train_TES['Pclass_TE'] = enc_Pclass_TES_fit.transform(X_train_TES['Pclass'])
X_test_TES['Pclass_TE'] = enc_Pclass_TES_fit.transform(X_test_TES['Pclass'])

# SEX
enc_Sex_TES = ce.TargetEncoder(min_samples_leaf=20, smoothing=10)
enc_Sex_TES_fit = enc_Sex_TES.fit(X_train_TES[['Sex']], y_train_TES)
X_train_TES['Sex_TE'] = enc_Sex_TES_fit.transform(X_train_TES['Sex'])
X_test_TES['Sex_TE'] = enc_Sex_TES_fit.transform(X_test_TES['Sex'])

# Embarked
enc_Embarked_TES = ce.TargetEncoder(min_samples_leaf=20, smoothing=10)
enc_Embarked_TES_fit = enc_Embarked_TES.fit(X_train_TES[['Embarked']], y_train_TES)
X_train_TES['Embarked_TE'] = enc_Embarked_TES_fit.transform(X_train_TES['Embarked'])
X_test_TES['Embarked_TE'] = enc_Embarked_TES_fit.transform(X_test_TES['Embarked'])

display(X_train_TES.head(5))
display(X_test_TES.head(5))



Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked,Pclass_TE,Sex_TE,Embarked_TE
675,18.0,0,0,7.775,3,male,S,3,0.197849,0.327485
774,54.0,1,3,23.0,2,female,S,2,0.716599,0.327485
664,20.0,1,0,7.925,3,male,S,3,0.197849,0.327485
517,0.0,0,0,24.15,3,male,Q,3,0.197849,0.378778
130,33.0,0,0,7.8958,3,male,C,3,0.197849,0.564883


Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked,Pclass_TE,Sex_TE,Embarked_TE
400,39.0,0,0,7.925,3,male,S,3,0.197849,0.327485
568,0.0,0,0,7.2292,3,male,C,3,0.197849,0.564883
135,23.0,0,0,15.0458,2,male,C,2,0.197849,0.564883
341,24.0,3,2,263.0,1,female,S,1,0.716599,0.327485
239,33.0,0,0,12.275,2,male,S,2,0.197849,0.327485


#### *Selecionando as colunas*

In [None]:
X_train_TES = X_train_TES[['Age', 'SibSp', 'Parch', 'Fare',
                         'Sex_TE', 'Pclass_TE', 'Embarked_TE']]
display(X_train_TES.head(5))

X_test_TES = X_test_TES[['Age', 'SibSp', 'Parch', 'Fare',
                       'Sex_TE', 'Pclass_TE', 'Embarked_TE']]
display(X_test_TES.head(5))

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_TE,Pclass_TE,Embarked_TE
675,18.0,0,0,7.775,0.197849,3,0.327485
774,54.0,1,3,23.0,0.716599,2,0.327485
664,20.0,1,0,7.925,0.197849,3,0.327485
517,0.0,0,0,24.15,0.197849,3,0.378778
130,33.0,0,0,7.8958,0.197849,3,0.564883


Unnamed: 0,Age,SibSp,Parch,Fare,Sex_TE,Pclass_TE,Embarked_TE
400,39.0,0,0,7.925,0.197849,3,0.327485
568,0.0,0,0,7.2292,0.197849,3,0.564883
135,23.0,0,0,15.0458,0.197849,2,0.564883
341,24.0,3,2,263.0,0.716599,1,0.327485
239,33.0,0,0,12.275,0.197849,2,0.327485


#### *Treinando Modelo*

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, mean_absolute_error, make_scorer

rf_TES = RandomForestClassifier(n_estimators=100, random_state=42)
rf_TES.fit(X_train_TES, y_train_TES)

rf_TES

#### *Predição & Erro*

In [None]:
pred_train_TES = rf_TES.predict_proba(X_train_TES)[:, 1]
pred_test_TES = rf_TES.predict_proba(X_test_TES)[:, 1]

df_aux_TES = pd.DataFrame({'Método':['Target Encoder Suavização'],
                          'Train ROC':[roc_auc_score(y_train_TES, pred_train_TES)],
                          'Test ROC':[roc_auc_score(y_test_TES, pred_test_TES)],})

df_result = pd.concat([df_result, df_aux_TES], axis=0)
df_result

Unnamed: 0,Método,Train ROC,Test ROC
0,One-Hot-Encoding,0.997621,0.867666
0,Target Encoder,0.997722,0.86618
0,Target Encoder Suavização,0.997713,0.867343


## Target Encoder + Suavização + Shuffle Split

O método de Shuffle Split [Ref 01] é usando para aumentar a variabilidade, nesse caso, iremos separar a amostra em 5 partes, dividiremos cada uma das 5 partes em duas partes, com isso em uma das partes calculamos o Target Encoding e aplicamos na outra, isso diminui o overfit.

Para fazer toda essa manipulação criamos a classe TargetEncoderSSS

* Ref 01: # https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit


#### *Classe TargetEncoderSSS*

In [None]:
import random

from sklearn.model_selection import ShuffleSplit
import category_encoders as ce


___VAZIO___ = '___VAZIO___'

def pecco(msg, ecco:bool=True, use_data:bool=True, status:str='INFO') -> None:
    from datetime import datetime
    if ecco:
        pre_msg = ''
        if use_data:
            data_hora = f"[{datetime.now().strftime('%d/%m/%Y:%H:%M')}]"
        if status != '':
            status = f'[{status}]'
        if use_data != False or status != '':
            pre_msg = f'{data_hora}{status} - '

        print(f'{pre_msg}{msg}')


class TargetEncoderSSS:
    def __init__(self,
                 n_splits:int=5, random_state=13051980,
                 min_samples_leaf:int=20, smoothing:int=10,
                 ecco:bool=False) -> None:

        self._tess = {}
        self._cols = []
        self._cols_te = []
        self._ecco = ecco
        self._n_splits = n_splits
        self._random_state = random_state
        self._smoothing = smoothing
        self._min_samples_leaf = min_samples_leaf
        self._target_encoder_fit = {}
        random.seed(self._random_state)

    def fit(self, X_train, y_train):
        X_train = X_train.copy()
        y_train = y_train.copy()

        pecco(msg=f'Iniciando fit...', status='INFO', use_data=True, ecco=self._ecco)

        pecco(msg=f'Fazendo Split com Shuffle Split', status='INFO', use_data=True, ecco=self._ecco)
        shuffle_split = ShuffleSplit(n_splits=self._n_splits, test_size=1/self._n_splits, random_state=self._random_state)

        pecco(msg=f"   Train [{len(X_train)}] ", status='', use_data=False, ecco=self._ecco)
        for i, (in_index, out_index) in enumerate(shuffle_split.split(X_train)):
            pecco(msg=f"   Fold {i}:", status='', use_data=False, ecco=self._ecco)
            pecco(msg=f"      Train [{len(in_index)}] ", status='', use_data=False, ecco=self._ecco)
            pecco(msg=f"      Test  [{len(out_index)}] ", status='', use_data=False, ecco=self._ecco)

        pecco(msg=f'Fazendo Target Encoder...', status='INFO', use_data=True, ecco=self._ecco)
        pecco(msg=f'  Colunas tratadas:', status='', use_data=False, ecco=self._ecco)
        self._cols = list(sorted(X_train.columns))
        self._cols_te = []

        X_train_tess = pd.DataFrame()

        for col in self._cols:
            col_te = f'TE_{col}'
            self._cols_te.append(col_te)
            pecco(msg=f'   {col} -> {col_te}', status='', use_data=False, ecco=self._ecco)

            X_train[[col]] = X_train[[col]].astype('str')

            for i, (in_index, out_index) in enumerate(shuffle_split.split(X_train)):

                X_train_in = X_train.iloc[in_index].copy()
                y_train_in = y_train.iloc[in_index].copy()
                X_train_out = X_train.iloc[out_index].copy()

                encoder = ce.TargetEncoder(return_df=True, verbose=self._ecco, smoothing=self._smoothing, min_samples_leaf=self._min_samples_leaf)
                encoder_fit = encoder.fit(X_train_in[[col]], y_train_in)

                X_train_out.loc[:, col_te] = encoder_fit.transform(X=X_train_out[[col]])
                X_train_tess = pd.concat([X_train_tess, X_train_out], axis=0)
                pecco(msg=f'      X_train TE SS [{X_train_tess.shape}]', status='', use_data=False, ecco=self._ecco)

                df_vazio = pd.DataFrame({col:[___VAZIO___]})
                df_vazio.loc[:, col_te] = encoder_fit.transform(X=df_vazio[[col]])
                X_train_tess = pd.concat([X_train_tess, df_vazio], axis=0)
                pecco(msg=f'      X_train TE SS [{X_train_tess.shape}] Média Global', status='', use_data=False, ecco=self._ecco)

            self._tess[col] = {}
            self._tess[col][col_te] = {}
            for vv in X_train_tess[col].unique():
                self._tess[col][col_te][vv] = []
                for valor_te in X_train_tess[X_train_tess[col] == vv][col_te].unique():
                    if valor_te != np.nan and str(valor_te) != str(np.nan):
                        self._tess[col][col_te][vv].append(valor_te)


        return self

    def transforme(self, X):
        pecco(msg=f'Fazendo transformação...', status='INFO', use_data=True, ecco=self._ecco)
        pecco(msg=f'   Colunas tratadas:', status='', use_data=False, ecco=self._ecco)

        X_tess = pd.DataFrame()
        X = X.copy()

        for col in X.columns:
            if col in self._cols:
                col_te = list(self._tess[col].keys())[0]
                X[[col]] = X[[col]].astype('str')

                df_cat = pd.DataFrame()

                for categ in X[col].unique():
                    df_aux = X[X[col] == categ][[col]].copy()

                    if categ not in self._tess[col][col_te]:
                        valor_te = self._tess[col][col_te][___VAZIO___]
                        pecco(msg=f'      {categ:49}(zero) -> {valor_te}', status='', use_data=False, ecco=self._ecco)
                    else:
                        valor_te = self._tess[col][col_te][categ]
                        pecco(msg=f'      {categ:55} -> {valor_te}', status='', use_data=False, ecco=self._ecco)

                    df_aux[col_te] = df_aux.apply(lambda row : random.choice(valor_te), axis=1)

                    df_cat = pd.concat([df_cat, df_aux], axis=0)

                X_tess = pd.concat([X_tess, df_cat], axis=1)

        return X_tess

#### *Separando treino e tests*

In [None]:
from sklearn.model_selection import train_test_split

X_TESSS = df_titanic[['Age', 'SibSp', 'Parch', 'Fare',
                      'Pclass',
                      'Sex',
                      'Embarked']]
display(X_TESSS.head(5))

y_TESSS = df_titanic['Survived']
display(y_TESSS.head(5))

X_train_TESSS, X_test_TESSS, y_train_TESSS, y_test_TESSS = train_test_split(X_TESSS, y_TESSS, test_size=0.2, random_state=13051980)
display(X_train_TESSS.head(5))
display(X_test_TESSS.head(5))

display(y_train_TESSS.head(5))
display(y_test_TESSS.head(5))

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked
0,22.0,1,0,7.25,3,male,S
1,38.0,1,0,71.2833,1,female,C
2,26.0,0,0,7.925,3,female,S
3,35.0,1,0,53.1,1,female,S
4,35.0,0,0,8.05,3,male,S


0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked
675,18.0,0,0,7.775,3,male,S
774,54.0,1,3,23.0,2,female,S
664,20.0,1,0,7.925,3,male,S
517,0.0,0,0,24.15,3,male,Q
130,33.0,0,0,7.8958,3,male,C


Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked
400,39.0,0,0,7.925,3,male,S
568,0.0,0,0,7.2292,3,male,C
135,23.0,0,0,15.0458,2,male,C
341,24.0,3,2,263.0,1,female,S
239,33.0,0,0,12.275,2,male,S


675    0
774    1
664    1
517    0
130    0
Name: Survived, dtype: int64

400    1
568    0
135    0
341    1
239    0
Name: Survived, dtype: int64

#### *Aplicando Método*

In [None]:
teSSS = TargetEncoderSSS(ecco=False)
teSSS_fit = teSSS.fit(X_train=X_train_TESSS[['Pclass', 'Sex', 'Embarked']], y_train=y_train_TESSS)

In [None]:
X_train_TESSS

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked
675,18.0,0,0,7.7750,3,male,S
774,54.0,1,3,23.0000,2,female,S
664,20.0,1,0,7.9250,3,male,S
517,0.0,0,0,24.1500,3,male,Q
130,33.0,0,0,7.8958,3,male,C
...,...,...,...,...,...,...,...
757,18.0,0,0,11.5000,2,male,S
884,25.0,0,0,7.0500,3,male,S
5,0.0,0,0,8.4583,3,male,Q
870,26.0,0,0,7.8958,3,male,S


In [None]:
X_train_TESSS = pd.concat([X_train_TESSS, teSSS_fit.transforme(X=X_train_TESSS[['Pclass', 'Sex', 'Embarked']])], axis=1)
X_train_TESSS

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked,Pclass.1,TE_Pclass,Sex.1,TE_Sex,Embarked.1,TE_Embarked
675,18.0,0,0,7.7750,3,male,S,3,0.255319,male,0.197260,S,0.330097
774,54.0,1,3,23.0000,2,female,S,2,0.452817,female,0.723618,S,0.321867
664,20.0,1,0,7.9250,3,male,S,3,0.271875,male,0.197260,S,0.342995
517,0.0,0,0,24.1500,3,male,Q,3,0.247706,male,0.197297,Q,0.334885
130,33.0,0,0,7.8958,3,male,C,3,0.247706,male,0.197260,C,0.588181
...,...,...,...,...,...,...,...,...,...,...,...,...,...
757,18.0,0,0,11.5000,2,male,S,2,0.457623,male,0.197260,S,0.330097
884,25.0,0,0,7.0500,3,male,S,3,0.255319,male,0.197297,S,0.342995
5,0.0,0,0,8.4583,3,male,Q,3,0.241486,male,0.197297,Q,0.369178
870,26.0,0,0,7.8958,3,male,S,3,0.255319,male,0.218329,S,0.330097


In [None]:
X_test_TESSS = pd.concat([X_test_TESSS, teSSS_fit.transforme(X=X_test_TESSS[['Pclass', 'Sex', 'Embarked']])], axis=1)
X_test_TESSS

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass,Sex,Embarked,Pclass.1,TE_Pclass,Sex.1,TE_Sex,Embarked.1,TE_Embarked
400,39.0,0,0,7.9250,3,male,S,3,0.247706,male,0.208672,S,0.330097
568,0.0,0,0,7.2292,3,male,C,3,0.247706,male,0.201635,C,0.588181
135,23.0,0,0,15.0458,2,male,C,2,0.513500,male,0.201635,C,0.613166
341,24.0,3,2,263.0000,1,female,S,1,0.637679,female,0.747475,S,0.342995
239,33.0,0,0,12.2750,2,male,S,2,0.495308,male,0.201635,S,0.330097
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8,27.0,0,2,11.1333,3,female,S,3,0.255319,female,0.725490,S,0.330097
71,16.0,5,2,46.9000,3,female,S,3,0.271875,female,0.730000,S,0.330097
476,34.0,1,0,21.0000,2,male,S,2,0.452817,male,0.197297,S,0.353659
872,33.0,0,0,5.0000,1,male,S,1,0.648851,male,0.208672,S,0.342995


Note que agora temos mais valores para cada categoria, por exemplo Pclass tinha somente valores 1, 2, 3, entretanto agora temos vários valores para cada uma das categorias, é essa a variabilidade que diminui o overfit.

In [None]:
X_train_TESSS['TE_Pclass'].unique()

array([0.25531915, 0.45281704, 0.271875  , 0.24770642, 0.66399222,
       0.49530812, 0.24148607, 0.68749386, 0.51349994, 0.24035608,
       0.63432553, 0.64885108, 0.4864771 , 0.45762318, 0.63767939])

#### *Selecionado colunas*

In [None]:
X_train_TESSS = X_train_TESSS[['Age', 'SibSp', 'Parch', 'Fare', 'TE_Sex', 'TE_Pclass', 'TE_Embarked']]
display(X_train_TESSS.head(5))

X_test_TESSS = X_test_TESSS[['Age', 'SibSp', 'Parch', 'Fare', 'TE_Sex', 'TE_Pclass', 'TE_Embarked']]
display(X_test_TESSS.head(5))

Unnamed: 0,Age,SibSp,Parch,Fare,TE_Sex,TE_Pclass,TE_Embarked
675,18.0,0,0,7.775,0.19726,0.255319,0.330097
774,54.0,1,3,23.0,0.723618,0.452817,0.321867
664,20.0,1,0,7.925,0.19726,0.271875,0.342995
517,0.0,0,0,24.15,0.197297,0.247706,0.334885
130,33.0,0,0,7.8958,0.19726,0.247706,0.588181


Unnamed: 0,Age,SibSp,Parch,Fare,TE_Sex,TE_Pclass,TE_Embarked
400,39.0,0,0,7.925,0.208672,0.247706,0.330097
568,0.0,0,0,7.2292,0.201635,0.247706,0.588181
135,23.0,0,0,15.0458,0.201635,0.5135,0.613166
341,24.0,3,2,263.0,0.747475,0.637679,0.342995
239,33.0,0,0,12.275,0.201635,0.495308,0.330097


#### *Treinado Modelo*

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, mean_absolute_error, make_scorer

rf_TESSS = RandomForestClassifier(n_estimators=100, random_state=42)
rf_TESSS.fit(X_train_TESSS, y_train_TESSS)

rf_TESSS

#### *Predição & Erro*

In [None]:
pred_train_TESSS = rf_TESSS.predict_proba(X_train_TESSS)[:, 1]
pred_test_TESSS = rf_TESSS.predict_proba(X_test_TESSS)[:, 1]

df_aux_TESSS = pd.DataFrame({'Método':['Target Encoder Suavização Shuffle Split'],
                             'Train ROC':[roc_auc_score(y_train_TESSS, pred_train_TESSS)],
                             'Test ROC':[roc_auc_score(y_test_TESSS, pred_test_TESSS)]})

df_result = pd.concat([df_result, df_aux_TESSS], axis=0)
df_result

Unnamed: 0,Método,Train ROC,Test ROC
0,One-Hot-Encoding,0.997621,0.867666
0,Target Encoder,0.997722,0.86618
0,Target Encoder Suavização,0.997713,0.867343
0,Target Encoder Suavização Shuffle Split,0.999996,0.870574


## Resultado

In [None]:
df_result

Unnamed: 0,Método,Train ROC,Test ROC
0,One-Hot-Encoding,0.997621,0.867666
0,Target Encoder,0.997722,0.86618
0,Target Encoder Suavização,0.997713,0.867343
0,Target Encoder Suavização Shuffle Split,0.999996,0.870574


Mesmo todos os valores sendo muito próximos podemos ver que o Target Encoding Suavização Shuffle Split obteve os melhores resultados.