<a href="https://colab.research.google.com/github/edgaracabral/Kaggle_BankMarketing/blob/main/12_Pre_Processamento_de_Dados.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle Bank Marketing Case
## Pré-Processamento de Dados - **Sem Dados de Campanha Atual**

Neste fase temos como desafio:
* Codificar, normalizar, padronizar variáveis (justificar escolhas)
* Separar em bases de treino/teste

Para este caso em particular, eliminaremos as variáveis:
* contact
* day
* month
* duration
* campaign

O raciocínio é que estas variáveis de alguma forma vazam parte da informação da variável Target. A suspeita é que a duração na última chamada (**duration**) já incorpora a acquisição do investimento. E portanto outros dados possam também contaminar a modelagem.


## Framework de Preparação de Dados (DataPrep)
Usaremos o framework de DataPrep orientado pela PoD Academy, que consiste nos seguintes passos:
- Gerar Metadados da ABT (Tabela Analítica de Modelagem)
- Tratamento de missing (nulos)
 - Média para variáveis numéricas
 - 'Desconhecido' para variáveis categóricas
- Tratamento de categóricas de alta cardinalidade (LabelEncoder)
- Tratamento de categóricas de baixa cardinalidade (OneHotEncoder)
- Aplicar normalização a toda tabela de modelagem (ABT)
- Gerar artefatos para implantação do data prep realizado

# 1. Leitura dos Dados

## 1.1 Setando o Ambiente
* Google Drive
* Carregar Bibliotecas Utilizadas
* Definir Path dos arquivos

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Nota: `pod_academy_functions` é a nossa biblioteca criada no curso de Ciência de Dados

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import pickle

# Redirecione para o folder on pod_academy_functions.py is localizado
%cd /content/drive/MyDrive/PoD Academy/modelos/Hackaton_DS_2023/PoD Framework
import pod_academy_functions as pod

/content/drive/MyDrive/PoD Academy/modelos/Hackaton_DS_2023/PoD Framework


In [7]:
# Armazene o caminho do folder de Dados em file_path
file_path = '/content/drive/MyDrive/Kaggle/Bank\ Marketing/01\ original\ data'

# Vá para o folder de dados
%cd $file_path

/content/drive/MyDrive/Kaggle/Bank Marketing/01 original data


## 1.2 Carregando os Dados

In [8]:
# Carregar o arquivo CSV em um DataFrame
df_train_00 = pd.read_csv('bank-full.csv', sep=';')

# Exibir o tamanho do arquivo lido
print(f"df_train_00.shape = {df_train_00.shape}")

# Exibir as primeiras linhas do DataFrame
df_train_00.head()

df_train_00.shape = (45211, 17)


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


## 1.3 Redirecionado Artefatos para pasta `dataprep`
A partir de agora, todos os arquivos e artefatos gerados serão armazenados nesta pasta.

In [9]:
# A partir de agora, todos os artefatos gerados serão colocados dentro da pasta dataprep
%cd /content/drive/MyDrive/Kaggle/Bank\ Marketing/12\ data\ prep

/content/drive/MyDrive/Kaggle/Bank Marketing/12 data prep


## 1.4 Adequando os dados ao Padrão do Framework
Para reduzir o trabalho de recodificação do Framework, os seguintes cuidados foram tomados:
* Criar variável **id** como chave primária de cada linha da tabela
* Criar variável **target** binária, onde `target = 0` para `y = no` e `target = 1` para `y == yes`
* Remover a coluna **y**

In [10]:
# Adicionado a coluna 'id' ao dataframe, para termos uma chave primária para cada linha da tabela
df_train_00['id'] = range(1, len(df_train_00) + 1)

# Crie a coluna target a partir da coluna y
df_train_00['target'] = df_train_00['y'].map({'yes': 1, 'no': 0})

# Remover a coluna y da base de treino
df_train_00 = df_train_00.drop(axis=1, columns=['y'])

## 1.5 Eliminando variáveis da Campanha Atual

Eliminaremos as variáveis **contact**, **day**, **month**, **duration** e **campaign**

In [11]:
# Removendo as colunas contact, day, month, duration e campaign
df_train_00.drop(axis=1,columns=['contact', 'day', 'month', 'duration', 'campaign'],inplace=True)
df_train_00.shape

(45211, 13)

# 2. Separando *Base de Treino* da *Base de Teste* (Holdout 70/30)

## 2.1 Passo intermediário para acelerar depuração do código
O código abaixo permite chavear entre usar a base de treino integral ou apenas uma amostragem. Use a amostragem para acelerar a execução do notebook, o que permite identificar erros de forma acelerada. No modo amostragem, pode-se também testar se os blocos de OneHot Encoding e Label Encoding estão tratando excessões adequadamente.

* Para usar a base integral, comente a 1a linha mas remova comentário da 2a linha.
* Para usar a amostra, remova comentário da 1a linha e comente a 2a linha.

In [12]:
#df_train_00_sample = df_train_00.sample(n=10000, random_state=42)
df_train_00_sample = df_train_00.copy()
df_train_00_sample.shape

(45211, 13)

In [13]:
df_train_00_sample

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,pdays,previous,poutcome,id,target
0,58,management,married,tertiary,no,2143,yes,no,-1,0,unknown,1,0
1,44,technician,single,secondary,no,29,yes,no,-1,0,unknown,2,0
2,33,entrepreneur,married,secondary,no,2,yes,yes,-1,0,unknown,3,0
3,47,blue-collar,married,unknown,no,1506,yes,no,-1,0,unknown,4,0
4,33,unknown,single,unknown,no,1,no,no,-1,0,unknown,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,-1,0,unknown,45207,1
45207,71,retired,divorced,primary,no,1729,no,no,-1,0,unknown,45208,1
45208,72,retired,married,secondary,no,5715,no,no,184,3,success,45209,1
45209,57,blue-collar,married,secondary,no,668,no,no,-1,0,unknown,45210,0


## 2.2 Separando **treino** e **teste** para garantir validação cruzada Holdout 70/30
Cria-se as tabelas de `treino` e de `teste`.

In [14]:
# Suponha que você queira separar 70% dos dados para treino e 30% para validação
train, test = train_test_split(df_train_00_sample, test_size=0.3, random_state=42)
train.shape,test.shape

((31647, 13), (13564, 13))

In [15]:
train.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,pdays,previous,poutcome,id,target
10747,36,technician,single,tertiary,no,0,no,no,-1,0,unknown,10748,0
26054,56,entrepreneur,married,secondary,no,196,no,no,-1,0,unknown,26055,0
9125,46,blue-collar,married,secondary,no,0,yes,no,-1,0,unknown,9126,0
41659,41,management,divorced,tertiary,no,3426,no,no,119,5,success,41660,0
4443,38,blue-collar,married,secondary,no,0,yes,no,-1,0,unknown,4444,0


Cria cópias físicas dos dataframes para possível comparação (debug).

In [16]:
# Criando um novo dataframe baseado no original (Original fazia transformações aqui)
df_train_01 = train.copy()
df_test_01 = test.copy()

In [17]:
df_train_01.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,pdays,previous,poutcome,id,target
10747,36,technician,single,tertiary,no,0,no,no,-1,0,unknown,10748,0
26054,56,entrepreneur,married,secondary,no,196,no,no,-1,0,unknown,26055,0
9125,46,blue-collar,married,secondary,no,0,yes,no,-1,0,unknown,9126,0
41659,41,management,divorced,tertiary,no,3426,no,no,119,5,success,41660,0
4443,38,blue-collar,married,secondary,no,0,yes,no,-1,0,unknown,4444,0


## 4. Criar metadados de Domínio (tipo dos dados)
Este processo ajuda a verificar **quantidade de nulos**, **cardinalidade** e **tipo das variáveis**. Facilita o uso de rotinas para visualização automática dos dados.

In [18]:
metadados = pod.pod_academy_generate_metadata(df_train_01,
                                          ids=['id'],
                                          targets=['target'],
                                          orderby = 'PC_NULOS')

metadados

Unnamed: 0,FEATURE,USO_FEATURE,QT_NULOS,PC_NULOS,CARDINALIDADE,TIPO_FEATURE
0,age,Explicativa,0,0.0,77,int64
1,job,Explicativa,0,0.0,12,object
2,marital,Explicativa,0,0.0,3,object
3,education,Explicativa,0,0.0,4,object
4,default,Explicativa,0,0.0,2,object
5,balance,Explicativa,0,0.0,6307,int64
6,housing,Explicativa,0,0.0,2,object
7,loan,Explicativa,0,0.0,2,object
8,pdays,Explicativa,0,0.0,518,int64
9,previous,Explicativa,0,0.0,39,int64


# 5. Excluindo variáveis com mais que 70% de nulos

* Entram dataframes **df_train_01** e **df_test_01**
* Saem dataframes **df_train_02** e **df_test_02**


In [19]:
missing_cutoff = 70

drop_vars_nulos = metadados[(metadados['PC_NULOS'] >= missing_cutoff)]
lista_drop_vars = list(drop_vars_nulos.FEATURE.values)

print('Variáveis que serão excluídas por alto percentual de nulos: ',lista_drop_vars)
# retirando lista de variáveis com alto percentual de nulos
df_train_02 = df_train_01.drop(axis=1,columns=lista_drop_vars)
df_train_02.shape

Variáveis que serão excluídas por alto percentual de nulos:  []


(31647, 13)

In [20]:
# Salvar a lista em um arquivo .pkl
with open('prd_drop_nullvars_hktn.pkl', 'wb') as f:
    pickle.dump(lista_drop_vars, f)

Tratando a base de teste da mesma forma que a de treino

In [21]:
# retirando lista de variáveis com alto percentual de nulos
df_test_02 = df_test_01.drop(axis=1,columns=lista_drop_vars)
df_test_02.shape

(13564, 13)

In [22]:
df_train_02.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,pdays,previous,poutcome,id,target
10747,36,technician,single,tertiary,no,0,no,no,-1,0,unknown,10748,0
26054,56,entrepreneur,married,secondary,no,196,no,no,-1,0,unknown,26055,0
9125,46,blue-collar,married,secondary,no,0,yes,no,-1,0,unknown,9126,0
41659,41,management,divorced,tertiary,no,3426,no,no,119,5,success,41660,0
4443,38,blue-collar,married,secondary,no,0,yes,no,-1,0,unknown,4444,0


In [23]:
# Retirar ID e Target do tratamento de nulos na base de treino
df_train_02 = df_train_02.drop(axis=1, columns=['id', 'target'])
df_train_02.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,pdays,previous,poutcome
10747,36,technician,single,tertiary,no,0,no,no,-1,0,unknown
26054,56,entrepreneur,married,secondary,no,196,no,no,-1,0,unknown
9125,46,blue-collar,married,secondary,no,0,yes,no,-1,0,unknown
41659,41,management,divorced,tertiary,no,3426,no,no,119,5,success
4443,38,blue-collar,married,secondary,no,0,yes,no,-1,0,unknown


In [24]:
# Retirar ID e Target do tratamento de nulos na base de teste
df_test_02 = df_test_02.drop(axis=1, columns=['id', 'target'])
df_test_02.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,pdays,previous,poutcome
3776,40,blue-collar,married,secondary,no,580,yes,no,-1,0,unknown
9928,47,services,single,secondary,no,3644,no,no,-1,0,unknown
33409,25,student,single,tertiary,no,538,yes,no,-1,0,unknown
31885,42,management,married,tertiary,no,1773,no,no,336,1,failure
15738,56,management,married,tertiary,no,217,no,yes,-1,0,unknown


# 6. Substituindo os nulos
- pela média para variáveis numéricas
- por 'POD_VERIFICAR" para categóricas

In [25]:
# Substituindo os nulos na base de treino
df_train_03, means = pod.pod_custom_fillna(df_train_02)

with open('prd_fillna_hktn.pkl', 'wb') as f:
  pickle.dump(means, f)

In [26]:
with open('prd_fillna_hktn.pkl', 'rb') as f:
  loaded_means = pickle.load(f)
loaded_means

{'age': 40.941669036559546,
 'balance': 1359.3179719129555,
 'pdays': 225.2088520055325,
 'previous': 0.585869118715834}

In [27]:
# Substituindo os nulos na base de teste usando as médias da base de treino
df_test_03 = pod.pod_custom_fillna_prod(df_test_02, loaded_means)
df_test_03.shape

(13564, 11)

In [28]:
df_test_03.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,pdays,previous,poutcome
3776,40,blue-collar,married,secondary,no,580.0,yes,no,225.208852,0,unknown
9928,47,services,single,secondary,no,3644.0,no,no,225.208852,0,unknown
33409,25,student,single,tertiary,no,538.0,yes,no,225.208852,0,unknown
31885,42,management,married,tertiary,no,1773.0,no,no,336.0,1,failure
15738,56,management,married,tertiary,no,217.0,no,yes,225.208852,0,unknown


# 7. Tratamento de categóricas de alta cardinalidade (`LabelEncoder`)

In [29]:
# Identifica colunas para LabelEncoder que não estejam na lista lista_drop_vars
card_cutoff = 35
df_categ_labelenc = metadados[(metadados['CARDINALIDADE'] > card_cutoff) & (metadados['TIPO_FEATURE'] == 'object')]
lista_vars_abt = list(df_train_03.columns)
lista_lenc = list(df_categ_labelenc.FEATURE.values)

for item in lista_drop_vars:
    if item in lista_lenc:
        lista_lenc.remove(item)

print('Lista de vars para Label Encoding: ',lista_lenc)

Lista de vars para Label Encoding:  []


In [30]:
# transforma via LabelEncoder colunas na base de treino
import pickle
from sklearn.preprocessing import LabelEncoder

encoders = {}

for col in lista_lenc:
    encoder = LabelEncoder()
    df_train_03[col] = encoder.fit_transform(df_train_03[col])

    # Armazena o encoder para a coluna atual em um dicionário
    encoders[col] = encoder

# Salva o dicionário de encoders e a lista de colunas em um arquivo .pkl
data_to_serialize = {
    'encoders': encoders,
    'columns': lista_lenc
}

with open('prd_labelenc_hktn.pkl', 'wb') as f:
    pickle.dump(data_to_serialize, f)

In [31]:
print(data_to_serialize)

{'encoders': {}, 'columns': []}


In [32]:
# Carregar os encoders e a lista de colunas
with open('prd_labelenc_hktn.pkl', 'rb') as f:
    loaded_data = pickle.load(f)

loaded_encoders = loaded_data['encoders']
loaded_columns = loaded_data['columns']

# Suponha df_test_03 como sua base de teste
for col in loaded_columns:
    if col in loaded_encoders:
        # Transforma a coluna usando o encoder carregado
        df_test_03[col] = loaded_encoders[col].transform(df_test_03[col])


# 8. Tratamento para categóricas de baixa cardinalidade (`OneHot Encoder`)

In [33]:
print(df_train_03.shape)
print(df_test_03.shape)

(31647, 11)
(13564, 11)


In [34]:
import pickle
from sklearn.preprocessing import OneHotEncoder

card_cutoff = 35
df_categ_onehot = metadados[(metadados['CARDINALIDADE'] <= card_cutoff) & (metadados['TIPO_FEATURE'] == 'object')]
lista_onehot = list(df_categ_onehot.FEATURE.values)
print('Lista de vars para OneHot Encoding: ',lista_onehot)

# Instanciando o encoder
encoder = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')

# Aplicando o one-hot encoding
encoded_data = encoder.fit_transform(df_train_03[lista_onehot])
encoded_cols = encoder.get_feature_names_out(lista_onehot)
encoded_df = pd.DataFrame(encoded_data, columns=encoded_cols, index=df_train_03.index)

df_train_03 = pd.concat([df_train_03.drop(lista_onehot, axis=1), encoded_df], axis=1)

# Salva o encoder e a lista de colunas em um arquivo .pkl
data_to_serialize = {
    'encoder': encoder,
    'columns': lista_onehot
}

with open('prd_onehotenc_hktn.pkl', 'wb') as f:
    pickle.dump(data_to_serialize, f)

df_train_03.shape

Lista de vars para OneHot Encoding:  ['job', 'marital', 'education', 'default', 'housing', 'loan', 'poutcome']


(31647, 26)

In [35]:
# Carregar o encoder e a lista de colunas
with open('prd_onehotenc_hktn.pkl', 'rb') as f:
    loaded_data = pickle.load(f)

loaded_encoder = loaded_data['encoder']
loaded_columns = loaded_data['columns']

# Suponha df_test_03 como sua base de teste
encoded_data_test = loaded_encoder.transform(df_test_03[loaded_columns])
encoded_cols_test = loaded_encoder.get_feature_names_out(loaded_columns)
encoded_df_test = pd.DataFrame(encoded_data_test, columns=encoded_cols_test, index=df_test_03.index)

df_test_03 = pd.concat([df_test_03.drop(loaded_columns, axis=1), encoded_df_test], axis=1)

df_test_03.shape

(13564, 26)

# 9. Aplicar `Padronização` a toda tabela de modelagem tratada ate este ponto

In [36]:
import pickle
from sklearn.preprocessing import StandardScaler

# Excluindo IDs e Targets
df_id_target = metadados[(metadados['USO_FEATURE'] == 'ID') | (metadados['USO_FEATURE'] == 'Target')]
lista_id_target = list(df_id_target.FEATURE.values)
print('Lista de IDs e Target: ',lista_id_target)

# Instanciando o scaler
scaler = StandardScaler()

# Selecionando colunas numéricas
numeric_cols = df_train_03.select_dtypes(include=['float64', 'int64','int32']).columns

# Aplicando a normalização
df_train_03[numeric_cols] = scaler.fit_transform(df_train_03[numeric_cols])

# Salva o scaler em um arquivo .pkl
with open('prd_scaler_hktn.pkl', 'wb') as f:
    pickle.dump(scaler, f)

print(scaler)
df_train_03.shape

Lista de IDs e Target:  ['id', 'target']
StandardScaler()


(31647, 26)

In [37]:
# Carregar o scaler
with open('prd_scaler_hktn.pkl', 'rb') as f:
    loaded_scaler = pickle.load(f)

# Suponha df_test_03 como sua base de teste

# Selecionando colunas numéricas
numeric_cols = df_test_03.select_dtypes(include=['float64', 'int64','int32']).columns

# Aplicando a normalização
df_test_03[numeric_cols] = loaded_scaler.fit_transform(df_test_03[numeric_cols])

df_test_03.shape

(13564, 26)

In [38]:
df_test_03.head()

Unnamed: 0,age,balance,pdays,previous,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,...,marital_single,education_secondary,education_tertiary,education_unknown,default_yes,housing_yes,loan_yes,poutcome_other,poutcome_success,poutcome_unknown
3776,-0.087221,-0.248404,0.008043,-0.289532,1.930746,-0.185029,-0.166294,-0.519766,-0.227594,-0.186303,...,-0.629323,0.988419,-0.660382,-0.202638,-0.122953,0.894768,-0.435928,-0.198462,-0.187992,0.472084
9928,0.573919,0.709957,0.008043,-0.289532,-0.517935,-0.185029,-0.166294,-0.519766,-0.227594,-0.186303,...,1.589009,0.988419,-0.660382,-0.202638,-0.122953,-1.117608,-0.435928,-0.198462,-0.187992,0.472084
33409,-1.503949,-0.261541,0.008043,-0.289532,-0.517935,-0.185029,-0.166294,-0.519766,-0.227594,-0.186303,...,1.589009,-1.011717,1.514276,-0.202638,-0.122953,0.894768,-0.435928,-0.198462,-0.187992,0.472084
31885,0.101676,0.124744,2.327377,0.220761,-0.517935,-0.185029,-0.166294,1.923944,-0.227594,-0.186303,...,-0.629323,-1.011717,1.514276,-0.202638,-0.122953,-1.117608,-0.435928,-0.198462,-0.187992,-2.118267
15738,1.423956,-0.361943,0.008043,-0.289532,-0.517935,-0.185029,-0.166294,1.923944,-0.227594,-0.186303,...,-0.629323,-1.011717,1.514276,-0.202638,-0.122953,-1.117608,2.293956,-0.198462,-0.187992,0.472084


Retornar **id** e **target** para a tabela pós data prep

In [39]:
## Trazer o id e target para a tabela pós dataprep

abt_train = df_train_03.merge(train[['id','target']], left_index=True, right_index=True, how='inner')
abt_test = df_test_03.merge(test[['id','target']], left_index=True, right_index=True, how='inner')

print(abt_train.shape)
print(abt_test.shape)

(31647, 28)
(13564, 28)


In [40]:
abt_train.head()

Unnamed: 0,age,balance,pdays,previous,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,...,education_tertiary,education_unknown,default_yes,housing_yes,loan_yes,poutcome_other,poutcome_success,poutcome_unknown,id,target
10747,-0.464799,-0.45668,-1.138504e-15,-0.240512,-0.526225,-0.184151,-0.168627,-0.512,-0.230455,-0.191898,...,1.564172,-0.208793,-0.140544,-1.119132,-0.437178,-0.209122,-0.185066,0.472706,10748,0
26054,1.416343,-0.390831,-1.138504e-15,-0.240512,-0.526225,5.430326,-0.168627,-0.512,-0.230455,-0.191898,...,-0.639316,-0.208793,-0.140544,-1.119132,-0.437178,-0.209122,-0.185066,0.472706,26055,0
9125,0.475772,-0.45668,-1.138504e-15,-0.240512,1.900329,-0.184151,-0.168627,-0.512,-0.230455,-0.191898,...,-0.639316,-0.208793,-0.140544,0.89355,-0.437178,-0.209122,-0.185066,0.472706,9126,0
41659,0.005486,0.694328,-2.127233,1.812098,-0.526225,-0.184151,-0.168627,1.953125,-0.230455,-0.191898,...,1.564172,-0.208793,-0.140544,-1.119132,-0.437178,-0.209122,5.403473,-2.115481,41660,0
4443,-0.276685,-0.45668,-1.138504e-15,-0.240512,1.900329,-0.184151,-0.168627,-0.512,-0.230455,-0.191898,...,-0.639316,-0.208793,-0.140544,0.89355,-0.437178,-0.209122,-0.185066,0.472706,4444,0


# 10. Salvando tabelas de treino e teste pós preparação dos dados

In [41]:
abt_train.to_csv('abt_train.csv')
abt_test.to_csv('abt_test.csv')