## Template para Seleção de Variáveis - Problemas de Classificação (Target Binário)
* Gradient Boosting
* Random Forest
* Decision Tree Classifier
* Pearson Corr (Target)


#### Importando Biblioteca de Funções Turing Lab e Turing IA

In [1]:
id_empresa = '1022'
%run -i '/home/.Turing/TuringCredentialsAccess.py3'
%run -i '/home/.Turing/TuringLabFunctions.py3'

-------------------- LabTuring ---------------------------
----------- Exploração de Dados e Modelagem --------------
------ Funções carregadas em memória com sucesso ---------
--------- Data da última atualização: --------------------
--------------- 10/11/2019 -------------------------------
----------------------------------------------------------


#### Mapear bibliotecas Python necessárias

In [2]:
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

#### Papeando projeto Turing AutoML

In [4]:
#---------- Definição dos Parâmetros do projeto ---------------------------------
nm_bucket = 'turing-bkt-treinamentos-prod'
nm_file = 'ABT_V01_BJ.csv'
targetname = 'TARGET'
id_name = 'SK_ID_CURR'
abt_delimiter = ','

#### Gerando credenciais para acesso aos dados do projeto

In [5]:
S3fs,S3session,S3client,S3resource = TuringUsersCredentialsControl(id_empresa)
s3_path_write_abt = nm_bucket+'/Projetos/FLAT_TABLES/'

#### Lendo tabela de origem (ABT de modelagem)

In [6]:
path_file = 'Projetos/FLAT_TABLES/'+nm_file
input_df = TuringReadS3File(S3client,nm_bucket,path_file,sep=',')
input_df.shape

Arquivo csv carregado em memoria


(307511, 150)

#### Gerando Amostra

In [7]:
df_00_amt = TuringSampleData(input_df,pct=1,seed=10004)

('Amostra de ', 307511, 'linhas gerada com sucesso')


#### Ajustando Metadados

In [8]:
metadados = TuringFastMetadata(df_00_amt,id_name,targetname)

Gerando Metadados...
Metadado Gerado Com Sucesso


#### Aplicando função para preparação dos dados

In [9]:
df_dp = TuringFastDataPrep(metadados,df_00_amt,id_name,targetname)

Iniciando Preparação de Dados...
Executando
Turing Fast Data Preparation Concluído com Sucesso


### Processo Supervisionado para seleção de variáveis
* Random Forest Classifier
* Gradient Boosting Classifier
* Decision Tree Classifier
* Pearson Corr (target)

In [10]:
abt_vars_rf, imp, vars_modelo_imp_aux  = TuringVariableImportance('rf',df_dp,'target','Supervisionado')
imp.sort_values(by='Importancia', ascending=False).head(10)

Unnamed: 0,Variaveis,Importancia
0,EXT_SOURCE_3,0.219
1,EXT_SOURCE_2,0.197
2,AVG_DAYS_CREDIT,0.063
3,EXT_SOURCE_1,0.05
4,QT_CRED_Active,0.05
5,MIN_DAYS_CREDIT,0.043
6,DAYS_BIRTH,0.028
7,AVG_AMT_CREDIT_Closed,0.024
8,AVG_AMT_CREDIT_Microloan,0.024
9,MAX_DAYS_CREDIT,0.022


In [11]:
abt_vars_gb, imp, vars_modelo_imp_aux  = TuringVariableImportance('gbk',df_dp,'target','Supervisionado')
imp.sort_values(by='Importancia', ascending=False).head(10)

Unnamed: 0,Variaveis,Importancia
0,EXT_SOURCE_3,0.321
1,EXT_SOURCE_2,0.31
2,EXT_SOURCE_1,0.089
3,DAYS_BIRTH,0.037
4,AMT_GOODS_PRICE,0.036
5,AMT_CREDIT,0.019
6,AVG_AMT_CREDIT_Microloan,0.018
7,AMT_ANNUITY,0.016
8,DAYS_EMPLOYED,0.015
9,FLAG_DOCUMENT_3,0.013


In [12]:
abt_vars_tree, imp, vars_modelo_imp_aux  = TuringVariableImportance('tree',df_dp,'target','Supervisionado')
imp.sort_values(by='Importancia', ascending=False).head(10)

Unnamed: 0,Variaveis,Importancia
0,EXT_SOURCE_3,0.3
1,EXT_SOURCE_2,0.268
2,EXT_SOURCE_1,0.074
3,DAYS_BIRTH,0.036
4,AMT_ANNUITY,0.025
5,DAYS_ID_PUBLISH,0.016
6,AVG_AMT_CREDIT_Microloan,0.014
7,DAYS_EMPLOYED,0.014
8,FLAG_DOCUMENT_3,0.013
9,AVG_DAYS_CREDIT_UPDATE,0.013


In [13]:
vl_corte_corr = 0.1

person_order = TuringPearsonCorrTarget(df_dp,'target')
person_order['CorrPearson'] = abs(person_order['target'])
vars_selected_corr_df = person_order[person_order['CorrPearson']>=vl_corte_corr]
vars_selected_corr = vars_selected_corr_df['Variaveis'].values

abt_vars_pearson = df_dp[vars_selected_corr]
vars_selected_corr_df.head()

Unnamed: 0,index,target,Variaveis,CorrPearson
1,target,1.0,target,1.0
31,EXT_SOURCE_3,-0.157397,EXT_SOURCE_3,0.157397
30,EXT_SOURCE_2,-0.160303,EXT_SOURCE_2,0.160303


#### Trazendo Ids de volta para ABTs (Todas as chaves de cruzamentos)

In [14]:
abt_vars_rf = pd.merge(df_dp[['id']], abt_vars_rf, left_index=True, right_index=True)

abt_vars_gb = pd.merge(df_dp[['id']], abt_vars_gb, left_index=True, right_index=True)

abt_vars_tree = pd.merge(df_dp[['id']], abt_vars_tree, left_index=True, right_index=True)

abt_vars_pearson = pd.merge(df_dp[['id']], abt_vars_pearson, left_index=True, right_index=True)

### Salvando Tabelas para modelagem no Turing AutoML

In [15]:
nm_s3_file = 'ABT_V02_RF_VARS.csv'
TuringWriteS3CSVFile(abt_vars_rf,s3_path_write_abt,nm_s3_file,S3fs,delimiter=',')

encoding=latin-1


'Arquivo Salvo Com Sucesso'

In [16]:
nm_s3_file = 'ABT_V02_GB_VARS.csv'
TuringWriteS3CSVFile(abt_vars_gb,s3_path_write_abt,nm_s3_file,S3fs,delimiter=',')

encoding=latin-1


'Arquivo Salvo Com Sucesso'

In [17]:
nm_s3_file = 'ABT_V02_TREE_VARS.csv'
TuringWriteS3CSVFile(abt_vars_tree,s3_path_write_abt,nm_s3_file,S3fs,delimiter=',')

encoding=latin-1


'Arquivo Salvo Com Sucesso'

In [18]:
nm_s3_file = 'ABT_V02_PEARSON_VARS.csv'
TuringWriteS3CSVFile(abt_vars_pearson,s3_path_write_abt,nm_s3_file,S3fs,delimiter=',')

encoding=latin-1


'Arquivo Salvo Com Sucesso'

In [24]:
lista_select_vars = list(imp.Variaveis.values) + [targetname, id_name]
lista_select_vars.remove('id')

### Filtrar variaveis seleciondas da tabela original

In [25]:
abt_rf_selvar = input_df[lista_select_vars]
abt_rf_selvar.shape

(307511, 69)

In [26]:
nm_s3_file = 'ABT_V01_RF_VARS.csv'
TuringWriteS3CSVFile(abt_rf_selvar,s3_path_write_abt,nm_s3_file,S3fs,delimiter=',')

encoding=latin-1


'Arquivo Salvo Com Sucesso'