<a id="section_CART"></a> 
## Workshop Final DS Digital House

<h1 style='background:#3f4788; border:2; font-size:300%; font-weight: bold;
color:white; padding:20px'><center>Análise de Conjunto de Dados de Credit Score</center></h1>

<center><img src = "https://storage.googleapis.com/kaggle-datasets-images/2289007/3846912/ad5e128929f5ac26133b67a6110de7c0/dataset-cover.jpg?t=2022-06-22-14-33-45" width = 900 height = 400/></center>


Você está trabalhando como cientista de dados em uma empresa financeira global. Ao longo dos anos, a empresa coletou dados bancários básicos e reuniu muitas informações relacionadas a crédito de alguns clientes. A gerência quer construir um sistema inteligente para segregar as pessoas em faixas de pontuação de crédito para reduzir os esforços manuais e melhorar a precisão das apurações.

Tarefa
Dadas as informações relacionadas ao crédito de uma pessoa, construa um modelo de aprendizado de máquina que possa classificar a pontuação de crédito.


<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">

<h1 style='background:#3f4788; border:0; border-radius: 10px; color:white;padding:20px'><center> Sumário </center></h1>

### [**1. Importando bibliotecas e carregando dados**](#title-one)

### [**2. Data Wrangling, EDA e Visualizações**](#title-two)

### [**3. Limpeza de dados**](#title-three)

### [**4. Pré-processamento de Dados**](#title-four)

### [**5. Feature Importance e Feature Selection**](#title-five)

### [**6. Modelagem**](#title-six)

<a id="title-one"></a>
<h1 style='background:#3f4788; border:2; border-radius: 10px; color:white;padding:20px'><center>Importando bibliotecas e carregando dados</center></h1>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns
pd.set_option('display.max_columns', None)
import missingno as msno
#from dataprep.eda import plot, plot_correlation, create_report, plot_missing

%matplotlib inline

# Definindo fontes
font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 30}

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## Importando os dados

In [None]:
# importando bases de treino e teste
df_test = pd.read_csv(r"../DataSet/test.csv", low_memory = False)
df_train = pd.read_csv(r"../DataSet/train.csv", low_memory = False)

## Agrupando os datasets para limpar os dados

In [None]:
# # incluindo colunas para sperar os dados da mesma maneira que estavam originalmente
# df_train['test'] = 0
# df_test['test'] = 1

# # incluindo coluna de score com nan nos dados de teste
# df_test['Credit_Score'] = np.nan

In [None]:
# concatenando os datasets ja que eles tem as mesmas colunas

df_total = pd.concat([df_train, df_test], ignore_index = True)
df_orig = df_total.copy()

In [None]:
print('dados de treino: ', df_train.shape)
print('dados de teste: ', df_test.shape)
print('todos os dados agrupados: ', df_total.shape)

In [None]:
df_total.head()

## Colunas dataset:

* ID - Identificador de entrada
* Customer_ID - ID cliente
* Month - Mês do ano
* Name - nome do cliente
* Age - Idade Cliente
* SSN - Social Security Number (CPF no Brasil)
* Occupation - Ocupação do cliente
* Annual_Income - renda anual
* Monthly_Inhand_Salary - Salario mensal do cliente
* Num_Bank_Accounts - quantidade de contas em bancos
* Num_Credit_Card - quantidade de cartões de crédito
* Interest_Rate - taxa de juros cartão de crédito
* Num_of_Loan - Quantidade de empréstimos feitos no banco
* Type_of_Loan - tipo de empréstimo feito pelo cliente
* Delay_from_due_date - qtd. de dias de atraso pagamento cartão
* Num_of_Delayed_Payment - Média de pagamentos atrasado pelo cliente
* Changed_Credit_Limit - Variação percentual de limite do cartão de crédito
* Num_Credit_Inquiries - Quantidade de "cobranças" no cartão
* Credit_Mix - mix de crédito do cliente
* Outstanding_Debt - restante à ser pago da dívida
* Credit_Utilization_Ratio - Taxa de utlização do cartão de crédito
* Credit_History_Age - Tempo de histórico de crédito do cliente
* Payment_of_Min_Amount - Pagamento minimo
* Total_EMI_per_month - Pagamento fixo em dolares por mes
* Amount_invested_monthly - Quantidade de dinheiro investido pelo cliente mensalmente
* Payment_Behaviour - Comportamento de pagamento cliente
* Monthly_Balance - Saldo Mensal Cliente
* Credit_Score - Target, Pontuação de uso de crédito
* test - coluna utilizada para separar o dataset nos dados de treino e teste

In [None]:
df_total.info()

#### Porcentagem de dados faltantes

In [None]:
# Esta função define a quantidade % de dados faltantes
def plot_nas(df: pd.DataFrame):
    if df.isnull().sum().sum() != 0:
        na_df = (df.isnull().sum() / len(df)) * 100      
        na_df = na_df.drop(na_df[na_df == 0].index).sort_values(ascending=False)
        missing_data = pd.DataFrame({'Missing Ratio %' :na_df})
        missing_data.plot(kind = "barh", color='#087E8B')
        plt.title("% de dados faltantes")
        plt.show()
    else:
        print('No NAs found')
        
plot_nas(df_total)

In [None]:
msno.bar(df_total, color='#087E8B');

## Checando os principais valores de algumas colunas

In [None]:
colunas = df_total.columns

for coluna in colunas:
    print('Variavel: ', coluna)
    print(20*'-')
    print(df_total[coluna].value_counts(dropna=False))

### Observações

1. Colunas numéricas com "_" ok
    * Age,
    * Annual_Income,
    * Monthly_Inhand_Salary,
    * Num_Bank_Accounts,
    * Num_Credit_Card,
    * Interest_Rate
    * Num_of_Loan
    * Delay_from_due_date
    * Num_of_Delayed_Payment
    * Changed_Credit_Limit
    * Num_Credit_Inquiries
    * Outstanding_Debt
    * Credit_Utilization_Ratio
    * Total_EMI_per_month
    * Amount_invested_monthly
    * Monthly_Balance
2. SSN #F%$D@*&8 ok
3. Occupation _______ ok
4. Type_of_Loan - transformar em lista e indexar
5. Changed_Credit_Limit "_" -> NaN ok
6. Credit_Mix "_" -> NaN ok
7. Credit_History_Age Transformar em qtd. Meses
8. Payment_of_Min_Amount "NM" -> NaN ok
9. Payment_Behaviour "!@9#%8" -> NaN e transformar dado ok

<a id="title-two"></a>
<h1 style='background:#3f4788; border:2; border-radius: 10px; color:white;padding:20px'><center>Data Wrangling, EDA e Visualizações</center></h1>

## Ajustando os campos númericos que estão definidos como string por terem underlines em alguns registros

In [None]:
# Campos númericos que estão como string - retirar underline dos numeros

colunas_ul = ['Age', 'Annual_Income', 'Num_of_Loan', 'Num_of_Delayed_Payment',
              'Changed_Credit_Limit', 'Outstanding_Debt', 'Amount_invested_monthly', 'Monthly_Balance']
for row in colunas_ul:
    df_total[row] = df_total[row].str.replace(r'_+', '')

## Data Wrangling

* Limpeza nos campos com dados inconsistentes

In [None]:
# Removendo coluna ID, completamente inútil para a análise
df_total.drop(['ID'], axis = 1, inplace = True)

# removendo caracter estranho do SSN
df_total['SSN'].replace('#F%$D@*&8', np.NaN, inplace=True)

# removendo os underlines e colocando NaN na coluna Occupation
df_total['Occupation'].replace('_______', np.NaN, inplace=True)

df_total['Changed_Credit_Limit'].replace(['_', ''], np.NaN, inplace=True)

df_total['Credit_Mix'].replace('_', np.NaN, inplace=True)

df_total['Payment_of_Min_Amount'].replace('NM', np.NaN, inplace=True)

df_total['Payment_Behaviour'].replace('!@9#%8', np.NaN, inplace=True)

### Convertendo campo de credit history para qtd. Meses
* 1 year = 12 months

In [None]:
# convertendo Credit_History_Age em quantidade de meses
def converter_mes(x):
    if pd.notnull(x):
        ano = int(x.split(' ')[0])
        mes = int(x.split(' ')[3])
        return (ano*12)+mes
    else:
        return x

df_total['Credit_History_age'] = df_total['Credit_History_Age'].apply(lambda x: converter_mes(x)).astype(float)

### Removendo 'and' no preenchimento do campo

In [None]:
df_total['Type_of_Loan_ajustado'] = df_total['Type_of_Loan'].replace("[abc]* and ", " ", regex=True)

### Conversão de dados

In [None]:
#criando dicionario e convertendo os dados

dicionario_conversao = {
    'Age': int,
    'Num_Bank_Accounts': int,
    'Num_Credit_Card': int,
    'Num_of_Loan': int,
    'Num_of_Delayed_Payment': int,
    'Annual_Income' : float,
    'Monthly_Inhand_Salary' : float,
    'Interest_Rate' : float,
    'Delay_from_due_date' : float,
    'Changed_Credit_Limit' : float,
    'Num_Credit_Inquiries' : float,
    'Outstanding_Debt' : float,
    'Credit_Utilization_Ratio' : float,
    'Changed_Credit_Limit' : float,
    'Amount_invested_monthly' : float,
    'Total_EMI_per_month' : float,
    'Num_of_Delayed_Payment' : float,
    'Monthly_Balance' : float,
    # 'ID' : object,
    'Customer_ID' : object,
    'Name' : object,
    'Month' : object,
    'SSN' : object,
    'Type_of_Loan' : object,
    'Occupation' : object,
    'Credit_Mix' : object,
    'Payment_of_Min_Amount' : object,
    'Payment_Behaviour' : object
    # 'test' : object
    }
# aplicando as type para variaveis

df_total = df_total.astype(dicionario_conversao)

In [None]:
# convertendo mes para encoding

import datetime

df_total['Month'] = df_total['Month'].apply(lambda x: datetime.datetime.strptime(x, '%B').month)

### Visualização dos dados pré-limpeza

In [None]:
numCols = ['Monthly_Inhand_Salary', 'Delay_from_due_date', 'Credit_Utilization_Ratio']

for col in numCols:
    plt.figure(figsize=(180,6))
    sns.displot(x=col,data=df_total, hue='Credit_Score', palette=["#ff006e", "#83c5be", "#3a0ca3"])
    plt.show()

In [None]:
#dropando na

df_nonull = df_total.dropna()

# pegando indices

# df2 = df_nonull.groupby(["Customer_ID"])["Month"].nlargest(1)

# list comprehension para juntar os índices

# indice_final = [i[1] for i in df2.index.values]

# aplicando máscara, pegando somente linha do último mês de cada usuário

# df_nonull_uniqueCID = df_total.loc[indice_final]

df_nonull_uniqueCID = df_nonull.copy() #desistimos de usar esse fatiamento e mantivemos para não renomear todas as variáveis depois

In [None]:
df_agegroup = df_nonull_uniqueCID.copy()

df_agegroup["Age_Group"] = pd.cut(df_agegroup.Age,
                             bins=[14, 25, 30, 45, 60, 100],
                             labels=["14-25", "25-30", "30-45", "45-60", "60-100"],
                            )

age_groups = (df_agegroup.groupby(["Age_Group", "Credit_Score"])["Outstanding_Debt", "Annual_Income", "Num_Bank_Accounts", "Num_Credit_Card"].sum().reset_index())

sns.catplot(data=age_groups,
                x="Age_Group",
                y="Outstanding_Debt",
                height=7,
                aspect=1,
                col="Credit_Score",
                kind="bar",
                ci=None,
                palette='viridis'
               ).set_axis_labels("Faixa Etária", "Soma de Dívida", size=20, fontweight="bold")

plt.show()

In [None]:
df_agegroup = df_nonull_uniqueCID.copy()

df_agegroup["Age_Group"] = pd.cut(df_agegroup.Age,
                             bins=[14, 25, 30, 45, 60, 100],
                             labels=["14-25", "25-30", "30-45", "45-60", "60-100"],
                            )

age_groups = (df_agegroup.groupby(["Age_Group", 'Credit_Score'])['Total_EMI_per_month'].mean().reset_index())

sns.catplot(data=age_groups,
                x="Age_Group",
                y="Total_EMI_per_month",
                hue='Credit_Score',
                height=7,
                aspect=1,
                kind="bar",
                ci=None,
                palette='viridis'
               ).set_axis_labels("Faixa Etária", "Parcela Mensal", size=20, fontweight="bold")

plt.show()

In [None]:
df_agegroup = df_nonull_uniqueCID.copy()

df_agegroup["Age_Group"] = pd.cut(df_agegroup.Age,
                             bins=[14, 25, 30, 45, 60, 100],
                             labels=["14-25", "25-30", "30-45", "45-60", "60-100"],
                            )

age_groups = (df_agegroup.groupby(["Age_Group", "Credit_Score"])["Outstanding_Debt", "Annual_Income", "Num_Bank_Accounts", "Num_Credit_Card","Credit_Utilization_Ratio"].sum().reset_index())

sns.catplot(data=age_groups,
                x="Age_Group",
                y="Credit_Utilization_Ratio",
                height=7,
                aspect=1,
                col="Credit_Score",
                kind="bar",
                ci=None,
                palette='viridis'
               ).set_axis_labels("Faixa Etária", "Uso do Crédito", size=20, fontweight="bold")

plt.show()

In [None]:
df_nonull_uniqueCID.describe()

In [None]:
ordem1 = df_nonull_uniqueCID.groupby(['Credit_Score'])['Delay_from_due_date'].mean().sort_values().index

sns.barplot(data = df_nonull_uniqueCID,
            x='Delay_from_due_date',
            y='Credit_Score',
            ci = None,
            order = ordem1)
plt.title('Média de Atraso x Credit Score')
plt.xlabel('')
plt.ylabel('')
plt.show()

In [None]:
ordem1 = df_nonull_uniqueCID.groupby(['Credit_Score'])['Num_of_Delayed_Payment'].mean().sort_values().index

sns.barplot(data = df_nonull_uniqueCID,
            x='Num_of_Delayed_Payment',
            y='Credit_Score',
            ci = None,
            order = ordem1)
plt.title('Média qtd. de Atrasos x Crédito Score')
plt.xlabel('')
plt.ylabel('')
plt.show()

In [None]:
ordem2 = df_nonull_uniqueCID.groupby(['Credit_Score'])['Monthly_Inhand_Salary'].mean().sort_values().index

sns.barplot(data = df_nonull_uniqueCID,
            x='Monthly_Inhand_Salary',
            y='Credit_Score',
            ci = None,
            order = ordem2)

plt.title('Salário x Score')
plt.show()

In [None]:
ordem2 = df_nonull_uniqueCID.groupby(['Credit_Score'])['Credit_History_age'].mean().sort_values().index

sns.barplot(data = df_nonull_uniqueCID,
            x='Credit_History_age',
            y='Credit_Score',
            ci = None,
            order = ordem2)

plt.title('Historico de crédito x Score')
plt.show()

In [None]:
fig, axs = plt.subplots(1, 3, figsize = (10,10))

sns.histplot(data = df_nonull_uniqueCID[df_nonull_uniqueCID['Credit_Score'] == 'Poor'], x = 'Monthly_Inhand_Salary', ax = axs[0], color = 'skyblue')
sns.histplot(data = df_nonull_uniqueCID[df_nonull_uniqueCID['Credit_Score'] == 'Standard'], x = 'Monthly_Inhand_Salary', ax = axs[1], color = 'orange')
sns.histplot(data = df_nonull_uniqueCID[df_nonull_uniqueCID['Credit_Score'] == 'Good'], x = 'Monthly_Inhand_Salary', ax = axs[2], color = 'teal')

fig.suptitle('Distribuição de Salário por Score (Poor, Standard, Good)', size = 15)
plt.show()

In [None]:
ordem1 = df_nonull_uniqueCID.groupby(['Occupation'])['Delay_from_due_date'].mean().sort_values().index

sns.barplot(data = df_nonull_uniqueCID,
            x='Delay_from_due_date',
            y='Occupation',
            ci = None,
            order = ordem1,
            palette='viridis')
plt.show()

In [None]:
df_plot = df_nonull_uniqueCID.groupby(['Occupation', 'Credit_Score']).size().reset_index().pivot(columns='Credit_Score', index='Occupation', values=0)

df_plot.plot(kind='bar', stacked=True)
plt.show()

In [None]:
ordem1 = df_nonull_uniqueCID.groupby(['Occupation'])['Outstanding_Debt'].mean().sort_values().index

sns.barplot(data = df_nonull_uniqueCID,
            x='Outstanding_Debt',
            y='Occupation',
            ci = None,
            order = ordem1,
            palette='viridis')
plt.show()

In [None]:
# numCols = df_total.select_dtypes([np.number]).columns

# for col in numCols:
#     fig, ax = plt.subplots(1, 2, figsize = (8,8))
#     sns.boxplot(data=df_total, y=col, x = 'Credit_Score', ax=ax[0])
#     sns.scatterplot(data=df_total,x = 'Credit_Score', s = 100, y=col, ax=ax[1], color ='#ee1199')
#     plt.show()

In [None]:
plt.figure(figsize=(5,5))

plt.pie(df_nonull_uniqueCID.Credit_Score.value_counts(normalize=True),
        labels=['Standard', 'Poor', 'Good'],
        textprops={'fontsize': 21},
        colors = sns.color_palette('viridis')[1:6:2],
        autopct='%.0f%%'
        )

plt.suptitle("Distribuição de Score")
plt.subplots_adjust(top=0.80)
plt.show();

In [None]:
# plt.figure(figsize=(20,18))
# sns.heatmap(df_total.corr(),annot=True,cmap='viridis', linewidths=1, linecolor='k', fmt='.2f')

In [None]:
plt.figure(figsize=(25,6))
sns.violinplot(x='Payment_Behaviour',y='Age',data=df_nonull_uniqueCID, hue='Credit_Score', palette='rainbow')
plt.title("Distribuição de idade por perfil de pagamento")
plt.show()

In [None]:
plt.figure(figsize=(25,6))
sns.violinplot(x='Payment_Behaviour',y='Credit_Utilization_Ratio',data=df_nonull_uniqueCID, hue='Credit_Score', palette='rainbow')
plt.title("Distribuição de utilização de crédito x Perfil de pagamento")
plt.show()

In [None]:
plt.figure(figsize=(59,1))
sns.displot(data=df_nonull_uniqueCID, x="Credit_Utilization_Ratio", kde=True, color='#087E8B')

In [None]:
nome_dict = {'Month': 'Mês',
             'Age': 'Idade',
             'Annual_Income': 'Renda Anual',
             'Monthly_Inhand_Salary': 'Salário Mensal',
             'Num_Bank_Accounts': 'Qtd. Contas',
             'Num_Credit_Card': 'Qtd. Cartões',
             'Interest_Rate': 'Taxa de Juros',
             'Num_of_Loan': 'Qtd. Empréstimos',
             'Delay_from_due_date': 'Atraso no pgto.',
             'Num_of_Delayed_Payment': 'Qtd. Atrasos',
             'Changed_Credit_Limit': 'Fator de troca do Limite de Crédito',
             'Num_Credit_Inquiries': 'Qtd. de Cobranças',
             'Outstanding_Debt': 'Dívida Pendente',
             'Credit_Utilization_Ratio': 'Taxa de Utilização de Crédito',
             'Total_EMI_per_month': 'Valor mensal prestação',
             'Amount_invested_monthly': 'Valor investido mensalmente',
             'Monthly_Balance': 'Saldo mensal',
             'Credit_History_age': 'Idade de Histórico de Crédito'}

In [None]:
fig = plt.figure(figsize= (15,9))

num_cols = list(df_nonull_uniqueCID.select_dtypes(exclude='object').columns)

for i, col in enumerate(num_cols):
    ax=fig.add_subplot(4,5,i+1)      
    sns.boxplot(x=df_total[col], ax=ax)
    plt.xlabel(nome_dict[col])
    fig.tight_layout()

fig.suptitle('Distribuição de variáveis - Sem Limpeza IQR\n', size = 24)
plt.subplots_adjust(top=0.90)
plt.show()

<a id="title-three"></a>
<h1 style='background:#3f4788; border:2; border-radius: 10px; color:white;padding:20px'><center>Limpeza de Dados</center></h1>

### Limpando outliers

* Idade - Clientes entre 18 e 100 anos;
* IQR nas variáveis numéricas.

In [None]:
# aplicando IQR para as colunas numéricas
# a coluna age sera tratada limitando idades entre 0 e 100 anos

print('Número de linhas pré-limpeza: ', df_nonull_uniqueCID.shape[0])

# Age

indice_age = df_nonull_uniqueCID[(df_nonull_uniqueCID['Age'] <= 18) | (df_nonull_uniqueCID['Age'] >= 100)].index

df_nonull_uniqueCID.drop(indice_age, inplace=True)

# Restante das colunas numéricas

for i in ['Num_of_Loan', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Delay_from_due_date', 'Annual_Income',
          'Interest_Rate', 'Num_of_Delayed_Payment', 'Num_Credit_Inquiries', 'Total_EMI_per_month',
          'Amount_invested_monthly', 'Monthly_Balance']:

    q1, q3 = np.percentile(df_nonull_uniqueCID[i], [25, 75])
    
    iqr = q3 - q1
    lim_inf = q1 - 1.5 * iqr
    lim_sup = q3 + 1.5 * iqr
    indice = df_nonull_uniqueCID[(df_nonull_uniqueCID[i] <= 0) | (df_nonull_uniqueCID[i] >= lim_sup)].index
    df_nonull_uniqueCID.drop(indice, inplace=True)

print('Número de linhas pós limpeza: ', df_nonull_uniqueCID.shape[0])

In [None]:
fig = plt.figure(figsize= (15,9))

for i, col in enumerate(num_cols):
    ax=fig.add_subplot(4,5,i+1)      
    sns.boxplot(x=df_nonull_uniqueCID[col], ax=ax)
    plt.xlabel(nome_dict[col])
    fig.tight_layout()

fig.suptitle('Distribuição de variáveis - Com Limpeza IQR\n', size = 24)
plt.subplots_adjust(top=0.90)
plt.show()

<a id="title-four"></a>
<h1 style='background:#3f4788; border:2; border-radius: 10px; color:white;padding:20px'><center>Pré-processamento de Dados</center></h1>

### Drops vars. inúteis, Normalização, Encoding, Splits etc.

In [None]:
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

df_processado = df_nonull_uniqueCID.copy()
df_processado = df_processado.drop(['Customer_ID', 'Name', 'SSN', 'Type_of_Loan', 'Type_of_Loan_ajustado',
                                    'Credit_History_Age', 'Credit_Score', 'Month', 'Monthly_Balance',
                                    'Occupation', 'Payment_Behaviour'], axis=1)

#Pegando variáveis categóricas e numéricas

categorical = list(df_processado.select_dtypes(include=['object']).columns)  #Talvez criar no começo pra visualizaçao describe().transpose()
numerical = list(df_processado.select_dtypes(include=['int64', 'float64']).columns) #Talvez criar no começo pra visualizaçao describe()

# Dummy Encoder para variáveis categóricas

df_processado_categoricals = pd.DataFrame(columns = categorical, index = df_processado.index)
for col in df_processado.select_dtypes('object'):
    df_processado_categoricals[col], _ = df_processado[col].factorize()

#Scaling para variáveis numéricas

from sklearn.preprocessing import StandardScaler
stdscaler = StandardScaler()

df_processado_numericals = pd.DataFrame(stdscaler.fit_transform(df_processado[numerical]), columns = numerical, index = df_processado_categoricals.index)

#Concatenando categóricas encodadas e numéricas escaladas

df_processado_final = pd.concat([df_processado_numericals, df_processado_categoricals], axis=1)

#Ordinal Encoder na target

from sklearn.preprocessing import OrdinalEncoder
ordenc = OrdinalEncoder()

df_processado_final['Credit_Score'] = ordenc.fit_transform(df_nonull_uniqueCID['Credit_Score'].values.reshape(-1,1)).astype(int)

#Definindo X e y

X = df_processado_final.drop(['Credit_Score'], axis = 1)
y = df_processado_final['Credit_Score']

X_todas_feats = df_nonull_uniqueCID.drop(['Customer_ID', 'Name', 'SSN', 'Type_of_Loan', 'Type_of_Loan_ajustado'], axis=1)
for col in X_todas_feats.select_dtypes('object'):
    X_todas_feats[col], _ = X_todas_feats[col].factorize()
    
for col in X_todas_feats.select_dtypes(['int64', 'float64']):
    X_todas_feats[col] = stdscaler.fit_transform(X_todas_feats[col].values.reshape(-1,1))

#Splitando o dataframe processado

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify = y, random_state=42)

X_train_todas_feats, X_test_todas_feats, y_train_todas_feats, y_test_todas_feats = train_test_split(X_todas_feats, y, test_size=0.2, stratify = y, random_state=42)

print('Shape dos splits com features seleciondadas (X_train, X_test, y_train, y_test): ')
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
print('\nShape dos splits com todas as features ("): ')
print(X_train_todas_feats.shape, X_test_todas_feats.shape, y_train_todas_feats.shape, y_test_todas_feats.shape)

In [None]:
print('Encoding da target: ',{'0': ordenc.categories_[0][0],
       '1': ordenc.categories_[0][1],
       '2': ordenc.categories_[0][2]})

<a id="title-five"></a>
<h1 style='background:#3f4788; border:2; border-radius: 10px; color:white;padding:20px'><center>Feature Importance e Feature Selection</center></h1>

In [None]:
df_processado_final.columns

In [None]:
feat_imp = ['Outstanding_Debt','Delay_from_due_date','Changed_Credit_Limit','Credit_History_age',
            'Monthly_Inhand_Salary','Num_of_Delayed_Payment','Credit_Utilization_Ratio', 'Credit_Score']

sns.pairplot(df_processado_final[feat_imp], hue='Credit_Score', palette='viridis')

In [None]:
def Distribution(columne,data,i):
    fig, ax = plt.subplots(1,2, figsize = (15,5))
    font_dict = {'fontsize': 14}
    title=['Antes do processamento','Depois do processamento']
    ax = np.ravel(ax)
    if i==1:
        sns.set(style='whitegrid')
        sns.kdeplot(data=data,x=columne, ax = ax[0],color='r').set_title(title[i])
        sns.boxplot(data=data,x=columne, ax = ax[1],palette='magma').set_title(title[i])
    else:
        sns.set(style='whitegrid')
        sns.kdeplot(data=data,x=columne, ax = ax[0],color='#2171b5').set_title(title[i])
        sns.boxplot(data=data,x=columne, ax = ax[1],color='#2171b5').set_title(title[i])
        
    ax = np.reshape(ax, (1, 2))
    plt.tight_layout()

In [None]:
Distribution(columne = 'Outstanding_Debt', data = df_nonull_uniqueCID, i = 0)

In [None]:
Distribution(columne = 'Outstanding_Debt', data = df_processado_final, i = 1)

In [None]:
from turtle import color


sns.pairplot(df_processado_final,
             x_vars=['Monthly_Inhand_Salary','Outstanding_Debt'],
             y_vars=['Changed_Credit_Limit','Credit_Utilization_Ratio'])

<a id="title-five"></a>
<h1 style='background:#3f4788; border:2; border-radius: 10px; color:white;padding:20px'><center>Feature Selection e Feature Importance</center></h1>

In [None]:
# Modelo Statsmodels para teste de hipótese e seleção de features

# X_train = sm.add_constant(X_train)
# X_test = sm.add_constant(X_test)

model = sm.OLS(y_train, X_train)
non_reg_OLS = model.fit()
non_reg_OLS.summary()

In [None]:
#VIF
def calc_vif(data):
    vif_df = pd.DataFrame(columns=['Var', 'VIF'])
    x_var_names = data.columns
    
    for i in range(0, x_var_names.shape[0]):
        y = data[x_var_names[i]]
        x = data[x_var_names.drop(x_var_names[i])]
        r2 = sm.OLS(y, x).fit().rsquared
        vif = round(1/(1-r2),3)
        vif_df.loc[i] = [x_var_names[i], vif]
    return vif_df.sort_values(by='VIF',axis = 0, ascending=False, inplace=False)

calc_vif(X_train)

### * ExtraTreeClassifier - método model.feature_importances_

In [None]:
#ExtraTreeClassifier para feature_importances_

from sklearn.tree import ExtraTreeClassifier

xtc = ExtraTreeClassifier()
xtc.fit(X_train, y_train)
feat_importance = pd.Series(xtc.feature_importances_, index=X_train.columns).sort_values(ascending=False)

plt.figure(figsize=(20,8))
feat_importance.plot(kind='bar', color='#087E8B')
plt.xticks(fontsize=19);

### *PCA - Visualização e Feature Importance

In [None]:
# Visualização da Análise de Componentes Principais (PCA)
from sklearn.decomposition import PCA

pca = PCA().fit(X_train)

plt.plot(pca.explained_variance_ratio_.cumsum(), lw=3, color='#087E8B')
plt.axhline(0.7, ls='--', color='k')
plt.axvline(6, ls='--', color='k')
plt.text(7.4, 0.72, '70% da variância', fontsize=14)
plt.text(6.2, 0.43, '6 componentes', fontsize=14)
plt.title('Variância cumulativa explicada por número de componentes principais', size=18)
plt.show()

In [None]:
print('Número de componentes principais para explicar pelo menos 70% da variância: {}'.format(np.argmax(pca.explained_variance_ratio_.cumsum() > 0.7)+1))
print(f'9 Componentes principais explicam: {pca.explained_variance_ratio_[:4].sum().round(4)*100}% da variância')

In [None]:
loadings = pd.DataFrame(
    data=pca.components_.T * np.sqrt(pca.explained_variance_), 
    columns=[f'PC{i}' for i in range(1, len(X_train.columns) + 1)],
    index=X_train.columns
)

plt.figure(figsize=(24,14))
sns.heatmap(loadings.iloc[:,:np.argmax(pca.explained_variance_ratio_.cumsum() > 0.7)+2], annot=True, cmap='viridis', linewidths=1, linecolor='k', fmt='.2f')

In [None]:
pc1_loadings = loadings.sort_values(by='PC1', ascending=False)[['PC1']]
pc1_loadings = pc1_loadings.reset_index()
pc1_loadings.columns = ['Attribute', 'CorrelationWithPC1']

plt.bar(x=pc1_loadings['Attribute'], height=pc1_loadings['CorrelationWithPC1'], color='#087E8B')
plt.title('PCA loading scores (first principal component)', size=20)
plt.xticks(rotation='vertical')
plt.show()

In [None]:
from turtle import color
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores, color='#087E8B')
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

discrete_features = X.dtypes == int
mi_scores = make_mi_scores(X, y, discrete_features)

print(mi_scores)
plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores)
plt.axvline(0.2, color='k', linestyle='--')

### Meme do homem-aranha apontando pro homem-aranha

<img src='img/meme-homem-aranha.JPG' width=30% height=40%>

<a id="title-six"></a>
<h1 style='background:#3f4788; border:2; border-radius: 10px; color:white;padding:20px'><center>Modelagem</center></h1>

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, roc_auc_score, roc_curve, auc, accuracy_score, f1_score, precision_score, recall_score, accuracy_score


from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier, StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

from sklearn.dummy import DummyClassifier

from skopt import BayesSearchCV

In [None]:
# Criando comparação de acurácia entre os 5 modelos escolhidos

comparacao_modelos = pd.DataFrame(columns = ['Modelo', 'Score'])

modelos = [LogisticRegression(solver='liblinear', random_state=42),
           KNeighborsClassifier(n_neighbors=41, p=4),
           DecisionTreeClassifier(random_state=42),
           RandomForestClassifier(random_state=42),
           XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
           ExtraTreeClassifier(random_state=42),
           MLPClassifier(solver='sgd', random_state=42, max_iter=500),
           ]

for model in modelos:
    model_name = model.__class__.__name__
    scores = cross_val_score(model, X_train, y_train, cv = 5, scoring = 'accuracy')
    comparacao_modelos.loc[len(comparacao_modelos)] = [model_name, scores.mean().round(4)]

print('* Comparação Cross-Validation-Score entre vários modelos: ')
comparacao_modelos

### Modelo Baseline: Dummy Classifier

In [None]:
dummymodel = DummyClassifier(strategy='prior')
dummymodel.fit(X_train, y_train)

y_pred_dummy = dummymodel.predict(X_test)

In [None]:
print(accuracy_score(y_test, y_pred_dummy))
cmp1 = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_dummy), display_labels=['Good', 'Poor', 'Standard'])
fig, ax = plt.subplots(figsize=(10,10))
plt.grid(False)
cmp1.plot(ax=ax)
plt.rc('font', **font)
plt.title('Matriz de confusão - Dummy Classifier', fontsize = 20)
plt.show()

In [None]:
r_auc_dummy = roc_auc_score(y_test, dummymodel.predict_proba(X_test), multi_class = 'ovo')
print('Dummy Classifier tem AUROC = %.3f' % (r_auc_dummy))

### 1º Modelo: Extra Trees Classifier

In [None]:
# #Otimização Bayesiana de hiperparâmetros
# xtclf = ExtraTreesClassifier(n_jobs=-1)
# xt_params = {'n_estimators':[2000],
#                 'criterion': ['gini'],
#                 'bootstrap':[True]
#                 }

# bayessearch_xt = BayesSearchCV(xtclf,
#                                   xt_params,
#                                   cv=5,
#                                   refit=['accuracy', 'f1'],
#                                   n_jobs=-1,
#                                   verbose=1,
#                                   random_state=42
#                                   ).fit(X_train, y_train)

# y_pred_xt = bayessearch_xt.predict(X_test)
# bayessearch_xt.best_params_

In [None]:
# OrderedDict([('bootstrap', True),
#              ('criterion', 'gini'),
#              ('n_estimators', 2000)])

xtclf = ExtraTreesClassifier(n_jobs=-1,
                             bootstrap=True,
                             criterion='gini',
                             n_estimators=2000).fit(X_train, y_train)

y_pred_xt = xtclf.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_xt))
cmp1 = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_xt), display_labels=['Good', 'Poor', 'Standard'])
fig, ax = plt.subplots(figsize=(10,10))
plt.grid(False)
cmp1.plot(ax=ax)
plt.rc('font', **font)
plt.title('Matriz de confusão - Extra Trees', fontsize = 20)
plt.show()

In [None]:
r_auc_xt = roc_auc_score(y_test, xtclf.predict_proba(X_test), multi_class = 'ovo')
print('Extra Trees tem AUROC = %.3f' % (r_auc_xt))

### 2º Modelo: Random Forest

In [None]:
# #Otimização Bayesiana de hiperparâmetros
# rfclassifier = RandomForestClassifier(random_state = 42, n_jobs = -1)
# rf_params = {'n_estimators': [500, 1500],
#               'max_depth': [None],
#               'criterion': ['gini']
#                }

# bayessearch_rf = BayesSearchCV(rfclassifier,
#                                rf_params,
#                                cv = 5,
#                                scoring = 'accuracy',
#                                n_jobs = -1,
#                                verbose = 1,
#                                refit = 'accuracy'
#                                ).fit(X_train, y_train)

# y_pred_rf = bayessearch_rf.predict(X_test)
# bayessearch_rf.best_params_

In [None]:
# OrderedDict([('criterion', 'gini'),
#              ('max_depth', None),
#              ('n_estimators', 1441)])

rfclf = RandomForestClassifier(random_state = 42,
                               n_jobs = -1,
                               criterion = 'gini',
                               max_depth = None,
                               n_estimators = 1441).fit(X_train, y_train)

y_pred_rf = rfclf.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_rf))
cmp1 = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_rf), display_labels=['Good', 'Poor', 'Standard'])
fig, ax = plt.subplots(figsize=(10,10))
plt.grid(False)
cmp1.plot(ax=ax)
plt.rc('font', **font)
plt.title('Matriz de confusão - Random Forest', fontsize = 20)
plt.show()

In [None]:
r_auc_rf = roc_auc_score(y_test, rfclf.predict_proba(X_test), multi_class = 'ovo')
print('Random Forest tem AUROC = %.3f' % (r_auc_rf))

### 3º Modelo: XGBoost

In [None]:
# #Otimização Bayesiana de hiperparâmetros
# xgbclassifier = XGBClassifier(eval_metric = 'logloss', use_label_encoder = False, random_state = 42)

# xgb_params = {'n_estimators': [5, 750, 1500],
#               'max_depth': [None],
#               'gamma': [0.5, 1, 5],
#               'subsample': [0.2, 1.0],
#               'colsample_bytree': [0.6, 0.8],
#               'min_child_weight': [0.3, 0.7, 1.0],
#               'learning_rate': [0.1, 0.3]
#                }

# OrderedDict([('colsample_bytree', 0.6),
#              ('gamma', 1.0),
#              ('learning_rate', 0.1),
#              ('max_depth', None),
#              ('min_child_weight', 0.7),
#              ('n_estimators', 1500),
#              ('subsample', 1.0)])

# bayessearch_xgbc = BayesSearchCV(xgbclassifier,
#                                  xgb_params,
#                                  cv = 5,
#                                  scoring = 'accuracy',
#                                  refit = 'accuracy',
#                                  verbose = 1,
#                                  n_jobs = -1
#                                  ).fit(X_train, y_train)

# y_pred_xgbc = bayessearch_xgbc.predict(X_test)
# bayessearch_xgbc.best_params_

In [None]:
xgbclassifier = XGBClassifier(eval_metric = 'logloss',
                              use_label_encoder = False,
                              random_state = 42,
                              colsample_bytree = 0.6,
                              gamma = 1.0,
                              learning_rate = 0.1,
                              max_depth = None,
                              min_child_weight = 0.7,
                              n_estimators = 1500,
                              subsample = 1.0).fit(X_test, y_test)

y_pred_xgbc = xgbclassifier.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_xgbc))
cmp1 = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_xgbc), display_labels=['Good', 'Poor', 'Standard'])
fig, ax = plt.subplots(figsize=(10,10))
plt.grid(False)
cmp1.plot(ax=ax)
plt.rc('font', **font)
plt.title('Matriz de confusão - XGBoost', fontsize = 20)
plt.show()

In [None]:
r_auc_xgbc = roc_auc_score(y_test, xgbclassifier.predict_proba(X_test), multi_class = 'ovo')
print('XGBoost tem AUROC = %.3f' % (r_auc_xgbc))

### 4º Modelo: Multi-Layer Perceptron (Rede Neural SKLearn)

In [None]:
# #Otimização Bayesiana de hiperparâmetros
# mlpc = MLPClassifier(random_state = 42, max_iter = 500)

# mlp_params = {'hidden_layer_sizes': [100, 200, 400],
#               'activation': ['tanh'],
#               'solver': ['sgd', 'adam'],
#               'alpha': [0.001, 0.1, 0.3],
#               'learning_rate': ['constant', 'invscaling']
#               }

# bayessearch_mlpc = BayesSearchCV(mlpc,
#                                  mlp_params,
#                                  cv = 3,
#                                  scoring = 'accuracy',
#                                  refit = 'accuracy',
#                                  verbose = 1,
#                                  n_jobs = -1
#                                  ).fit(X_train, y_train)

# y_pred_mlpc = bayessearch_mlpc.predict(X_test)
# bayessearch_mlpc.best_params_

In [None]:
mlpc = MLPClassifier(activation='tanh',
                     alpha=0.1,
                     hidden_layer_sizes=100,
                     learning_rate='invscaling',
                     solver='adam',
                     random_state = 42,
                     max_iter = 500,
                     warm_start=True).fit(X_train, y_train)

y_pred_mlpc = mlpc.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_mlpc))
cmp1 = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_mlpc), display_labels=['Good', 'Poor', 'Standard'])
fig, ax = plt.subplots(figsize=(10,10))
plt.grid(False)
cmp1.plot(ax=ax)
plt.rc('font', **font)
plt.title('Matriz de confusão - Multi-Layer Perceptron', fontsize = 20)
plt.show()

In [None]:
r_auc_mlp = roc_auc_score(y_test, mlpc.predict_proba(X_test), multi_class = 'ovo')
print('MLP Classifier tem AUROC = %.3f' % (r_auc_mlp))

### 5º Modelo: Stacking de XGBoost com SVC

In [None]:
# #Otimização Bayesiana de hiperparâmetros
# svc = SVC(random_state=42, max_iter=200)

# svc_params = {'probability': [True, False],
#               'kernel': ['rbf', 'linear', 'poly'],
#               'degree': [2, 3, 4],
#               'C': [0.1, 1.0, 10.0],
#               'gamma': ['scale', 'auto'],
#               'tol': [1e-5],
#               'shrinking': [True, False]
#               }

# bayessearch_svc = BayesSearchCV(svc,
#                                 svc_params,
#                                 cv = 3,
#                                 scoring = 'accuracy',
#                                 refit = 'accuracy',
#                                 verbose = 1,
#                                 n_jobs = -1,
#                                 random_state=42).fit(X_train, y_train)

# y_pred_mlpc = bayessearch_svc.predict(X_test)
# bayessearch_svc.best_params_

In [None]:
estimators = [('xgbc', XGBClassifier(eval_metric = 'logloss',
                                     use_label_encoder = False,
                                     random_state = 42,
                                     colsample_bytree = 0.6,
                                     gamma = 1.0,
                                     learning_rate = 0.1,
                                     max_depth = None,
                                     min_child_weight = 0.7,
                                     n_estimators = 1500,
                                     subsample = 1.0)),
              
              ('svc', SVC(random_state=42, 
                          C = 0.1,
                          degree = 3,
                          gamma ='scale',
                          kernel = 'poly',
                          probability = False,
                          shrinking = True,
                          tol = 1e-05))]

stackingclf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(random_state=42, max_iter=400), cv=5).fit(X_train, y_train)

y_pred_stacking = stackingclf.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_stacking))
cmp1 = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_stacking), display_labels=['Good', 'Poor', 'Standard'])
fig, ax = plt.subplots(figsize=(10,10))
plt.grid(False)
cmp1.plot(ax=ax)
plt.rc('font', **font)
plt.title('Matriz de confusão - Stacking XGBoost com SVC', fontsize = 20)
plt.show()

In [None]:
r_auc_stacking = roc_auc_score(y_test, stackingclf.predict_proba(X_test), multi_class = 'ovo')
print('Stacking Classifier tem AUROC = %.3f' % (r_auc_stacking))

## Comparativos dos modelos

In [None]:
#Extra Trees: y_pred_xt
#Random Forest: y_pred_rf
#XGBoost: y_pred_xgbc;
#Multi-Layer Perceptron: y_pred_mlpc
#Stacking Classifier: y_pred_stacking

In [None]:
def plot_multiclass_roc(clf, X_test, y_test, n_classes, modelname, figsize=(17, 6)):
    y_score = clf.predict_proba(X_test)

    # structures
    fpr = dict()
    tpr = dict()
    roc_auc = dict()

    # calculate dummies once
    y_test_dummies = pd.get_dummies(y_test, drop_first=False).values
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_test_dummies[:, i], y_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])

    # roc for each class
    fig, ax = plt.subplots(figsize=figsize)
    ax.plot([0, 1], [0, 1], 'k--')
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title(f'ROC curve - for {modelname}', fontdict={'fontsize': 26})
    for i in range(n_classes):
        ax.plot(fpr[i], tpr[i], label='ROC curve (area = %0.3f) for label %i' % (roc_auc[i], i))
    ax.legend(loc="best")
    ax.grid(alpha=.4)
    sns.despine()
    plt.show()

In [None]:
plot_multiclass_roc(dummymodel, X_test, y_test, n_classes = 3, modelname = 'Dummy Classifier', figsize = (16,10))
plot_multiclass_roc(xtclf, X_test, y_test, 3, 'Extra Trees', (16, 10))
plot_multiclass_roc(rfclf, X_test, y_test, 3, 'Random Forest', (16, 10))
plot_multiclass_roc(xgbclassifier, X_test, y_test, 3, 'XGBoost', (16, 10))
plot_multiclass_roc(mlpc, X_test, y_test, 3, 'Multi-Layer Perceptron', (16, 10))
plot_multiclass_roc(stackingclf, X_test, y_test, 3, 'Stacking', (16, 10))

#### Curvas ROC conjuntas (para a classe Standard)

In [None]:
def plot_allmodels_roc(clf, X_test, y_test, figsize=(17, 6)):
    y_score = []

    # structures
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    
    # calculate dummies once
    y_test_dummies = pd.get_dummies(y_test, drop_first=False).values
    
    for i, it in zip(clf, range(len(clf))):
        y_score = i.predict_proba(X_test)[:, 2]
        fpr[it], tpr[it], _ = roc_curve(y_test_dummies[:, 2], y_score)
        roc_auc[it] = auc(fpr[it], tpr[it])

    # roc for each class
    fig, ax = plt.subplots(figsize=figsize)
    ax.plot([0, 1], [0, 1], 'k--', label='DummyClassifier (AUC = 0.500)')
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title(f'ROC curve for Standard Class (OvR) - All models', fontdict={'fontsize': 26})
    
    for i, it in zip(clf, range(len(clf))):
        ax.plot(fpr[it], tpr[it], label='%s (AUC = %0.3f)' % (i.__class__.__name__, roc_auc[it]))
        
    ax.legend(loc="best")
    ax.grid(alpha=.4)
    sns.despine()
    plt.show()

In [None]:
plot_allmodels_roc([rfclf, xtclf, xgbclassifier, mlpc, stackingclf], X_test, y_test)

### Tabela comparativa de Métricas

In [None]:
def perf_measure(y_actual, y_hat):
   
   metricas = []
   TP, FP, TN, FN = 0, 0, 0, 0
   
   TP = confusion_matrix(y_actual, y_hat)[2][2]
   FP = confusion_matrix(y_actual, y_hat)[0][2] + confusion_matrix(y_actual, y_hat)[1][2]
   TN = confusion_matrix(y_actual, y_hat)[0][0] + confusion_matrix(y_actual, y_hat)[0][1] + confusion_matrix(y_actual, y_hat)[1][0] + confusion_matrix(y_actual, y_hat)[1][1]
   FN = confusion_matrix(y_actual, y_hat)[2][0] + confusion_matrix(y_actual, y_hat)[2][1]
   
   metricas.append(TP)
   metricas.append(FP)
   metricas.append(TN)
   metricas.append(FN)
   
   return metricas

In [None]:
comparativo_final_metricas = pd.DataFrame(columns = ['Acurácia', 'Recall', 'FN', 'F1-Score', 'AUROC'])

comparativo_final_metricas.loc['Dummy'] = [accuracy_score(y_test, y_pred_dummy).round(4),
                                           recall_score(y_test, y_pred_dummy, average='macro').round(4),
                                           perf_measure(y_test, y_pred_dummy)[3].sum(),
                                           f1_score(y_test, y_pred_dummy, average='macro').round(4),
                                           r_auc_dummy.round(4)
                                           ]

comparativo_final_metricas.loc['Extra Trees'] = [accuracy_score(y_test, y_pred_xt).round(4),
                                                 recall_score(y_test, y_pred_xt, average='macro').round(4),
                                                 perf_measure(y_test, y_pred_xt)[3].sum(),
                                                 f1_score(y_test, y_pred_xt, average='macro').round(4),
                                                 r_auc_xt.round(4)
                                                 ]

comparativo_final_metricas.loc['Random Forest'] = [accuracy_score(y_test, y_pred_rf).round(4),
                                                   recall_score(y_test, y_pred_rf, average='macro').round(4),
                                                   perf_measure(y_test, y_pred_rf)[3].sum(),
                                                   f1_score(y_test, y_pred_rf, average='macro').round(4),
                                                   r_auc_rf.round(4)
                                                   ]

comparativo_final_metricas.loc['XGBoost'] = [accuracy_score(y_test, y_pred_xgbc).round(4),
                                             recall_score(y_test, y_pred_xgbc, average='macro').round(4),
                                             perf_measure(y_test, y_pred_xgbc)[3].sum(),
                                             f1_score(y_test, y_pred_xgbc, average='macro').round(4),
                                             r_auc_xgbc.round(4)
                                             ]

comparativo_final_metricas.loc['Multi Layer Perceptron'] = [accuracy_score(y_test, y_pred_mlpc).round(4),
                                                            recall_score(y_test, y_pred_mlpc, average='macro').round(4),
                                                            perf_measure(y_test, y_pred_mlpc)[3].sum(),
                                                            f1_score(y_test, y_pred_mlpc, average='macro').round(4),
                                                            r_auc_rf.round(4)
                                                            ]

comparativo_final_metricas.loc['Stacking'] = [accuracy_score(y_test, y_pred_stacking).round(4),
                                              recall_score(y_test, y_pred_stacking, average='macro').round(4),
                                              perf_measure(y_test, y_pred_stacking)[3].sum(),
                                              f1_score(y_test, y_pred_stacking, average='macro').round(4),
                                              r_auc_stacking.round(4)
                                              ]

In [None]:
comparativo_final_metricas = comparativo_final_metricas.sort_values(by='Acurácia', ascending=False)
comparativo_final_metricas

#### FIM