# Exploração Inicial

A QuantumFinance realizou uma parceria com algumas empresas clientes para fornecer o score de crédito de seus clientes optantes para permitir melhores condições de pagamento.
Para que o modelo seja mais simples e diferente do modelo de score principal do banco, será necessário treinar esse modelo com os dados das transações mais recentes dos clientes.

Para permitir governança interna e integração com outros sistemas, esta solução precisa incluir:

Template de repositório para organização dos arquivos (dataset, notebook, modelo, etc.)
Rastreamento dos experimentos do treinamento do modelo
Versionamento do modelo
Disponibilização de endpoint de API seguro com autenticação e throttling
Documentação da API

Afim de validar e tornar como um exemplo de implementação para os parceiros integre uma aplicação modelo no Streamlit com a API disponibilizada.

Dataset https://www.kaggle.com/datasets/parisrohan/credit-score-classification

In [1]:

import pandas as pd
import numpy as np
import re
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler

In [2]:
df = pd.read_csv("../data/raw/train.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

  df = pd.read_csv("../data/raw/train.csv")


In [3]:
df = df.drop(columns=['ID','Customer_ID','Name','Month','SSN','Interest_Rate','Type_of_Loan','Payment_of_Min_Amount','Payment_Behaviour','Changed_Credit_Limit','Credit_Mix','Credit_History_Age','Delay_from_due_date','Outstanding_Debt'])
df.head()

Unnamed: 0,Age,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Num_of_Loan,Num_of_Delayed_Payment,Num_Credit_Inquiries,Credit_Utilization_Ratio,Total_EMI_per_month,Amount_invested_monthly,Monthly_Balance,Credit_Score
0,23,Scientist,19114.12,1824.843333,3,4,4,7.0,4.0,26.82262,49.574949,80.41529543900253,312.49408867943663,Good
1,23,Scientist,19114.12,,3,4,4,,4.0,31.94496,49.574949,118.28022162236736,284.62916249607184,Good
2,-500,Scientist,19114.12,,3,4,4,7.0,4.0,28.609352,49.574949,81.699521264648,331.2098628537912,Good
3,23,Scientist,19114.12,,3,4,4,4.0,4.0,31.377862,49.574949,199.4580743910713,223.45130972736783,Good
4,23,Scientist,19114.12,1824.843333,3,4,4,,4.0,24.797347,49.574949,41.420153086217326,341.48923103222177,Good


In [4]:
df = df.dropna(axis=0, how='any')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 73156 entries, 0 to 99999
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       73156 non-null  object 
 1   Occupation                73156 non-null  object 
 2   Annual_Income             73156 non-null  object 
 3   Monthly_Inhand_Salary     73156 non-null  float64
 4   Num_Bank_Accounts         73156 non-null  int64  
 5   Num_Credit_Card           73156 non-null  int64  
 6   Num_of_Loan               73156 non-null  object 
 7   Num_of_Delayed_Payment    73156 non-null  object 
 8   Num_Credit_Inquiries      73156 non-null  float64
 9   Credit_Utilization_Ratio  73156 non-null  float64
 10  Total_EMI_per_month       73156 non-null  float64
 11  Amount_invested_monthly   73156 non-null  object 
 12  Monthly_Balance           73156 non-null  object 
 13  Credit_Score              73156 non-null  object 
dtypes: float64(

## Limpeza dos dados

In [6]:
#Funcao para remover caracteres indesejados e converter colunas para float
def limpar_e_converter_colunas(df: pd.DataFrame, colunas: list) -> pd.DataFrame:
    
    for coluna in colunas:
        # Remove caracteres indesejados, mantendo apenas dígitos, ponto e vírgula
        df[coluna] = df[coluna].astype(str).apply(lambda x: re.sub(r'[^0-9.,]', '', x))
        
        # Substitui vírgula por ponto e converte para float
        df[coluna] = df[coluna].str.replace(',', '.', regex=False).astype(float).round(2)
    
    return df

In [7]:
df = limpar_e_converter_colunas(df, ['Age', 'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts', 'Num_Credit_Card','Num_of_Loan','Num_of_Delayed_Payment','Num_Credit_Inquiries','Credit_Utilization_Ratio','Total_EMI_per_month','Amount_invested_monthly','Monthly_Balance'])

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 73156 entries, 0 to 99999
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       73156 non-null  float64
 1   Occupation                73156 non-null  object 
 2   Annual_Income             73156 non-null  float64
 3   Monthly_Inhand_Salary     73156 non-null  float64
 4   Num_Bank_Accounts         73156 non-null  float64
 5   Num_Credit_Card           73156 non-null  float64
 6   Num_of_Loan               73156 non-null  float64
 7   Num_of_Delayed_Payment    73156 non-null  float64
 8   Num_Credit_Inquiries      73156 non-null  float64
 9   Credit_Utilization_Ratio  73156 non-null  float64
 10  Total_EMI_per_month       73156 non-null  float64
 11  Amount_invested_monthly   73156 non-null  float64
 12  Monthly_Balance           73156 non-null  float64
 13  Credit_Score              73156 non-null  object 
dtypes: float64(

In [9]:
df['Occupation'] = df['Occupation'].replace('_______', 'Não informado')

In [10]:
df['Occupation'].value_counts()

Occupation
Não informado    5158
Lawyer           4867
Engineer         4650
Mechanic         4627
Architect        4605
Accountant       4571
Developer        4563
Media_Manager    4553
Teacher          4545
Scientist        4524
Doctor           4523
Entrepreneur     4494
Journalist       4442
Musician         4373
Manager          4350
Writer           4311
Name: count, dtype: int64

In [11]:
def maper_colunas(df: pd.DataFrame, coluna: str, mapa: dict):
    df[coluna] = df[coluna].map(mapa).fillna(df[coluna])  # Mantém valores não mapeados
    return df

mapeamento = {'Não informado': 0, 'Lawyer': 1,'Engineer':2, 'Mechanic':3, 'Architect':4, 'Accountant':5, 'Developer':6, 'Media_Manager':7, 'Teacher':8, 'Scientist':9, 'Doctor':10,
               'Entrepreneur':11,'Journalist':12,'Musician':13,'Manager':14,'Writer':15}

df = maper_colunas(df, 'Occupation', mapeamento)

In [12]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 73156 entries, 0 to 99999
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       73156 non-null  float64
 1   Occupation                73156 non-null  int64  
 2   Annual_Income             73156 non-null  float64
 3   Monthly_Inhand_Salary     73156 non-null  float64
 4   Num_Bank_Accounts         73156 non-null  float64
 5   Num_Credit_Card           73156 non-null  float64
 6   Num_of_Loan               73156 non-null  float64
 7   Num_of_Delayed_Payment    73156 non-null  float64
 8   Num_Credit_Inquiries      73156 non-null  float64
 9   Credit_Utilization_Ratio  73156 non-null  float64
 10  Total_EMI_per_month       73156 non-null  float64
 11  Amount_invested_monthly   73156 non-null  float64
 12  Monthly_Balance           73156 non-null  float64
 13  Credit_Score              73156 non-null  object 
dtypes: float64(

In [14]:
scaler = MinMaxScaler()


colunas_para_normalizar = df.select_dtypes(include='number').columns.difference(['Credit_Score'])


# Aplicar o scaler
df[colunas_para_normalizar] = scaler.fit_transform(df[colunas_para_normalizar])

In [16]:
df.head()

Unnamed: 0,Age,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Num_of_Loan,Num_of_Delayed_Payment,Num_Credit_Inquiries,Credit_Utilization_Ratio,Total_EMI_per_month,Amount_invested_monthly,Monthly_Balance,Credit_Score
0,0.001036,0.6,0.000501,0.102087,0.001669,0.002668,0.002676,0.001592,0.00154,0.203984,0.000602,0.008042,9.3708e-25,Good
6,0.001036,0.6,0.000501,0.102087,0.001669,0.002668,0.002676,0.001819,0.00154,0.057005,0.000602,0.017834,7.3332e-25,Good
7,0.001036,0.6,0.000501,0.102087,0.001669,0.002668,0.002676,0.001365,0.00154,0.104739,0.000602,0.002479,1.0739699999999999e-24,Standard
8,0.001612,0.0,0.001151,0.183501,0.001112,0.002668,0.000669,0.00091,0.00077,0.12294,0.000229,0.010429,1.4116799999999999e-24,Standard
9,0.001612,0.533333,0.001151,0.183501,0.001112,0.002668,0.000669,0.000227,0.00077,0.606799,0.000229,0.004039,1.45338e-24,Good


In [18]:
mapa_credit_mix = {
    'Poor': 0,
    'Standard': 1,
    'Good': 2
}
df['Credit_Score'] = df['Credit_Score'].map(mapa_credit_mix)


In [19]:
df.head()

Unnamed: 0,Age,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Num_of_Loan,Num_of_Delayed_Payment,Num_Credit_Inquiries,Credit_Utilization_Ratio,Total_EMI_per_month,Amount_invested_monthly,Monthly_Balance,Credit_Score
0,0.001036,0.6,0.000501,0.102087,0.001669,0.002668,0.002676,0.001592,0.00154,0.203984,0.000602,0.008042,9.3708e-25,2
6,0.001036,0.6,0.000501,0.102087,0.001669,0.002668,0.002676,0.001819,0.00154,0.057005,0.000602,0.017834,7.3332e-25,2
7,0.001036,0.6,0.000501,0.102087,0.001669,0.002668,0.002676,0.001365,0.00154,0.104739,0.000602,0.002479,1.0739699999999999e-24,1
8,0.001612,0.0,0.001151,0.183501,0.001112,0.002668,0.000669,0.00091,0.00077,0.12294,0.000229,0.010429,1.4116799999999999e-24,1
9,0.001612,0.533333,0.001151,0.183501,0.001112,0.002668,0.000669,0.000227,0.00077,0.606799,0.000229,0.004039,1.45338e-24,2


In [20]:
df.to_csv("../data/processed/train_processed.csv", index=False)