# Exploração Inicial

A QuantumFinance realizou uma parceria com algumas empresas clientes para fornecer o score de crédito de seus clientes optantes para permitir melhores condições de pagamento.
Para que o modelo seja mais simples e diferente do modelo de score principal do banco, será necessário treinar esse modelo com os dados das transações mais recentes dos clientes.

Para permitir governança interna e integração com outros sistemas, esta solução precisa incluir:

Template de repositório para organização dos arquivos (dataset, notebook, modelo, etc.)
Rastreamento dos experimentos do treinamento do modelo
Versionamento do modelo
Disponibilização de endpoint de API seguro com autenticação e throttling
Documentação da API

Afim de validar e tornar como um exemplo de implementação para os parceiros integre uma aplicação modelo no Streamlit com a API disponibilizada.

Dataset https://www.kaggle.com/datasets/parisrohan/credit-score-classification

In [23]:

import pandas as pd
import numpy as np
import re
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
import io

In [24]:
df = pd.read_csv("../data/raw/train.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

  df = pd.read_csv("../data/raw/train.csv")


In [25]:
df_test = pd.read_csv("../data/raw/test.csv")
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        50000 non-null  object 
 1   Customer_ID               50000 non-null  object 
 2   Month                     50000 non-null  object 
 3   Name                      44985 non-null  object 
 4   Age                       50000 non-null  object 
 5   SSN                       50000 non-null  object 
 6   Occupation                50000 non-null  object 
 7   Annual_Income             50000 non-null  object 
 8   Monthly_Inhand_Salary     42502 non-null  float64
 9   Num_Bank_Accounts         50000 non-null  int64  
 10  Num_Credit_Card           50000 non-null  int64  
 11  Interest_Rate             50000 non-null  int64  
 12  Num_of_Loan               50000 non-null  object 
 13  Type_of_Loan              44296 non-null  object 
 14  Delay_

In [26]:
df = df.drop(columns=['ID','Customer_ID','Name','Month','SSN','Interest_Rate','Type_of_Loan','Payment_of_Min_Amount','Payment_Behaviour'])
df_test = df_test.drop(columns=['ID','Customer_ID','Name','Month','SSN','Interest_Rate','Type_of_Loan','Payment_of_Min_Amount','Payment_Behaviour'])
df.head()

Unnamed: 0,Age,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Total_EMI_per_month,Amount_invested_monthly,Monthly_Balance,Credit_Score
0,23,Scientist,19114.12,1824.843333,3,4,4,3,7.0,11.27,4.0,_,809.98,26.82262,22 Years and 1 Months,49.574949,80.41529543900253,312.49408867943663,Good
1,23,Scientist,19114.12,,3,4,4,-1,,11.27,4.0,Good,809.98,31.94496,,49.574949,118.28022162236736,284.62916249607184,Good
2,-500,Scientist,19114.12,,3,4,4,3,7.0,_,4.0,Good,809.98,28.609352,22 Years and 3 Months,49.574949,81.699521264648,331.2098628537912,Good
3,23,Scientist,19114.12,,3,4,4,5,4.0,6.27,4.0,Good,809.98,31.377862,22 Years and 4 Months,49.574949,199.4580743910713,223.45130972736783,Good
4,23,Scientist,19114.12,1824.843333,3,4,4,6,,11.27,4.0,Good,809.98,24.797347,22 Years and 5 Months,49.574949,41.420153086217326,341.48923103222177,Good


## Limpeza dos dados

In [None]:
def limpeza_dados(df):
    # --------------------------------------------
    # Substituir idade negativa por NaN (Not a Number) para ser tratada depois
    print("\n--- Corrigindo idade negativa... ---")
    df['Age'] = (
        df['Age']
        .astype(str)                              # Garantir que está como string
        .apply(lambda x: re.sub(r'[^0-9-]', '', x)) # Manter apenas números e sinais de negativo
        .apply(lambda x: int(x) if x not in ['', '-'] else 0) # Converter para int (vazios viram 0)
    )
    df.loc[df['Age'] <= 0, 'Age'] = np.nan

    # Substituir o valor '_' na coluna Credit_Mix por NaN
    print("--- Corrigindo valor '_' em Credit_Mix... ---")
    df['Credit_Mix'] = df['Credit_Mix'].replace('_', np.nan)


    # 1.2 Converter colunas de texto para numérico
    # --------------------------------------------
    # A coluna 'Amount_invested_monthly' e 'Monthly_Balance' estão como objeto (texto)
    print("\n--- Convertendo colunas de texto para numérico... ---")
    df['Amount_invested_monthly'] = pd.to_numeric(df['Amount_invested_monthly'], errors='coerce')
    df['Monthly_Balance'] = pd.to_numeric(df['Monthly_Balance'], errors='coerce')


    # 1.3 Engenharia na feature 'Credit_History_Age'
    # ----------------------------------------------
    # Extrair anos e meses para uma única coluna numérica (total de meses)
    print("--- Transformando 'Credit_History_Age' em uma feature numérica... ---")

    # Extrai os números de anos e meses usando expressão regular
    history_age_extracted = df['Credit_History_Age'].str.extract(r'(\d+)\s*Years.*?(\d+)\s*Months')
    history_age_extracted.columns = ['Years', 'Months']

    # Converte para numérico
    history_age_extracted['Years'] = pd.to_numeric(history_age_extracted['Years'])
    history_age_extracted['Months'] = pd.to_numeric(history_age_extracted['Months'])

    # Calcula o total de meses e preenche a nova coluna no DataFrame
    df['Credit_History_Total_Months'] = (history_age_extracted['Years'] * 12) + history_age_extracted['Months']

    # Remove a coluna original de texto
    df = df.drop(columns=['Credit_History_Age'])


    # 1.4 Preencher valores ausentes (Imputação)
    # -------------------------------------------
    print("\n--- Preenchendo valores ausentes (NaNs)... ---")

    # Separar colunas numéricas e categóricas
    numeric_cols = df.select_dtypes(include=np.number).columns
    categorical_cols = df.select_dtypes(include=['object']).columns

    # Estratégia para numéricos: preencher com a mediana (mais robusta a outliers)
    numeric_imputer = SimpleImputer(strategy='median')
    df[numeric_cols] = numeric_imputer.fit_transform(df[numeric_cols])

    # Estratégia para categóricos: preencher com o valor mais frequente
    categorical_imputer = SimpleImputer(strategy='most_frequent')
    df[categorical_cols] = categorical_imputer.fit_transform(df[categorical_cols])


    print("\n--- Verificação Pós-Limpeza (Não deve haver valores nulos) ---")
    print(df.isnull().sum())
    return df;

In [28]:
df =limpeza_dados(df)
df.head()


--- Corrigindo idade negativa... ---
--- Corrigindo valor '_' em Credit_Mix... ---

--- Convertendo colunas de texto para numérico... ---
--- Transformando 'Credit_History_Age' em uma feature numérica... ---

--- Preenchendo valores ausentes (NaNs)... ---

--- Verificação Pós-Limpeza (Não deve haver valores nulos) ---
Age                            0
Occupation                     0
Annual_Income                  0
Monthly_Inhand_Salary          0
Num_Bank_Accounts              0
Num_Credit_Card                0
Num_of_Loan                    0
Delay_from_due_date            0
Num_of_Delayed_Payment         0
Changed_Credit_Limit           0
Num_Credit_Inquiries           0
Credit_Mix                     0
Outstanding_Debt               0
Credit_Utilization_Ratio       0
Total_EMI_per_month            0
Amount_invested_monthly        0
Monthly_Balance                0
Credit_Score                   0
Credit_History_Total_Months    0
dtype: int64


Unnamed: 0,Age,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Total_EMI_per_month,Amount_invested_monthly,Monthly_Balance,Credit_Score,Credit_History_Total_Months
0,23.0,Scientist,19114.12,1824.843333,3.0,4.0,4,3.0,7,11.27,4.0,Standard,809.98,26.82262,49.574949,80.415295,312.494089,Good,265.0
1,23.0,Scientist,19114.12,3093.745,3.0,4.0,4,-1.0,19,11.27,4.0,Good,809.98,31.94496,49.574949,118.280222,284.629162,Good,219.0
2,33.0,Scientist,19114.12,3093.745,3.0,4.0,4,3.0,7,_,4.0,Good,809.98,28.609352,49.574949,81.699521,331.209863,Good,267.0
3,23.0,Scientist,19114.12,3093.745,3.0,4.0,4,5.0,4,6.27,4.0,Good,809.98,31.377862,49.574949,199.458074,223.45131,Good,268.0
4,23.0,Scientist,19114.12,1824.843333,3.0,4.0,4,6.0,19,11.27,4.0,Good,809.98,24.797347,49.574949,41.420153,341.489231,Good,269.0


In [32]:
df_test =limpeza_dados(df_test)
df_test.head()


--- Corrigindo idade negativa... ---
--- Corrigindo valor '_' em Credit_Mix... ---

--- Convertendo colunas de texto para numérico... ---
--- Transformando 'Credit_History_Age' em uma feature numérica... ---

--- Preenchendo valores ausentes (NaNs)... ---

--- Verificação Pós-Limpeza (Não deve haver valores nulos) ---
Age                            0
Occupation                     0
Annual_Income                  0
Monthly_Inhand_Salary          0
Num_Bank_Accounts              0
Num_Credit_Card                0
Num_of_Loan                    0
Delay_from_due_date            0
Num_of_Delayed_Payment         0
Changed_Credit_Limit           0
Num_Credit_Inquiries           0
Credit_Mix                     0
Outstanding_Debt               0
Credit_Utilization_Ratio       0
Total_EMI_per_month            0
Amount_invested_monthly        0
Monthly_Balance                0
Credit_History_Total_Months    0
dtype: int64


Unnamed: 0,Age,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Total_EMI_per_month,Amount_invested_monthly,Monthly_Balance,Credit_History_Total_Months
0,23.0,Scientist,19114.12,1824.843333,3.0,4.0,4,3.0,7,11.27,2022.0,Good,809.98,35.030402,49.574949,236.642682,186.266702,273.0
1,24.0,Scientist,19114.12,1824.843333,3.0,4.0,4,3.0,9,13.27,4.0,Good,809.98,33.053114,49.574949,21.46538,361.444004,274.0
2,24.0,Scientist,19114.12,1824.843333,3.0,4.0,4,-1.0,4,12.27,4.0,Good,809.98,33.811894,49.574949,148.233938,264.675446,225.0
3,24.0,Scientist,19114.12,3086.305,3.0,4.0,4,4.0,5,11.27,4.0,Good,809.98,32.430559,49.574949,39.082511,343.826873,276.0
4,28.0,_______,34847.84,3037.986667,2.0,4.0,1,3.0,1,5.42,5.0,Good,605.03,25.926822,18.816215,39.684018,485.298434,327.0


In [30]:
def tranformando_dados(df):
    cols_to_onehot = [ 'Occupation']
    print(f"\n--- Aplicando One-Hot Encoding nas colunas: {cols_to_onehot} ---")

    # `handle_unknown='ignore'` evita erros se o modelo encontrar uma categoria nova no futuro
    # `sparse_output=False` retorna um array denso (mais fácil de visualizar)
    ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
    encoded_features = ohe.fit_transform(df[cols_to_onehot])

    # Criar um novo DataFrame com as colunas codificadas
    df_encoded = pd.DataFrame(encoded_features, columns=ohe.get_feature_names_out(cols_to_onehot))

    # Juntar o DataFrame original com o codificado
    df = pd.concat([df.drop(columns=cols_to_onehot), df_encoded], axis=1)


    # 2.2 Codificação de Variáveis Categóricas Ordinais (Ordinal Encoding)
    # --------------------------------------------------------------------
    # Usamos para colunas onde a ordem importa, como 'Credit_Mix' e a variável alvo 'Credit_Score'.
    print("\n--- Aplicando Ordinal Encoding em 'Credit_Mix'")

    # Definir a ordem explícita das categorias
    credit_mix_order = ['Poor', 'Standard', 'Good','Bad']

    # Instanciar o codificador com as categorias definidas
    ordinal_encoder = OrdinalEncoder(categories=[credit_mix_order])

    # Aplicar a codificação
    cols_to_ordinal = ['Credit_Mix']
    df[cols_to_ordinal] = ordinal_encoder.fit_transform(df[cols_to_ordinal])


    # ==============================================================================
    # RESULTADO FINAL
    # ==============================================================================
    print("\n\n--- DataFrame Final (Pronto para Machine Learning) ---")
    print(df.head())

    print("\n--- Informações Finais do DataFrame (todas as colunas devem ser numéricas) ---")
    df.info()
    return df

In [31]:
df = tranformando_dados(df)
df.head()


--- Aplicando One-Hot Encoding nas colunas: ['Occupation'] ---

--- Aplicando Ordinal Encoding em 'Credit_Mix'


--- DataFrame Final (Pronto para Machine Learning) ---
    Age Annual_Income  Monthly_Inhand_Salary  Num_Bank_Accounts  \
0  23.0      19114.12            1824.843333                3.0   
1  23.0      19114.12            3093.745000                3.0   
2  33.0      19114.12            3093.745000                3.0   
3  23.0      19114.12            3093.745000                3.0   
4  23.0      19114.12            1824.843333                3.0   

   Num_Credit_Card Num_of_Loan  Delay_from_due_date Num_of_Delayed_Payment  \
0              4.0           4                  3.0                      7   
1              4.0           4                 -1.0                     19   
2              4.0           4                  3.0                      7   
3              4.0           4                  5.0                      4   
4              4.0           4        

Unnamed: 0,Age,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,...,Occupation_Journalist,Occupation_Lawyer,Occupation_Manager,Occupation_Mechanic,Occupation_Media_Manager,Occupation_Musician,Occupation_Scientist,Occupation_Teacher,Occupation_Writer,Occupation________
0,23.0,19114.12,1824.843333,3.0,4.0,4,3.0,7,11.27,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,23.0,19114.12,3093.745,3.0,4.0,4,-1.0,19,11.27,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,33.0,19114.12,3093.745,3.0,4.0,4,3.0,7,_,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,23.0,19114.12,3093.745,3.0,4.0,4,5.0,4,6.27,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,23.0,19114.12,1824.843333,3.0,4.0,4,6.0,19,11.27,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [33]:
df_test = tranformando_dados(df_test)
df_test.head()


--- Aplicando One-Hot Encoding nas colunas: ['Occupation'] ---

--- Aplicando Ordinal Encoding em 'Credit_Mix'


--- DataFrame Final (Pronto para Machine Learning) ---
    Age Annual_Income  Monthly_Inhand_Salary  Num_Bank_Accounts  \
0  23.0      19114.12            1824.843333                3.0   
1  24.0      19114.12            1824.843333                3.0   
2  24.0      19114.12            1824.843333                3.0   
3  24.0      19114.12            3086.305000                3.0   
4  28.0      34847.84            3037.986667                2.0   

   Num_Credit_Card Num_of_Loan  Delay_from_due_date Num_of_Delayed_Payment  \
0              4.0           4                  3.0                      7   
1              4.0           4                  3.0                      9   
2              4.0           4                 -1.0                      4   
3              4.0           4                  4.0                      5   
4              4.0           1        

Unnamed: 0,Age,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,...,Occupation_Journalist,Occupation_Lawyer,Occupation_Manager,Occupation_Mechanic,Occupation_Media_Manager,Occupation_Musician,Occupation_Scientist,Occupation_Teacher,Occupation_Writer,Occupation________
0,23.0,19114.12,1824.843333,3.0,4.0,4,3.0,7,11.27,2022.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,24.0,19114.12,1824.843333,3.0,4.0,4,3.0,9,13.27,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,24.0,19114.12,1824.843333,3.0,4.0,4,-1.0,4,12.27,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,24.0,19114.12,3086.305,3.0,4.0,4,4.0,5,11.27,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,28.0,34847.84,3037.986667,2.0,4.0,1,3.0,1,5.42,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [34]:
df.to_csv("../data/processed/train_processed.csv", index=False)
df_test.to_csv("../data/processed/test_processed.csv", index=False)

## Limpeza dados - test