# Ana Cristina Zanetti

## Descrição do problema e explanção dos dados

### Construir um modelo de aprendizado supervisionado que possa classificar a pontuação de crédito de um cliente em poor, standard ou good.

### Para construir este modelo, será utilizado um histórico de clientes disponibilizado em arquivo que contém:

1. ID -> Representa a identificação única para um registro de entrada
1. Customer_ID -> Representa a identificação única de um cliente
1. Month -> Representa o mês de criação do registro
1. Name -> Representa o nome do cliente
1. Age -> Representa a idade do cliente
1. SSN - Seguro social -> Representa o número de seguro social do cliente
1. Occupation -> Representa a ocupação/profissão do cliente
1. Annual_Income0- Renda Anual -> Representa a renda anual do cliente
1. Monthly_Inhand_Salary -> Representa o salário base mensal do cliente
1. Num_Bank_Accounts -> Representa o número de contas bancárias que o cliente possui
1. Num_Credit_Card -> Representa o número de cartões de crédito detidos pelo cliente
1. Interest_Rate -> Representa a taxa de juros
1. Num_of_Loan -> Representa o número de empréstimos contraídos no banco
1. Type_of_Loan -> Representa os tipos de empréstimo tomados pelo cliente
1. Delay_from_due_date -> Representa o número médio de dias de atraso desde a data de pagamento
1. Num_of_Delayed_Payment -> Representa o número médio de pagamentos atrasados ​​pelo cliente
1. Changed_Credit_Limit -> Representa a variação percentual no limite de crédito
1. Num_Credit_Inquiries -> Representa o número de consultas ao crédito
1. Credit_Mix -> Representa a classificação do mix de créditos
1. Outstanding_Debt -> Representa o total da dívida atual a ser paga (em USD)
1. Credit_Utilization_Ratio -> Representa a taxa de utilização de crédito
1. Credit_History_Age -> Representa a idade do histórico de crédito do cliente
1. Payment_of_Min_Amount -> Representa se apenas o valor mínimo foi pago pelo cliente
1. Total_EMI_per_month -> Representa os pagamentos mensais do EMI (parcelas do empréstimo em USD)
1. Amount_invested_monthly -> Representa o valor mensal investido pelo cliente (em USD)
1. Payment_Behaviour -> Representa o comportamento de pagamento do cliente (em USD)
1. Monthly_Balance -> Representa o valor do saldo mensal do cliente (em USD)
1. Credit_Score -> Representa a faixa de pontuação de crédito (ruim, padrão, bom)



## Importação de bibliotecas

In [25]:
import pandas as pd
import numpy as np
import re

In [26]:
df_credito = pd.read_csv('supervisionado/dataset/train.csv')
df_credito.head()

  df_credito = pd.read_csv('supervisionado/dataset/train.csv')


Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,0x1602,CUS_0xd40,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,_,809.98,26.82262,22 Years and 1 Months,No,49.574949,80.41529543900253,High_spent_Small_value_payments,312.49408867943663,Good
1,0x1603,CUS_0xd40,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.94496,,No,49.574949,118.28022162236736,Low_spent_Large_value_payments,284.62916249607184,Good
2,0x1604,CUS_0xd40,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,28.609352,22 Years and 3 Months,No,49.574949,81.699521264648,Low_spent_Medium_value_payments,331.2098628537912,Good
3,0x1605,CUS_0xd40,April,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.377862,22 Years and 4 Months,No,49.574949,199.4580743910713,Low_spent_Small_value_payments,223.45130972736783,Good
4,0x1606,CUS_0xd40,May,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,Good,809.98,24.797347,22 Years and 5 Months,No,49.574949,41.420153086217326,High_spent_Medium_value_payments,341.48923103222177,Good


*Obseervado já na carga,que a coluna Monthly_Balance, precisa ser trabalhada, tem vários tipos de dados, porém deveria ser numérica*

In [27]:
df_credito.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

*Muitos campos numéricos estão com object*

## Todas as colunas relacionadas a identificação únicas de registros ou de cliente serão excluídas por não contribuírem para o Credit Score

In [28]:
colunas_excluir = ['ID', 'Customer_ID', 'Name','SSN',]
df_credito = df_credito.drop(colunas_excluir, axis=1)

In [29]:
df_credito.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 24 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Month                     100000 non-null  object 
 1   Age                       100000 non-null  object 
 2   Occupation                100000 non-null  object 
 3   Annual_Income             100000 non-null  object 
 4   Monthly_Inhand_Salary     84998 non-null   float64
 5   Num_Bank_Accounts         100000 non-null  int64  
 6   Num_Credit_Card           100000 non-null  int64  
 7   Interest_Rate             100000 non-null  int64  
 8   Num_of_Loan               100000 non-null  object 
 9   Type_of_Loan              88592 non-null   object 
 10  Delay_from_due_date       100000 non-null  int64  
 11  Num_of_Delayed_Payment    92998 non-null   object 
 12  Changed_Credit_Limit      100000 non-null  object 
 13  Num_Credit_Inquiries      98035 non-null   fl

## Identifcação do tratamento necessário para os dados

### Month

In [30]:
df_credito['Occupation'].unique()

array(['Scientist', '_______', 'Teacher', 'Engineer', 'Entrepreneur',
       'Developer', 'Lawyer', 'Media_Manager', 'Doctor', 'Journalist',
       'Manager', 'Accountant', 'Musician', 'Mechanic', 'Writer',
       'Architect'], dtype=object)

In [31]:
df_credito.describe()

Unnamed: 0,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Delay_from_due_date,Num_Credit_Inquiries,Credit_Utilization_Ratio,Total_EMI_per_month
count,84998.0,100000.0,100000.0,100000.0,100000.0,98035.0,100000.0,100000.0
mean,4194.17085,17.09128,22.47443,72.46604,21.06878,27.754251,32.285173,1403.118217
std,3183.686167,117.404834,129.05741,466.422621,14.860104,193.177339,5.116875,8306.04127
min,303.645417,-1.0,0.0,1.0,-5.0,0.0,20.0,0.0
25%,1625.568229,3.0,4.0,8.0,10.0,3.0,28.052567,30.30666
50%,3093.745,6.0,5.0,13.0,18.0,6.0,32.305784,69.249473
75%,5957.448333,7.0,7.0,20.0,28.0,9.0,36.496663,161.224249
max,15204.633333,1798.0,1499.0,5797.0,67.0,2597.0,50.0,82331.0


In [41]:
# Retirar os "and" e tirar o espaço após a virgula
# Aplica o get_dummies com separador "," 
# Concatena os dataframes e exclui a coluna original

df_credito['Type_of_Loan'] = df_credito['Type_of_Loan'].str.replace(' and ','')
df_credito['Type_of_Loan'] = df_credito['Type_of_Loan'].str.replace(', ',',')


df_cat = df_credito['Type_of_Loan'].str.get_dummies(sep=',')


df_credito = pd.concat([df_credito, df_cat], axis=1)
df_credito = df_credito.drop('Type_of_Loan', axis=1)

df_credito


Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Score,Auto Loan,Credit-Builder Loan,Debt Consolidation Loan,Home Equity Loan,Mortgage Loan,Not Specified,Payday Loan,Personal Loan,Student Loan
0,0x1602,CUS_0xd40,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,Good,1,1,0,1,0,0,0,1,0
1,0x1603,CUS_0xd40,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,1,1,0,1,0,0,0,1,0
2,0x1604,CUS_0xd40,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,,3,...,Good,1,1,0,1,0,0,0,1,0
3,0x1605,CUS_0xd40,April,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,1,1,0,1,0,0,0,1,0
4,0x1606,CUS_0xd40,May,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,Good,1,1,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,0x25fe9,CUS_0x942c,April,Nicks,25,078-73-5990,Mechanic,39628.99,3359.415833,4,...,Poor,1,0,0,0,0,0,0,0,1
99996,0x25fea,CUS_0x942c,May,Nicks,25,078-73-5990,Mechanic,39628.99,3359.415833,4,...,Poor,1,0,0,0,0,0,0,0,1
99997,0x25feb,CUS_0x942c,June,Nicks,25,078-73-5990,Mechanic,39628.99,3359.415833,4,...,Poor,1,0,0,0,0,0,0,0,1
99998,0x25fec,CUS_0x942c,July,Nicks,25,078-73-5990,Mechanic,39628.99,3359.415833,4,...,Standard,1,0,0,0,0,0,0,0,1


In [32]:
df_cat.shape()

NameError: name 'df_cat' is not defined

In [34]:
df_dummies.columns

Index(['Auto Loan', 'Credit-Builder Loan', 'Debt Consolidation Loan',
       'Home Equity Loan', 'Mortgage Loan', 'Not Specified', 'Payday Loan',
       'Personal Loan', 'Student Loan'],
      dtype='object')

In [33]:
df_credito.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 24 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Month                     100000 non-null  object 
 1   Age                       100000 non-null  object 
 2   Occupation                100000 non-null  object 
 3   Annual_Income             100000 non-null  object 
 4   Monthly_Inhand_Salary     84998 non-null   float64
 5   Num_Bank_Accounts         100000 non-null  int64  
 6   Num_Credit_Card           100000 non-null  int64  
 7   Interest_Rate             100000 non-null  int64  
 8   Num_of_Loan               100000 non-null  object 
 9   Type_of_Loan              88592 non-null   object 
 10  Delay_from_due_date       100000 non-null  int64  
 11  Num_of_Delayed_Payment    92998 non-null   object 
 12  Changed_Credit_Limit      100000 non-null  object 
 13  Num_Credit_Inquiries      98035 non-null   fl

In [35]:
import pandas as pd
import re

# Exemplo de DataFrame
data = {'coluna_com_caracteres': ['123abc', '-456def', '789ghi','-500']}
df = pd.DataFrame(data)

# Função para remover caracteres não numéricos, incluindo o sinal de menos
def remover_nao_numericos(valor):
    return re.sub(r'[^-\d]', '', valor)

# Aplicar a função à coluna do DataFrame
df['coluna_com_caracteres'] = df['coluna_com_caracteres'].apply(remover_nao_numericos)

# Resultado
print(df)


  coluna_com_caracteres
0                   123
1                  -456
2                   789
3                  -500


### Funções para Limpeza dos dados

In [43]:
# Remover caracteres não numericos
def remover_nao_numericos(coluna):
      return coluna.apply(lambda valor: re.sub(r'[^-\d]', '', str(valor))) 

# Transformar em número
def transformar_em_numero(coluna):
      return pd.to_numeric(coluna, errors='coerce')
  
# Limpar valores negativos
def limpar_valores_negativos(coluna):
      return coluna.apply(lambda valor: None if valor < 0 else valor)

# Delimitar valores das colunas 
def delimitar_valores(coluna, limite_inferior, limite_superior):
        return coluna.apply(lambda valor: valor if limite_inferior <= valor <= limite_superior else np.nan)

