
# Limpeza de Dados Financeiros Orientada por Regra de Neg√≥cio

**M√≥dulo 4 ‚Äî Dados do T2Ti ERP para IA: Manipula√ß√£o e Limpeza**  
**V√≠deo 04 ‚Äî Limpeza de Dados Financeiros Orientada por Regra de Neg√≥cio**

Neste notebook realizamos a **limpeza de dados financeiros**

> ‚ö†Ô∏è Importante  
> - Nem todo nulo √© erro
> - Nem toda repeti√ß√£o √© duplicidade
> - Campo categ√≥rico ‚â† chave
> - Valor errado pesa mais que valor ausente
> - IA precisa de coer√™ncia, n√£o perfei√ß√£o  

Este notebook ser√° **incrementado nos pr√≥ximos v√≠deos** do m√≥dulo.



## Conex√£o com o Banco de Dados

Utilizaremos:
- MySQL
- SQLAlchemy
- PyMySQL (driver est√°vel e 100% Python)


In [3]:

import pandas as pd
from sqlalchemy import create_engine
import pymysql

# Garante compatibilidade com SQLAlchemy
pymysql.install_as_MySQLdb()

DATABASE_URL = "mysql+pymysql://root:root@localhost/fenix"
engine = create_engine(DATABASE_URL)



## Dataset Base do M√≥dulo

Este dataset representa **Contas a Receber** e ser√° usado at√© o final do m√≥dulo.


In [4]:

sql = """
SELECT
    pr.id                             AS parcela_id,
    pr.numero_parcela                 AS numero_parcela,
    pr.data_emissao                   AS data_emissao,
    pr.data_vencimento                AS data_vencimento,
    pr.data_recebimento               AS data_recebimento,
    pr.valor                          AS valor_parcela,
    pr.valor_recebido                 AS valor_recebido,
    pr.valor_juro                     AS valor_juro,
    pr.valor_multa                    AS valor_multa,
    pr.valor_desconto                 AS valor_desconto,
    pr.emitiu_boleto                  AS emitiu_boleto,
		pr.boleto_nosso_numero            AS boleto_nosso_numero,
    s.situacao                        AS codigo_status,
    s.descricao                       AS descricao_status,
    lr.id                             AS lancamento_id,
    lr.valor_a_receber                AS valor_total_lancamento,
    lr.data_lancamento                AS data_lancamento,
    c.id                              AS cliente_id,
    c.nome                            AS cliente_nome,
    c.tipo                            AS cliente_tipo,
    c.limite_credito                  AS limite_credito,
    nf.codigo                         AS natureza_codigo,
    nf.descricao                      AS natureza_descricao,
    nf.tipo                           AS natureza_tipo
FROM fin_parcela_receber pr
JOIN fin_lancamento_receber lr
    ON lr.id = pr.id_fin_lancamento_receber
JOIN view_pessoa_cliente c
    ON c.id = lr.id_cliente
JOIN fin_status_parcela s
    ON s.id = pr.id_fin_status_parcela
JOIN fin_natureza_financeira nf
    ON nf.id = lr.id_fin_natureza_financeira
"""

df = pd.read_sql(sql, engine)
df_limpo = df.copy()
df.head()


Unnamed: 0,parcela_id,numero_parcela,data_emissao,data_vencimento,data_recebimento,valor_parcela,valor_recebido,valor_juro,valor_multa,valor_desconto,...,lancamento_id,valor_total_lancamento,data_lancamento,cliente_id,cliente_nome,cliente_tipo,limite_credito,natureza_codigo,natureza_descricao,natureza_tipo
0,13,1,2023-02-15,2025-03-20,2025-03-20,1800.5,1746.485,0.0,0.0,54.015,...,2,1800.5,2025-02-15,2,TESTE PESSOA FISICA,F,1000.0,2103,Despesa Comercial,D
1,57,1,2025-10-26,2025-11-25,2025-12-30,196.057167,196.057167,,,,...,21,588.171501,2025-12-25,2,TESTE PESSOA FISICA,F,1000.0,101,Venda de Mercadorias,R
2,58,2,2025-10-26,2025-12-25,,196.057167,0.0,,,,...,21,588.171501,2025-12-25,2,TESTE PESSOA FISICA,F,1000.0,101,Venda de Mercadorias,R
3,59,3,2025-10-26,2026-01-24,,196.057167,0.0,,,,...,21,588.171501,2025-12-25,2,TESTE PESSOA FISICA,F,1000.0,101,Venda de Mercadorias,R
4,270,1,2025-10-26,2025-11-25,2025-11-25,267.914668,267.914668,,,,...,60,535.829336,2025-12-25,2,TESTE PESSOA FISICA,F,1000.0,101,Venda de Mercadorias,R



## Vis√£o Geral do Dataset

Antes de qualquer decis√£o, precisamos entender:
- Volume de dados
- Estrutura
- Tipos de colunas


In [5]:
df.shape

(2677, 24)

In [6]:
df.columns

Index(['parcela_id', 'numero_parcela', 'data_emissao', 'data_vencimento',
       'data_recebimento', 'valor_parcela', 'valor_recebido', 'valor_juro',
       'valor_multa', 'valor_desconto', 'emitiu_boleto', 'boleto_nosso_numero',
       'codigo_status', 'descricao_status', 'lancamento_id',
       'valor_total_lancamento', 'data_lancamento', 'cliente_id',
       'cliente_nome', 'cliente_tipo', 'limite_credito', 'natureza_codigo',
       'natureza_descricao', 'natureza_tipo'],
      dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2677 entries, 0 to 2676
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   parcela_id              2677 non-null   int64  
 1   numero_parcela          2677 non-null   int64  
 2   data_emissao            2677 non-null   object 
 3   data_vencimento         2677 non-null   object 
 4   data_recebimento        337 non-null    object 
 5   valor_parcela           2677 non-null   float64
 6   valor_recebido          2671 non-null   float64
 7   valor_juro              9 non-null      float64
 8   valor_multa             9 non-null      float64
 9   valor_desconto          9 non-null      float64
 10  emitiu_boleto           14 non-null     object 
 11  boleto_nosso_numero     4 non-null      object 
 12  codigo_status           2677 non-null   object 
 13  descricao_status        2677 non-null   object 
 14  lancamento_id           2677 non-null   


## Diagn√≥stico de Valores Ausentes

Nem todo valor ausente √© erro.
Em dados financeiros, muitas aus√™ncias s√£o **leg√≠timas**.


In [8]:
df.isnull().sum()

parcela_id                   0
numero_parcela               0
data_emissao                 0
data_vencimento              0
data_recebimento          2340
valor_parcela                0
valor_recebido               6
valor_juro                2668
valor_multa               2668
valor_desconto            2668
emitiu_boleto             2663
boleto_nosso_numero       2673
codigo_status                0
descricao_status             0
lancamento_id                0
valor_total_lancamento       0
data_lancamento              0
cliente_id                   0
cliente_nome                 0
cliente_tipo                 0
limite_credito               0
natureza_codigo              0
natureza_descricao           0
natureza_tipo                0
dtype: int64

In [9]:
(df.isnull().mean() * 100).sort_values(ascending=False)

boleto_nosso_numero       99.850579
valor_juro                99.663803
valor_multa               99.663803
valor_desconto            99.663803
emitiu_boleto             99.477027
data_recebimento          87.411281
valor_recebido             0.224131
parcela_id                 0.000000
valor_parcela              0.000000
numero_parcela             0.000000
data_emissao               0.000000
data_vencimento            0.000000
codigo_status              0.000000
descricao_status           0.000000
lancamento_id              0.000000
valor_total_lancamento     0.000000
data_lancamento            0.000000
cliente_id                 0.000000
cliente_nome               0.000000
cliente_tipo               0.000000
limite_credito             0.000000
natureza_codigo            0.000000
natureza_descricao         0.000000
natureza_tipo              0.000000
dtype: float64

In [10]:
df_limpo['data_recebimento'].isnull().mean() * 100

np.float64(87.41128128502055)

In [11]:
df_limpo[['valor_recebido', 'valor_juro', 'valor_multa', 'valor_desconto']].isnull().sum()

valor_recebido       6
valor_juro        2668
valor_multa       2668
valor_desconto    2668
dtype: int64

In [12]:
col_valores = ['valor_recebido', 'valor_juro', 'valor_multa', 'valor_desconto']
df_limpo[col_valores] = df_limpo[col_valores].fillna(0)

In [13]:
df_limpo[['valor_recebido', 'valor_juro', 'valor_multa', 'valor_desconto']].isnull().sum()

valor_recebido    0
valor_juro        0
valor_multa       0
valor_desconto    0
dtype: int64


## Diagn√≥stico de Duplicidades

Duplicidade pode indicar:
- Problema t√©cnico (JOIN)
- Situa√ß√£o v√°lida de neg√≥cio


In [14]:
df.duplicated().sum()
#df_limpo.drop_duplicates() # n√£o fa√ßa isso aqui porque pode remover linhas importantes

np.int64(0)

In [15]:

df[df.duplicated(subset=["parcela_id"], keep=False)]
#df[df.duplicated(subset=["boleto_nosso_numero"], keep=False)]
#df[df["boleto_nosso_numero"].notna()].duplicated(subset=["boleto_nosso_numero"], keep=False)


Unnamed: 0,parcela_id,numero_parcela,data_emissao,data_vencimento,data_recebimento,valor_parcela,valor_recebido,valor_juro,valor_multa,valor_desconto,...,lancamento_id,valor_total_lancamento,data_lancamento,cliente_id,cliente_nome,cliente_tipo,limite_credito,natureza_codigo,natureza_descricao,natureza_tipo


## Padroniza√ß√£o de Dados Categ√≥ricos
IA √© sens√≠vel a varia√ß√µes textuais

In [16]:
df_limpo['descricao_status'] = df_limpo['descricao_status'].str.upper().str.strip()
# Pago | PAGO | pago | ' pago'  -> PAGO


## Diagn√≥stico de Datas Financeiras

Datas erradas causam impactos severos em modelos de IA.


In [17]:

df["data_emissao"] = pd.to_datetime(df["data_emissao"], errors="coerce")
df["data_vencimento"] = pd.to_datetime(df["data_vencimento"], errors="coerce")
df["data_recebimento"] = pd.to_datetime(df["data_recebimento"], errors="coerce")

df[["data_emissao", "data_vencimento", "data_recebimento"]].describe()

col_datas = ["data_emissao", "data_vencimento", "data_recebimento"]
df_limpo[col_datas] = df_limpo[col_datas].apply(pd.to_datetime, errors="coerce")



## Diagn√≥stico de Valores Financeiros

Analisamos:
- Valores zerados
- Valores negativos
- Inconsist√™ncias entre previsto e recebido


In [18]:

df[
    ["valor_parcela", "valor_recebido", "valor_juro", "valor_multa", "valor_desconto"]
].describe()


Unnamed: 0,valor_parcela,valor_recebido,valor_juro,valor_multa,valor_desconto
count,2677.0,2671.0,9.0,9.0,9.0
mean,311.254725,76.554398,0.0,0.0,19.890556
std,446.848442,374.037696,0.0,0.0,43.275155
min,9.08364,0.0,0.0,0.0,0.0
25%,82.20971,0.0,0.0,0.0,0.0
50%,182.6966,0.0,0.0,0.0,0.0
75%,388.233587,0.0,0.0,0.0,0.0
max,5886.483326,5886.483326,0.0,0.0,125.0


In [19]:
df_limpo[
    ["valor_parcela", "valor_recebido", "valor_juro", "valor_multa", "valor_desconto"]
].describe()

Unnamed: 0,valor_parcela,valor_recebido,valor_juro,valor_multa,valor_desconto
count,2677.0,2677.0,2677.0,2677.0,2677.0
mean,311.254725,76.382816,0.0,0.0,0.066871
std,446.848442,373.635681,0.0,0.0,2.631493
min,9.08364,0.0,0.0,0.0,0.0
25%,82.20971,0.0,0.0,0.0,0.0
50%,182.6966,0.0,0.0,0.0,0.0
75%,388.233587,0.0,0.0,0.0,0.0
max,5886.483326,5886.483326,0.0,0.0,125.0


In [20]:
df_limpo.describe()

Unnamed: 0,parcela_id,numero_parcela,data_emissao,data_vencimento,data_recebimento,valor_parcela,valor_recebido,valor_juro,valor_multa,valor_desconto,lancamento_id,valor_total_lancamento,cliente_id,limite_credito
count,2677.0,2677.0,2677,2677,337,2677.0,2677.0,2677.0,2677.0,2677.0,2677.0,2677.0,2677.0,2677.0
mean,1421.380276,4.045573,2025-10-23 22:04:53.163989504,2026-02-21 06:10:05.155024128,2025-11-22 02:25:16.913946624,311.254725,76.382816,0.0,0.0,0.066871,264.409414,1663.886256,10.649234,3685.991782
min,11.0,1.0,2023-01-10 00:00:00,2023-02-26 00:00:00,2025-02-15 00:00:00,9.08364,0.0,0.0,0.0,0.0,1.0,81.020352,1.0,800.0
25%,724.0,2.0,2025-10-26 00:00:00,2025-12-25 00:00:00,2025-11-25 00:00:00,82.20971,0.0,0.0,0.0,0.0,142.0,609.635641,5.0,1500.0
50%,1431.0,4.0,2025-10-26 00:00:00,2026-02-23 00:00:00,2025-11-25 00:00:00,182.6966,0.0,0.0,0.0,0.0,269.0,1154.056532,10.0,3000.0
75%,2131.0,6.0,2025-10-26 00:00:00,2026-04-24 00:00:00,2025-11-25 00:00:00,388.233587,0.0,0.0,0.0,0.0,390.0,2503.086652,15.0,5500.0
max,2834.0,10.0,2025-10-26 00:00:00,2026-08-22 00:00:00,2025-12-30 00:00:00,5886.483326,5886.483326,0.0,0.0,125.0,516.0,9600.0,21.0,7800.0
std,814.371411,2.475171,,,,446.848442,373.635681,0.0,0.0,2.631493,144.644184,1390.348083,5.825554,2349.369783



## Diagn√≥stico Cruzado (Regra de Neg√≥cio)

Aqui encontramos problemas que **s√≥ aparecem ao cruzar informa√ß√µes**.


In [21]:

# Parcelas marcadas como pagas sem data de recebimento
df[
    df["descricao_status"].str.contains("Quitado", na=False) &
    df["data_recebimento"].isnull()
]


Unnamed: 0,parcela_id,numero_parcela,data_emissao,data_vencimento,data_recebimento,valor_parcela,valor_recebido,valor_juro,valor_multa,valor_desconto,...,lancamento_id,valor_total_lancamento,data_lancamento,cliente_id,cliente_nome,cliente_tipo,limite_credito,natureza_codigo,natureza_descricao,natureza_tipo
308,17,2,2025-05-11,2025-08-05,NaT,1780.0,66.5,0.0,0.0,0.0,...,5,5340.0,2025-05-30,5,MAIS UMA PESSOA FISICA,F,1800.0,101,Venda de Mercadorias,R


In [22]:

# Valores recebidos maiores que o valor da parcela
df[df["valor_recebido"] > df["valor_parcela"]]


Unnamed: 0,parcela_id,numero_parcela,data_emissao,data_vencimento,data_recebimento,valor_parcela,valor_recebido,valor_juro,valor_multa,valor_desconto,...,lancamento_id,valor_total_lancamento,data_lancamento,cliente_id,cliente_nome,cliente_tipo,limite_credito,natureza_codigo,natureza_descricao,natureza_tipo



## Classifica√ß√£o dos Problemas Encontrados

1. Aus√™ncia leg√≠tima  
2. Erro de cadastro  
3. Erro de processo  
4. Dado v√°lido, mas ruim para IA  
5. Dado que deve virar feature  

Essa classifica√ß√£o guiar√° a limpeza no pr√≥ximo v√≠deo.



## Conclus√£o do V√≠deo 03

Neste notebook:
- N√£o limpamos dados
- N√£o removemos registros
- Apenas entendemos o cen√°rio real

üëâ **No pr√≥ximo v√≠deo**, iniciaremos a limpeza orientada por regra de neg√≥cio.



# Feature Engineering - Financeiro para IA

**M√≥dulo 4 ‚Äî Dados do T2Ti ERP para IA: Manipula√ß√£o e Limpeza**  
**V√≠deo 05 ‚Äî Feature Engineering - Financeiro para IA**

A partir daqui, vamos criar Features (vari√°veis inteligentes para IA)



**Feature para dias de atraso**
- positivo ‚Üí atraso
- zero ‚Üí no prazo
- negativo ‚Üí antecipado
- NaN ‚Üí ainda n√£o pago

In [23]:
# calculando os dias de atraso
df_limpo["dias_atraso"] = (df_limpo["data_recebimento"] - df_limpo["data_vencimento"]).dt.days

# exibindo os primeiros registros para verificar a nova coluna
df_limpo[["parcela_id", "data_vencimento", "data_recebimento", "dias_atraso"]].head(10)

Unnamed: 0,parcela_id,data_vencimento,data_recebimento,dias_atraso
0,13,2025-03-20,2025-03-20,0.0
1,57,2025-11-25,2025-12-30,35.0
2,58,2025-12-25,NaT,
3,59,2026-01-24,NaT,
4,270,2025-11-25,2025-11-25,0.0
5,271,2025-12-25,NaT,
6,485,2025-11-25,NaT,
7,486,2025-12-25,NaT,
8,487,2026-01-24,NaT,
9,488,2026-02-23,NaT,


**Flag Bin√°ria de Atraso**

In [24]:
# definindo a flag bin√°ria de atraso
df_limpo["em_atraso"] = df_limpo["dias_atraso"] > 0

# exibindo os primeiros registros para verificar a nova coluna
df_limpo[["parcela_id", "data_vencimento", "data_recebimento", "dias_atraso", "em_atraso"]].head(10)

Unnamed: 0,parcela_id,data_vencimento,data_recebimento,dias_atraso,em_atraso
0,13,2025-03-20,2025-03-20,0.0,False
1,57,2025-11-25,2025-12-30,35.0,True
2,58,2025-12-25,NaT,,False
3,59,2026-01-24,NaT,,False
4,270,2025-11-25,2025-11-25,0.0,False
5,271,2025-12-25,NaT,,False
6,485,2025-11-25,NaT,,False
7,486,2025-12-25,NaT,,False
8,487,2026-01-24,NaT,,False
9,488,2026-02-23,NaT,,False


**Feature Parcelas em Aberto**

In [25]:
# definindo uma feature para as parcelas em aberto
df_limpo["parcela_em_aberto"] = df_limpo["data_recebimento"].isnull()

# exibir os dados, agora com a nova feature
df_limpo[["parcela_id", "data_vencimento", "data_recebimento", "dias_atraso", "em_atraso", "parcela_em_aberto"]].head(10)

Unnamed: 0,parcela_id,data_vencimento,data_recebimento,dias_atraso,em_atraso,parcela_em_aberto
0,13,2025-03-20,2025-03-20,0.0,False,False
1,57,2025-11-25,2025-12-30,35.0,True,False
2,58,2025-12-25,NaT,,False,True
3,59,2026-01-24,NaT,,False,True
4,270,2025-11-25,2025-11-25,0.0,False,False
5,271,2025-12-25,NaT,,False,True
6,485,2025-11-25,NaT,,False,True
7,486,2025-12-25,NaT,,False,True
8,487,2026-01-24,NaT,,False,True
9,488,2026-02-23,NaT,,False,True


**Feature para calcular o valor total recebido**

In [26]:
# Cria√ß√£o de nova feature para calcular o valor total recebido
df_limpo["valor_total_recebido"] = (
		df_limpo["valor_recebido"] + df_limpo["valor_juro"] + df_limpo["valor_multa"] - df_limpo["valor_desconto"]
)

# exibir os dados, agora com a nova feature
df_limpo[["parcela_id", "parcela_em_aberto", "valor_recebido", "valor_juro", "valor_multa", "valor_desconto", "valor_total_recebido"]].head(10)

Unnamed: 0,parcela_id,parcela_em_aberto,valor_recebido,valor_juro,valor_multa,valor_desconto,valor_total_recebido
0,13,False,1746.485,0.0,0.0,54.015,1692.47
1,57,False,196.057167,0.0,0.0,0.0,196.057167
2,58,True,0.0,0.0,0.0,0.0,0.0
3,59,True,0.0,0.0,0.0,0.0,0.0
4,270,False,267.914668,0.0,0.0,0.0,267.914668
5,271,True,0.0,0.0,0.0,0.0,0.0
6,485,True,0.0,0.0,0.0,0.0,0.0
7,486,True,0.0,0.0,0.0,0.0,0.0
8,487,True,0.0,0.0,0.0,0.0,0.0
9,488,True,0.0,0.0,0.0,0.0,0.0


**Feature Percentual Recebido da Parcela**

In [27]:
# criar nova feature para calcular o percentual recebido em rela√ß√£o ao valor da parcela
df_limpo["percentual_recebido"] = df_limpo["valor_total_recebido"] / df_limpo["valor_parcela"] * 100

# exibir os dados, agora com a nova feature
df_limpo[["parcela_id", "valor_parcela", "valor_total_recebido", "percentual_recebido"]].head(10)

Unnamed: 0,parcela_id,valor_parcela,valor_total_recebido,percentual_recebido
0,13,1800.5,1692.47,94.0
1,57,196.057167,196.057167,100.0
2,58,196.057167,0.0,0.0
3,59,196.057167,0.0,0.0
4,270,267.914668,267.914668,100.0
5,271,267.914668,0.0,0.0
6,485,13.333756,0.0,0.0
7,486,13.333756,0.0,0.0
8,487,13.333756,0.0,0.0
9,488,13.333756,0.0,0.0


**Valida√ß√£o das Features Criadas**

In [28]:
# vamos fazer uma valida√ß√£o das features criadas usando o describe para verificar se os valores fazem sentido
df_limpo[["dias_atraso", "em_atraso", "parcela_em_aberto", "valor_total_recebido", "percentual_recebido"]].describe(include="all")

Unnamed: 0,dias_atraso,em_atraso,parcela_em_aberto,valor_total_recebido,percentual_recebido
count,337.0,2677,2677,2677.0,2677.0
unique,,2,2,,
top,,False,True,,
freq,,2675,2340,,
mean,2.139466,,,76.315944,12.584137
std,37.411739,,,373.267171,33.162974
min,0.0,,,0.0,0.0
25%,0.0,,,0.0,0.0
50%,0.0,,,0.0,0.0
75%,0.0,,,0.0,0.0


# Pr√© Processamento de Dados para Machine Learning (ML)

**M√≥dulo 4 ‚Äî Dados do T2Ti ERP para IA: Manipula√ß√£o e Limpeza**  
**V√≠deo 06 ‚Äî Pr√© Processamento de Dados para Machine Learning (ML)**

**Objetivos**

Ensinar o aluno a transformar dados tratados e enriquecidos em um dataset pronto para Machine Learning, respeitando boas pr√°ticas, evitando vazamento de dados (data leakage) e preparando o terreno para os pr√≥ximos m√≥dulos.

- ML n√£o entende texto
- ML n√£o entende datas
- ML n√£o entende contexto
- ML entende n√∫meros e padr√µes

In [30]:
features = ["dias_atraso", "em_atraso", "parcela_em_aberto", "valor_total_recebido", "percentual_recebido"]
df_modelo = df_limpo[features].copy()
df_modelo.head(10)

Unnamed: 0,dias_atraso,em_atraso,parcela_em_aberto,valor_total_recebido,percentual_recebido
0,0.0,False,False,1692.47,94.0
1,35.0,True,False,196.057167,100.0
2,,False,True,0.0,0.0
3,,False,True,0.0,0.0
4,0.0,False,False,267.914668,100.0
5,,False,True,0.0,0.0
6,,False,True,0.0,0.0
7,,False,True,0.0,0.0
8,,False,True,0.0,0.0
9,,False,True,0.0,0.0


In [32]:
df_modelo.describe(include="all")

Unnamed: 0,dias_atraso,em_atraso,parcela_em_aberto,valor_total_recebido,percentual_recebido
count,337.0,2677,2677,2677.0,2677.0
unique,,2,2,,
top,,False,True,,
freq,,2675,2340,,
mean,2.139466,,,76.315944,12.584137
std,37.411739,,,373.267171,33.162974
min,0.0,,,0.0,0.0
25%,0.0,,,0.0,0.0
50%,0.0,,,0.0,0.0
75%,0.0,,,0.0,0.0


In [33]:
df_modelo.isnull().sum()

dias_atraso             2340
em_atraso                  0
parcela_em_aberto          0
valor_total_recebido       0
percentual_recebido        0
dtype: int64

In [34]:
# preencher os dias em atraso que est√£o nulos com o valor 0, pois se n√£o tem data de recebimento, n√£o tem como calcular os dias de atraso, ent√£o vamos considerar que s√£o 0 dias de atraso
df_modelo["dias_atraso"] = df_modelo["dias_atraso"].fillna(0)

# verificar novamente se tem valores nulos
df_modelo.isnull().sum()

dias_atraso             0
em_atraso               0
parcela_em_aberto       0
valor_total_recebido    0
percentual_recebido     0
dtype: int64

In [35]:
# vamos converter as colcunas booleanas para inteiro, onde True = 1 e False = 0, para facilitar o uso em modelos de machine learning
df_modelo["em_atraso"] = df_modelo["em_atraso"].astype(int)
df_modelo["parcela_em_aberto"] = df_modelo["parcela_em_aberto"].astype(int)

# vamos verificar o resultado final do nosso dataframe de modelo
df_modelo.head(10)

Unnamed: 0,dias_atraso,em_atraso,parcela_em_aberto,valor_total_recebido,percentual_recebido
0,0.0,0,0,1692.47,94.0
1,35.0,1,0,196.057167,100.0
2,0.0,0,1,0.0,0.0
3,0.0,0,1,0.0,0.0
4,0.0,0,0,267.914668,100.0
5,0.0,0,1,0.0,0.0
6,0.0,0,1,0.0,0.0
7,0.0,0,1,0.0,0.0
8,0.0,0,1,0.0,0.0
9,0.0,0,1,0.0,0.0


## Escalonamento de Dados

Os dados financeiros tem escalas diferentes, por exemplo:

- dias_atraso:          0 - 120  (varia√ß√£o m√©dia)
- valor_total_recebido: 0 - 50.000 (varia√ß√£o enorme!)
- percentual_recebido:  0 - 1    (varia√ß√£o pequena)

In [37]:
# primeiro vamos ver os dados do df_modelo antes do escalonamento - vamos mostrar apenas mean e std
df_modelo.describe().loc[["mean", "std"]]

Unnamed: 0,dias_atraso,em_atraso,parcela_em_aberto,valor_total_recebido,percentual_recebido
mean,0.269331,0.000747,0.874113,76.315944,12.584137
std,13.275662,0.027328,0.331784,373.267171,33.162974


In [38]:
# vamos usar o sklearn para fazer o escalonamento dos dados, usando o StandardScaler para padronizar as features num√©ricas
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df_escalado = pd.DataFrame(
    scaler.fit_transform(df_modelo),
    columns=df_modelo.columns
)

In [39]:
# agora vamos verificar os dados do df_escalado para ver o resultado do escalonamento - vamos mostrar apenas mean e std
df_escalado.describe().loc[["mean", "std"]]

Unnamed: 0,dias_atraso,em_atraso,parcela_em_aberto,valor_total_recebido,percentual_recebido
mean,0.0,7.96275e-18,-9.555300000000001e-17,-3.1851e-17,-1.59255e-17
std,1.000187,1.000187,1.000187,1.000187,1.000187


In [44]:
# vamos mostrar novamente o describe, mas agora formatando os numeros sem nota√ß√£o cient√≠fica e com 2 casas decimais
pd.options.display.float_format = '{:.2f}'.format
df_escalado.describe().loc[["mean", "std"]]

Unnamed: 0,dias_atraso,em_atraso,parcela_em_aberto,valor_total_recebido,percentual_recebido
mean,0.0,0.0,-0.0,-0.0,-0.0
std,1.0,1.0,1.0,1.0,1.0


## Separa√ß√£o de dados entre treino e uso real

In [46]:
# vamos fazer a separa√ß√£o de dados entre treino e teste usando o train_test_split do sklearn, separando 20% dos dados para teste
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(
    df_escalado,          # Todos os dados padronizados
    test_size=0.2,        # 20% para teste (80% para treino)
    random_state=42       # Semente para reprodutibilidade
)

## O que n√£o fizemos ainda
- n√£o treinamos modelo
- n√£o avaliamos m√©tricas
- n√£o ajustamos hiperpar√¢metros
- n√£o usamos deep learning

## O que n√≥s j√° temos
-Ô∏è um notebook completo
- dados financeiros reais tratados
- features inteligentes
- dataset pronto para IA