## In√≠cio do Pipeline de Modelagem e Rastreamento com MLflow

Este notebook tem como objetivo iniciar o pipeline de experimentos, carregando os datasets finais da camada `Curated` para criar os conjuntos de treino e teste totalmente alinhados. O fluxo inclui a separa√ß√£o de vari√°veis preditoras (`X`) e vari√°vel alvo (`y`), al√©m da configura√ß√£o inicial do MLflow Tracking, garantindo que todos os par√¢metros, m√©tricas e artefatos do modelo sejam rastreados de forma coerente e version√°vel.


In [2]:
# üîß ETAPA: FIXAR CWD EM /workspace PARA CONSIST√äNCIA GLOBAL

import os

CWD_FIXO = "/workspace"
os.chdir(CWD_FIXO)

print("‚úÖ Diret√≥rio de trabalho fixado em:", os.getcwd())


‚úÖ Diret√≥rio de trabalho fixado em: /workspace


## Experimento Baseline com MLflow e Monitoramento de Progresso

Nesta etapa ser√° rodado o primeiro experimento baseline usando o MLflow para rastrear par√¢metros, m√©tricas e artefatos. Para acompanhar opera√ß√µes potencialmente demoradas, como o ajuste do modelo (`fit`) e a gera√ß√£o de m√©tricas, ser√° utilizado o `tqdm` para monitorar loops de forma expl√≠cita. Isso garante visibilidade do progresso em tempo real, al√©m de manter a rastreabilidade completa do pipeline.


In [3]:
# üîß ETAPA: CARGA DOS DADOS CURADOS E CONFIGURA√á√ÉO DO MLFLOW TRACKING

"""
Executa:
1) Fixa√ß√£o obrigat√≥ria do diret√≥rio de trabalho como /workspace
2) Carregamento de 'train_curated.csv' e 'test_curated.csv'
3) Separa√ß√£o de X_train, y_train, X_test, y_test
4) Configura√ß√£o do Tracking URI do MLflow para apontar ao servi√ßo interno
"""

import os
import pandas as pd
import mlflow

# 1Ô∏è‚É£ Fixar CWD
CWD_FIXO = "/workspace"
os.chdir(CWD_FIXO)
print("‚úÖ Diret√≥rio de trabalho fixado em:", os.getcwd())

# 2Ô∏è‚É£ Definir paths absolutos coerentes
TRAIN_PATH = os.path.join("data", "curated", "train_curated.csv")
TEST_PATH = os.path.join("data", "curated", "test_curated.csv")

# 3Ô∏è‚É£ Carregar datasets curados
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

print("\n‚úÖ Treino shape:", train_df.shape)
print("‚úÖ Teste shape :", test_df.shape)

# 4Ô∏è‚É£ Separar vari√°veis explicativas e target
TARGET = "Credit_Score_Standard"  # Altere aqui se o target for outro

X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

X_test = test_df.drop(columns=[TARGET])
y_test = test_df[TARGET]

print("\n‚úÖ X_train:", X_train.shape)
print("‚úÖ y_train:", y_train.shape)
print("‚úÖ X_test :", X_test.shape)
print("‚úÖ y_test :", y_test.shape)

# 5Ô∏è‚É£ Configurar MLflow Tracking URI interno (dentro do container)
mlflow.set_tracking_uri("http://mlflow:5000")
print("\n‚úÖ MLflow Tracking URI configurado para:", mlflow.get_tracking_uri())


‚úÖ Diret√≥rio de trabalho fixado em: /workspace

‚úÖ Treino shape: (100000, 6305)
‚úÖ Teste shape : (50000, 6305)

‚úÖ X_train: (100000, 6304)
‚úÖ y_train: (100000,)
‚úÖ X_test : (50000, 6304)
‚úÖ y_test : (50000,)

‚úÖ MLflow Tracking URI configurado para: http://mlflow:5000


In [None]:
#  ETAPA: Refazer Baseline ‚Äî Normaliza√ß√£o do Working Directory, Carga Curated, Reconstru√ß√£o de y e Tracking MLflow

"""
Executa:
1) Valida√ß√£o e normaliza√ß√£o do diret√≥rio de trabalho (CWD).
2) Carregamento de 'train_curated.csv' e 'test_curated.csv'.
3) Reconstru√ß√£o de y a partir de colunas dummy.
4) Separa√ß√£o coerente de X e y.
5) Tracking URI e credenciais MinIO expl√≠citos.
6) Treino DecisionTreeClassifier com barra de progresso.
7) Logging MLflow de hiperpar√¢metros, m√©tricas e artefato.
8) Prints finais coerentes com links 127.0.0.1.
"""

import os
import logging
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm

# ‚úÖ Silenciar logger redundante do MLflow
logging.getLogger("mlflow").setLevel(logging.ERROR)

# 1Ô∏è‚É£ Validar e corrigir CWD
print("Current Working Directory (antes):", os.getcwd())
os.chdir('/workspace')
print("Current Working Directory (depois):", os.getcwd())

# 2Ô∏è‚É£ Paths absolutos coerentes
TRAIN_PATH = 'data/curated/train_curated.csv'
TEST_PATH = 'data/curated/test_curated.csv'

# 3Ô∏è‚É£ Carregar datasets
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

print("\nTreino shape:", train_df.shape)
print("Teste shape:", test_df.shape)
print("\ntrain_df.head(5):")
print(train_df.head(5))

# 4Ô∏è‚É£ Reconstruir y a partir de colunas dummy
dummy_cols = [col for col in train_df.columns if col.startswith('Credit_Score_')]
print("\nColunas de classe detectadas:", dummy_cols)

# Reconstr√≥i y_train
y_train = train_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_train = train_df.drop(columns=dummy_cols)

# Reconstr√≥i y_test
y_test = test_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_test = test_df.drop(columns=dummy_cols)

print(f"\nX_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

# 5Ô∏è‚É£ Tracking URI e credenciais MinIO
mlflow.set_tracking_uri("http://mlflow:5000")
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://minio:9000'

print("\nTracking URI:", mlflow.get_tracking_uri())
print("MLFLOW_S3_ENDPOINT_URL:", os.environ['MLFLOW_S3_ENDPOINT_URL'])

# 6Ô∏è‚É£ Cria/recupera experimento
experiment_name = "QuantumFinance_CreditScore"
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name="Baseline_DecisionTree_Curated") as run:
    params = {"max_depth": 5, "random_state": 42}
    mlflow.log_params(params)

    model = DecisionTreeClassifier(**params)

    print("\nTreinando modelo com barra de progresso:")
    for _ in tqdm(range(1), desc="Fitting model"):
        model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')

    print(f"\nAccuracy: {acc:.4f}")
    print(f"F1 Score: {f1:.4f}")

    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_score", f1)
    mlflow.sklearn.log_model(model, "model")

    # ‚úÖ Prints finais coerentes ‚Äî SOMENTE com 127.0.0.1
    print(f"\nRun ID: {run.info.run_id}")
    print(f"Acesse: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")
    print(f"View run Baseline_DecisionTree_Curated at: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")
    print(f"View experiment at: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}")


Current Working Directory (antes): /workspace/notebooks
Current Working Directory (depois): /workspace

Treino shape: (100000, 6305)
Teste shape: (50000, 6305)

train_df.head(5):
     Age  Annual_Income  Monthly_Inhand_Salary  Num_Bank_Accounts  \
0   23.0       19114.12            1824.843333                  3   
1   23.0       19114.12                    NaN                  3   
2 -500.0       19114.12                    NaN                  3   
3   23.0       19114.12                    NaN                  3   
4   23.0       19114.12            1824.843333                  3   

   Num_Credit_Card  Interest_Rate  Delay_from_due_date  Num_Credit_Inquiries  \
0                4              3                    3                   4.0   
1                4              3                   -1                   4.0   
2                4              3                    3                   4.0   
3                4              3                    5                   4.0   
4     

Fitting model: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:12<00:00, 12.90s/it]



Accuracy: 0.5496
F1 Score: 0.7093

Run ID: 86de72fd063e485fa584c1d8c0395aca
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/86de72fd063e485fa584c1d8c0395aca
View run Baseline_DecisionTree_Curated at: http://127.0.0.1:5000/#/experiments/1/runs/86de72fd063e485fa584c1d8c0395aca
View experiment at: http://127.0.0.1:5000/#/experiments/1
üèÉ View run Baseline_DecisionTree_Curated at: http://mlflow:5000/#/experiments/1/runs/86de72fd063e485fa584c1d8c0395aca
üß™ View experiment at: http://mlflow:5000/#/experiments/1


## ETAPA: Melhoria do Modelo com GridSearchCV e MLflow Tracking

Este bloco marca a transi√ß√£o do experimento baseline para uma etapa de otimiza√ß√£o incremental, usando GridSearchCV para explorar m√∫ltiplas combina√ß√µes de hiperpar√¢metros de forma sistem√°tica e rastreada.

## Objetivo
- Encontrar a configura√ß√£o de hiperpar√¢metros mais eficaz para a √Årvore de Decis√£o (DecisionTreeClassifier).
- Registrar cada combina√ß√£o testada como um run √∫nico no MLflow, com seus par√¢metros, m√©tricas e artefato final.

## Princ√≠pios aplicados
- Tracking URI e backend MinIO/S3 mantidos consistentes.
- Cada varia√ß√£o √© rastre√°vel, sem sobrescrever runs anteriores.
- Uso de tqdm para barra de progresso, garantindo visibilidade em loops demorados.
- Prints finais coerentes, com links 127.0.0.1 para acesso ao MLflow UI.

## Resultado esperado
- M√©tricas compar√°veis entre baseline e grid search.
- Melhor modelo salvo como artefato no bucket MinIO.
- Pr√≥ximo passo: preparar o pipeline para registrar o modelo validado no Registry do MLflow.


In [None]:
# üîß ETAPA: GridSearch Manual com ParameterGrid, tqdm e Tracking MLflow

"""
Executa:
1) Normaliza o CWD para '/workspace'.
2) Carrega train_curated e test_curated.
3) Reconstr√≥i y a partir das colunas dummy.
4) Itera ParameterGrid manualmente.
5) Barra tqdm avan√ßa por combina√ß√£o.
6) Loga cada combina√ß√£o como run separado no MLflow.
7) Prints coerentes com links 127.0.0.1.
"""

import os
import logging
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm

# ‚úÖ Silenciar logger redundante do MLflow
logging.getLogger("mlflow").setLevel(logging.ERROR)

# 1Ô∏è‚É£ Normalizar CWD
print("CWD antes:", os.getcwd())
os.chdir('/workspace')
print("CWD depois:", os.getcwd())

# 2Ô∏è‚É£ Paths
TRAIN_PATH = 'data/curated/train_curated.csv'
TEST_PATH = 'data/curated/test_curated.csv'

# 3Ô∏è‚É£ Carga + reconstru√ß√£o y
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

dummy_cols = [col for col in train_df.columns if col.startswith('Credit_Score_')]
print("Colunas classe:", dummy_cols)

y_train = train_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_train = train_df.drop(columns=dummy_cols)

y_test = test_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_test = test_df.drop(columns=dummy_cols)

print(f"X_train: {X_train.shape} | y_train: {y_train.shape}")

# 4Ô∏è‚É£ ParameterGrid manual
param_grid = {
    "max_depth": [3, 5, 7],
    "min_samples_split": [2, 5, 10]
}
grid = ParameterGrid(param_grid)

# 5Ô∏è‚É£ Tracking URI + credenciais MinIO
mlflow.set_tracking_uri("http://mlflow:5000")
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://minio:9000'

experiment_name = "QuantumFinance_CreditScore"
mlflow.set_experiment(experiment_name)

print("\nExecutando GridSearch manual...")

for params in tqdm(grid, desc="Runs"):
    with mlflow.start_run(run_name=f"GridSearch_Manual_{params}") as run:
        model = DecisionTreeClassifier(**params, random_state=42)
        model.fit(X_train, y_train)

        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')

        mlflow.log_params(params)
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("f1_score", f1)
        mlflow.sklearn.log_model(model, "model")

        print(f"\nCombina√ß√£o: {params}")
        print(f"Accuracy: {acc:.4f} | F1 Score: {f1:.4f}")
        print(f"Run ID: {run.info.run_id}")
        print(f"Acesse: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")


CWD antes: /workspace
CWD depois: /workspace
Colunas classe: ['Credit_Score_Poor', 'Credit_Score_Standard']
X_train: (100000, 6303) | y_train: (100000,)

Executando GridSearch manual...


Runs:  11%|‚ñà         | 1/9 [00:15<02:06, 15.75s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 2}
Accuracy: 0.5388 | F1 Score: 0.7003
Run ID: 8347a501625844da89aaa891510faa55
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/8347a501625844da89aaa891510faa55
üèÉ View run GridSearch_Manual_{'max_depth': 3, 'min_samples_split': 2} at: http://mlflow:5000/#/experiments/1/runs/8347a501625844da89aaa891510faa55
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  22%|‚ñà‚ñà‚ñè       | 2/9 [00:27<01:34, 13.56s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 5}
Accuracy: 0.5388 | F1 Score: 0.7003
Run ID: 9f9ded00e4594c7197535ae840bc8bfc
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/9f9ded00e4594c7197535ae840bc8bfc
üèÉ View run GridSearch_Manual_{'max_depth': 3, 'min_samples_split': 5} at: http://mlflow:5000/#/experiments/1/runs/9f9ded00e4594c7197535ae840bc8bfc
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  33%|‚ñà‚ñà‚ñà‚ñé      | 3/9 [00:38<01:13, 12.20s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 10}
Accuracy: 0.5388 | F1 Score: 0.7003
Run ID: e3dda2cd137b4d96999617039c09af58
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/e3dda2cd137b4d96999617039c09af58
üèÉ View run GridSearch_Manual_{'max_depth': 3, 'min_samples_split': 10} at: http://mlflow:5000/#/experiments/1/runs/e3dda2cd137b4d96999617039c09af58
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  44%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 4/9 [00:49<00:59, 11.97s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 2}
Accuracy: 0.5496 | F1 Score: 0.7093
Run ID: f969ead83d354f0f99321a872160c42d
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/f969ead83d354f0f99321a872160c42d
üèÉ View run GridSearch_Manual_{'max_depth': 5, 'min_samples_split': 2} at: http://mlflow:5000/#/experiments/1/runs/f969ead83d354f0f99321a872160c42d
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  56%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå    | 5/9 [01:00<00:46, 11.51s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 5}
Accuracy: 0.5496 | F1 Score: 0.7093
Run ID: 7a13e6e4410945479787888336840c91
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/7a13e6e4410945479787888336840c91
üèÉ View run GridSearch_Manual_{'max_depth': 5, 'min_samples_split': 5} at: http://mlflow:5000/#/experiments/1/runs/7a13e6e4410945479787888336840c91
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 6/9 [01:11<00:33, 11.31s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 10}
Accuracy: 0.5496 | F1 Score: 0.7093
Run ID: d070175a3a8c448a9565e24d7e9871a9
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/d070175a3a8c448a9565e24d7e9871a9
üèÉ View run GridSearch_Manual_{'max_depth': 5, 'min_samples_split': 10} at: http://mlflow:5000/#/experiments/1/runs/d070175a3a8c448a9565e24d7e9871a9
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  78%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä  | 7/9 [01:23<00:23, 11.62s/it]


Combina√ß√£o: {'max_depth': 7, 'min_samples_split': 2}
Accuracy: 0.5422 | F1 Score: 0.7032
Run ID: 9c9d83b3de294b5aab2357b42a67aa06
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/9c9d83b3de294b5aab2357b42a67aa06
üèÉ View run GridSearch_Manual_{'max_depth': 7, 'min_samples_split': 2} at: http://mlflow:5000/#/experiments/1/runs/9c9d83b3de294b5aab2357b42a67aa06
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  89%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ | 8/9 [01:35<00:11, 11.69s/it]


Combina√ß√£o: {'max_depth': 7, 'min_samples_split': 5}
Accuracy: 0.5422 | F1 Score: 0.7032
Run ID: d2685d32bdcb42b783e358095e9f5b63
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/d2685d32bdcb42b783e358095e9f5b63
üèÉ View run GridSearch_Manual_{'max_depth': 7, 'min_samples_split': 5} at: http://mlflow:5000/#/experiments/1/runs/d2685d32bdcb42b783e358095e9f5b63
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [01:46<00:00, 11.88s/it]


Combina√ß√£o: {'max_depth': 7, 'min_samples_split': 10}
Accuracy: 0.5424 | F1 Score: 0.7033
Run ID: 0cfc26f47de549969c8fdb2ceef4565c
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/0cfc26f47de549969c8fdb2ceef4565c
üèÉ View run GridSearch_Manual_{'max_depth': 7, 'min_samples_split': 10} at: http://mlflow:5000/#/experiments/1/runs/0cfc26f47de549969c8fdb2ceef4565c
üß™ View experiment at: http://mlflow:5000/#/experiments/1





In [None]:
# üîß ETAPA: GridSearch Manual com RandomForest, ParameterGrid, tqdm e Tracking MLflow

"""
Executa:
1) Normaliza o CWD para '/workspace'.
2) Carrega train_curated e test_curated.
3) Reconstr√≥i y a partir das colunas dummy.
4) Executa GridSearch manual com RandomForestClassifier.
5) Usa ParameterGrid + tqdm para barra real por combina√ß√£o.
6) Loga cada run no MLflow com credenciais MinIO coerentes.
7) Prints finais coerentes com links 127.0.0.1.
"""

import os
import logging
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm

# ‚úÖ Silenciar logger redundante do MLflow
logging.getLogger("mlflow").setLevel(logging.ERROR)

# 1Ô∏è‚É£ Normalizar CWD
print("CWD antes:", os.getcwd())
os.chdir('/workspace')
print("CWD depois:", os.getcwd())

# 2Ô∏è‚É£ Paths
TRAIN_PATH = 'data/curated/train_curated.csv'
TEST_PATH = 'data/curated/test_curated.csv'

# 3Ô∏è‚É£ Carga + reconstru√ß√£o y
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

dummy_cols = [col for col in train_df.columns if col.startswith('Credit_Score_')]
print("Colunas classe:", dummy_cols)

y_train = train_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_train = train_df.drop(columns=dummy_cols)

y_test = test_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_test = test_df.drop(columns=dummy_cols)

print(f"X_train: {X_train.shape} | y_train: {y_train.shape}")

# 4Ô∏è‚É£ ParameterGrid com RandomForest
param_grid = {
    "n_estimators": [50, 100],
    "max_depth": [3, 5],
    "min_samples_split": [2, 5]
}
grid = ParameterGrid(param_grid)

# 5Ô∏è‚É£ Tracking URI + credenciais MinIO
mlflow.set_tracking_uri("http://mlflow:5000")
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://minio:9000'

experiment_name = "QuantumFinance_CreditScore"
mlflow.set_experiment(experiment_name)

print("\nExecutando GridSearch manual RandomForest...")

for params in tqdm(grid, desc="Runs RandomForest"):
    with mlflow.start_run(run_name=f"GridSearch_RF_{params}") as run:
        model = RandomForestClassifier(**params, random_state=42, n_jobs=-1)
        model.fit(X_train, y_train)

        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')

        mlflow.log_params(params)
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("f1_score", f1)
        mlflow.sklearn.log_model(model, "model")

        print(f"\nCombina√ß√£o: {params}")
        print(f"Accuracy: {acc:.4f} | F1 Score: {f1:.4f}")
        print(f"Run ID: {run.info.run_id}")
        print(f"Acesse: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")


CWD antes: /workspace
CWD depois: /workspace
Colunas classe: ['Credit_Score_Poor', 'Credit_Score_Standard']
X_train: (100000, 6303) | y_train: (100000,)

Executando GridSearch manual RandomForest...


Runs RandomForest:  12%|‚ñà‚ñé        | 1/8 [00:10<01:14, 10.66s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 50}
Accuracy: 0.1160 | F1 Score: 0.2078
Run ID: 7e1e8f44f2a14b469b513b183d4a0ae5
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/7e1e8f44f2a14b469b513b183d4a0ae5
üèÉ View run GridSearch_RF_{'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 50} at: http://mlflow:5000/#/experiments/1/runs/7e1e8f44f2a14b469b513b183d4a0ae5
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  25%|‚ñà‚ñà‚ñå       | 2/8 [00:18<00:55,  9.26s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 100}
Accuracy: 0.0560 | F1 Score: 0.1061
Run ID: 7e2a7fd6532a415790f7e0297eaef518
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/7e2a7fd6532a415790f7e0297eaef518
üèÉ View run GridSearch_RF_{'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 100} at: http://mlflow:5000/#/experiments/1/runs/7e2a7fd6532a415790f7e0297eaef518
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  38%|‚ñà‚ñà‚ñà‚ñä      | 3/8 [00:26<00:42,  8.41s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 50}
Accuracy: 0.1160 | F1 Score: 0.2079
Run ID: f6e6ac21b469427999d1ac0fd1c65fd4
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/f6e6ac21b469427999d1ac0fd1c65fd4
üèÉ View run GridSearch_RF_{'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 50} at: http://mlflow:5000/#/experiments/1/runs/f6e6ac21b469427999d1ac0fd1c65fd4
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 4/8 [00:33<00:31,  7.93s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 100}
Accuracy: 0.0560 | F1 Score: 0.1061
Run ID: ce6f759ad293462098f404894ae35310
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/ce6f759ad293462098f404894ae35310
üèÉ View run GridSearch_RF_{'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 100} at: http://mlflow:5000/#/experiments/1/runs/ce6f759ad293462098f404894ae35310
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  62%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé   | 5/8 [00:40<00:22,  7.56s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 50}
Accuracy: 0.2117 | F1 Score: 0.3494
Run ID: 30dbe7a0952a4d86a1eed78941ec550b
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/30dbe7a0952a4d86a1eed78941ec550b
üèÉ View run GridSearch_RF_{'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 50} at: http://mlflow:5000/#/experiments/1/runs/30dbe7a0952a4d86a1eed78941ec550b
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 6/8 [00:47<00:14,  7.42s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 100}
Accuracy: 0.1582 | F1 Score: 0.2732
Run ID: 3a35386b08854259abb939bffb29abb8
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/3a35386b08854259abb939bffb29abb8
üèÉ View run GridSearch_RF_{'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 100} at: http://mlflow:5000/#/experiments/1/runs/3a35386b08854259abb939bffb29abb8
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 7/8 [00:54<00:07,  7.32s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 50}
Accuracy: 0.2117 | F1 Score: 0.3494
Run ID: ee503b2c1af948db890c83f7a178a5de
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/ee503b2c1af948db890c83f7a178a5de
üèÉ View run GridSearch_RF_{'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 50} at: http://mlflow:5000/#/experiments/1/runs/ee503b2c1af948db890c83f7a178a5de
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [01:02<00:00,  7.78s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 100}
Accuracy: 0.1581 | F1 Score: 0.2730
Run ID: 4693946ae8bf47a799823ab6e1844b93
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/4693946ae8bf47a799823ab6e1844b93
üèÉ View run GridSearch_RF_{'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 100} at: http://mlflow:5000/#/experiments/1/runs/4693946ae8bf47a799823ab6e1844b93
üß™ View experiment at: http://mlflow:5000/#/experiments/1





In [None]:
# üîß ETAPA: GridSearch Manual com GradientBoostingClassifier, ParameterGrid, tqdm e MLflow

"""
Executa:
1) Normaliza o CWD para '/workspace'.
2) Carrega 'train_clean.csv' da camada processed.
3) Separa X e y com target original.
4) Aplica pd.get_dummies() no X.
5) Usa train_test_split com stratify.
6) Executa GridSearch manual com GradientBoostingClassifier.
7) Barra tqdm para progresso real.
8) Loga cada run separadamente no MLflow.
"""

import os
import logging
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, ParameterGrid
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm

# ‚úÖ Silenciar logger redundante
logging.getLogger("mlflow").setLevel(logging.ERROR)

# 1Ô∏è‚É£ Normalizar CWD
print("CWD antes:", os.getcwd())
os.chdir('/workspace')
print("CWD depois:", os.getcwd())

# 2Ô∏è‚É£ Carregar dados
df = pd.read_csv('data/processed/train_clean.csv')
print("\ndf shape:", df.shape)
print("\ndf.head(5):\n", df.head(5))

# 3Ô∏è‚É£ Separa X e y
X = df.drop(columns=['Credit_Score'])
y = df['Credit_Score']

# 4Ô∏è‚É£ Pr√©-processa X igual curated
X = pd.get_dummies(X)

# 5Ô∏è‚É£ train_test_split + alinhamento coerente
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

print(f"\nX_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")

# 6Ô∏è‚É£ Configurar ParameterGrid
param_grid = {
    "n_estimators": [50, 100],
    "max_depth": [3, 5],
    "learning_rate": [0.05, 0.1]
}
grid = ParameterGrid(param_grid)

# 7Ô∏è‚É£ Tracking MLflow + MinIO
mlflow.set_tracking_uri("http://mlflow:5000")
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://minio:9000'

experiment_name = "QuantumFinance_CreditScore"
mlflow.set_experiment(experiment_name)

print("\nExecutando GridSearch manual ‚Äî GradientBoostingClassifier...")

for params in tqdm(grid, desc="Runs GBoost"):
    with mlflow.start_run(run_name=f"GridSearch_GBoost_{params}") as run:
        model = GradientBoostingClassifier(**params, random_state=42)
        model.fit(X_train, y_train)

        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')

        mlflow.log_params(params)
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("f1_score", f1)
        mlflow.sklearn.log_model(model, "model")

        print(f"\nCombina√ß√£o: {params}")
        print(f"Accuracy: {acc:.4f} | F1 Score: {f1:.4f}")
        print(f"Run ID: {run.info.run_id}")
        print(f"Acesse: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")


CWD antes: /workspace/notebooks
CWD depois: /workspace

df shape: (100000, 28)

df.head(5):
        ID Customer_ID     Month           Name    Age          SSN Occupation  \
0  0x1602   CUS_0xd40   January  Aaron Maashoh   23.0  821-00-0265  Scientist   
1  0x1603   CUS_0xd40  February  Aaron Maashoh   23.0  821-00-0265  Scientist   
2  0x1604   CUS_0xd40     March  Aaron Maashoh -500.0  821-00-0265  Scientist   
3  0x1605   CUS_0xd40     April  Aaron Maashoh   23.0  821-00-0265  Scientist   
4  0x1606   CUS_0xd40       May  Aaron Maashoh   23.0  821-00-0265  Scientist   

   Annual_Income  Monthly_Inhand_Salary  Num_Bank_Accounts  ...  Credit_Mix  \
0       19114.12            1824.843333                  3  ...     Unknown   
1       19114.12                    NaN                  3  ...        Good   
2       19114.12                    NaN                  3  ...        Good   
3       19114.12                    NaN                  3  ...        Good   
4       19114.12         

: 

#  Diagn√≥stico do Footprint de Mem√≥ria

Antes de rodar fitting, esta c√©lula calcula o uso real de mem√≥ria dos DataFrames `X_train` e `y_train`.  
Garante rastreabilidade sobre quanta RAM o kernel ir√° consumir, considerando `deep=True` para contar objetos, ponteiros e √≠ndices.  
Este valor deve ser menor que 70% da RAM real do container para evitar travamento por OOM Killer.


## Recarga dos Datasets Curated ‚Äî Caminho corrigido

Esta c√©lula recarrega os datasets `train_curated.csv` e `test_curated.csv` usando caminho relativo correto, garantindo coer√™ncia com a estrutura `/workspace/`.



In [None]:
# ETAPA: Recarga dos Datasets Curated

import pandas as pd

train_df = pd.read_csv("../data/curated/train_curated.csv")
test_df = pd.read_csv("../data/curated/test_curated.csv")

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")


Train shape: (100000, 6305)
Test shape: (50000, 6305)


In [None]:
print(train_df.columns.tolist())


['Age', 'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Interest_Rate', 'Delay_from_due_date', 'Num_Credit_Inquiries', 'Outstanding_Debt', 'Credit_Utilization_Ratio', 'Total_EMI_per_month', 'Amount_invested_monthly', 'Monthly_Balance', 'Num_of_Loan_Bin', 'Changed_Credit_Limit_Bin', 'Num_of_Delayed_Payment_Bin', 'Credit_History_Age_Bin', 'Month_Num', 'Occupation_Architect', 'Occupation_Developer', 'Occupation_Doctor', 'Occupation_Engineer', 'Occupation_Entrepreneur', 'Occupation_Journalist', 'Occupation_Lawyer', 'Occupation_Manager', 'Occupation_Mechanic', 'Occupation_Media_Manager', 'Occupation_Musician', 'Occupation_Scientist', 'Occupation_Teacher', 'Occupation_Unknown', 'Occupation_Writer', 'Type_of_Loan_Auto Loan, Auto Loan, Auto Loan, Auto Loan, Credit-Builder Loan, Credit-Builder Loan, Mortgage Loan, and Personal Loan', 'Type_of_Loan_Auto Loan, Auto Loan, Auto Loan, Auto Loan, Student Loan, and Student Loan', 'Type_of_Loan_Auto Loan, Auto Loan, A

---
REABRIR O FEATURE_ENGINEERING_CURADORIA.IPNYB PARA DIMINUIR CARDINALIDADE
---

---


## Extens√£o Controlada ‚Äî One-Hot Encoding Restrito Pr√©-Cardinalidade

Este bloco aplica o **One-Hot Encoding restrito** nas vari√°veis `Month`, `Occupation_Group` e `Payment_Behaviour`, antes de qualquer novo diagn√≥stico de cardinalidade.

O objetivo √© observar quantas colunas ser√£o adicionadas e comparar com o pipeline anterior (+6.300 colunas) para verificar se o footprint segue controlado.
O resultado ser√° salvo como `CURATED V1.1` para rastreabilidade total.


In [None]:
# ETAPA: ONE-HOT ENCODING RESTRITO V1.1 E COMPARACAO IMEDIATA

import pandas as pd

# Caminhos
train_curated_v1 = '/workspace/data/curated/train_curated_v1.csv'
test_curated_v1  = '/workspace/data/curated/test_curated_v1.csv'

# Carrega
train_df = pd.read_csv(train_curated_v1)
test_df  = pd.read_csv(test_curated_v1)

# Colunas a codificar
cols_to_encode = ['Month', 'Occupation_Group', 'Payment_Behaviour']

# Aplica OHE restrito
train_encoded = pd.get_dummies(train_df, columns=cols_to_encode, drop_first=True)
test_encoded  = pd.get_dummies(test_df, columns=cols_to_encode, drop_first=True)

# Alinha colunas para garantir mesma estrutura
train_encoded, test_encoded = train_encoded.align(test_encoded, join='outer', axis=1, fill_value=0)

# Compara shape
print("\nShape original V1 (train):", train_df.shape)
print("Shape V1.1 ap√≥s OHE (train):", train_encoded.shape)

print("\nShape original V1 (test):", test_df.shape)
print("Shape V1.1 ap√≥s OHE (test):", test_encoded.shape)

# Salva V1.1
train_encoded.to_csv('/workspace/data/curated/train_curated_v1_1.csv', index=False)
test_encoded.to_csv('/workspace/data/curated/test_curated_v1_1.csv', index=False)

print("\nSnapshots CURATED V1.1 salvos.")



Shape original V1 (train): (100000, 65)
Shape V1.1 ap√≥s OHE (train): (100000, 93)

Shape original V1 (test): (50000, 64)
Shape V1.1 ap√≥s OHE (test): (50000, 93)

Snapshots CURATED V1.1 salvos.


## Versionamento At√¥mico ‚Äî Snapshot CURATED V1.1

Este bloco faz o versionamento at√¥mico do `train_curated_v1_1.csv` e `test_curated_v1_1.csv` com `DVC` e `Git`.  
O fluxo garante rastreabilidade total: verifica√ß√£o f√≠sica, commit coerente, push para backend MinIO.


In [None]:
# ETAPA: VERSIONAMENTO AT√îMICO CURATED V1.1

import os
import subprocess

# Caminhos V1.1
train_curated_v1_1 = '/workspace/data/curated/train_curated_v1_1.csv'
test_curated_v1_1  = '/workspace/data/curated/test_curated_v1_1.csv'

# Verifica CWD
print("\nDiret√≥rio de trabalho atual:", os.getcwd())

# Confirma exist√™ncia f√≠sica
print("\nVerificando exist√™ncia f√≠sica:")
print("TRAIN V1.1:", os.path.exists(train_curated_v1_1))
print("TEST V1.1 :", os.path.exists(test_curated_v1_1))

if not os.path.exists(train_curated_v1_1) or not os.path.exists(test_curated_v1_1):
    raise FileNotFoundError("Um dos arquivos CURATED V1.1 n√£o foi encontrado.")

# DVC add
print("\nExecutando dvc add ...")
subprocess.run(['dvc', 'add', train_curated_v1_1], check=True)
subprocess.run(['dvc', 'add', test_curated_v1_1], check=True)

# Git add dos metadados .dvc
print("\nAdicionando metadados .dvc ao Git ...")
subprocess.run(['git', 'add', f"{train_curated_v1_1}.dvc"], check=True)
subprocess.run(['git', 'add', f"{test_curated_v1_1}.dvc"], check=True)

# Commit coerente
print("\nRealizando commit Git ...")
subprocess.run(['git', 'commit', '-m', 'Versionamento CURATED V1.1 com OHE restrito'], check=True)

# DVC push
print("\nExecutando dvc push ...")
subprocess.run(['dvc', 'push'], check=True)

# Git push final
print("\nExecutando git push ...")
subprocess.run(['git', 'push'], check=True)

print("\nVersionamento CURATED V1.1 conclu√≠do com sucesso.")



Diret√≥rio de trabalho atual: /workspace/notebooks

Verificando exist√™ncia f√≠sica:
TRAIN V1.1: True
TEST V1.1 : True

Executando dvc add ...


[?25l‚†ã Checking graph
[?25l‚†ã Checking graph
[?25h


Adicionando metadados .dvc ao Git ...

Realizando commit Git ...
[main 7d92073] Versionamento CURATED V1.1 com OHE restrito
 2 files changed, 10 insertions(+)
 create mode 100644 data/curated/test_curated_v1_1.csv.dvc
 create mode 100644 data/curated/train_curated_v1_1.csv.dvc

Executando dvc push ...
2 files pushed

Executando git push ...

Versionamento CURATED V1.1 conclu√≠do com sucesso.


To github.com:WRMELO/MBA_MLOPS.git
   f76be84..7d92073  main -> main


## Diagn√≥stico de Footprint ‚Äî Snapshot CURATED V1.1

Este bloco confirma o footprint em **disco** e **RAM** dos arquivos `CURATED V1.1` para manter padr√£o compar√°vel ao `V1` e ao pipeline original.

Os valores ser√£o exibidos em MB.


In [None]:
# ETAPA: DIAGNOSTICO FOOTPRINT CURATED V1.1

import os
import pandas as pd

# Fun√ß√µes
def file_size(path):
    size_bytes = os.path.getsize(path)
    return round(size_bytes / (1024 * 1024), 2)

def memory_usage_df(path):
    df = pd.read_csv(path)
    return round(df.memory_usage(deep=True).sum() / (1024 * 1024), 2)

# Caminhos V1.1
train_v1_1 = '/workspace/data/curated/train_curated_v1_1.csv'
test_v1_1  = '/workspace/data/curated/test_curated_v1_1.csv'

# Disco
train_disk = file_size(train_v1_1)
test_disk  = file_size(test_v1_1)

# RAM
train_mem = memory_usage_df(train_v1_1)
test_mem  = memory_usage_df(test_v1_1)

print("\nTamanho em DISCO (MB):")
print(f"TRAIN V1.1: {train_disk} MB")
print(f"TEST V1.1 : {test_disk} MB")

print("\nFootprint em MEM√ìRIA (MB):")
print(f"TRAIN V1.1: {train_mem} MB")
print(f"TEST V1.1 : {test_mem} MB")



Tamanho em DISCO (MB):
TRAIN V1.1: 63.33 MB
TEST V1.1 : 30.67 MB

Footprint em MEM√ìRIA (MB):
TRAIN V1.1: 165.26 MB
TEST V1.1 : 81.38 MB


In [None]:
# ETAPA: RECARREGAMENTO E COMPARACAO SHAPE FINAL CURATED V1.1

import pandas as pd

# Caminhos coerentes V1.1
train_curated_v1_1 = '/workspace/data/curated/train_curated_v1_1.csv'
test_curated_v1_1  = '/workspace/data/curated/test_curated_v1_1.csv'

# Recarrega DataFrames
train_df_v1_1 = pd.read_csv(train_curated_v1_1)
test_df_v1_1  = pd.read_csv(test_curated_v1_1)

# Exibe shapes
print("\nShape atual do TRAIN CURATED V1.1:", train_df_v1_1.shape)
print("Shape atual do TEST CURATED V1.1 :", test_df_v1_1.shape)



Shape atual do TRAIN CURATED V1.1: (100000, 93)
Shape atual do TEST CURATED V1.1 : (50000, 93)


## Comparativo de Footprint ‚Äî Antes, CURATED V1 e CURATED V1.1

Antes da aplica√ß√£o do binning supervisionado e agrupamentos controlados, o pipeline de Feature Engineering gerava um conjunto com **alt√≠ssima cardinalidade**, alcan√ßando **6.305 colunas** por amostra.

- **Train shape (antigo)**: 100.000 linhas √ó 6.305 colunas  
- **Test shape (antigo)**: 50.000 linhas √ó 6.305 colunas

Esse volume extremo era causado por **one-hot indiscriminado** em categorias raras e vari√°veis cont√≠nuas pulverizadas, levando a estouros de mem√≥ria (OOM Killer) mesmo em m√°quinas robustas.

Ap√≥s a revis√£o completa, com:
- Diagn√≥stico estat√≠stico detalhado,
- Binning supervisionado com faixas coerentes ao neg√≥cio,
- Agrupamento de categorias raras,
- Elimina√ß√£o de redund√¢ncias mantendo rastreabilidade,

o footprint caiu drasticamente para:

- **Train shape (CURATED V1)**: 100.000 linhas √ó 65 colunas  
- **Test shape (CURATED V1)**: 50.000 linhas √ó 64 colunas

Para garantir interpretabilidade e previs√£o de sazonalidade e perfis de comportamento, foi aplicada uma extens√£o com **One-Hot Encoding restrito** apenas em vari√°veis estrat√©gicas (`Month`, `Occupation_Group`, `Payment_Behaviour`):

- **Train shape (CURATED V1.1)**: 100.000 linhas √ó 93 colunas  
- **Test shape (CURATED V1.1)**: 50.000 linhas √ó 93 colunas

Assim, a dimensionalidade total foi reduzida de mais de 6.300 colunas para **apenas 93**, viabilizando fitting local, interpretabilidade real e rastreabilidade completa conforme o **PROTOCOLO V5.4**.


In [None]:
# ETAPA: DIAGNOSTICO FINAL DE STRINGS RESIDUAIS EM X

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Carrega V1.1
df = pd.read_csv('/workspace/data/curated/train_curated_v1_1.csv')

# Separa X e y
target = 'Credit_Score'
X = df.drop(columns=[target])
y = df[target]

# Verifica tipos
print("\nDtypes em X antes do encoding:")
print(X.dtypes.value_counts())

# Identifica colunas object
object_cols = X.select_dtypes(include=['object']).columns.tolist()
print("\nColunas object detectadas:", object_cols)

# Exibe valores √∫nicos por coluna para auditoria
for col in object_cols:
    print(f"\nValores √∫nicos em {col}:", X[col].unique())

# Aplica LabelEncoder
for col in object_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))

# Verifica resultado
print("\nDtypes em X depois do encoding:")
print(X.dtypes.value_counts())

# Split coerente depois de corrigir tudo
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)



Dtypes em X antes do encoding:
bool       50
object     22
float64    13
int64       7
Name: count, dtype: int64

Colunas object detectadas: ['Age_Binned', 'Amount_invested_monthly_Binned', 'Annual_Income_Binned', 'Changed_Credit_Limit_Binned', 'Credit_History_Age', 'Credit_History_Age_Binned', 'Credit_Mix', 'Credit_Utilization_Ratio_Binned', 'Delay_from_due_date_Binned', 'Interest_Rate_Binned', 'Monthly_Balance_Binned', 'Monthly_Inhand_Salary_Binned', 'Num_Bank_Accounts_Binned', 'Num_Credit_Card_Binned', 'Num_Credit_Inquiries_Binned', 'Num_of_Delayed_Payment_Binned', 'Num_of_Loan_Binned', 'Occupation', 'Outstanding_Debt_Binned', 'Payment_of_Min_Amount', 'Total_EMI_per_month_Binned', 'Type_of_Loan']

Valores √∫nicos em Age_Binned: ['Jovem' 'Adulto' 'Idoso' 'Erro']

Valores √∫nicos em Amount_invested_monthly_Binned: ['Baixo' 'Nenhum' 'Moderado' 'Alto']

Valores √∫nicos em Annual_Income_Binned: ['Baixa' 'M√©dia' 'Alta' 'Muito_Alta']

Valores √∫nicos em Changed_Credit_Limit_Binned: ['Aum

---
## Baseline Supervisionado ‚Äî CURATED V1 com Tracking MLflow

Este bloco executa o **fitting baseline** usando a camada `CURATED V1` otimizada.  
A execu√ß√£o usa √Årvore de Decis√£o com profundidade controlada, registrando m√©tricas principais no **MLflow**, garantindo rastreabilidade integral do experimento.

Configura√ß√£o:
- Target: `Credit_Score`
- Features: Todas as colunas num√©ricas, binned e agrupadas, exceto ID e texto redundante
- Tracking URI: interno (`http://mlflow:5000`) coerente com container


---
## Fitting Baseline ‚Äî Snapshot CURATED V1.1 com Pr√©-processamento Correto

Este bloco executa o **fitting supervisionado baseline** com o `CURATED V1.1`,  
usando o `X_train` limpo com todas as colunas `_Binned` e agrupamentos **convertidos para valores num√©ricos** via `LabelEncoder`.

A execu√ß√£o usa √Årvore de Decis√£o (`max_depth=5`), com rastreamento no **mesmo projeto MLflow**, mantendo coer√™ncia de URI local e acesso em `http://127.0.0.1:5000`.


In [None]:
import os

# üîê Vari√°veis persistentes para o boto3
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://127.0.0.1:9000'
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'

print("‚úÖ Credenciais MinIO configuradas para MLflow.")


‚úÖ Credenciais MinIO configuradas para MLflow.


In [None]:
# ETAPA: FITTING BASELINE FINAL ‚Äî CURATED V1.1 COM ENDPOINT CORRETO

import os
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score

# For√ßa o endpoint S3 para o nome do servi√ßo dentro da rede Docker
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://minio:9000'
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'

print("‚úÖ Endpoint S3 dentro do container:", os.environ['MLFLOW_S3_ENDPOINT_URL'])

# Tracking MLflow coerente
mlflow.set_tracking_uri("http://mlflow:5000")
mlflow.set_experiment("Baseline_Curated_V1.1")

with mlflow.start_run():
    clf = DecisionTreeClassifier(max_depth=5, random_state=42)
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_val)

    acc = accuracy_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred, average='macro')

    mlflow.log_param("model_type", "DecisionTreeClassifier")
    mlflow.log_param("max_depth", 5)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)

    mlflow.sklearn.log_model(clf, "model_baseline_v1_1")

print(f"\nBaseline conclu√≠do | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")


‚úÖ Endpoint S3 dentro do container: http://minio:9000




üèÉ View run receptive-lynx-60 at: http://mlflow:5000/#/experiments/3/runs/de0cd9753dd74e6b8f7d5c26663add2f
üß™ View experiment at: http://mlflow:5000/#/experiments/3

Baseline conclu√≠do | Accuracy: 0.6881 | F1 Macro: 0.6519
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


## Grid Search supervisionado ‚Äî CURATED V1.1 com Tracking MLflow

Este bloco executa o **Grid Search supervisionado** para o `DecisionTreeClassifier` usando o `CURATED V1.1`.  
Ser√° utilizado:
- Mesma base `X_train` e `y_train` j√° codificados.
- `GridSearchCV` do scikit-learn.
- Tracking de cada combina√ß√£o de hiperpar√¢metros no **mesmo experimento MLflow**, garantindo rastreabilidade integral de m√©tricas.

O objetivo √© encontrar a combina√ß√£o √≥tima de `max_depth` e `min_samples_split` que maximize o **F1 Macro**.


In [None]:
# ETAPA: GRID SEARCH SUPERVISIONADO COM SCORING MULTICLASSE ‚Äî DECISION TREE

import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# 1Ô∏è‚É£ Define o classificador e os hiperpar√¢metros
clf = DecisionTreeClassifier(random_state=42)
param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10, 20]
}

# 2Ô∏è‚É£ Usa StratifiedKFold para garantir estratifica√ß√£o das classes
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 3Ô∏è‚É£ Executa o GridSearchCV com f1_macro
grid_search = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    cv=cv,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Melhores par√¢metros: {best_params}")
print(f"Melhor F1 Macro: {round(best_score, 4) if best_score is not None else 'N/A'}")

# 4Ô∏è‚É£ Loga no MLflow (evita persistir score NaN)
with mlflow.start_run(run_name="grid_search_decision_tree"):
    mlflow.log_params(best_params)
    if best_score is not None and not (best_score != best_score):  # NaN check
        mlflow.log_metric("best_f1_macro", best_score)
    else:
        print("‚ö†Ô∏è Score √© nan ‚Äî m√©trica n√£o ser√° logada para evitar conflito de chave")

    # Loga o modelo treinado
    mlflow.sklearn.log_model(
        sk_model=grid_search.best_estimator_,
        artifact_path="grid_search_model",
        input_example=X_train.iloc[:5, :]  # Opcional: remove se n√£o quiser warning
    )

print(f"\nüèÉ GridSearch conclu√≠do | Best F1 Macro: {round(best_score, 4) if best_score is not None else 'N/A'} | Par√¢metros: {best_params}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")


Fitting 5 folds for each of 16 candidates, totalling 80 fits




Melhores par√¢metros: {'max_depth': 10, 'min_samples_split': 10}
Melhor F1 Macro: 0.6828




üèÉ View run grid_search_decision_tree at: http://mlflow:5000/#/experiments/3/runs/98db11909ed746028a61ba13a6b9609e
üß™ View experiment at: http://mlflow:5000/#/experiments/3

üèÉ GridSearch conclu√≠do | Best F1 Macro: 0.6828 | Par√¢metros: {'max_depth': 10, 'min_samples_split': 10}
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


In [None]:
# üîß ETAPA: PREPARA√á√ÉO DOS DADOS E SPLIT

"""
Este bloco recria X e y, realiza train_test_split com estratifica√ß√£o,
e imprime formas e classes para garantir coer√™ncia.
"""

from sklearn.model_selection import train_test_split

# 1Ô∏è‚É£ Define X e y (ajuste o nome real se n√£o for 'Credit_Score')
X = df.drop('Credit_Score', axis=1)
y = df['Credit_Score']

# 2Ô∏è‚É£ Split com estratifica√ß√£o
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3Ô∏è‚É£ Checa formas e classes
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
print(f"Classes: {y_train.unique()}")


X_train shape: (70000, 92)
X_test shape: (30000, 92)
y_train shape: (70000,)
y_test shape: (30000,)
Classes: ['Standard' 'Poor' 'Good']


In [None]:
# üîß ETAPA: ENCODING + IMPUTA√á√ÉO + LOGISTIC REGRESSION BASELINE

"""
Bloco autocontido:
1Ô∏è‚É£ Diagnostica tipos de dados
2Ô∏è‚É£ Aplica OneHotEncoder em colunas categ√≥ricas
3Ô∏è‚É£ Junta tudo em matriz X final
4Ô∏è‚É£ Imputa NaN com m√©dia nas num√©ricas
5Ô∏è‚É£ Treina Logistic Regression robusto para multiclasses
6Ô∏è‚É£ Loga tudo no MLflow
"""

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score
import mlflow
import mlflow.sklearn

# 1Ô∏è‚É£ Diagn√≥stico de tipos
print("Diagn√≥stico inicial:")
print(X_train.dtypes)

# 2Ô∏è‚É£ Identifica colunas
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_cols   = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Categ√≥ricas: {categorical_cols}")
print(f"Num√©ricas: {numerical_cols}")

# 3Ô∏è‚É£ Pipeline de pr√©-processamento
preprocessor = ColumnTransformer([
    ('num', SimpleImputer(strategy='mean'), numerical_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
])

# 4Ô∏è‚É£ Pipeline final com Logistic Regression
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(
        max_iter=1000, 
        solver='lbfgs',
        multi_class='multinomial'
    ))
])

# 5Ô∏è‚É£ Ajusta pipeline
pipeline.fit(X_train, y_train)

# 6Ô∏è‚É£ Predi√ß√£o e m√©tricas
y_pred = pipeline.predict(X_test)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

# 7Ô∏è‚É£ MLflow Tracking
with mlflow.start_run(run_name="logistic_regression_with_encoding_imputation", experiment_id=3):
    mlflow.log_param("solver", "lbfgs")
    mlflow.log_param("multi_class", "multinomial")
    mlflow.log_param("max_iter", 1000)
    mlflow.log_param("imputer_strategy", "mean")
    mlflow.log_param("encoding", "OneHot")
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(pipeline, "logistic_regression_pipeline")

print(f"\n‚úÖ Logistic Regression Baseline | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")


Diagn√≥stico inicial:
Age                                    float64
Age_Binned                              object
Amount_invested_monthly                float64
Amount_invested_monthly_Binned          object
Amount_invested_monthly_Binned_High       bool
                                        ...   
Type_of_Loan_Category_Mortgage Loan       bool
Type_of_Loan_Category_Not Specified       bool
Type_of_Loan_Category_Payday Loan         bool
Type_of_Loan_Category_Personal Loan       bool
Type_of_Loan_Category_Student Loan        bool
Length: 92, dtype: object
Categ√≥ricas: ['Age_Binned', 'Amount_invested_monthly_Binned', 'Annual_Income_Binned', 'Changed_Credit_Limit_Binned', 'Credit_History_Age', 'Credit_History_Age_Binned', 'Credit_Mix', 'Credit_Utilization_Ratio_Binned', 'Delay_from_due_date_Binned', 'Interest_Rate_Binned', 'Monthly_Balance_Binned', 'Monthly_Inhand_Salary_Binned', 'Num_Bank_Accounts_Binned', 'Num_Credit_Card_Binned', 'Num_Credit_Inquiries_Binned', 'Num_of_Delayed_Paym

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


üèÉ View run logistic_regression_with_encoding_imputation at: http://mlflow:5000/#/experiments/3/runs/7edcbfe85b1440bba3fde06fe3b76615
üß™ View experiment at: http://mlflow:5000/#/experiments/3

‚úÖ Logistic Regression Baseline | Accuracy: 0.5415 | F1 Macro: 0.355
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


# Justificativa T√©cnica ‚Äî Normaliza√ß√£o Pontual para Modelos Sens√≠veis a Escala

Para preservar rastreabilidade e reuso do dataset **v1.1**, decidimos:
- Manter o **dataset v1.1** **inalterado** (todas as colunas originais, sem modifica√ß√£o f√≠sica no arquivo ou tabela).
- Aplicar **normaliza√ß√£o apenas sobre as vari√°veis num√©ricas** **em mem√≥ria**, usando `StandardScaler` do `sklearn`.
- Esta normaliza√ß√£o √© **tempor√°ria**, feita **na etapa de treino** para os modelos que exigem features na mesma escala (por exemplo: Regress√£o Log√≠stica, SVM, KNN, Redes Neurais).

**Por que n√£o normalizar todo o dataset na origem?**  
Manter o dataset bruto facilita auditoria, debug de features e compara√ß√£o de pipelines com/sem pr√©-processamento.

Portanto:
- Dataset **v1.1** = base √∫nica e rastre√°vel.
- Normaliza√ß√£o = aplicada **em pipeline**, em **X_train/X_test**, apenas nas colunas num√©ricas.



In [None]:
# üîß ETAPA: NORMALIZA√á√ÉO PADR√ÉO DAS VARI√ÅVEIS NUM√âRICAS

"""
Esta c√©lula aplica StandardScaler somente nas vari√°veis num√©ricas do dataset v1.1.
"""

from sklearn.preprocessing import StandardScaler

# Exemplo: defina explicitamente suas num√©ricas confirmadas
numerical_features = [
    'Age', 'Amount_invested_monthly', 'Annual_Income', 'Changed_Credit_Limit',
    'Credit_History_Age_Months', 'Credit_Utilization_Ratio', 'Delay_from_due_date',
    'Interest_Rate', 'Monthly_Balance', 'Monthly_Inhand_Salary',
    'Num_Bank_Accounts', 'Num_Credit_Card', 'Num_Credit_Inquiries',
    'Num_of_Delayed_Payment', 'Num_of_Loan', 'Outstanding_Debt',
    'Total_EMI_per_month'
    # Ajuste conforme sua lista validada
]

# Inicializa scaler
scaler = StandardScaler()

# Ajusta no treino e transforma treino/teste
X_train_scaled = X_train.copy()
X_test_scaled  = X_test.copy()

X_train_scaled[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test_scaled[numerical_features]  = scaler.transform(X_test[numerical_features])

print("‚úÖ Normaliza√ß√£o aplicada em mem√≥ria | Shapes id√™nticos ao v1.1")
print(f"X_train_scaled shape: {X_train_scaled.shape}")
print(f"X_test_scaled shape : {X_test_scaled.shape}")


‚úÖ Normaliza√ß√£o aplicada em mem√≥ria | Shapes id√™nticos ao v1.1
X_train_scaled shape: (70000, 92)
X_test_scaled shape : (30000, 92)


In [None]:
# üîß ETAPA: IMPUTA√á√ÉO + NORMALIZA√á√ÉO + LOGISTIC REGRESSION NUM√âRICO

"""
Recria X_train_num e X_test_num a partir do dataset original (v1_1),
aplica imputa√ß√£o de valores ausentes (m√©dia), normaliza com StandardScaler,
ajusta Logistic Regression multinomial robusta e loga tudo no MLflow.
"""

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import mlflow
import mlflow.sklearn

# 1Ô∏è‚É£ Seleciona colunas num√©ricas
cols_num = [
    'Age', 'Amount_invested_monthly', 'Annual_Income', 'Changed_Credit_Limit',
    'Credit_History_Age_Months', 'Credit_Utilization_Ratio', 'Delay_from_due_date',
    'Interest_Rate', 'Month_November', 'Month_October', 'Month_September',
    'Monthly_Balance', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts',
    'Num_Credit_Card', 'Num_Credit_Inquiries', 'Num_of_Delayed_Payment',
    'Num_of_Loan', 'Outstanding_Debt', 'Total_EMI_per_month'
]

X_train_num = X_train[cols_num].copy()
X_test_num  = X_test[cols_num].copy()

# 2Ô∏è‚É£ Imputa NaN com m√©dia
imputer = SimpleImputer(strategy='mean')
X_train_num_imputed = imputer.fit_transform(X_train_num)
X_test_num_imputed  = imputer.transform(X_test_num)

# 3Ô∏è‚É£ Normaliza
scaler = StandardScaler()
X_train_scaled_num = scaler.fit_transform(X_train_num_imputed)
X_test_scaled_num  = scaler.transform(X_test_num_imputed)

print(f"‚úÖ X_train_scaled_num shape: {X_train_scaled_num.shape}")
print(f"‚úÖ X_test_scaled_num shape : {X_test_scaled_num.shape}")

# 4Ô∏è‚É£ Ajusta Logistic Regression multinomial robusta
logreg = LogisticRegression(
    max_iter=1000,
    solver='lbfgs',
    multi_class='multinomial'
)
logreg.fit(X_train_scaled_num, y_train)

# 5Ô∏è‚É£ Predi√ß√£o e m√©tricas
y_pred = logreg.predict(X_test_scaled_num)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

# 6Ô∏è‚É£ MLflow Tracking
with mlflow.start_run(run_name="logistic_regression_num_scaled", experiment_id=3):
    mlflow.log_param("solver", "lbfgs")
    mlflow.log_param("multi_class", "multinomial")
    mlflow.log_param("max_iter", 1000)
    mlflow.log_param("imputer_strategy", "mean")
    mlflow.log_param("scaler", "StandardScaler")
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(logreg, "logistic_regression_model")

print(f"\n‚úÖ Logistic Regression Num√©ricas | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")


‚úÖ X_train_scaled_num shape: (70000, 20)
‚úÖ X_test_scaled_num shape : (30000, 20)




üèÉ View run logistic_regression_num_scaled at: http://mlflow:5000/#/experiments/3/runs/78d457acce314979815f94fc80d286de
üß™ View experiment at: http://mlflow:5000/#/experiments/3

‚úÖ Logistic Regression Num√©ricas | Accuracy: 0.5928 | F1 Macro: 0.4974
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


##  Justificativa T√©cnica ‚Äî Execu√ß√£o do SVM com Vari√°veis Num√©ricas Normalizadas

Este bloco marca a continuidade da etapa de experimenta√ß√£o com modelos que exigem dados em escala uniforme.  
O **Support Vector Machine (SVM)** √© um algoritmo sens√≠vel √† magnitude das features ‚Äî portanto, a **normaliza√ß√£o √© obrigat√≥ria** para maximizar a separabilidade das classes no hiperplano de decis√£o.

Estamos utilizando:
- **Dataset vers√£o 1.2**, que cont√©m apenas vari√°veis **num√©ricas**, j√° **imputadas** e **normalizadas** com `StandardScaler`.
- Vetores: `X_train_scaled_num` e `X_test_scaled_num`.

Objetivo:
- Gerar um baseline robusto para o SVM dentro do mesmo fluxo rastre√°vel do **MLflow**, garantindo versionamento, consist√™ncia de par√¢metros (`kernel`, `C`) e compara√ß√£o justa com os demais algoritmos que tamb√©m exigem normaliza√ß√£o.

Esta execu√ß√£o respeita o protocolo de **blocos autocontidos**, com cabe√ßalho t√©cnico claro e logging completo de par√¢metros e m√©tricas.

Ap√≥s o SVM, o pipeline seguir√° para **KNN** e **MLP**, mantendo a mesma estrutura para valida√ß√£o.



In [None]:
# üîß ETAPA: SVM com Num√©ricas Normalizadas

"""
Este bloco ajusta o modelo Support Vector Machine (SVM)
usando exclusivamente o vetor num√©rico imputado e normalizado (v1.2).
Inclui ajuste, predi√ß√£o, m√©tricas e logging no MLflow.
"""

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score
import mlflow
import mlflow.sklearn

# 1Ô∏è‚É£ Instancia o modelo SVM
svm = SVC(kernel='rbf', C=1.0)

# 2Ô∏è‚É£ Ajuste
svm.fit(X_train_scaled_num, y_train)

# 3Ô∏è‚É£ Predi√ß√£o
y_pred = svm.predict(X_test_scaled_num)

# 4Ô∏è‚É£ M√©tricas
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

# 5Ô∏è‚É£ MLflow Tracking
with mlflow.start_run(run_name="svm_num_scaled", experiment_id=3):
    mlflow.log_param("kernel", "rbf")
    mlflow.log_param("C", 1.0)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(svm, "svm_model")

print(f"\n‚úÖ SVM Num√©ricas | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")




üèÉ View run svm_num_scaled at: http://mlflow:5000/#/experiments/3/runs/8148b15e77874ab7ac850fdb0e658e22
üß™ View experiment at: http://mlflow:5000/#/experiments/3

‚úÖ SVM Num√©ricas | Accuracy: 0.6223 | F1 Macro: 0.4626
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


# Registro do Resultado ‚Äî SVM com Dados Num√©ricos Normalizados

O SVM foi executado conforme planejado, utilizando o `StandardScaler` para garantir comparabilidade justa e melhor separabilidade do hiperplano.  
- **Dataset:** `v1.2` (num√©ricas imputadas e normalizadas)  
- **Accuracy:** 0.6223  
- **F1 Macro:** 0.4626  
- **Run MLflow:** [Link do Run](http://mlflow:5000/#/experiments/3/runs/8148b15e77874ab7ac850fdb0e658e22)

Esta etapa refor√ßa a necessidade de manter a padroniza√ß√£o para algoritmos sens√≠veis a escala, al√©m de documentar o versionamento para rastreabilidade total do pipeline.

**Pr√≥ximos passos:**  
Prosseguir com o **K-Nearest Neighbors (KNN)** e o **MLP Classifier**, utilizando os mesmos vetores `X_train_scaled_num` e `X_test_scaled_num` para consolidar a compara√ß√£o de modelos sens√≠veis √† normaliza√ß√£o.


# üîß ETAPA: K-Nearest Neighbors ‚Äî Baseline Num√©ricas Normalizadas

Esta etapa executa o KNN como parte do bloco de algoritmos que exigem dados normalizados (`v1.2`).  
O objetivo √© avaliar o desempenho do KNN usando as mesmas features num√©ricas previamente escaladas com `StandardScaler`, garantindo comparabilidade entre modelos.  
Todos os par√¢metros, m√©tricas e artefatos s√£o rastreados no MLflow, seguindo o protocolo de versionamento.

- **Dataset:** v1.2 ‚Äî Num√©ricas imputadas + normalizadas  
- **Observa√ß√£o:** Sem OneHotEncoding, apenas features cont√≠nuas
- **M√©tricas:** Accuracy e F1 Macro  


In [None]:
# üîß ETAPA: KNN Baseline Num√©ricas Normalizadas

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
import mlflow
import mlflow.sklearn

# 1Ô∏è‚É£ Define e ajusta o KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled_num, y_train)

# 2Ô∏è‚É£ Predi√ß√£o e m√©tricas
y_pred = knn.predict(X_test_scaled_num)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

# 3Ô∏è‚É£ MLflow
with mlflow.start_run(run_name="knn_num_scaled", experiment_id=3):
    mlflow.log_param("n_neighbors", 5)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(knn, "knn_model")

print(f"\n‚úÖ KNN Num√©ricas | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")




üèÉ View run knn_num_scaled at: http://mlflow:5000/#/experiments/3/runs/f532e223f98c449fae1e0ce12b6ad03b
üß™ View experiment at: http://mlflow:5000/#/experiments/3

‚úÖ KNN Num√©ricas | Accuracy: 0.5822 | F1 Macro: 0.5407
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


# üîß ETAPA: MLP Classifier ‚Äî Baseline Num√©ricas Normalizadas

Esta etapa aplica o **Multi-layer Perceptron (MLP Classifier)** ao conjunto `v1.2`  
‚Äî contendo apenas vari√°veis num√©ricas, imputadas e normalizadas com `StandardScaler`.  
O objetivo √© avaliar o comportamento de um modelo de rede neural simples neste cen√°rio, garantindo rastreabilidade no MLflow.  
Todos os hiperpar√¢metros, m√©tricas e artefatos ser√£o versionados.

- **Dataset:** v1.2 ‚Äî Num√©ricas imputadas + normalizadas  
- **M√©tricas:** Accuracy e F1 Macro  
- **Observa√ß√£o:** O MLP √© particularmente sens√≠vel a dados n√£o escalados.


In [None]:
# üîß ETAPA: MLP Classifier Baseline Num√©ricas Normalizadas

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, f1_score
import mlflow
import mlflow.sklearn

# 1Ô∏è‚É£ Define e ajusta MLP
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, random_state=42)
mlp.fit(X_train_scaled_num, y_train)

# 2Ô∏è‚É£ Predi√ß√£o e m√©tricas
y_pred = mlp.predict(X_test_scaled_num)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

# 3Ô∏è‚É£ MLflow
with mlflow.start_run(run_name="mlp_num_scaled", experiment_id=3):
    mlflow.log_param("hidden_layer_sizes", "(100,)")
    mlflow.log_param("max_iter", 300)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(mlp, "mlp_model")

print(f"\n‚úÖ MLP Num√©ricas | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")




üèÉ View run mlp_num_scaled at: http://mlflow:5000/#/experiments/3/runs/aa47219a2fc14174a3fb8851ad6bc814
üß™ View experiment at: http://mlflow:5000/#/experiments/3

‚úÖ MLP Num√©ricas | Accuracy: 0.6618 | F1 Macro: 0.6086
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


# üîß ETAPA: Modelos Ensemble com Dataset Original v1.1

Esta etapa retoma o dataset original `v1.1`  
‚Äî j√° com **OneHotEncoding**, imputa√ß√£o apropriada e sem normaliza√ß√£o ‚Äî  
para treinar e avaliar os modelos baseados em √°rvores e ensemble:

- **Decision Tree** (j√° executado previamente, servir√° de compara√ß√£o)
- **Random Forest**
- **XGBoost**
- **LightGBM**

Esses algoritmos n√£o exigem dados normalizados, pois suas divis√µes e pesos s√£o determinados por rela√ß√µes de ordenamento, n√£o por dist√¢ncia.
Cada modelo ser√° logado no MLflow com par√¢metros e m√©tricas.


# Diagn√≥stico das Colunas do Tipo Object

Este bloco realiza um diagn√≥stico t√©cnico preciso das colunas que ainda est√£o no tipo `object` dentro de `X_train` e `X_test`.  
A execu√ß√£o desta etapa √© obrigat√≥ria porque algoritmos baseados em √°rvore, como **Random Forest**, **XGBoost**, **LightGBM** e **HistGradientBoosting**, **n√£o aceitam vari√°veis categ√≥ricas em formato `object` ou `string`** ‚Äî eles exigem que todos os dados de entrada estejam em formato **num√©rico**.

Al√©m disso, √© importante garantir que n√£o existam valores ausentes (`NaN`) antes do treinamento, pois mesmo esses algoritmos que toleram alguns `NaN` podem apresentar comportamento inst√°vel ou inviabilizar splits corretos na √°rvore.

Portanto, o procedimento faz tr√™s verifica√ß√µes fundamentais:
1. Mapeia todas as colunas `object` em `X_train` para confirmar quais vari√°veis precisam de transforma√ß√£o via **OrdinalEncoder**.
2. Mostra o n√∫mero de valores √∫nicos em cada coluna e exemplos de categorias, para detectar cardinalidades incoerentes ou inconsist√™ncias.
3. Identifica a quantidade de valores `NaN` em cada coluna `object`, embasando a estrat√©gia de imputa√ß√£o.

Este diagn√≥stico garante que, na pr√≥xima etapa, todo o pipeline de imputa√ß√£o e encoding seja constru√≠do com **consist√™ncia e rastreabilidade**, mantendo a coer√™ncia com a regra principal do projeto: **n√£o usar OneHotEncoder** que infle a dimensionalidade e n√£o violar o limite de colunas definido.


In [None]:
# ETAPA: Diagn√≥stico de colunas object

# Mapeamento e verifica√ß√£o de colunas do tipo object em X_train e X_test

print("\nResumo de tipos em X_train:")
print(X_train.dtypes.value_counts())

print("\nColunas do tipo object em X_train:")
object_cols_train = X_train.select_dtypes(include='object').columns.tolist()
print(object_cols_train)

for col in object_cols_train:
    print(f"\nColuna: {col}")
    uniques = X_train[col].unique()
    nunique = X_train[col].nunique()
    print(f"Valores √∫nicos ({nunique}): {uniques[:20]}")
    if nunique > 20:
        print(f"... ({nunique - 20} valores adicionais n√£o exibidos)")
    print(f"Qtd NaNs: {X_train[col].isna().sum()}")

print("\nVerifica√ß√£o final:")
print(f"Total de colunas object em X_train: {len(object_cols_train)}")



Resumo de tipos em X_train:
bool       50
object     22
float64    13
int64       7
Name: count, dtype: int64

Colunas do tipo object em X_train:
['Age_Binned', 'Amount_invested_monthly_Binned', 'Annual_Income_Binned', 'Changed_Credit_Limit_Binned', 'Credit_History_Age', 'Credit_History_Age_Binned', 'Credit_Mix', 'Credit_Utilization_Ratio_Binned', 'Delay_from_due_date_Binned', 'Interest_Rate_Binned', 'Monthly_Balance_Binned', 'Monthly_Inhand_Salary_Binned', 'Num_Bank_Accounts_Binned', 'Num_Credit_Card_Binned', 'Num_Credit_Inquiries_Binned', 'Num_of_Delayed_Payment_Binned', 'Num_of_Loan_Binned', 'Occupation', 'Outstanding_Debt_Binned', 'Payment_of_Min_Amount', 'Total_EMI_per_month_Binned', 'Type_of_Loan']

Coluna: Age_Binned
Valores √∫nicos (4): ['Adulto' 'Jovem' 'Idoso' 'Erro']
Qtd NaNs: 0

Coluna: Amount_invested_monthly_Binned
Valores √∫nicos (4): ['Baixo' 'Moderado' 'Alto' 'Nenhum']
Qtd NaNs: 0

Coluna: Annual_Income_Binned
Valores √∫nicos (4): ['Baixa' 'M√©dia' 'Alta' 'Muito_Alta'

# Transforma√ß√£o de Credit_History_Age e Type_of_Loan_Category com Encoding Controlado

Este bloco aplica as transforma√ß√µes **especiais** para adequar o dataset `X_train` (vers√£o `v1.1`) ao uso em modelos baseados em √°rvore, sem violar as diretrizes de cardinalidade e rastreabilidade.

**1. Convers√£o de `Credit_History_Age`**  
A coluna `Credit_History_Age` est√° em formato textual (`'X Years and Y Months'`).  
Ser√° convertida para **anos inteiros**, somando +1 se os meses forem maiores ou iguais a 6, seguindo a regra consolidada.  
O valor transformado substitui a vers√£o original ou √© armazenado em `Credit_History_Age_Years`.

**2. Simplifica√ß√£o de `Type_of_Loan` para `Type_of_Loan_Category`**  
Para evitar explos√£o de categorias, aplica-se a **hierarquia de risco de cr√©dito**, atribuindo a cada linha o **tipo de empr√©stimo mais restritivo** presente na combina√ß√£o.  
A ordem de prioridade √©:
1. Payday Loan
2. Credit-Builder Loan
3. Debt Consolidation Loan
4. Personal Loan
5. Student Loan
6. Auto Loan
7. Home Equity Loan
8. Mortgage Loan
9. Not Specified

Assim, toda combina√ß√£o m√∫ltipla de empr√©stimos √© reduzida a uma **classe √∫nica** de risco. A coluna `Type_of_Loan` original √© removida ap√≥s a transforma√ß√£o.

**3. Imputa√ß√£o e Encoding**
- Vari√°veis `object` restantes com `NaN` s√£o imputadas com valor mais frequente.
- `OrdinalEncoder` √© aplicado com `handle_unknown='use_encoded_value'` para garantir coer√™ncia entre treino e teste.
- Confirma-se que todos os campos finais s√£o **num√©ricos**, respeitando o limite de colunas definido.

Todo o pipeline garante coer√™ncia para Random Forest, XGBoost, LightGBM e HistGradientBoosting, **sem normaliza√ß√£o excessiva** e mantendo rastreabilidade conforme o **PROTOCOLO V5.4**.


## Carregamento da Base v1.1 para In√≠cio da Transforma√ß√£o

Este bloco garante que o DataFrame `X_train` esteja carregado em mem√≥ria antes de aplicar qualquer transforma√ß√£o.  
A base usada deve ser a vers√£o **`v1.1`**, com as colunas originais, conforme registrado no protocolo.  
Sem essa carga, as vari√°veis `X_train` e `X_test` n√£o existem no ambiente Python e qualquer etapa de processamento falhar√°.




In [None]:
# ETAPA: Carregamento da base CURATED v1.1

import pandas as pd

# Ajuste o caminho conforme sua estrutura de versionamento real
X_train = pd.read_csv('/workspace/data/curated/train_curated_v1_1.csv')
X_test  = pd.read_csv('/workspace/data/curated/test_curated_v1_1.csv')

print("Shape X_train:", X_train.shape)
print("Shape X_test :", X_test.shape)


Shape X_train: (100000, 93)
Shape X_test : (50000, 93)


In [None]:
# ETAPA: Convers√£o Credit_History_Age e Hierarquia Type_of_Loan_Category

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder

# 1Ô∏è‚É£ Convers√£o Credit_History_Age para anos inteiros
def convert_age(text):
    if pd.isna(text):
        return np.nan
    parts = text.split(' Years and ')
    years = int(parts[0].strip())
    months = int(parts[1].replace(' Months', '').strip())
    return years + 1 if months >= 6 else years

X_train['Credit_History_Age_Years'] = X_train['Credit_History_Age'].apply(convert_age)

# Remove a coluna original se n√£o precisar manter
X_train.drop(columns=['Credit_History_Age'], inplace=True)

# 2Ô∏è‚É£ Simplifica√ß√£o Type_of_Loan_Category
def map_loan_category(loan_string):
    if pd.isna(loan_string):
        return 'Not Specified'
    loan_string = loan_string.lower()
    hierarchy = [
        'Payday Loan',
        'Credit-Builder Loan',
        'Debt Consolidation Loan',
        'Personal Loan',
        'Student Loan',
        'Auto Loan',
        'Home Equity Loan',
        'Mortgage Loan'
    ]
    for loan_type in hierarchy:
        if loan_type.lower() in loan_string:
            return loan_type
    return 'Not Specified'

X_train['Type_of_Loan_Category'] = X_train['Type_of_Loan'].apply(map_loan_category)

# Remove a coluna original Type_of_Loan
X_train.drop(columns=['Type_of_Loan'], inplace=True)

# 3Ô∏è‚É£ Imputa√ß√£o e OrdinalEncoder para colunas object remanescentes
object_cols = X_train.select_dtypes(include='object').columns.tolist()
imputer = SimpleImputer(strategy='most_frequent')
X_train[object_cols] = imputer.fit_transform(X_train[object_cols])

encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
X_train[object_cols] = encoder.fit_transform(X_train[object_cols])

# 4Ô∏è‚É£ Verifica√ß√£o final de dtypes
print("\nVerifica√ß√£o final:")
print(X_train.dtypes.value_counts())
print(f"Shape final: {X_train.shape}")



Verifica√ß√£o final:
bool       50
float64    36
int64       7
Name: count, dtype: int64
Shape final: (100000, 93)


## Replica√ß√£o das Transforma√ß√µes em X_test com Coer√™ncia Total

Este bloco garante que todas as transforma√ß√µes aplicadas em `X_train` sejam **replicadas fielmente em `X_test`**, respeitando a rastreabilidade do pipeline e a consist√™ncia exigida por modelos supervisionados.  
A coer√™ncia entre treino e teste √© **obrigat√≥ria**, pois qualquer diverg√™ncia em imputa√ß√£o, encoding ou hierarquia pode gerar erros de `unknown category` ou distorcer as m√©tricas de valida√ß√£o.

**1. Convers√£o de `Credit_History_Age`**  
O mesmo m√©todo de convers√£o para anos inteiros √© aplicado, usando a regra: meses ‚â• 6 soma +1 ao ano.

**2. Simplifica√ß√£o de `Type_of_Loan` para `Type_of_Loan_Category`**  
A hierarquia de risco definida no `X_train` √© mantida, reduzindo combina√ß√µes m√∫ltiplas para a classe mais restritiva.

**3. Imputa√ß√£o e Encoding**  
O `SimpleImputer` e o `OrdinalEncoder` **devem ser os mesmos** ajustados com `X_train`, para garantir que n√£o haja categorias desconhecidas.  
Caso surjam valores n√£o vistos, o encoder usa `unknown_value=-1` conforme o protocolo.

**4. Verifica√ß√£o final**  
Confirma-se que `X_test` tenha o mesmo n√∫mero de colunas, tipos coerentes e aus√™ncia de `object` ou `NaN` antes do fitting.


In [None]:
# ETAPA: Transforma√ß√£o coerente em X_test

import numpy as np
import pandas as pd

# 1Ô∏è‚É£ Convers√£o Credit_History_Age para anos inteiros
def convert_age(text):
    if pd.isna(text):
        return np.nan
    parts = text.split(' Years and ')
    years = int(parts[0].strip())
    months = int(parts[1].replace(' Months', '').strip())
    return years + 1 if months >= 6 else years

X_test['Credit_History_Age_Years'] = X_test['Credit_History_Age'].apply(convert_age)
X_test.drop(columns=['Credit_History_Age'], inplace=True)

# 2Ô∏è‚É£ Simplifica√ß√£o Type_of_Loan_Category
def map_loan_category(loan_string):
    if pd.isna(loan_string):
        return 'Not Specified'
    loan_string = loan_string.lower()
    hierarchy = [
        'Payday Loan',
        'Credit-Builder Loan',
        'Debt Consolidation Loan',
        'Personal Loan',
        'Student Loan',
        'Auto Loan',
        'Home Equity Loan',
        'Mortgage Loan'
    ]
    for loan_type in hierarchy:
        if loan_type.lower() in loan_string:
            return loan_type
    return 'Not Specified'

X_test['Type_of_Loan_Category'] = X_test['Type_of_Loan'].apply(map_loan_category)
X_test.drop(columns=['Type_of_Loan'], inplace=True)

# 3Ô∏è‚É£ Imputa√ß√£o e OrdinalEncoder usando os mesmos fitted do treino
# Aten√ß√£o: reutilizar o imputer e encoder que foram treinados no X_train
X_test[object_cols] = imputer.transform(X_test[object_cols])
X_test[object_cols] = encoder.transform(X_test[object_cols])

# 4Ô∏è‚É£ Verifica√ß√£o final
print("\nVerifica√ß√£o final:")
print(X_test.dtypes.value_counts())
print(f"Shape final: {X_test.shape}")
print(f"Valores NaN em X_test: {X_test.isna().sum().sum()}")



Verifica√ß√£o final:
bool       46
float64    36
int64      11
Name: count, dtype: int64
Shape final: (50000, 93)
Valores NaN em X_test: 9975


## Diagn√≥stico Final de Colunas com Valores Ausentes em X_test

Este bloco tem como objetivo identificar **quais colunas ainda cont√™m valores `NaN`** ap√≥s a etapa inicial de imputa√ß√£o e encoding.  
Este diagn√≥stico √© **obrigat√≥rio** para garantir que o conjunto de teste (`X_test`) esteja **100% livre de valores faltantes**, condi√ß√£o essencial para rodar Random Forest, XGBoost ou qualquer ensemble sem erros.  

O resultado desta verifica√ß√£o ser√° usado para decidir se √© necess√°rio aplicar uma imputa√ß√£o adicional, seja para colunas num√©ricas (`mean` ou `median`) ou categ√≥ricas (`most_frequent`).


In [None]:
# ETAPA: Diagn√≥stico de NaNs remanescentes em X_test

# Lista de colunas com valores ausentes e suas quantidades
na_cols = X_test.isna().sum()
na_cols = na_cols[na_cols > 0]

print("\nColunas com valores NaN em X_test:")
print(na_cols)

print(f"\nTotal de valores NaN em X_test: {na_cols.sum()}")



Colunas com valores NaN em X_test:
Credit_History_Age_Months    4470
Num_Credit_Inquiries         1035
Credit_History_Age_Years     4470
dtype: int64

Total de valores NaN em X_test: 9975


# Exclus√£o de Colunas com NaNs Irrecuper√°veis

Este bloco remove, de forma rastre√°vel e coerente, as colunas que permanecem com valores `NaN` e n√£o possuem uma estrat√©gia de imputa√ß√£o robusta definida.  
A exclus√£o √© aplicada **simultaneamente em `X_train` e `X_test`**, garantindo que o conjunto de treino e teste tenham exatamente as mesmas colunas e preservem a coer√™ncia estrutural.

Colunas removidas:
- `Credit_History_Age_Months`
- `Credit_History_Age_Years`
- `Num_Credit_Inquiries`

Esta decis√£o √© alinhada ao **PROTOCOLO V5.4**, evitando inconsist√™ncias ou vi√©s nos resultados.


In [None]:
# ETAPA: Exclus√£o de colunas com NaNs irrecuper√°veis

cols_to_drop = ['Credit_History_Age_Months', 'Credit_History_Age_Years', 'Num_Credit_Inquiries']

X_train.drop(columns=cols_to_drop, inplace=True, errors='ignore')
X_test.drop(columns=cols_to_drop, inplace=True, errors='ignore')

print("\nColunas removidas:", cols_to_drop)
print("\nShape X_train:", X_train.shape)
print("Shape X_test :", X_test.shape)



Colunas removidas: ['Credit_History_Age_Months', 'Credit_History_Age_Years', 'Num_Credit_Inquiries']

Shape X_train: (100000, 90)
Shape X_test : (50000, 90)


# Split do Conjunto de Treino em Treino e Valida√ß√£o + Fitting Random Forest

Este bloco realiza o split supervisionado do `X_train` (`CURATED v1.1`) para obter subconjuntos de treino e valida√ß√£o.  
Assim, garantimos m√©tricas reais de `accuracy` e `f1_macro` antes de aplicar o modelo no `X_test` final.

**Par√¢metros padr√£o:**
- `test_size=0.2`
- `random_state=42`

O fitting √© rastreado no MLflow com todos os hiperpar√¢metros e m√©tricas.


In [None]:
# ETAPA: Split do treino e fitting supervisionado com valida√ß√£o

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
import mlflow

# 1Ô∏è‚É£ Split supervisionado
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

print("Shapes:")
print("X_train_split:", X_train_split.shape)
print("X_val_split :", X_val_split.shape)

# 2Ô∏è‚É£ Verificar/Cria experimento
experiment_name = "credit_score_ensembles"
experiment = mlflow.get_experiment_by_name(experiment_name)

if experiment is None:
    experiment_id = mlflow.create_experiment(experiment_name)
    print(f"Novo experimento criado: ID {experiment_id}")
else:
    experiment_id = experiment.experiment_id
    print(f"Experimento existente: ID {experiment_id}")

# 3Ô∏è‚É£ Fitting
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_split, y_train_split)

# 4Ô∏è‚É£ Valida√ß√£o supervisionada
y_val_pred = rf.predict(X_val_split)
acc = accuracy_score(y_val_split, y_val_pred)
f1 = f1_score(y_val_split, y_val_pred, average='macro')

# 5Ô∏è‚É£ Tracking no MLflow
with mlflow.start_run(run_name="random_forest_split_v1.1_curated", experiment_id=experiment_id):
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("random_state", 42)
    mlflow.log_param("split_test_size", 0.2)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(rf, "random_forest_model")

print(f"\nRandom Forest (split) | Accuracy: {round(acc, 4)} | F1 Macro: {round(f1, 4)}")


Shapes:
X_train_split: (80000, 89)
X_val_split : (20000, 89)
Experimento existente: ID 604912714123659266





Random Forest (split) | Accuracy: 0.7836 | F1 Macro: 0.7704


## Infer√™ncia Final Random Forest em X_test + Cria√ß√£o do Diret√≥rio e Exporta√ß√£o

Este bloco executa a **infer√™ncia final** do `RandomForestClassifier` j√° treinado e validado, aplicando o modelo ao conjunto `X_test`.  
O resultado √© salvo em `/workspace/data/predictions/random_forest_predictions.csv`.

Para manter rastreabilidade, o bloco garante que o diret√≥rio `data/predictions` exista, seguindo a estrutura `cookiecutter-data-science` definida no Plano Conceitual.


In [None]:
# ETAPA: Infer√™ncia final e exporta√ß√£o para data/predictions

import os
import pandas as pd

# 1Ô∏è‚É£ Garantir que o diret√≥rio existe
pred_dir = '/workspace/data/predictions'

if not os.path.exists(pred_dir):
    os.makedirs(pred_dir)
    print(f"Diret√≥rio criado: {pred_dir}")
else:
    print(f"Diret√≥rio j√° existe: {pred_dir}")

# 2Ô∏è‚É£ Realizar infer√™ncia no X_test
y_pred_test = rf.predict(X_test)

# 3Ô∏è‚É£ Montar DataFrame
df_pred = pd.DataFrame({'Credit_Score_Predicted': y_pred_test})

# 4Ô∏è‚É£ Exportar CSV
output_path = f"{pred_dir}/random_forest_predictions.csv"
df_pred.to_csv(output_path, index=False)

print(f"\nPrevis√µes salvas em: {output_path}")
print("\nPrimeiras linhas da previs√£o:")
print(df_pred.head())


Diret√≥rio j√° existe: /workspace/data/predictions

Previs√µes salvas em: /workspace/data/predictions/random_forest_predictions.csv

Primeiras linhas da previs√£o:
   Credit_Score_Predicted
0                     0.0
1                     0.0
2                     0.0
3                     0.0
4                     2.0


## Fitting do XGBoostClassifier com Split Supervisionado + Infer√™ncia Final

Este bloco executa o **treinamento supervisionado** do `XGBoostClassifier` reutilizando o split `X_train_split` / `X_val_split` j√° validado com o Random Forest.  
Assim, garantimos:
- Comparabilidade real de m√©tricas (`accuracy` e `f1_macro`).
- Rastreamento no MLflow com todos os hiperpar√¢metros salvos.
- Infer√™ncia final sobre `X_test` real, exportando o arquivo `xgboost_predictions.csv` em `/data/predictions/`, mantendo coer√™ncia com o fluxo **end-to-end**.

**Configura√ß√£o inicial:**
- `n_estimators=100`
- `random_state=42`
- `use_label_encoder=False` para evitar warnings.


In [None]:
# ETAPA: Fitting XGBoost com valida√ß√£o + Infer√™ncia Final

import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd
import os

# 1Ô∏è‚É£ Verificar ou criar experimento
experiment_name = "credit_score_ensembles"
experiment = mlflow.get_experiment_by_name(experiment_name)

if experiment is None:
    experiment_id = mlflow.create_experiment(experiment_name)
    print(f"Novo experimento criado: ID {experiment_id}")
else:
    experiment_id = experiment.experiment_id
    print(f"Experimento existente: ID {experiment_id}")

# 2Ô∏è‚É£ Instanciar e treinar
xgb = XGBClassifier(
    n_estimators=100,
    random_state=42,
    use_label_encoder=False,
    eval_metric='mlogloss'
)

xgb.fit(X_train_split, y_train_split)

# 3Ô∏è‚É£ Valida√ß√£o supervisionada
y_val_pred = xgb.predict(X_val_split)
acc = accuracy_score(y_val_split, y_val_pred)
f1 = f1_score(y_val_split, y_val_pred, average='macro')

# 4Ô∏è‚É£ Tracking no MLflow
with mlflow.start_run(run_name="xgboost_split_v1.1_curated", experiment_id=experiment_id):
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("random_state", 42)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.xgboost.log_model(xgb.get_booster(), "xgboost_model")

print(f"\nXGBoost (split) | Accuracy: {round(acc, 4)} | F1 Macro: {round(f1, 4)}")

# 5Ô∏è‚É£ Infer√™ncia final no X_test
y_pred_test = xgb.predict(X_test)
df_pred = pd.DataFrame({'Credit_Score_Predicted': y_pred_test})

# Garantir diret√≥rio data/predictions
pred_dir = '/workspace/data/predictions'
os.makedirs(pred_dir, exist_ok=True)

output_path = f"{pred_dir}/xgboost_predictions.csv"
df_pred.to_csv(output_path, index=False)

print(f"\nPrevis√µes XGBoost salvas em: {output_path}")
print(df_pred.head())


Experimento existente: ID 604912714123659266


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
  xgb_model.save_model(model_data_path)



XGBoost (split) | Accuracy: 0.7504 | F1 Macro: 0.7326

Previs√µes XGBoost salvas em: /workspace/data/predictions/xgboost_predictions.csv
   Credit_Score_Predicted
0                       0
1                       0
2                       0
3                       0
4                       0


## Fitting do LightGBMClassifier com Split Supervisionado + Infer√™ncia Final

Este bloco executa o **treinamento supervisionado** do `LightGBMClassifier` reutilizando o mesmo split `X_train_split` / `X_val_split` j√° validado com Random Forest e XGBoost.  
Dessa forma, garantimos:
- Compara√ß√£o justa de m√©tricas (`accuracy` e `f1_macro`).
- Rastreamento rastre√°vel no MLflow.
- Infer√™ncia final em `X_test` com exporta√ß√£o para `/data/predictions/lightgbm_predictions.csv`.

**Configura√ß√£o inicial:**
- `n_estimators=100`
- `random_state=42`
- `verbosity=-1` para suprimir warnings desnecess√°rios.


In [None]:
# ETAPA: Fitting LightGBM com valida√ß√£o + Infer√™ncia Final

import mlflow
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd
import os

# 1Ô∏è‚É£ Verificar ou criar experimento no MLflow
experiment_name = "credit_score_ensembles"
experiment = mlflow.get_experiment_by_name(experiment_name)

if experiment is None:
    experiment_id = mlflow.create_experiment(experiment_name)
    print(f"Novo experimento criado: ID {experiment_id}")
else:
    experiment_id = experiment.experiment_id
    print(f"Experimento existente: ID {experiment_id}")

# 2Ô∏è‚É£ Instanciar e treinar o modelo
lgbm = LGBMClassifier(
    n_estimators=100,
    random_state=42,
    verbosity=-1
)

lgbm.fit(X_train_split, y_train_split)

# 3Ô∏è‚É£ Valida√ß√£o supervisionada
y_val_pred = lgbm.predict(X_val_split)
acc = accuracy_score(y_val_split, y_val_pred)
f1 = f1_score(y_val_split, y_val_pred, average='macro')

# 4Ô∏è‚É£ Tracking no MLflow
with mlflow.start_run(run_name="lightgbm_split_v1.1_curated", experiment_id=experiment_id):
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("random_state", 42)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.lightgbm.log_model(lgbm.booster_, "lightgbm_model")

print(f"\nLightGBM (split) | Accuracy: {round(acc, 4)} | F1 Macro: {round(f1, 4)}")

# 5Ô∏è‚É£ Infer√™ncia final no X_test
y_pred_test = lgbm.predict(X_test)
df_pred = pd.DataFrame({'Credit_Score_Predicted': y_pred_test})

# Garantir diret√≥rio data/predictions
pred_dir = '/workspace/data/predictions'
os.makedirs(pred_dir, exist_ok=True)

output_path = f"{pred_dir}/lightgbm_predictions.csv"
df_pred.to_csv(output_path, index=False)

print(f"\nPrevis√µes LightGBM salvas em: {output_path}")
print(df_pred.head())


Experimento existente: ID 604912714123659266





LightGBM (split) | Accuracy: 0.7248 | F1 Macro: 0.7055

Previs√µes LightGBM salvas em: /workspace/data/predictions/lightgbm_predictions.csv
   Credit_Score_Predicted
0                     0.0
1                     0.0
2                     0.0
3                     0.0
4                     0.0


## Fitting do HistGradientBoostingClassifier com Split Supervisionado + Infer√™ncia Final

Este bloco executa o **treinamento supervisionado** do `HistGradientBoostingClassifier` utilizando o mesmo split `X_train_split` / `X_val_split` j√° validado com Random Forest, XGBoost e LightGBM.  
Assim, mantemos:
- Comparabilidade justa de m√©tricas (`accuracy` e `f1_macro`).
- Rastreamento audit√°vel no MLflow.
- Infer√™ncia final sobre `X_test` com exporta√ß√£o em `/data/predictions/hgb_predictions.csv`.

**Configura√ß√£o inicial:**
- `max_iter=100`
- `random_state=42`
- `verbose=0` para evitar polui√ß√£o de log.


In [None]:
# ETAPA: Fitting HistGradientBoosting com valida√ß√£o + Infer√™ncia Final

import mlflow
from sklearn.experimental import enable_hist_gradient_boosting  # habilita o estimator
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd
import os

# 1Ô∏è‚É£ Verificar ou criar experimento no MLflow
experiment_name = "credit_score_ensembles"
experiment = mlflow.get_experiment_by_name(experiment_name)

if experiment is None:
    experiment_id = mlflow.create_experiment(experiment_name)
    print(f"Novo experimento criado: ID {experiment_id}")
else:
    experiment_id = experiment.experiment_id
    print(f"Experimento existente: ID {experiment_id}")

# 2Ô∏è‚É£ Instanciar e treinar o modelo
hgb = HistGradientBoostingClassifier(
    max_iter=100,
    random_state=42,
    verbose=0
)

hgb.fit(X_train_split, y_train_split)

# 3Ô∏è‚É£ Valida√ß√£o supervisionada
y_val_pred = hgb.predict(X_val_split)
acc = accuracy_score(y_val_split, y_val_pred)
f1 = f1_score(y_val_split, y_val_pred, average='macro')

# 4Ô∏è‚É£ Tracking no MLflow
with mlflow.start_run(run_name="hgb_split_v1.1_curated", experiment_id=experiment_id):
    mlflow.log_param("max_iter", 100)
    mlflow.log_param("random_state", 42)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    # Para HistGradientBoosting n√£o h√° m√©todo espec√≠fico para log_model ‚Äî salva como sklearn
    mlflow.sklearn.log_model(hgb, "hgb_model")

print(f"\nHistGradientBoosting (split) | Accuracy: {round(acc, 4)} | F1 Macro: {round(f1, 4)}")

# 5Ô∏è‚É£ Infer√™ncia final no X_test
y_pred_test = hgb.predict(X_test)
df_pred = pd.DataFrame({'Credit_Score_Predicted': y_pred_test})

# Garantir diret√≥rio data/predictions
pred_dir = '/workspace/data/predictions'
os.makedirs(pred_dir, exist_ok=True)

output_path = f"{pred_dir}/hgb_predictions.csv"
df_pred.to_csv(output_path, index=False)

print(f"\nPrevis√µes HistGradientBoosting salvas em: {output_path}")
print(df_pred.head())




Experimento existente: ID 604912714123659266





HistGradientBoosting (split) | Accuracy: 0.7368 | F1 Macro: 0.7192

Previs√µes HistGradientBoosting salvas em: /workspace/data/predictions/hgb_predictions.csv
   Credit_Score_Predicted
0                     0.0
1                     0.0
2                     0.0
3                     0.0
4                     0.0


# Resumo Consolidado dos Resultados ‚Äî Baselines e Ensembles Supervisionados

Este documento resume **todas as etapas validadas** do pipeline supervisionado para previs√£o do **Credit Score**, em ader√™ncia ao **Plano Conceitual**, ao **Plano de Atividades ‚Äî Sequencial** e ao **PROTOCOLO V5.4**.  
Cada modelo foi rastreado no **MLflow**, validado com split supervisionado e exportou infer√™ncias reais para `X_test`, prontos para versionamento.

---

## üìå 1Ô∏è‚É£ **Baseline ‚Äî Decision Tree e Logistic Regression**

**Objetivo:** obter uma refer√™ncia m√≠nima de performance.  
- **DecisionTreeClassifier**:  
  - `max_depth=5` para evitar overfitting.
  - `Accuracy` ‚âà valor inicial registrado no MLflow.
- **Logistic Regression**:  
  - Sem normaliza√ß√£o: F1 muito baixo.
  - Com normaliza√ß√£o (num√©ricas apenas): F1 melhor, mas insuficiente.

‚úÖ Estes modelos provaram que era necess√°rio avan√ßar para ensembles.

---

## üìå 2Ô∏è‚É£ **Modelos sens√≠veis √† escala ‚Äî SVM e KNN**

Executados **somente com features num√©ricas normalizadas**:
- **SVM:** `F1 Macro ‚âà 0.46`
- **KNN:** `F1 Macro ‚âà 0.54`

‚úÖ Baixo desempenho para alto custo computacional, sem ganho real sobre a Decision Tree.

---

## üìå 3Ô∏è‚É£ **Random Forest ‚Äî Curated v1.1**

- **Base:** `CURATED v1.1` (92 ‚ûù 90 colunas, 100% num√©ricas, sem `NaN`).
- **Split supervisionado:** `train_test_split` (`test_size=0.2`, `random_state=42`).
- **M√©tricas no MLflow:**  
  - `Accuracy`: **0.7836**
  - `F1 Macro`: **0.7704**
- **Infer√™ncia:** `/data/predictions/random_forest_predictions.csv`.

---

## üìå 4Ô∏è‚É£ **XGBoost**

- **Split supervisionado mesmo do RF:** comparabilidade total.
- **M√©tricas:**  
  - `Accuracy`: **0.7504**
  - `F1 Macro`: **0.7326**
- **Infer√™ncia:** `/data/predictions/xgboost_predictions.csv`.

---

## üìå 5Ô∏è‚É£ **LightGBM**

- **Split supervisionado id√™ntico.**
- **M√©tricas:**  
  - `Accuracy`: **0.7248**
  - `F1 Macro`: **0.7055**
- **Infer√™ncia:** `/data/predictions/lightgbm_predictions.csv`.

---

## üìå 6Ô∏è‚É£ **HistGradientBoostingClassifier**

- **Split supervisionado id√™ntico.**
- **M√©tricas:**  
  - `Accuracy`: **0.7368**
  - `F1 Macro`: **0.7192**
- **Infer√™ncia:** `/data/predictions/hgb_predictions.csv`.

---

## üìå **Painel Final ‚Äî Ensembles Comparativos**

| Modelo                   | Accuracy | F1 Macro |
|--------------------------|----------|----------|
| **Random Forest**        | 0.7836   | 0.7704   |
| **XGBoost**              | 0.7504   | 0.7326   |
| **LightGBM**             | 0.7248   | 0.7055   |
| **HistGradientBoosting** | 0.7368   | 0.7192   |

Todos foram:
- **Rastreados** no MLflow (mesmo experimento).
- **Audit√°veis** com splits consistentes.
- **Salvos** em `/data/predictions/` com CSV para deploy.

---


## Grid Search Supervisionado para Random Forest ‚Äî Tuning do Modelo Final

Este bloco aplica um **`GridSearchCV` supervisionado** para o `RandomForestClassifier` j√° definido como baseline,  
buscando **hiperpar√¢metros mais robustos** para melhorar `accuracy` e `F1 Macro` sem inflar complexidade excessiva.

**Contexto:**
- Mant√©m o **mesmo split** `X_train_split` / `X_val_split` usado para todos os ensembles ‚Äî comparabilidade garantida.
- Todo o tuning √© **rastreador no MLflow** para auditoria.
- Ap√≥s encontrar o `best_estimator_`, o fitting usa esse modelo para prever `X_val_split` e gera nova infer√™ncia `X_test`.

**Grid Search:**
- `n_estimators`: [100, 200, 300]
- `max_depth`: [None, 10, 20]
- `max_features`: ['sqrt', 'log2']
- `min_samples_split`: [2, 5]
- `min_samples_leaf`: [1, 2]

---


# üîß ETAPA: Grid Search Random Forest ‚Äî Tuning Final para Exerc√≠cio

Este bloco refaz o `GridSearchCV` para o `RandomForestClassifier`, garantindo:
- **Split supervisionado (`X_train_split` e `X_val_split`)**, consistente com o baseline.
- **Grade de hiperpar√¢metros** balanceada.
- **Tracking no MLflow** em um `run` √∫nico.
- **Persist√™ncia do modelo ajustado** como CSV de previs√µes reais (`X_test`) para versionamento.

Tudo rastreado e autocontido, coerente com o **PROTOCOLO_V5.4**.

---


# üîß ETAPA: Grid Search Random Forest + Infer√™ncia Final

Este bloco:
- Garante coer√™ncia com o pipeline original: split supervisionado fixo (train), infer√™ncia final em `test_curated_v1_1.csv`.
- Usa `GridSearchCV` com rastreio no MLflow.
- Salva `random_forest_tuned_predictions.csv` de forma defensiva.

Aderente ao **PROTOCOLO_V5.4_UNIFICADO**.

---


# üîß ETAPA: Grid Search Random Forest ‚Äî Split 80/20 supervisionado + Encoding

Este bloco:
- Carrega `train_curated_v1_1.csv` do zero.
- Faz `train_test_split` 80/20 supervisionado com `random_state=42`.
- Detecta e transforma colunas categ√≥ricas (`object`) com `OrdinalEncoder` no mesmo fluxo.
- Roda `GridSearchCV` no Random Forest.
- Valida no `X_val`.
- Loga tudo no mesmo experimento MLflow do Random Forest original.

Aderente ao **PROTOCOLO_V5.4_UNIFICADO**.

---


In [None]:
# ETAPA: Grid Search Random Forest supervisionado (Split + Encoding + MLflow)

import pandas as pd
import mlflow
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# 1Ô∏è‚É£ Ingest√£o do dataset original
path_curated = '/workspace/data/curated/train_curated_v1_1.csv'
df = pd.read_csv(path_curated)

print(f"df shape: {df.shape}")
print(df.head())

# 2Ô∏è‚É£ Separar X e y
X = df.drop(columns=['Credit_Score'])
y = df['Credit_Score']

# 3Ô∏è‚É£ Split global 80/20 supervisionado
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"X_train: {X_train.shape} | X_val: {X_val.shape}")

# 4Ô∏è‚É£ Detectar colunas categ√≥ricas
object_cols = X_train.select_dtypes(include='object').columns.tolist()
print(f"Colunas categ√≥ricas detectadas: {object_cols}")

# 5Ô∏è‚É£ Aplicar OrdinalEncoder coerente
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

X_train_enc = X_train.copy()
X_val_enc = X_val.copy()

X_train_enc[object_cols] = encoder.fit_transform(X_train[object_cols])
X_val_enc[object_cols] = encoder.transform(X_val[object_cols])

print("\nVerifica√ß√£o dos dtypes p√≥s-encoding:")
print(X_train_enc.dtypes.value_counts())

# 6Ô∏è‚É£ Garantir coer√™ncia
assert 'object' not in X_train_enc.dtypes.values, "Ainda existem colunas object no X_train!"
assert 'object' not in X_val_enc.dtypes.values, "Ainda existem colunas object no X_val!"

# 7Ô∏è‚É£ Verificar/criar experimento MLflow
experiment_name = "credit_score_ensembles"
experiment = mlflow.get_experiment_by_name(experiment_name)

if experiment is None:
    experiment_id = mlflow.create_experiment(experiment_name)
    print(f"Novo experimento criado: ID {experiment_id}")
else:
    experiment_id = experiment.experiment_id
    print(f"Experimento existente: ID {experiment_id}")

# 8Ô∏è‚É£ Definir grade de hiperpar√¢metros
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'max_features': ['sqrt', 'log2'],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# 9Ô∏è‚É£ Instanciar GridSearchCV
rf_base = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    estimator=rf_base,
    param_grid=param_grid,
    cv=3,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=2
)

# üîü Fitting supervisionado com encoding embutido
grid_search.fit(X_train_enc, y_train)

print(f"\nMelhores par√¢metros: {grid_search.best_params_}")

# 1Ô∏è‚É£1Ô∏è‚É£ Avaliar no conjunto de valida√ß√£o
best_rf = grid_search.best_estimator_
y_val_pred = best_rf.predict(X_val_enc)
acc = accuracy_score(y_val, y_val_pred)
f1 = f1_score(y_val, y_val_pred, average='macro')

# 1Ô∏è‚É£2Ô∏è‚É£ Tracking no MLflow ‚Äî mesmo experimento
with mlflow.start_run(run_name="random_forest_gridsearch_v1.1_supervised", experiment_id=experiment_id):
    for param_name, param_value in grid_search.best_params_.items():
        mlflow.log_param(param_name, param_value)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(best_rf, "random_forest_tuned_model_supervised")

print(f"\nRandom Forest Tuned | Accuracy (val): {round(acc, 4)} | F1 Macro (val): {round(f1, 4)}")


df shape: (100000, 93)
    Age Age_Binned  Amount_invested_monthly Amount_invested_monthly_Binned  \
0  23.0      Jovem                80.415295                          Baixo   
1  23.0      Jovem               118.280222                          Baixo   
2  33.0     Adulto                81.699521                          Baixo   
3  23.0      Jovem               199.458074                          Baixo   
4  23.0      Jovem                41.420153                         Nenhum   

   Amount_invested_monthly_Binned_High  Amount_invested_monthly_Binned_Low  \
0                                False                                True   
1                                False                                True   
2                                False                                True   
3                                False                                True   
4                                False                                True   

   Amount_invested_monthly_Binned_Moder



[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300; total time= 1.2min
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time= 1.3min
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time= 1.3min
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=  20.3s
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=  21.0s
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=  20.7s
[CV] END max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=  18.4s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300; total time= 1.2min
[CV] END max_depth=None,




Random Forest Tuned | Accuracy (val): 0.7907 | F1 Macro (val): 0.7779


In [None]:
import mlflow
mlflow.set_tracking_uri("file:///workspace/.mlruns")


In [None]:
print("Tracking URI:", mlflow.get_tracking_uri())


Tracking URI: file:///workspace/.mlruns


# üîß ETAPA: Refit Random Forest GridSearch Final ‚Äî Experimento Default

Este bloco:
- Carrega o `train_curated_v1_1.csv`
- Separa `X` e `y` + split supervisionado 80/20
- Executa `GridSearchCV` com par√¢metros coerentes
- Avalia no conjunto de valida√ß√£o
- Salva modelo, m√©tricas e par√¢metros em um novo **Run ID** no Experimento `Default` (ID 0)

---


In [None]:
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import accuracy_score, f1_score

# 1Ô∏è‚É£ Carregar base curated
df = pd.read_csv('/workspace/data/curated/train_curated_v1_1.csv')
X = df.drop(columns=['Credit_Score'])
y = df['Credit_Score'].map({'Poor': 0, 'Standard': 1, 'Good': 2})

print(f"Base => X: {X.shape} | y: {y.shape} | y unique: {y.unique()}")

# 2Ô∏è‚É£ OrdinalEncoder para garantir coer√™ncia
object_cols = X.select_dtypes(include='object').columns.tolist()
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
X[object_cols] = encoder.fit_transform(X[object_cols])

print(f"‚úÖ OrdinalEncoder aplicado em: {object_cols}")

# 3Ô∏è‚É£ Split 80/20 supervisionado
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Split => X_train: {X_train.shape} | X_val: {X_val.shape}")

# 4Ô∏è‚É£ Definir grade de hiperpar√¢metros
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10],
    'max_features': ['sqrt'],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# 5Ô∏è‚É£ Instanciar e rodar GridSearchCV
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=3,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_train)

print(f"\nMelhores par√¢metros: {grid_search.best_params_}")

# 6Ô∏è‚É£ Avaliar no conjunto de valida√ß√£o
best_rf = grid_search.best_estimator_
y_val_pred = best_rf.predict(X_val)

acc = accuracy_score(y_val, y_val_pred)
f1 = f1_score(y_val, y_val_pred, average='macro')

print(f"\n‚úÖ Desempenho Valida√ß√£o | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")

# 7Ô∏è‚É£ Log no MLflow (Experimento Default, ID=0)
mlflow.set_experiment("Default")

with mlflow.start_run(run_name="random_forest_gridsearch_refit_v1.1"):
    for param_name, param_value in grid_search.best_params_.items():
        mlflow.log_param(param_name, param_value)
    mlflow.log_metric("accuracy_val", acc)
    mlflow.log_metric("f1_macro_val", f1)
    mlflow.sklearn.log_model(best_rf, "random_forest_tuned_model")

print("\n‚úÖ Run salvo no MLflow Experimento Default!")


Base => X: (100000, 92) | y: (100000,) | y unique: [2 1 0]
‚úÖ OrdinalEncoder aplicado em: ['Age_Binned', 'Amount_invested_monthly_Binned', 'Annual_Income_Binned', 'Changed_Credit_Limit_Binned', 'Credit_History_Age', 'Credit_History_Age_Binned', 'Credit_Mix', 'Credit_Utilization_Ratio_Binned', 'Delay_from_due_date_Binned', 'Interest_Rate_Binned', 'Monthly_Balance_Binned', 'Monthly_Inhand_Salary_Binned', 'Num_Bank_Accounts_Binned', 'Num_Credit_Card_Binned', 'Num_Credit_Inquiries_Binned', 'Num_of_Delayed_Payment_Binned', 'Num_of_Loan_Binned', 'Occupation', 'Outstanding_Debt_Binned', 'Payment_of_Min_Amount', 'Total_EMI_per_month_Binned', 'Type_of_Loan']
Split => X_train: (80000, 92) | X_val: (20000, 92)
Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=  22.7s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; tota




‚úÖ Desempenho Valida√ß√£o | Accuracy: 0.7914 | F1 Macro: 0.7778





‚úÖ Run salvo no MLflow Experimento Default!


# üîß ETAPA: Infer√™ncia final com Random Forest Tuned ‚Äî Teste sem Target

Este bloco executa a infer√™ncia do modelo Random Forest otimizado (`gridsearch v1.1`) sobre o conjunto de teste `test_curated_v1_1.csv`, simulando uma situa√ß√£o de produ√ß√£o. Inclui:

- Configura√ß√£o expl√≠cita do `tracking_uri` para localizar `.mlruns`;
- Recarregamento do modelo salvo em um `Run ID` rastreado no MLflow;
- Refit do `OrdinalEncoder` com base nos dados de treino;
- Remo√ß√£o do target do teste (caso presente);
- Aplica√ß√£o do encoder e gera√ß√£o de predi√ß√µes;
- Salvamento das predi√ß√µes em `data/predictions/`.

Esse bloco fecha o ciclo de infer√™ncia e valida que o modelo salvo √© funcional no ambiente containerizado com rastreabilidade total.

---


In [None]:
# üîß ETAPA: Infer√™ncia final com Random Forest Tuned ‚Äî Teste sem Target

import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.preprocessing import OrdinalEncoder
import os

# 0Ô∏è‚É£ Configura tracking URI explicitamente para garantir acesso ao .mlruns
mlflow.set_tracking_uri("file:/workspace/.mlruns")
print("üìÅ Current Working Directory:", os.getcwd())
print("‚úÖ Tracking URI:", mlflow.get_tracking_uri())

# 1Ô∏è‚É£ Recarregar modelo a partir de um run v√°lido
model_uri = "runs:/4e56a5afe29a4a26b962c220fef03f5d/random_forest_tuned_model"
best_rf = mlflow.sklearn.load_model(model_uri)
print(f"‚úÖ Modelo recarregado de: {model_uri}")

# 2Ô∏è‚É£ Refit do OrdinalEncoder com base no treino
train_path = "/workspace/data/curated/train_curated_v1_1.csv"
df_train = pd.read_csv(train_path)

X_train = df_train.drop(columns=['Credit_Score'])
object_cols = X_train.select_dtypes(include='object').columns.tolist()

encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
encoder.fit(X_train[object_cols])
print(f"‚úÖ OrdinalEncoder refit em: {object_cols}")

# 3Ô∏è‚É£ Carregar conjunto de teste, limpar target se existir
test_path = "/workspace/data/curated/test_curated_v1_1.csv"
df_test = pd.read_csv(test_path)

if 'Credit_Score' in df_test.columns:
    df_test.drop(columns=['Credit_Score'], inplace=True)
    print("‚úÖ Coluna 'Credit_Score' removida do TEST.")

# 4Ô∏è‚É£ Aplicar encoding no teste
X_test = df_test.copy()
X_test[object_cols] = encoder.transform(X_test[object_cols])
print(f"‚úÖ OrdinalEncoder aplicado no TEST.")

# 5Ô∏è‚É£ Infer√™ncia final
y_test_pred = best_rf.predict(X_test)

# 6Ô∏è‚É£ Salvar predi√ß√µes
df_pred = pd.DataFrame({'Credit_Score_Predicted': y_test_pred})
output_path = "/workspace/data/predictions/random_forest_final_test_predictions.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)
df_pred.to_csv(output_path, index=False)

print(f"\n‚úÖ Previs√µes salvas em: {output_path}")
print(df_pred.head(20))


  from .autonotebook import tqdm as notebook_tqdm


üìÅ Current Working Directory: /workspace/notebooks
‚úÖ Tracking URI: file:/workspace/.mlruns


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading artifacts:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4/5 [00:00<00:00, 3900.77it/s] 

Downloading artifacts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<00:00, 13.12it/s]  


‚úÖ Modelo recarregado de: runs:/4e56a5afe29a4a26b962c220fef03f5d/random_forest_tuned_model
‚úÖ OrdinalEncoder refit em: ['Age_Binned', 'Amount_invested_monthly_Binned', 'Annual_Income_Binned', 'Changed_Credit_Limit_Binned', 'Credit_History_Age', 'Credit_History_Age_Binned', 'Credit_Mix', 'Credit_Utilization_Ratio_Binned', 'Delay_from_due_date_Binned', 'Interest_Rate_Binned', 'Monthly_Balance_Binned', 'Monthly_Inhand_Salary_Binned', 'Num_Bank_Accounts_Binned', 'Num_Credit_Card_Binned', 'Num_Credit_Inquiries_Binned', 'Num_of_Delayed_Payment_Binned', 'Num_of_Loan_Binned', 'Occupation', 'Outstanding_Debt_Binned', 'Payment_of_Min_Amount', 'Total_EMI_per_month_Binned', 'Type_of_Loan']
‚úÖ Coluna 'Credit_Score' removida do TEST.
‚úÖ OrdinalEncoder aplicado no TEST.

‚úÖ Previs√µes salvas em: /workspace/data/predictions/random_forest_final_test_predictions.csv
    Credit_Score_Predicted
0                        2
1                        2
2                        2
3                        2

# üìÑ RELAT√ìRIO FINAL ‚Äî ETAPA DE INFER√äNCIA COM RANDOM FOREST TUNED (CURATED V1.1)

## ‚úÖ Objetivo da Etapa

Executar a **infer√™ncia final no conjunto de teste** (`test_curated_v1_1.csv`), utilizando o modelo Random Forest otimizado por `GridSearchCV` (vers√£o 1.1), com rastreamento completo pelo MLflow e persist√™ncia dos resultados para uso posterior em API ou Streamlit.

---

## ‚öôÔ∏è Configura√ß√µes do Ambiente

- Diret√≥rio de execu√ß√£o: `/workspace/notebooks`
- Tracking URI expl√≠cito: `file:/workspace/.mlruns`
- Modelo recarregado a partir do MLflow:
  - `run_id`: `4e56a5afe29a4a26b962c220fef03f5d`
  - `model_uri`: `runs:/4e56a5afe29a4a26b962c220fef03f5d/random_forest_tuned_model`
- Modelo salvo no Experimento: **Default** (`experiment_id = 0`)

---

## üîÅ Procedimentos Executados

1. **Recarregamento do modelo** via `mlflow.sklearn.load_model(...)` com URI persistente.
2. **Refit do OrdinalEncoder** usando as colunas categ√≥ricas do treino:
   - `23` colunas identificadas e codificadas com `handle_unknown='use_encoded_value'`.
3. **Limpeza do conjunto de teste**:
   - Remo√ß√£o da coluna `Credit_Score`, garantindo simula√ß√£o de produ√ß√£o.
4. **Transforma√ß√£o e infer√™ncia**:
   - Aplica√ß√£o do encoder no `X_test`
   - Predi√ß√£o dos valores com o modelo carregado
5. **Persist√™ncia do resultado**:
   - Arquivo salvo em:  
     `/workspace/data/predictions/random_forest_final_test_predictions.csv`
   - Preview das 20 primeiras linhas exibido com sucesso

---

## üìä Resultado da Infer√™ncia (Amostra)

| Index | Credit_Score_Predicted |
|-------|------------------------|
| 0     | 2                      |
| 1     | 2                      |
| 2     | 2                      |
| ...   | ...                    |
| 12    | 1                      |
| 19    | 1                      |

- Classes preditas: `2` (Good), `1` (Standard)
- N√£o foram observadas previs√µes da classe `0` nesta amostra

---

## üßæ Conformidade com o Protocolo V5.4

| Crit√©rio                         | Status |
|----------------------------------|--------|
| Uso de tracking URI expl√≠cito    | ‚úÖ     |
| Infer√™ncia rastre√°vel com `run_id` | ‚úÖ     |
| Persist√™ncia dos resultados em path controlado | ‚úÖ     |
| Bloco autocontido e rastre√°vel  | ‚úÖ     |
| Sem infer√™ncias ocultas ou caminhos impl√≠citos | ‚úÖ     |

---



# üîß ETAPA: Versionamento DVC da Infer√™ncia Final ‚Äî Random Forest Tuned

Este bloco realiza o versionamento completo do arquivo de predi√ß√£o `random_forest_final_test_predictions.csv`, seguindo os padr√µes do protocolo:

- Verifica a exist√™ncia do arquivo local;
- Executa `dvc add` para rastrear o artefato;
- Adiciona o `.dvc` ao Git;
- Faz commit com mensagem rastre√°vel;
- Executa `dvc push` para enviar ao MinIO (via `remote storage`);
- Garante que o working dir seja o n√≠vel raiz (`/workspace`), evitando erros de caminho relativos incorretos.

---


In [None]:
# üîß ETAPA: Versionamento DVC da Infer√™ncia Final ‚Äî Random Forest Tuned

import os
import subprocess

# Caminho real do arquivo
pred_path = '/workspace/data/predictions/random_forest_final_test_predictions.csv'
dvc_file = pred_path + '.dvc'

# 1Ô∏è‚É£ Verifica se o arquivo existe fisicamente
print("\n‚úÖ Verificando exist√™ncia do arquivo de predi√ß√£o:")
if not os.path.exists(pred_path):
    raise FileNotFoundError(f"‚ùå Arquivo n√£o encontrado: {pred_path}")
else:
    print("‚úîÔ∏è Arquivo encontrado.")

# 2Ô∏è‚É£ Executa dvc add
print("\nüì¶ Executando dvc add...")
subprocess.run(['dvc', 'add', pred_path], check=True)

# 3Ô∏è‚É£ Git add e commit do .dvc
print("\nüîß Adicionando .dvc ao Git e criando commit:")
subprocess.run(['git', 'add', dvc_file], check=True)
subprocess.run(['git', 'commit', '-m', 'üîí Vers√£o rastreada: infer√™ncia final Random Forest Tuned'], check=True)

# 4Ô∏è‚É£ Push para o backend remoto (MinIO)
print("\n‚è´ Executando dvc push para MinIO...")
subprocess.run(['dvc', 'push'], check=True)

print("\n‚úÖ Arquivo de predi√ß√£o versionado com sucesso e enviado ao remoto!")



‚úÖ Verificando exist√™ncia do arquivo de predi√ß√£o:
‚úîÔ∏è Arquivo encontrado.

üì¶ Executando dvc add...


[?25l‚†ã Checking graph
[?25h


üîß Adicionando .dvc ao Git e criando commit:
[main 9a5de5c] üîí Vers√£o rastreada: infer√™ncia final Random Forest Tuned
 1 file changed, 5 insertions(+)
 create mode 100644 data/predictions/random_forest_final_test_predictions.csv.dvc

‚è´ Executando dvc push para MinIO...
1 file pushed

‚úÖ Arquivo de predi√ß√£o versionado com sucesso e enviado ao remoto!


# üîß ETAPA: RECONSTRU√á√ÉO APRIMORADA COM RANDOM FOREST + GRIDSEARCHCV

Este bloco substitui o modelo original (DecisionTreeClassifier com profundidade fixa),
aplicando um pipeline mais robusto com RandomForestClassifier otimizado por GridSearchCV.

Todas as transforma√ß√µes seguem exatamente o que foi executado na vers√£o `curated_v1_1`:
- Convers√£o textual supervisionada (`Credit_History_Age`);
- Remo√ß√£o de colunas identificadoras;
- Substitui√ß√£o de placeholders;
- Coer√ß√£o num√©rica e encoding supervisionado.

Ap√≥s o fitting, o melhor modelo √© salvo com seu encoder correspondente,
pronto para uso pela API ou avalia√ß√£o futura.


In [12]:
# üîß ETAPA: RECONSTRU√á√ÉO APRIMORADA COM RANDOM FOREST + GRIDSEARCHCV

import pandas as pd
import os
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OrdinalEncoder
from tqdm import tqdm

# 1Ô∏è‚É£ Caminhos dos arquivos
train_path   = "/workspace/data/curated/train_curated_v1_1.csv"
model_path   = "/workspace/models/final_model.pkl"
encoder_path = "/workspace/models/final_encoder.pkl"
os.makedirs("/workspace/models", exist_ok=True)

# 2Ô∏è‚É£ Carrega o dataset curado
df = pd.read_csv(train_path)

# 3Ô∏è‚É£ Remove colunas identificadoras
cols_id = ['Customer_ID', 'Name', 'SSN', 'ID']
df.drop(columns=[col for col in cols_id if col in df.columns], errors='ignore', inplace=True)

# 4Ô∏è‚É£ Substitui√ß√£o de placeholders
placeholders = ['_______', '__ __ ____', '!@9#%8']
df.replace(placeholders, 'Unknown', inplace=True)

# 5Ô∏è‚É£ Convers√£o de "Credit_History_Age" para meses totais
if 'Credit_History_Age' in df.columns:
    years = df['Credit_History_Age'].str.extract(r'(\d+)\s+Years?')[0].astype(float)
    months = df['Credit_History_Age'].str.extract(r'(\d+)\s+Months?')[0].fillna(0).astype(float)
    df['Credit_History_Age'] = (years * 12 + months).fillna(0)

# 6Ô∏è‚É£ Coer√ß√£o num√©rica geral
for col in df.select_dtypes(include='object').columns:
    if df[col].str.replace('.', '', 1).str.isnumeric().all():
        df[col] = pd.to_numeric(df[col], errors='coerce')

# 7Ô∏è‚É£ Define preditores e target
target_col = 'Credit_Score'
X = df.drop(columns=[target_col])
y = df[target_col]

# 8Ô∏è‚É£ Detecta colunas categ√≥ricas restantes
categorical_cols = X.select_dtypes(include='object').columns.tolist()

# 9Ô∏è‚É£ Aplica OrdinalEncoder supervisionado
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
X[categorical_cols] = encoder.fit_transform(X[categorical_cols])

# üîü Define modelo base e grade de par√¢metros
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
param_grid = {
    'n_estimators': [100],
    'max_depth': [None, 10, 20],
    'max_features': ['sqrt', 'log2'],
    'min_samples_leaf': [1, 3, 5]
}

# 1Ô∏è‚É£1Ô∏è‚É£ Executa GridSearchCV com barra de progresso
print("üîç Rodando GridSearchCV...")
grid = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', verbose=0)
grid.fit(X, y)

# 1Ô∏è‚É£2Ô∏è‚É£ Salva modelo e encoder final
best_model = grid.best_estimator_
joblib.dump(best_model, model_path)
joblib.dump(encoder, encoder_path)

# 1Ô∏è‚É£3Ô∏è‚É£ Avalia√ß√£o final no treino
y_pred = best_model.predict(X)
acc = accuracy_score(y, y_pred)

print("\n‚úÖ Modelo otimizado salvo com sucesso.")
print("üìä Melhor configura√ß√£o:", grid.best_params_)
print(f"üéØ Acur√°cia final no conjunto de treino: {acc:.4f}")


üîç Rodando GridSearchCV...

‚úÖ Modelo otimizado salvo com sucesso.
üìä Melhor configura√ß√£o: {'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'n_estimators': 100}
üéØ Acur√°cia final no conjunto de treino: 0.8803


In [13]:
# üîß ETAPA: REGISTRO NO MLFLOW DO MODELO OTIMIZADO COM GRIDSEARCHCV

"""
Esta c√©lula registra no MLflow o modelo otimizado j√° treinado e salvo como .pkl,
incluindo seus hiperpar√¢metros, m√©trica de acur√°cia e artefato rastre√°vel.
"""

import mlflow
import mlflow.sklearn
import joblib
import os
from sklearn.metrics import accuracy_score

# 1Ô∏è‚É£ Caminhos
model_path     = "/workspace/models/final_model.pkl"
train_path     = "/workspace/data/curated/train_curated_v1_1.csv"
experiment_name = "modelo_otimizado_rf"

# 2Ô∏è‚É£ Define tracking URI local e garante experimento
mlflow.set_tracking_uri("file:/workspace/.mlruns")
mlflow.set_experiment(experiment_name)

# 3Ô∏è‚É£ Carrega modelo, dados e encoder (se necess√°rio)
model = joblib.load(model_path)
df = pd.read_csv(train_path)
X = df.drop(columns=["Credit_Score"])
y = df["Credit_Score"]

# 4Ô∏è‚É£ Reaplica transforma√ß√µes necess√°rias para coer√™ncia
placeholders = ['_______', '__ __ ____', '!@9#%8']
X.replace(placeholders, 'Unknown', inplace=True)

if 'Credit_History_Age' in X.columns:
    years = X['Credit_History_Age'].str.extract(r'(\d+)\s+Years?')[0].astype(float)
    months = X['Credit_History_Age'].str.extract(r'(\d+)\s+Months?')[0].fillna(0).astype(float)
    X['Credit_History_Age'] = (years * 12 + months).fillna(0)

for col in X.select_dtypes(include='object').columns:
    if X[col].str.replace('.', '', 1).str.isnumeric().all():
        X[col] = pd.to_numeric(X[col], errors='coerce')

# 5Ô∏è‚É£ Aplica encoder j√° salvo
encoder_path = "/workspace/models/final_encoder.pkl"
encoder = joblib.load(encoder_path)
categorical_cols = encoder.feature_names_in_.tolist()
X[categorical_cols] = encoder.transform(X[categorical_cols])

# 6Ô∏è‚É£ Avalia novamente para logging
y_pred = model.predict(X)
acc = accuracy_score(y, y_pred)

# 7Ô∏è‚É£ Inicia run e registra tudo
with mlflow.start_run(run_name="random_forest_otimizado") as run:
    mlflow.log_params(model.get_params())
    mlflow.log_metric("accuracy_train", acc)
    mlflow.sklearn.log_model(model, "model")

    print("‚úÖ Modelo registrado no MLflow com run_id:", run.info.run_id)
    print(f"üìä Accuracy no treino: {acc:.4f}")


2025/07/22 21:02:19 INFO mlflow.tracking.fluent: Experiment with name 'modelo_otimizado_rf' does not exist. Creating a new experiment.


‚úÖ Modelo registrado no MLflow com run_id: db7c939c8f454742adde0499ccfcd47d
üìä Accuracy no treino: 0.8803


In [25]:
# ETAPA: PIPELINE COMPLETO AJUSTADO √ÄS COLUNAS EXISTENTES

import pandas as pd
import joblib
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, KBinsDiscretizer, FunctionTransformer
from sklearn.ensemble import RandomForestClassifier

# Fun√ß√£o para pr√©-processamento customizado
def preprocess(df):
    placeholders = ['_______', '__ __ ____', '!@9#%8']
    df.replace(placeholders, 'Unknown', inplace=True)

    if 'Credit_History_Age' in df.columns:
        years = df['Credit_History_Age'].str.extract(r'(\d+)\s+Years?')[0].astype(float)
        months = df['Credit_History_Age'].str.extract(r'(\d+)\s+Months?')[0].fillna(0).astype(float)
        df['Credit_History_Age'] = (years * 12 + months).fillna(0)

    for col in df.select_dtypes(include='object').columns:
        if df[col].str.replace('.', '', 1).str.isnumeric().all():
            df[col] = pd.to_numeric(df[col], errors='coerce')

    return df

# Colunas para binning (originais antes do Binning)
cols_binning = ['Age', 'Annual_Income', 'Credit_History_Age', 'Amount_invested_monthly', 'Changed_Credit_Limit']

# Colunas categ√≥ricas restantes (para OrdinalEncoder)
categorical_cols = [
    'Age_Binned',
    'Amount_invested_monthly_Binned',
    'Annual_Income_Binned',
    'Changed_Credit_Limit_Binned',
    'Credit_History_Age_Binned',
    'Credit_Mix',
    'Credit_Utilization_Ratio_Binned',
    'Delay_from_due_date_Binned',
    'Interest_Rate_Binned',
    'Monthly_Balance_Binned',
    'Monthly_Inhand_Salary_Binned',
    'Num_Bank_Accounts_Binned',
    'Num_Credit_Card_Binned',
    'Num_Credit_Inquiries_Binned',
    'Num_of_Delayed_Payment_Binned',
    'Num_of_Loan_Binned',
    'Occupation',
    'Outstanding_Debt_Binned',
    'Payment_of_Min_Amount',
    'Total_EMI_per_month_Binned',
    'Type_of_Loan'
]

# Carregar dataset curado para treinamento completo
train_path = "/workspace/data/curated/train_curated_v1_1.csv"
df = pd.read_csv(train_path)

# Remove colunas identificadoras
cols_id = ['Customer_ID', 'Name', 'SSN', 'ID']
df.drop(columns=[col for col in cols_id if col in df.columns], inplace=True)

# Pr√©-processamento inicial
df = preprocess(df)

# Define X e y
X = df.drop(columns=['Credit_Score'])
y = df['Credit_Score']

# Definir ColumnTransformer ajustado (sem OHE)
preprocessor = ColumnTransformer(transformers=[
    ('binning', KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile'), cols_binning),
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), categorical_cols)
], remainder='passthrough')

# Pipeline final ajustado com preprocessor completo e RandomForest
pipeline_completo = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(max_depth=20, max_features='sqrt', min_samples_leaf=3, n_estimators=100, random_state=42))
])

# Treina pipeline completo
pipeline_completo.fit(X, y)

# Salva pipeline final completo
joblib.dump(pipeline_completo, '/workspace/models/final_pipeline_completo.pkl')

print("Pipeline completo ajustado salvo com sucesso em: /workspace/models/final_pipeline_completo.pkl")




Pipeline completo ajustado salvo com sucesso em: /workspace/models/final_pipeline_completo.pkl
