## In√≠cio do Pipeline de Modelagem e Rastreamento com MLflow

Este notebook tem como objetivo iniciar o pipeline de experimentos, carregando os datasets finais da camada `Curated` para criar os conjuntos de treino e teste totalmente alinhados. O fluxo inclui a separa√ß√£o de vari√°veis preditoras (`X`) e vari√°vel alvo (`y`), al√©m da configura√ß√£o inicial do MLflow Tracking, garantindo que todos os par√¢metros, m√©tricas e artefatos do modelo sejam rastreados de forma coerente e version√°vel.


In [1]:
#  ETAPA: Carga dos Dados Curados e Configura√ß√£o do MLflow Tracking

"""
Executa:
1) Valida√ß√£o do diret√≥rio de trabalho.
2) Carregamento de 'train_curated.csv' e 'test_curated.csv'.
3) Separa√ß√£o de X_train, y_train, X_test, y_test.
4) Configura√ß√£o do Tracking URI do MLflow.
"""

import os
import pandas as pd
import mlflow

# 1Ô∏è‚É£ Validar CWD
print("Current Working Directory:", os.getcwd())

# 2Ô∏è‚É£ Paths coerentes
TRAIN_PATH = 'data/curated/train_curated.csv'
TEST_PATH = 'data/curated/test_curated.csv'

# 3Ô∏è‚É£ Carregar datasets
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

print("\nTreino shape:", train_df.shape)
print("Teste shape:", test_df.shape)

# 4Ô∏è‚É£ Separar X e y
TARGET = 'Credit_Score_Standard'  # Ajuste para seu target real

X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

X_test = test_df.drop(columns=[TARGET])
y_test = test_df[TARGET]

print("\nX_train:", X_train.shape)
print("y_train:", y_train.shape)
print("X_test:", X_test.shape)
print("y_test:", y_test.shape)

# 5Ô∏è‚É£ Configurar MLflow Tracking URI
mlflow.set_tracking_uri("http://mlflow:5000")  # Ajuste se necess√°rio
print("\nTracking URI configurado:", mlflow.get_tracking_uri())


Current Working Directory: /workspace

Treino shape: (100000, 6305)
Teste shape: (50000, 6305)

X_train: (100000, 6304)
y_train: (100000,)
X_test: (50000, 6304)
y_test: (50000,)

Tracking URI configurado: http://mlflow:5000


## Experimento Baseline com MLflow e Monitoramento de Progresso

Nesta etapa ser√° rodado o primeiro experimento baseline usando o MLflow para rastrear par√¢metros, m√©tricas e artefatos. Para acompanhar opera√ß√µes potencialmente demoradas, como o ajuste do modelo (`fit`) e a gera√ß√£o de m√©tricas, ser√° utilizado o `tqdm` para monitorar loops de forma expl√≠cita. Isso garante visibilidade do progresso em tempo real, al√©m de manter a rastreabilidade completa do pipeline.


In [2]:
#  ETAPA: Refazer Baseline ‚Äî Normaliza√ß√£o do Working Directory, Carga Curated, Reconstru√ß√£o de y e Tracking MLflow

"""
Executa:
1) Valida√ß√£o e normaliza√ß√£o do diret√≥rio de trabalho (CWD).
2) Carregamento de 'train_curated.csv' e 'test_curated.csv'.
3) Reconstru√ß√£o de y a partir de colunas dummy.
4) Separa√ß√£o coerente de X e y.
5) Tracking URI e credenciais MinIO expl√≠citos.
6) Treino DecisionTreeClassifier com barra de progresso.
7) Logging MLflow de hiperpar√¢metros, m√©tricas e artefato.
8) Prints finais coerentes com links 127.0.0.1.
"""

import os
import logging
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm

# ‚úÖ Silenciar logger redundante do MLflow
logging.getLogger("mlflow").setLevel(logging.ERROR)

# 1Ô∏è‚É£ Validar e corrigir CWD
print("Current Working Directory (antes):", os.getcwd())
os.chdir('/workspace')
print("Current Working Directory (depois):", os.getcwd())

# 2Ô∏è‚É£ Paths absolutos coerentes
TRAIN_PATH = 'data/curated/train_curated.csv'
TEST_PATH = 'data/curated/test_curated.csv'

# 3Ô∏è‚É£ Carregar datasets
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

print("\nTreino shape:", train_df.shape)
print("Teste shape:", test_df.shape)
print("\ntrain_df.head(5):")
print(train_df.head(5))

# 4Ô∏è‚É£ Reconstruir y a partir de colunas dummy
dummy_cols = [col for col in train_df.columns if col.startswith('Credit_Score_')]
print("\nColunas de classe detectadas:", dummy_cols)

# Reconstr√≥i y_train
y_train = train_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_train = train_df.drop(columns=dummy_cols)

# Reconstr√≥i y_test
y_test = test_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_test = test_df.drop(columns=dummy_cols)

print(f"\nX_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

# 5Ô∏è‚É£ Tracking URI e credenciais MinIO
mlflow.set_tracking_uri("http://mlflow:5000")
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://minio:9000'

print("\nTracking URI:", mlflow.get_tracking_uri())
print("MLFLOW_S3_ENDPOINT_URL:", os.environ['MLFLOW_S3_ENDPOINT_URL'])

# 6Ô∏è‚É£ Cria/recupera experimento
experiment_name = "QuantumFinance_CreditScore"
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name="Baseline_DecisionTree_Curated") as run:
    params = {"max_depth": 5, "random_state": 42}
    mlflow.log_params(params)

    model = DecisionTreeClassifier(**params)

    print("\nTreinando modelo com barra de progresso:")
    for _ in tqdm(range(1), desc="Fitting model"):
        model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')

    print(f"\nAccuracy: {acc:.4f}")
    print(f"F1 Score: {f1:.4f}")

    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_score", f1)
    mlflow.sklearn.log_model(model, "model")

    # ‚úÖ Prints finais coerentes ‚Äî SOMENTE com 127.0.0.1
    print(f"\nRun ID: {run.info.run_id}")
    print(f"Acesse: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")
    print(f"View run Baseline_DecisionTree_Curated at: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")
    print(f"View experiment at: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}")


Current Working Directory (antes): /workspace/notebooks
Current Working Directory (depois): /workspace

Treino shape: (100000, 6305)
Teste shape: (50000, 6305)

train_df.head(5):
     Age  Annual_Income  Monthly_Inhand_Salary  Num_Bank_Accounts  \
0   23.0       19114.12            1824.843333                  3   
1   23.0       19114.12                    NaN                  3   
2 -500.0       19114.12                    NaN                  3   
3   23.0       19114.12                    NaN                  3   
4   23.0       19114.12            1824.843333                  3   

   Num_Credit_Card  Interest_Rate  Delay_from_due_date  Num_Credit_Inquiries  \
0                4              3                    3                   4.0   
1                4              3                   -1                   4.0   
2                4              3                    3                   4.0   
3                4              3                    5                   4.0   
4     

Fitting model: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:12<00:00, 12.90s/it]



Accuracy: 0.5496
F1 Score: 0.7093

Run ID: 86de72fd063e485fa584c1d8c0395aca
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/86de72fd063e485fa584c1d8c0395aca
View run Baseline_DecisionTree_Curated at: http://127.0.0.1:5000/#/experiments/1/runs/86de72fd063e485fa584c1d8c0395aca
View experiment at: http://127.0.0.1:5000/#/experiments/1
üèÉ View run Baseline_DecisionTree_Curated at: http://mlflow:5000/#/experiments/1/runs/86de72fd063e485fa584c1d8c0395aca
üß™ View experiment at: http://mlflow:5000/#/experiments/1


## ETAPA: Melhoria do Modelo com GridSearchCV e MLflow Tracking

Este bloco marca a transi√ß√£o do experimento baseline para uma etapa de otimiza√ß√£o incremental, usando GridSearchCV para explorar m√∫ltiplas combina√ß√µes de hiperpar√¢metros de forma sistem√°tica e rastreada.

## Objetivo
- Encontrar a configura√ß√£o de hiperpar√¢metros mais eficaz para a √Årvore de Decis√£o (DecisionTreeClassifier).
- Registrar cada combina√ß√£o testada como um run √∫nico no MLflow, com seus par√¢metros, m√©tricas e artefato final.

## Princ√≠pios aplicados
- Tracking URI e backend MinIO/S3 mantidos consistentes.
- Cada varia√ß√£o √© rastre√°vel, sem sobrescrever runs anteriores.
- Uso de tqdm para barra de progresso, garantindo visibilidade em loops demorados.
- Prints finais coerentes, com links 127.0.0.1 para acesso ao MLflow UI.

## Resultado esperado
- M√©tricas compar√°veis entre baseline e grid search.
- Melhor modelo salvo como artefato no bucket MinIO.
- Pr√≥ximo passo: preparar o pipeline para registrar o modelo validado no Registry do MLflow.


In [4]:
# üîß ETAPA: GridSearch Manual com ParameterGrid, tqdm e Tracking MLflow

"""
Executa:
1) Normaliza o CWD para '/workspace'.
2) Carrega train_curated e test_curated.
3) Reconstr√≥i y a partir das colunas dummy.
4) Itera ParameterGrid manualmente.
5) Barra tqdm avan√ßa por combina√ß√£o.
6) Loga cada combina√ß√£o como run separado no MLflow.
7) Prints coerentes com links 127.0.0.1.
"""

import os
import logging
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm

# ‚úÖ Silenciar logger redundante do MLflow
logging.getLogger("mlflow").setLevel(logging.ERROR)

# 1Ô∏è‚É£ Normalizar CWD
print("CWD antes:", os.getcwd())
os.chdir('/workspace')
print("CWD depois:", os.getcwd())

# 2Ô∏è‚É£ Paths
TRAIN_PATH = 'data/curated/train_curated.csv'
TEST_PATH = 'data/curated/test_curated.csv'

# 3Ô∏è‚É£ Carga + reconstru√ß√£o y
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

dummy_cols = [col for col in train_df.columns if col.startswith('Credit_Score_')]
print("Colunas classe:", dummy_cols)

y_train = train_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_train = train_df.drop(columns=dummy_cols)

y_test = test_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_test = test_df.drop(columns=dummy_cols)

print(f"X_train: {X_train.shape} | y_train: {y_train.shape}")

# 4Ô∏è‚É£ ParameterGrid manual
param_grid = {
    "max_depth": [3, 5, 7],
    "min_samples_split": [2, 5, 10]
}
grid = ParameterGrid(param_grid)

# 5Ô∏è‚É£ Tracking URI + credenciais MinIO
mlflow.set_tracking_uri("http://mlflow:5000")
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://minio:9000'

experiment_name = "QuantumFinance_CreditScore"
mlflow.set_experiment(experiment_name)

print("\nExecutando GridSearch manual...")

for params in tqdm(grid, desc="Runs"):
    with mlflow.start_run(run_name=f"GridSearch_Manual_{params}") as run:
        model = DecisionTreeClassifier(**params, random_state=42)
        model.fit(X_train, y_train)

        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')

        mlflow.log_params(params)
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("f1_score", f1)
        mlflow.sklearn.log_model(model, "model")

        print(f"\nCombina√ß√£o: {params}")
        print(f"Accuracy: {acc:.4f} | F1 Score: {f1:.4f}")
        print(f"Run ID: {run.info.run_id}")
        print(f"Acesse: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")


CWD antes: /workspace
CWD depois: /workspace
Colunas classe: ['Credit_Score_Poor', 'Credit_Score_Standard']
X_train: (100000, 6303) | y_train: (100000,)

Executando GridSearch manual...


Runs:  11%|‚ñà         | 1/9 [00:15<02:06, 15.75s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 2}
Accuracy: 0.5388 | F1 Score: 0.7003
Run ID: 8347a501625844da89aaa891510faa55
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/8347a501625844da89aaa891510faa55
üèÉ View run GridSearch_Manual_{'max_depth': 3, 'min_samples_split': 2} at: http://mlflow:5000/#/experiments/1/runs/8347a501625844da89aaa891510faa55
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  22%|‚ñà‚ñà‚ñè       | 2/9 [00:27<01:34, 13.56s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 5}
Accuracy: 0.5388 | F1 Score: 0.7003
Run ID: 9f9ded00e4594c7197535ae840bc8bfc
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/9f9ded00e4594c7197535ae840bc8bfc
üèÉ View run GridSearch_Manual_{'max_depth': 3, 'min_samples_split': 5} at: http://mlflow:5000/#/experiments/1/runs/9f9ded00e4594c7197535ae840bc8bfc
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  33%|‚ñà‚ñà‚ñà‚ñé      | 3/9 [00:38<01:13, 12.20s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 10}
Accuracy: 0.5388 | F1 Score: 0.7003
Run ID: e3dda2cd137b4d96999617039c09af58
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/e3dda2cd137b4d96999617039c09af58
üèÉ View run GridSearch_Manual_{'max_depth': 3, 'min_samples_split': 10} at: http://mlflow:5000/#/experiments/1/runs/e3dda2cd137b4d96999617039c09af58
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  44%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 4/9 [00:49<00:59, 11.97s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 2}
Accuracy: 0.5496 | F1 Score: 0.7093
Run ID: f969ead83d354f0f99321a872160c42d
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/f969ead83d354f0f99321a872160c42d
üèÉ View run GridSearch_Manual_{'max_depth': 5, 'min_samples_split': 2} at: http://mlflow:5000/#/experiments/1/runs/f969ead83d354f0f99321a872160c42d
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  56%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå    | 5/9 [01:00<00:46, 11.51s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 5}
Accuracy: 0.5496 | F1 Score: 0.7093
Run ID: 7a13e6e4410945479787888336840c91
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/7a13e6e4410945479787888336840c91
üèÉ View run GridSearch_Manual_{'max_depth': 5, 'min_samples_split': 5} at: http://mlflow:5000/#/experiments/1/runs/7a13e6e4410945479787888336840c91
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 6/9 [01:11<00:33, 11.31s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 10}
Accuracy: 0.5496 | F1 Score: 0.7093
Run ID: d070175a3a8c448a9565e24d7e9871a9
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/d070175a3a8c448a9565e24d7e9871a9
üèÉ View run GridSearch_Manual_{'max_depth': 5, 'min_samples_split': 10} at: http://mlflow:5000/#/experiments/1/runs/d070175a3a8c448a9565e24d7e9871a9
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  78%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä  | 7/9 [01:23<00:23, 11.62s/it]


Combina√ß√£o: {'max_depth': 7, 'min_samples_split': 2}
Accuracy: 0.5422 | F1 Score: 0.7032
Run ID: 9c9d83b3de294b5aab2357b42a67aa06
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/9c9d83b3de294b5aab2357b42a67aa06
üèÉ View run GridSearch_Manual_{'max_depth': 7, 'min_samples_split': 2} at: http://mlflow:5000/#/experiments/1/runs/9c9d83b3de294b5aab2357b42a67aa06
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs:  89%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ | 8/9 [01:35<00:11, 11.69s/it]


Combina√ß√£o: {'max_depth': 7, 'min_samples_split': 5}
Accuracy: 0.5422 | F1 Score: 0.7032
Run ID: d2685d32bdcb42b783e358095e9f5b63
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/d2685d32bdcb42b783e358095e9f5b63
üèÉ View run GridSearch_Manual_{'max_depth': 7, 'min_samples_split': 5} at: http://mlflow:5000/#/experiments/1/runs/d2685d32bdcb42b783e358095e9f5b63
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [01:46<00:00, 11.88s/it]


Combina√ß√£o: {'max_depth': 7, 'min_samples_split': 10}
Accuracy: 0.5424 | F1 Score: 0.7033
Run ID: 0cfc26f47de549969c8fdb2ceef4565c
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/0cfc26f47de549969c8fdb2ceef4565c
üèÉ View run GridSearch_Manual_{'max_depth': 7, 'min_samples_split': 10} at: http://mlflow:5000/#/experiments/1/runs/0cfc26f47de549969c8fdb2ceef4565c
üß™ View experiment at: http://mlflow:5000/#/experiments/1





In [5]:
# üîß ETAPA: GridSearch Manual com RandomForest, ParameterGrid, tqdm e Tracking MLflow

"""
Executa:
1) Normaliza o CWD para '/workspace'.
2) Carrega train_curated e test_curated.
3) Reconstr√≥i y a partir das colunas dummy.
4) Executa GridSearch manual com RandomForestClassifier.
5) Usa ParameterGrid + tqdm para barra real por combina√ß√£o.
6) Loga cada run no MLflow com credenciais MinIO coerentes.
7) Prints finais coerentes com links 127.0.0.1.
"""

import os
import logging
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm

# ‚úÖ Silenciar logger redundante do MLflow
logging.getLogger("mlflow").setLevel(logging.ERROR)

# 1Ô∏è‚É£ Normalizar CWD
print("CWD antes:", os.getcwd())
os.chdir('/workspace')
print("CWD depois:", os.getcwd())

# 2Ô∏è‚É£ Paths
TRAIN_PATH = 'data/curated/train_curated.csv'
TEST_PATH = 'data/curated/test_curated.csv'

# 3Ô∏è‚É£ Carga + reconstru√ß√£o y
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)

dummy_cols = [col for col in train_df.columns if col.startswith('Credit_Score_')]
print("Colunas classe:", dummy_cols)

y_train = train_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_train = train_df.drop(columns=dummy_cols)

y_test = test_df[dummy_cols].idxmax(axis=1).str.replace('Credit_Score_', '')
X_test = test_df.drop(columns=dummy_cols)

print(f"X_train: {X_train.shape} | y_train: {y_train.shape}")

# 4Ô∏è‚É£ ParameterGrid com RandomForest
param_grid = {
    "n_estimators": [50, 100],
    "max_depth": [3, 5],
    "min_samples_split": [2, 5]
}
grid = ParameterGrid(param_grid)

# 5Ô∏è‚É£ Tracking URI + credenciais MinIO
mlflow.set_tracking_uri("http://mlflow:5000")
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://minio:9000'

experiment_name = "QuantumFinance_CreditScore"
mlflow.set_experiment(experiment_name)

print("\nExecutando GridSearch manual RandomForest...")

for params in tqdm(grid, desc="Runs RandomForest"):
    with mlflow.start_run(run_name=f"GridSearch_RF_{params}") as run:
        model = RandomForestClassifier(**params, random_state=42, n_jobs=-1)
        model.fit(X_train, y_train)

        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')

        mlflow.log_params(params)
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("f1_score", f1)
        mlflow.sklearn.log_model(model, "model")

        print(f"\nCombina√ß√£o: {params}")
        print(f"Accuracy: {acc:.4f} | F1 Score: {f1:.4f}")
        print(f"Run ID: {run.info.run_id}")
        print(f"Acesse: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")


CWD antes: /workspace
CWD depois: /workspace
Colunas classe: ['Credit_Score_Poor', 'Credit_Score_Standard']
X_train: (100000, 6303) | y_train: (100000,)

Executando GridSearch manual RandomForest...


Runs RandomForest:  12%|‚ñà‚ñé        | 1/8 [00:10<01:14, 10.66s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 50}
Accuracy: 0.1160 | F1 Score: 0.2078
Run ID: 7e1e8f44f2a14b469b513b183d4a0ae5
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/7e1e8f44f2a14b469b513b183d4a0ae5
üèÉ View run GridSearch_RF_{'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 50} at: http://mlflow:5000/#/experiments/1/runs/7e1e8f44f2a14b469b513b183d4a0ae5
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  25%|‚ñà‚ñà‚ñå       | 2/8 [00:18<00:55,  9.26s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 100}
Accuracy: 0.0560 | F1 Score: 0.1061
Run ID: 7e2a7fd6532a415790f7e0297eaef518
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/7e2a7fd6532a415790f7e0297eaef518
üèÉ View run GridSearch_RF_{'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 100} at: http://mlflow:5000/#/experiments/1/runs/7e2a7fd6532a415790f7e0297eaef518
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  38%|‚ñà‚ñà‚ñà‚ñä      | 3/8 [00:26<00:42,  8.41s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 50}
Accuracy: 0.1160 | F1 Score: 0.2079
Run ID: f6e6ac21b469427999d1ac0fd1c65fd4
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/f6e6ac21b469427999d1ac0fd1c65fd4
üèÉ View run GridSearch_RF_{'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 50} at: http://mlflow:5000/#/experiments/1/runs/f6e6ac21b469427999d1ac0fd1c65fd4
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 4/8 [00:33<00:31,  7.93s/it]


Combina√ß√£o: {'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 100}
Accuracy: 0.0560 | F1 Score: 0.1061
Run ID: ce6f759ad293462098f404894ae35310
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/ce6f759ad293462098f404894ae35310
üèÉ View run GridSearch_RF_{'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 100} at: http://mlflow:5000/#/experiments/1/runs/ce6f759ad293462098f404894ae35310
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  62%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé   | 5/8 [00:40<00:22,  7.56s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 50}
Accuracy: 0.2117 | F1 Score: 0.3494
Run ID: 30dbe7a0952a4d86a1eed78941ec550b
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/30dbe7a0952a4d86a1eed78941ec550b
üèÉ View run GridSearch_RF_{'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 50} at: http://mlflow:5000/#/experiments/1/runs/30dbe7a0952a4d86a1eed78941ec550b
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 6/8 [00:47<00:14,  7.42s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 100}
Accuracy: 0.1582 | F1 Score: 0.2732
Run ID: 3a35386b08854259abb939bffb29abb8
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/3a35386b08854259abb939bffb29abb8
üèÉ View run GridSearch_RF_{'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 100} at: http://mlflow:5000/#/experiments/1/runs/3a35386b08854259abb939bffb29abb8
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest:  88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 7/8 [00:54<00:07,  7.32s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 50}
Accuracy: 0.2117 | F1 Score: 0.3494
Run ID: ee503b2c1af948db890c83f7a178a5de
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/ee503b2c1af948db890c83f7a178a5de
üèÉ View run GridSearch_RF_{'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 50} at: http://mlflow:5000/#/experiments/1/runs/ee503b2c1af948db890c83f7a178a5de
üß™ View experiment at: http://mlflow:5000/#/experiments/1


Runs RandomForest: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [01:02<00:00,  7.78s/it]


Combina√ß√£o: {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 100}
Accuracy: 0.1581 | F1 Score: 0.2730
Run ID: 4693946ae8bf47a799823ab6e1844b93
Acesse: http://127.0.0.1:5000/#/experiments/1/runs/4693946ae8bf47a799823ab6e1844b93
üèÉ View run GridSearch_RF_{'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 100} at: http://mlflow:5000/#/experiments/1/runs/4693946ae8bf47a799823ab6e1844b93
üß™ View experiment at: http://mlflow:5000/#/experiments/1





In [1]:
# üîß ETAPA: GridSearch Manual com GradientBoostingClassifier, ParameterGrid, tqdm e MLflow

"""
Executa:
1) Normaliza o CWD para '/workspace'.
2) Carrega 'train_clean.csv' da camada processed.
3) Separa X e y com target original.
4) Aplica pd.get_dummies() no X.
5) Usa train_test_split com stratify.
6) Executa GridSearch manual com GradientBoostingClassifier.
7) Barra tqdm para progresso real.
8) Loga cada run separadamente no MLflow.
"""

import os
import logging
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, ParameterGrid
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm

# ‚úÖ Silenciar logger redundante
logging.getLogger("mlflow").setLevel(logging.ERROR)

# 1Ô∏è‚É£ Normalizar CWD
print("CWD antes:", os.getcwd())
os.chdir('/workspace')
print("CWD depois:", os.getcwd())

# 2Ô∏è‚É£ Carregar dados
df = pd.read_csv('data/processed/train_clean.csv')
print("\ndf shape:", df.shape)
print("\ndf.head(5):\n", df.head(5))

# 3Ô∏è‚É£ Separa X e y
X = df.drop(columns=['Credit_Score'])
y = df['Credit_Score']

# 4Ô∏è‚É£ Pr√©-processa X igual curated
X = pd.get_dummies(X)

# 5Ô∏è‚É£ train_test_split + alinhamento coerente
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

print(f"\nX_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")

# 6Ô∏è‚É£ Configurar ParameterGrid
param_grid = {
    "n_estimators": [50, 100],
    "max_depth": [3, 5],
    "learning_rate": [0.05, 0.1]
}
grid = ParameterGrid(param_grid)

# 7Ô∏è‚É£ Tracking MLflow + MinIO
mlflow.set_tracking_uri("http://mlflow:5000")
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://minio:9000'

experiment_name = "QuantumFinance_CreditScore"
mlflow.set_experiment(experiment_name)

print("\nExecutando GridSearch manual ‚Äî GradientBoostingClassifier...")

for params in tqdm(grid, desc="Runs GBoost"):
    with mlflow.start_run(run_name=f"GridSearch_GBoost_{params}") as run:
        model = GradientBoostingClassifier(**params, random_state=42)
        model.fit(X_train, y_train)

        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')

        mlflow.log_params(params)
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("f1_score", f1)
        mlflow.sklearn.log_model(model, "model")

        print(f"\nCombina√ß√£o: {params}")
        print(f"Accuracy: {acc:.4f} | F1 Score: {f1:.4f}")
        print(f"Run ID: {run.info.run_id}")
        print(f"Acesse: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")


CWD antes: /workspace/notebooks
CWD depois: /workspace

df shape: (100000, 28)

df.head(5):
        ID Customer_ID     Month           Name    Age          SSN Occupation  \
0  0x1602   CUS_0xd40   January  Aaron Maashoh   23.0  821-00-0265  Scientist   
1  0x1603   CUS_0xd40  February  Aaron Maashoh   23.0  821-00-0265  Scientist   
2  0x1604   CUS_0xd40     March  Aaron Maashoh -500.0  821-00-0265  Scientist   
3  0x1605   CUS_0xd40     April  Aaron Maashoh   23.0  821-00-0265  Scientist   
4  0x1606   CUS_0xd40       May  Aaron Maashoh   23.0  821-00-0265  Scientist   

   Annual_Income  Monthly_Inhand_Salary  Num_Bank_Accounts  ...  Credit_Mix  \
0       19114.12            1824.843333                  3  ...     Unknown   
1       19114.12                    NaN                  3  ...        Good   
2       19114.12                    NaN                  3  ...        Good   
3       19114.12                    NaN                  3  ...        Good   
4       19114.12         

: 

#  Diagn√≥stico do Footprint de Mem√≥ria

Antes de rodar fitting, esta c√©lula calcula o uso real de mem√≥ria dos DataFrames `X_train` e `y_train`.  
Garante rastreabilidade sobre quanta RAM o kernel ir√° consumir, considerando `deep=True` para contar objetos, ponteiros e √≠ndices.  
Este valor deve ser menor que 70% da RAM real do container para evitar travamento por OOM Killer.


## Recarga dos Datasets Curated ‚Äî Caminho corrigido

Esta c√©lula recarrega os datasets `train_curated.csv` e `test_curated.csv` usando caminho relativo correto, garantindo coer√™ncia com a estrutura `/workspace/`.



In [3]:
# ETAPA: Recarga dos Datasets Curated

import pandas as pd

train_df = pd.read_csv("../data/curated/train_curated.csv")
test_df = pd.read_csv("../data/curated/test_curated.csv")

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")


Train shape: (100000, 6305)
Test shape: (50000, 6305)


In [7]:
print(train_df.columns.tolist())


['Age', 'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Interest_Rate', 'Delay_from_due_date', 'Num_Credit_Inquiries', 'Outstanding_Debt', 'Credit_Utilization_Ratio', 'Total_EMI_per_month', 'Amount_invested_monthly', 'Monthly_Balance', 'Num_of_Loan_Bin', 'Changed_Credit_Limit_Bin', 'Num_of_Delayed_Payment_Bin', 'Credit_History_Age_Bin', 'Month_Num', 'Occupation_Architect', 'Occupation_Developer', 'Occupation_Doctor', 'Occupation_Engineer', 'Occupation_Entrepreneur', 'Occupation_Journalist', 'Occupation_Lawyer', 'Occupation_Manager', 'Occupation_Mechanic', 'Occupation_Media_Manager', 'Occupation_Musician', 'Occupation_Scientist', 'Occupation_Teacher', 'Occupation_Unknown', 'Occupation_Writer', 'Type_of_Loan_Auto Loan, Auto Loan, Auto Loan, Auto Loan, Credit-Builder Loan, Credit-Builder Loan, Mortgage Loan, and Personal Loan', 'Type_of_Loan_Auto Loan, Auto Loan, Auto Loan, Auto Loan, Student Loan, and Student Loan', 'Type_of_Loan_Auto Loan, Auto Loan, A

---
REABRIR O FEATURE_ENGINEERING_CURADORIA.IPNYB PARA DIMINUIR CARDINALIDADE
---

---


## Extens√£o Controlada ‚Äî One-Hot Encoding Restrito Pr√©-Cardinalidade

Este bloco aplica o **One-Hot Encoding restrito** nas vari√°veis `Month`, `Occupation_Group` e `Payment_Behaviour`, antes de qualquer novo diagn√≥stico de cardinalidade.

O objetivo √© observar quantas colunas ser√£o adicionadas e comparar com o pipeline anterior (+6.300 colunas) para verificar se o footprint segue controlado.
O resultado ser√° salvo como `CURATED V1.1` para rastreabilidade total.


In [5]:
# ETAPA: ONE-HOT ENCODING RESTRITO V1.1 E COMPARACAO IMEDIATA

import pandas as pd

# Caminhos
train_curated_v1 = '/workspace/data/curated/train_curated_v1.csv'
test_curated_v1  = '/workspace/data/curated/test_curated_v1.csv'

# Carrega
train_df = pd.read_csv(train_curated_v1)
test_df  = pd.read_csv(test_curated_v1)

# Colunas a codificar
cols_to_encode = ['Month', 'Occupation_Group', 'Payment_Behaviour']

# Aplica OHE restrito
train_encoded = pd.get_dummies(train_df, columns=cols_to_encode, drop_first=True)
test_encoded  = pd.get_dummies(test_df, columns=cols_to_encode, drop_first=True)

# Alinha colunas para garantir mesma estrutura
train_encoded, test_encoded = train_encoded.align(test_encoded, join='outer', axis=1, fill_value=0)

# Compara shape
print("\nShape original V1 (train):", train_df.shape)
print("Shape V1.1 ap√≥s OHE (train):", train_encoded.shape)

print("\nShape original V1 (test):", test_df.shape)
print("Shape V1.1 ap√≥s OHE (test):", test_encoded.shape)

# Salva V1.1
train_encoded.to_csv('/workspace/data/curated/train_curated_v1_1.csv', index=False)
test_encoded.to_csv('/workspace/data/curated/test_curated_v1_1.csv', index=False)

print("\nSnapshots CURATED V1.1 salvos.")



Shape original V1 (train): (100000, 65)
Shape V1.1 ap√≥s OHE (train): (100000, 93)

Shape original V1 (test): (50000, 64)
Shape V1.1 ap√≥s OHE (test): (50000, 93)

Snapshots CURATED V1.1 salvos.


## Versionamento At√¥mico ‚Äî Snapshot CURATED V1.1

Este bloco faz o versionamento at√¥mico do `train_curated_v1_1.csv` e `test_curated_v1_1.csv` com `DVC` e `Git`.  
O fluxo garante rastreabilidade total: verifica√ß√£o f√≠sica, commit coerente, push para backend MinIO.


In [6]:
# ETAPA: VERSIONAMENTO AT√îMICO CURATED V1.1

import os
import subprocess

# Caminhos V1.1
train_curated_v1_1 = '/workspace/data/curated/train_curated_v1_1.csv'
test_curated_v1_1  = '/workspace/data/curated/test_curated_v1_1.csv'

# Verifica CWD
print("\nDiret√≥rio de trabalho atual:", os.getcwd())

# Confirma exist√™ncia f√≠sica
print("\nVerificando exist√™ncia f√≠sica:")
print("TRAIN V1.1:", os.path.exists(train_curated_v1_1))
print("TEST V1.1 :", os.path.exists(test_curated_v1_1))

if not os.path.exists(train_curated_v1_1) or not os.path.exists(test_curated_v1_1):
    raise FileNotFoundError("Um dos arquivos CURATED V1.1 n√£o foi encontrado.")

# DVC add
print("\nExecutando dvc add ...")
subprocess.run(['dvc', 'add', train_curated_v1_1], check=True)
subprocess.run(['dvc', 'add', test_curated_v1_1], check=True)

# Git add dos metadados .dvc
print("\nAdicionando metadados .dvc ao Git ...")
subprocess.run(['git', 'add', f"{train_curated_v1_1}.dvc"], check=True)
subprocess.run(['git', 'add', f"{test_curated_v1_1}.dvc"], check=True)

# Commit coerente
print("\nRealizando commit Git ...")
subprocess.run(['git', 'commit', '-m', 'Versionamento CURATED V1.1 com OHE restrito'], check=True)

# DVC push
print("\nExecutando dvc push ...")
subprocess.run(['dvc', 'push'], check=True)

# Git push final
print("\nExecutando git push ...")
subprocess.run(['git', 'push'], check=True)

print("\nVersionamento CURATED V1.1 conclu√≠do com sucesso.")



Diret√≥rio de trabalho atual: /workspace/notebooks

Verificando exist√™ncia f√≠sica:
TRAIN V1.1: True
TEST V1.1 : True

Executando dvc add ...


[?25l‚†ã Checking graph
[?25l‚†ã Checking graph
[?25h


Adicionando metadados .dvc ao Git ...

Realizando commit Git ...
[main 7d92073] Versionamento CURATED V1.1 com OHE restrito
 2 files changed, 10 insertions(+)
 create mode 100644 data/curated/test_curated_v1_1.csv.dvc
 create mode 100644 data/curated/train_curated_v1_1.csv.dvc

Executando dvc push ...
2 files pushed

Executando git push ...

Versionamento CURATED V1.1 conclu√≠do com sucesso.


To github.com:WRMELO/MBA_MLOPS.git
   f76be84..7d92073  main -> main


## Diagn√≥stico de Footprint ‚Äî Snapshot CURATED V1.1

Este bloco confirma o footprint em **disco** e **RAM** dos arquivos `CURATED V1.1` para manter padr√£o compar√°vel ao `V1` e ao pipeline original.

Os valores ser√£o exibidos em MB.


In [7]:
# ETAPA: DIAGNOSTICO FOOTPRINT CURATED V1.1

import os
import pandas as pd

# Fun√ß√µes
def file_size(path):
    size_bytes = os.path.getsize(path)
    return round(size_bytes / (1024 * 1024), 2)

def memory_usage_df(path):
    df = pd.read_csv(path)
    return round(df.memory_usage(deep=True).sum() / (1024 * 1024), 2)

# Caminhos V1.1
train_v1_1 = '/workspace/data/curated/train_curated_v1_1.csv'
test_v1_1  = '/workspace/data/curated/test_curated_v1_1.csv'

# Disco
train_disk = file_size(train_v1_1)
test_disk  = file_size(test_v1_1)

# RAM
train_mem = memory_usage_df(train_v1_1)
test_mem  = memory_usage_df(test_v1_1)

print("\nTamanho em DISCO (MB):")
print(f"TRAIN V1.1: {train_disk} MB")
print(f"TEST V1.1 : {test_disk} MB")

print("\nFootprint em MEM√ìRIA (MB):")
print(f"TRAIN V1.1: {train_mem} MB")
print(f"TEST V1.1 : {test_mem} MB")



Tamanho em DISCO (MB):
TRAIN V1.1: 63.33 MB
TEST V1.1 : 30.67 MB

Footprint em MEM√ìRIA (MB):
TRAIN V1.1: 165.26 MB
TEST V1.1 : 81.38 MB


In [8]:
# ETAPA: RECARREGAMENTO E COMPARACAO SHAPE FINAL CURATED V1.1

import pandas as pd

# Caminhos coerentes V1.1
train_curated_v1_1 = '/workspace/data/curated/train_curated_v1_1.csv'
test_curated_v1_1  = '/workspace/data/curated/test_curated_v1_1.csv'

# Recarrega DataFrames
train_df_v1_1 = pd.read_csv(train_curated_v1_1)
test_df_v1_1  = pd.read_csv(test_curated_v1_1)

# Exibe shapes
print("\nShape atual do TRAIN CURATED V1.1:", train_df_v1_1.shape)
print("Shape atual do TEST CURATED V1.1 :", test_df_v1_1.shape)



Shape atual do TRAIN CURATED V1.1: (100000, 93)
Shape atual do TEST CURATED V1.1 : (50000, 93)


## Comparativo de Footprint ‚Äî Antes, CURATED V1 e CURATED V1.1

Antes da aplica√ß√£o do binning supervisionado e agrupamentos controlados, o pipeline de Feature Engineering gerava um conjunto com **alt√≠ssima cardinalidade**, alcan√ßando **6.305 colunas** por amostra.

- **Train shape (antigo)**: 100.000 linhas √ó 6.305 colunas  
- **Test shape (antigo)**: 50.000 linhas √ó 6.305 colunas

Esse volume extremo era causado por **one-hot indiscriminado** em categorias raras e vari√°veis cont√≠nuas pulverizadas, levando a estouros de mem√≥ria (OOM Killer) mesmo em m√°quinas robustas.

Ap√≥s a revis√£o completa, com:
- Diagn√≥stico estat√≠stico detalhado,
- Binning supervisionado com faixas coerentes ao neg√≥cio,
- Agrupamento de categorias raras,
- Elimina√ß√£o de redund√¢ncias mantendo rastreabilidade,

o footprint caiu drasticamente para:

- **Train shape (CURATED V1)**: 100.000 linhas √ó 65 colunas  
- **Test shape (CURATED V1)**: 50.000 linhas √ó 64 colunas

Para garantir interpretabilidade e previs√£o de sazonalidade e perfis de comportamento, foi aplicada uma extens√£o com **One-Hot Encoding restrito** apenas em vari√°veis estrat√©gicas (`Month`, `Occupation_Group`, `Payment_Behaviour`):

- **Train shape (CURATED V1.1)**: 100.000 linhas √ó 93 colunas  
- **Test shape (CURATED V1.1)**: 50.000 linhas √ó 93 colunas

Assim, a dimensionalidade total foi reduzida de mais de 6.300 colunas para **apenas 93**, viabilizando fitting local, interpretabilidade real e rastreabilidade completa conforme o **PROTOCOLO V5.4**.


In [11]:
# ETAPA: DIAGNOSTICO FINAL DE STRINGS RESIDUAIS EM X

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Carrega V1.1
df = pd.read_csv('/workspace/data/curated/train_curated_v1_1.csv')

# Separa X e y
target = 'Credit_Score'
X = df.drop(columns=[target])
y = df[target]

# Verifica tipos
print("\nDtypes em X antes do encoding:")
print(X.dtypes.value_counts())

# Identifica colunas object
object_cols = X.select_dtypes(include=['object']).columns.tolist()
print("\nColunas object detectadas:", object_cols)

# Exibe valores √∫nicos por coluna para auditoria
for col in object_cols:
    print(f"\nValores √∫nicos em {col}:", X[col].unique())

# Aplica LabelEncoder
for col in object_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))

# Verifica resultado
print("\nDtypes em X depois do encoding:")
print(X.dtypes.value_counts())

# Split coerente depois de corrigir tudo
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)



Dtypes em X antes do encoding:
bool       50
object     22
float64    13
int64       7
Name: count, dtype: int64

Colunas object detectadas: ['Age_Binned', 'Amount_invested_monthly_Binned', 'Annual_Income_Binned', 'Changed_Credit_Limit_Binned', 'Credit_History_Age', 'Credit_History_Age_Binned', 'Credit_Mix', 'Credit_Utilization_Ratio_Binned', 'Delay_from_due_date_Binned', 'Interest_Rate_Binned', 'Monthly_Balance_Binned', 'Monthly_Inhand_Salary_Binned', 'Num_Bank_Accounts_Binned', 'Num_Credit_Card_Binned', 'Num_Credit_Inquiries_Binned', 'Num_of_Delayed_Payment_Binned', 'Num_of_Loan_Binned', 'Occupation', 'Outstanding_Debt_Binned', 'Payment_of_Min_Amount', 'Total_EMI_per_month_Binned', 'Type_of_Loan']

Valores √∫nicos em Age_Binned: ['Jovem' 'Adulto' 'Idoso' 'Erro']

Valores √∫nicos em Amount_invested_monthly_Binned: ['Baixo' 'Nenhum' 'Moderado' 'Alto']

Valores √∫nicos em Annual_Income_Binned: ['Baixa' 'M√©dia' 'Alta' 'Muito_Alta']

Valores √∫nicos em Changed_Credit_Limit_Binned: ['Aum

---
## Baseline Supervisionado ‚Äî CURATED V1 com Tracking MLflow

Este bloco executa o **fitting baseline** usando a camada `CURATED V1` otimizada.  
A execu√ß√£o usa √Årvore de Decis√£o com profundidade controlada, registrando m√©tricas principais no **MLflow**, garantindo rastreabilidade integral do experimento.

Configura√ß√£o:
- Target: `Credit_Score`
- Features: Todas as colunas num√©ricas, binned e agrupadas, exceto ID e texto redundante
- Tracking URI: interno (`http://mlflow:5000`) coerente com container


---
## Fitting Baseline ‚Äî Snapshot CURATED V1.1 com Pr√©-processamento Correto

Este bloco executa o **fitting supervisionado baseline** com o `CURATED V1.1`,  
usando o `X_train` limpo com todas as colunas `_Binned` e agrupamentos **convertidos para valores num√©ricos** via `LabelEncoder`.

A execu√ß√£o usa √Årvore de Decis√£o (`max_depth=5`), com rastreamento no **mesmo projeto MLflow**, mantendo coer√™ncia de URI local e acesso em `http://127.0.0.1:5000`.


In [13]:
import os

# üîê Vari√°veis persistentes para o boto3
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://127.0.0.1:9000'
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'

print("‚úÖ Credenciais MinIO configuradas para MLflow.")


‚úÖ Credenciais MinIO configuradas para MLflow.


In [15]:
# ETAPA: FITTING BASELINE FINAL ‚Äî CURATED V1.1 COM ENDPOINT CORRETO

import os
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score

# For√ßa o endpoint S3 para o nome do servi√ßo dentro da rede Docker
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://minio:9000'
os.environ['AWS_ACCESS_KEY_ID'] = 'wrm'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'senha_segura'

print("‚úÖ Endpoint S3 dentro do container:", os.environ['MLFLOW_S3_ENDPOINT_URL'])

# Tracking MLflow coerente
mlflow.set_tracking_uri("http://mlflow:5000")
mlflow.set_experiment("Baseline_Curated_V1.1")

with mlflow.start_run():
    clf = DecisionTreeClassifier(max_depth=5, random_state=42)
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_val)

    acc = accuracy_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred, average='macro')

    mlflow.log_param("model_type", "DecisionTreeClassifier")
    mlflow.log_param("max_depth", 5)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)

    mlflow.sklearn.log_model(clf, "model_baseline_v1_1")

print(f"\nBaseline conclu√≠do | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")


‚úÖ Endpoint S3 dentro do container: http://minio:9000




üèÉ View run receptive-lynx-60 at: http://mlflow:5000/#/experiments/3/runs/de0cd9753dd74e6b8f7d5c26663add2f
üß™ View experiment at: http://mlflow:5000/#/experiments/3

Baseline conclu√≠do | Accuracy: 0.6881 | F1 Macro: 0.6519
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


## Grid Search supervisionado ‚Äî CURATED V1.1 com Tracking MLflow

Este bloco executa o **Grid Search supervisionado** para o `DecisionTreeClassifier` usando o `CURATED V1.1`.  
Ser√° utilizado:
- Mesma base `X_train` e `y_train` j√° codificados.
- `GridSearchCV` do scikit-learn.
- Tracking de cada combina√ß√£o de hiperpar√¢metros no **mesmo experimento MLflow**, garantindo rastreabilidade integral de m√©tricas.

O objetivo √© encontrar a combina√ß√£o √≥tima de `max_depth` e `min_samples_split` que maximize o **F1 Macro**.


In [20]:
# ETAPA: GRID SEARCH SUPERVISIONADO COM SCORING MULTICLASSE ‚Äî DECISION TREE

import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# 1Ô∏è‚É£ Define o classificador e os hiperpar√¢metros
clf = DecisionTreeClassifier(random_state=42)
param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10, 20]
}

# 2Ô∏è‚É£ Usa StratifiedKFold para garantir estratifica√ß√£o das classes
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 3Ô∏è‚É£ Executa o GridSearchCV com f1_macro
grid_search = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    cv=cv,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Melhores par√¢metros: {best_params}")
print(f"Melhor F1 Macro: {round(best_score, 4) if best_score is not None else 'N/A'}")

# 4Ô∏è‚É£ Loga no MLflow (evita persistir score NaN)
with mlflow.start_run(run_name="grid_search_decision_tree"):
    mlflow.log_params(best_params)
    if best_score is not None and not (best_score != best_score):  # NaN check
        mlflow.log_metric("best_f1_macro", best_score)
    else:
        print("‚ö†Ô∏è Score √© nan ‚Äî m√©trica n√£o ser√° logada para evitar conflito de chave")

    # Loga o modelo treinado
    mlflow.sklearn.log_model(
        sk_model=grid_search.best_estimator_,
        artifact_path="grid_search_model",
        input_example=X_train.iloc[:5, :]  # Opcional: remove se n√£o quiser warning
    )

print(f"\nüèÉ GridSearch conclu√≠do | Best F1 Macro: {round(best_score, 4) if best_score is not None else 'N/A'} | Par√¢metros: {best_params}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")


Fitting 5 folds for each of 16 candidates, totalling 80 fits




Melhores par√¢metros: {'max_depth': 10, 'min_samples_split': 10}
Melhor F1 Macro: 0.6828




üèÉ View run grid_search_decision_tree at: http://mlflow:5000/#/experiments/3/runs/98db11909ed746028a61ba13a6b9609e
üß™ View experiment at: http://mlflow:5000/#/experiments/3

üèÉ GridSearch conclu√≠do | Best F1 Macro: 0.6828 | Par√¢metros: {'max_depth': 10, 'min_samples_split': 10}
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


In [27]:
# üîß ETAPA: PREPARA√á√ÉO DOS DADOS E SPLIT

"""
Este bloco recria X e y, realiza train_test_split com estratifica√ß√£o,
e imprime formas e classes para garantir coer√™ncia.
"""

from sklearn.model_selection import train_test_split

# 1Ô∏è‚É£ Define X e y (ajuste o nome real se n√£o for 'Credit_Score')
X = df.drop('Credit_Score', axis=1)
y = df['Credit_Score']

# 2Ô∏è‚É£ Split com estratifica√ß√£o
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3Ô∏è‚É£ Checa formas e classes
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
print(f"Classes: {y_train.unique()}")


X_train shape: (70000, 92)
X_test shape: (30000, 92)
y_train shape: (70000,)
y_test shape: (30000,)
Classes: ['Standard' 'Poor' 'Good']


In [29]:
# üîß ETAPA: ENCODING + IMPUTA√á√ÉO + LOGISTIC REGRESSION BASELINE

"""
Bloco autocontido:
1Ô∏è‚É£ Diagnostica tipos de dados
2Ô∏è‚É£ Aplica OneHotEncoder em colunas categ√≥ricas
3Ô∏è‚É£ Junta tudo em matriz X final
4Ô∏è‚É£ Imputa NaN com m√©dia nas num√©ricas
5Ô∏è‚É£ Treina Logistic Regression robusto para multiclasses
6Ô∏è‚É£ Loga tudo no MLflow
"""

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score
import mlflow
import mlflow.sklearn

# 1Ô∏è‚É£ Diagn√≥stico de tipos
print("Diagn√≥stico inicial:")
print(X_train.dtypes)

# 2Ô∏è‚É£ Identifica colunas
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_cols   = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Categ√≥ricas: {categorical_cols}")
print(f"Num√©ricas: {numerical_cols}")

# 3Ô∏è‚É£ Pipeline de pr√©-processamento
preprocessor = ColumnTransformer([
    ('num', SimpleImputer(strategy='mean'), numerical_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
])

# 4Ô∏è‚É£ Pipeline final com Logistic Regression
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(
        max_iter=1000, 
        solver='lbfgs',
        multi_class='multinomial'
    ))
])

# 5Ô∏è‚É£ Ajusta pipeline
pipeline.fit(X_train, y_train)

# 6Ô∏è‚É£ Predi√ß√£o e m√©tricas
y_pred = pipeline.predict(X_test)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

# 7Ô∏è‚É£ MLflow Tracking
with mlflow.start_run(run_name="logistic_regression_with_encoding_imputation", experiment_id=3):
    mlflow.log_param("solver", "lbfgs")
    mlflow.log_param("multi_class", "multinomial")
    mlflow.log_param("max_iter", 1000)
    mlflow.log_param("imputer_strategy", "mean")
    mlflow.log_param("encoding", "OneHot")
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(pipeline, "logistic_regression_pipeline")

print(f"\n‚úÖ Logistic Regression Baseline | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")


Diagn√≥stico inicial:
Age                                    float64
Age_Binned                              object
Amount_invested_monthly                float64
Amount_invested_monthly_Binned          object
Amount_invested_monthly_Binned_High       bool
                                        ...   
Type_of_Loan_Category_Mortgage Loan       bool
Type_of_Loan_Category_Not Specified       bool
Type_of_Loan_Category_Payday Loan         bool
Type_of_Loan_Category_Personal Loan       bool
Type_of_Loan_Category_Student Loan        bool
Length: 92, dtype: object
Categ√≥ricas: ['Age_Binned', 'Amount_invested_monthly_Binned', 'Annual_Income_Binned', 'Changed_Credit_Limit_Binned', 'Credit_History_Age', 'Credit_History_Age_Binned', 'Credit_Mix', 'Credit_Utilization_Ratio_Binned', 'Delay_from_due_date_Binned', 'Interest_Rate_Binned', 'Monthly_Balance_Binned', 'Monthly_Inhand_Salary_Binned', 'Num_Bank_Accounts_Binned', 'Num_Credit_Card_Binned', 'Num_Credit_Inquiries_Binned', 'Num_of_Delayed_Paym

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


üèÉ View run logistic_regression_with_encoding_imputation at: http://mlflow:5000/#/experiments/3/runs/7edcbfe85b1440bba3fde06fe3b76615
üß™ View experiment at: http://mlflow:5000/#/experiments/3

‚úÖ Logistic Regression Baseline | Accuracy: 0.5415 | F1 Macro: 0.355
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


# Justificativa T√©cnica ‚Äî Normaliza√ß√£o Pontual para Modelos Sens√≠veis a Escala

Para preservar rastreabilidade e reuso do dataset **v1.1**, decidimos:
- Manter o **dataset v1.1** **inalterado** (todas as colunas originais, sem modifica√ß√£o f√≠sica no arquivo ou tabela).
- Aplicar **normaliza√ß√£o apenas sobre as vari√°veis num√©ricas** **em mem√≥ria**, usando `StandardScaler` do `sklearn`.
- Esta normaliza√ß√£o √© **tempor√°ria**, feita **na etapa de treino** para os modelos que exigem features na mesma escala (por exemplo: Regress√£o Log√≠stica, SVM, KNN, Redes Neurais).

**Por que n√£o normalizar todo o dataset na origem?**  
Manter o dataset bruto facilita auditoria, debug de features e compara√ß√£o de pipelines com/sem pr√©-processamento.

Portanto:
- Dataset **v1.1** = base √∫nica e rastre√°vel.
- Normaliza√ß√£o = aplicada **em pipeline**, em **X_train/X_test**, apenas nas colunas num√©ricas.



In [31]:
# üîß ETAPA: NORMALIZA√á√ÉO PADR√ÉO DAS VARI√ÅVEIS NUM√âRICAS

"""
Esta c√©lula aplica StandardScaler somente nas vari√°veis num√©ricas do dataset v1.1.
"""

from sklearn.preprocessing import StandardScaler

# Exemplo: defina explicitamente suas num√©ricas confirmadas
numerical_features = [
    'Age', 'Amount_invested_monthly', 'Annual_Income', 'Changed_Credit_Limit',
    'Credit_History_Age_Months', 'Credit_Utilization_Ratio', 'Delay_from_due_date',
    'Interest_Rate', 'Monthly_Balance', 'Monthly_Inhand_Salary',
    'Num_Bank_Accounts', 'Num_Credit_Card', 'Num_Credit_Inquiries',
    'Num_of_Delayed_Payment', 'Num_of_Loan', 'Outstanding_Debt',
    'Total_EMI_per_month'
    # Ajuste conforme sua lista validada
]

# Inicializa scaler
scaler = StandardScaler()

# Ajusta no treino e transforma treino/teste
X_train_scaled = X_train.copy()
X_test_scaled  = X_test.copy()

X_train_scaled[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test_scaled[numerical_features]  = scaler.transform(X_test[numerical_features])

print("‚úÖ Normaliza√ß√£o aplicada em mem√≥ria | Shapes id√™nticos ao v1.1")
print(f"X_train_scaled shape: {X_train_scaled.shape}")
print(f"X_test_scaled shape : {X_test_scaled.shape}")


‚úÖ Normaliza√ß√£o aplicada em mem√≥ria | Shapes id√™nticos ao v1.1
X_train_scaled shape: (70000, 92)
X_test_scaled shape : (30000, 92)


In [38]:
# üîß ETAPA: IMPUTA√á√ÉO + NORMALIZA√á√ÉO + LOGISTIC REGRESSION NUM√âRICO

"""
Recria X_train_num e X_test_num a partir do dataset original (v1_1),
aplica imputa√ß√£o de valores ausentes (m√©dia), normaliza com StandardScaler,
ajusta Logistic Regression multinomial robusta e loga tudo no MLflow.
"""

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import mlflow
import mlflow.sklearn

# 1Ô∏è‚É£ Seleciona colunas num√©ricas
cols_num = [
    'Age', 'Amount_invested_monthly', 'Annual_Income', 'Changed_Credit_Limit',
    'Credit_History_Age_Months', 'Credit_Utilization_Ratio', 'Delay_from_due_date',
    'Interest_Rate', 'Month_November', 'Month_October', 'Month_September',
    'Monthly_Balance', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts',
    'Num_Credit_Card', 'Num_Credit_Inquiries', 'Num_of_Delayed_Payment',
    'Num_of_Loan', 'Outstanding_Debt', 'Total_EMI_per_month'
]

X_train_num = X_train[cols_num].copy()
X_test_num  = X_test[cols_num].copy()

# 2Ô∏è‚É£ Imputa NaN com m√©dia
imputer = SimpleImputer(strategy='mean')
X_train_num_imputed = imputer.fit_transform(X_train_num)
X_test_num_imputed  = imputer.transform(X_test_num)

# 3Ô∏è‚É£ Normaliza
scaler = StandardScaler()
X_train_scaled_num = scaler.fit_transform(X_train_num_imputed)
X_test_scaled_num  = scaler.transform(X_test_num_imputed)

print(f"‚úÖ X_train_scaled_num shape: {X_train_scaled_num.shape}")
print(f"‚úÖ X_test_scaled_num shape : {X_test_scaled_num.shape}")

# 4Ô∏è‚É£ Ajusta Logistic Regression multinomial robusta
logreg = LogisticRegression(
    max_iter=1000,
    solver='lbfgs',
    multi_class='multinomial'
)
logreg.fit(X_train_scaled_num, y_train)

# 5Ô∏è‚É£ Predi√ß√£o e m√©tricas
y_pred = logreg.predict(X_test_scaled_num)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

# 6Ô∏è‚É£ MLflow Tracking
with mlflow.start_run(run_name="logistic_regression_num_scaled", experiment_id=3):
    mlflow.log_param("solver", "lbfgs")
    mlflow.log_param("multi_class", "multinomial")
    mlflow.log_param("max_iter", 1000)
    mlflow.log_param("imputer_strategy", "mean")
    mlflow.log_param("scaler", "StandardScaler")
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(logreg, "logistic_regression_model")

print(f"\n‚úÖ Logistic Regression Num√©ricas | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")


‚úÖ X_train_scaled_num shape: (70000, 20)
‚úÖ X_test_scaled_num shape : (30000, 20)




üèÉ View run logistic_regression_num_scaled at: http://mlflow:5000/#/experiments/3/runs/78d457acce314979815f94fc80d286de
üß™ View experiment at: http://mlflow:5000/#/experiments/3

‚úÖ Logistic Regression Num√©ricas | Accuracy: 0.5928 | F1 Macro: 0.4974
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


##  Justificativa T√©cnica ‚Äî Execu√ß√£o do SVM com Vari√°veis Num√©ricas Normalizadas

Este bloco marca a continuidade da etapa de experimenta√ß√£o com modelos que exigem dados em escala uniforme.  
O **Support Vector Machine (SVM)** √© um algoritmo sens√≠vel √† magnitude das features ‚Äî portanto, a **normaliza√ß√£o √© obrigat√≥ria** para maximizar a separabilidade das classes no hiperplano de decis√£o.

Estamos utilizando:
- **Dataset vers√£o 1.2**, que cont√©m apenas vari√°veis **num√©ricas**, j√° **imputadas** e **normalizadas** com `StandardScaler`.
- Vetores: `X_train_scaled_num` e `X_test_scaled_num`.

Objetivo:
- Gerar um baseline robusto para o SVM dentro do mesmo fluxo rastre√°vel do **MLflow**, garantindo versionamento, consist√™ncia de par√¢metros (`kernel`, `C`) e compara√ß√£o justa com os demais algoritmos que tamb√©m exigem normaliza√ß√£o.

Esta execu√ß√£o respeita o protocolo de **blocos autocontidos**, com cabe√ßalho t√©cnico claro e logging completo de par√¢metros e m√©tricas.

Ap√≥s o SVM, o pipeline seguir√° para **KNN** e **MLP**, mantendo a mesma estrutura para valida√ß√£o.



In [40]:
# üîß ETAPA: SVM com Num√©ricas Normalizadas

"""
Este bloco ajusta o modelo Support Vector Machine (SVM)
usando exclusivamente o vetor num√©rico imputado e normalizado (v1.2).
Inclui ajuste, predi√ß√£o, m√©tricas e logging no MLflow.
"""

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score
import mlflow
import mlflow.sklearn

# 1Ô∏è‚É£ Instancia o modelo SVM
svm = SVC(kernel='rbf', C=1.0)

# 2Ô∏è‚É£ Ajuste
svm.fit(X_train_scaled_num, y_train)

# 3Ô∏è‚É£ Predi√ß√£o
y_pred = svm.predict(X_test_scaled_num)

# 4Ô∏è‚É£ M√©tricas
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

# 5Ô∏è‚É£ MLflow Tracking
with mlflow.start_run(run_name="svm_num_scaled", experiment_id=3):
    mlflow.log_param("kernel", "rbf")
    mlflow.log_param("C", 1.0)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(svm, "svm_model")

print(f"\n‚úÖ SVM Num√©ricas | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")




üèÉ View run svm_num_scaled at: http://mlflow:5000/#/experiments/3/runs/8148b15e77874ab7ac850fdb0e658e22
üß™ View experiment at: http://mlflow:5000/#/experiments/3

‚úÖ SVM Num√©ricas | Accuracy: 0.6223 | F1 Macro: 0.4626
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


# Registro do Resultado ‚Äî SVM com Dados Num√©ricos Normalizados

O SVM foi executado conforme planejado, utilizando o `StandardScaler` para garantir comparabilidade justa e melhor separabilidade do hiperplano.  
- **Dataset:** `v1.2` (num√©ricas imputadas e normalizadas)  
- **Accuracy:** 0.6223  
- **F1 Macro:** 0.4626  
- **Run MLflow:** [Link do Run](http://mlflow:5000/#/experiments/3/runs/8148b15e77874ab7ac850fdb0e658e22)

Esta etapa refor√ßa a necessidade de manter a padroniza√ß√£o para algoritmos sens√≠veis a escala, al√©m de documentar o versionamento para rastreabilidade total do pipeline.

**Pr√≥ximos passos:**  
Prosseguir com o **K-Nearest Neighbors (KNN)** e o **MLP Classifier**, utilizando os mesmos vetores `X_train_scaled_num` e `X_test_scaled_num` para consolidar a compara√ß√£o de modelos sens√≠veis √† normaliza√ß√£o.


# üîß ETAPA: K-Nearest Neighbors ‚Äî Baseline Num√©ricas Normalizadas

Esta etapa executa o KNN como parte do bloco de algoritmos que exigem dados normalizados (`v1.2`).  
O objetivo √© avaliar o desempenho do KNN usando as mesmas features num√©ricas previamente escaladas com `StandardScaler`, garantindo comparabilidade entre modelos.  
Todos os par√¢metros, m√©tricas e artefatos s√£o rastreados no MLflow, seguindo o protocolo de versionamento.

- **Dataset:** v1.2 ‚Äî Num√©ricas imputadas + normalizadas  
- **Observa√ß√£o:** Sem OneHotEncoding, apenas features cont√≠nuas
- **M√©tricas:** Accuracy e F1 Macro  


In [41]:
# üîß ETAPA: KNN Baseline Num√©ricas Normalizadas

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
import mlflow
import mlflow.sklearn

# 1Ô∏è‚É£ Define e ajusta o KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled_num, y_train)

# 2Ô∏è‚É£ Predi√ß√£o e m√©tricas
y_pred = knn.predict(X_test_scaled_num)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

# 3Ô∏è‚É£ MLflow
with mlflow.start_run(run_name="knn_num_scaled", experiment_id=3):
    mlflow.log_param("n_neighbors", 5)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(knn, "knn_model")

print(f"\n‚úÖ KNN Num√©ricas | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")




üèÉ View run knn_num_scaled at: http://mlflow:5000/#/experiments/3/runs/f532e223f98c449fae1e0ce12b6ad03b
üß™ View experiment at: http://mlflow:5000/#/experiments/3

‚úÖ KNN Num√©ricas | Accuracy: 0.5822 | F1 Macro: 0.5407
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


# üîß ETAPA: MLP Classifier ‚Äî Baseline Num√©ricas Normalizadas

Esta etapa aplica o **Multi-layer Perceptron (MLP Classifier)** ao conjunto `v1.2`  
‚Äî contendo apenas vari√°veis num√©ricas, imputadas e normalizadas com `StandardScaler`.  
O objetivo √© avaliar o comportamento de um modelo de rede neural simples neste cen√°rio, garantindo rastreabilidade no MLflow.  
Todos os hiperpar√¢metros, m√©tricas e artefatos ser√£o versionados.

- **Dataset:** v1.2 ‚Äî Num√©ricas imputadas + normalizadas  
- **M√©tricas:** Accuracy e F1 Macro  
- **Observa√ß√£o:** O MLP √© particularmente sens√≠vel a dados n√£o escalados.


In [42]:
# üîß ETAPA: MLP Classifier Baseline Num√©ricas Normalizadas

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, f1_score
import mlflow
import mlflow.sklearn

# 1Ô∏è‚É£ Define e ajusta MLP
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, random_state=42)
mlp.fit(X_train_scaled_num, y_train)

# 2Ô∏è‚É£ Predi√ß√£o e m√©tricas
y_pred = mlp.predict(X_test_scaled_num)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

# 3Ô∏è‚É£ MLflow
with mlflow.start_run(run_name="mlp_num_scaled", experiment_id=3):
    mlflow.log_param("hidden_layer_sizes", "(100,)")
    mlflow.log_param("max_iter", 300)
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.sklearn.log_model(mlp, "mlp_model")

print(f"\n‚úÖ MLP Num√©ricas | Accuracy: {round(acc,4)} | F1 Macro: {round(f1,4)}")
print("Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000")




üèÉ View run mlp_num_scaled at: http://mlflow:5000/#/experiments/3/runs/aa47219a2fc14174a3fb8851ad6bc814
üß™ View experiment at: http://mlflow:5000/#/experiments/3

‚úÖ MLP Num√©ricas | Accuracy: 0.6618 | F1 Macro: 0.6086
Acesse o MLflow Tracking UI fora do container em: http://127.0.0.1:5000


# üîß ETAPA: Modelos Ensemble com Dataset Original v1.1

Esta etapa retoma o dataset original `v1.1`  
‚Äî j√° com **OneHotEncoding**, imputa√ß√£o apropriada e sem normaliza√ß√£o ‚Äî  
para treinar e avaliar os modelos baseados em √°rvores e ensemble:

- **Decision Tree** (j√° executado previamente, servir√° de compara√ß√£o)
- **Random Forest**
- **XGBoost**
- **LightGBM**

Esses algoritmos n√£o exigem dados normalizados, pois suas divis√µes e pesos s√£o determinados por rela√ß√µes de ordenamento, n√£o por dist√¢ncia.
Cada modelo ser√° logado no MLflow com par√¢metros e m√©tricas.


# Diagn√≥stico das Colunas do Tipo Object

Este bloco realiza um diagn√≥stico t√©cnico preciso das colunas que ainda est√£o no tipo `object` dentro de `X_train` e `X_test`.  
A execu√ß√£o desta etapa √© obrigat√≥ria porque algoritmos baseados em √°rvore, como **Random Forest**, **XGBoost**, **LightGBM** e **HistGradientBoosting**, **n√£o aceitam vari√°veis categ√≥ricas em formato `object` ou `string`** ‚Äî eles exigem que todos os dados de entrada estejam em formato **num√©rico**.

Al√©m disso, √© importante garantir que n√£o existam valores ausentes (`NaN`) antes do treinamento, pois mesmo esses algoritmos que toleram alguns `NaN` podem apresentar comportamento inst√°vel ou inviabilizar splits corretos na √°rvore.

Portanto, o procedimento faz tr√™s verifica√ß√µes fundamentais:
1. Mapeia todas as colunas `object` em `X_train` para confirmar quais vari√°veis precisam de transforma√ß√£o via **OrdinalEncoder**.
2. Mostra o n√∫mero de valores √∫nicos em cada coluna e exemplos de categorias, para detectar cardinalidades incoerentes ou inconsist√™ncias.
3. Identifica a quantidade de valores `NaN` em cada coluna `object`, embasando a estrat√©gia de imputa√ß√£o.

Este diagn√≥stico garante que, na pr√≥xima etapa, todo o pipeline de imputa√ß√£o e encoding seja constru√≠do com **consist√™ncia e rastreabilidade**, mantendo a coer√™ncia com a regra principal do projeto: **n√£o usar OneHotEncoder** que infle a dimensionalidade e n√£o violar o limite de colunas definido.


In [44]:
# ETAPA: Diagn√≥stico de colunas object

# Mapeamento e verifica√ß√£o de colunas do tipo object em X_train e X_test

print("\nResumo de tipos em X_train:")
print(X_train.dtypes.value_counts())

print("\nColunas do tipo object em X_train:")
object_cols_train = X_train.select_dtypes(include='object').columns.tolist()
print(object_cols_train)

for col in object_cols_train:
    print(f"\nColuna: {col}")
    uniques = X_train[col].unique()
    nunique = X_train[col].nunique()
    print(f"Valores √∫nicos ({nunique}): {uniques[:20]}")
    if nunique > 20:
        print(f"... ({nunique - 20} valores adicionais n√£o exibidos)")
    print(f"Qtd NaNs: {X_train[col].isna().sum()}")

print("\nVerifica√ß√£o final:")
print(f"Total de colunas object em X_train: {len(object_cols_train)}")



Resumo de tipos em X_train:
bool       50
object     22
float64    13
int64       7
Name: count, dtype: int64

Colunas do tipo object em X_train:
['Age_Binned', 'Amount_invested_monthly_Binned', 'Annual_Income_Binned', 'Changed_Credit_Limit_Binned', 'Credit_History_Age', 'Credit_History_Age_Binned', 'Credit_Mix', 'Credit_Utilization_Ratio_Binned', 'Delay_from_due_date_Binned', 'Interest_Rate_Binned', 'Monthly_Balance_Binned', 'Monthly_Inhand_Salary_Binned', 'Num_Bank_Accounts_Binned', 'Num_Credit_Card_Binned', 'Num_Credit_Inquiries_Binned', 'Num_of_Delayed_Payment_Binned', 'Num_of_Loan_Binned', 'Occupation', 'Outstanding_Debt_Binned', 'Payment_of_Min_Amount', 'Total_EMI_per_month_Binned', 'Type_of_Loan']

Coluna: Age_Binned
Valores √∫nicos (4): ['Adulto' 'Jovem' 'Idoso' 'Erro']
Qtd NaNs: 0

Coluna: Amount_invested_monthly_Binned
Valores √∫nicos (4): ['Baixo' 'Moderado' 'Alto' 'Nenhum']
Qtd NaNs: 0

Coluna: Annual_Income_Binned
Valores √∫nicos (4): ['Baixa' 'M√©dia' 'Alta' 'Muito_Alta'