# ü§ñ TEMPLATE ‚Äî MODELING NOTEBOOK

## Objetivo:

Treinar, avaliar e comparar modelos de machine learning para resolver
o problema definido, garantindo generaliza√ß√£o, reprodutibilidade
e alinhamento com m√©tricas de neg√≥cio.

Crit√©rios:
- Valida√ß√£o robusta
- M√©tricas adequadas
- Baseline claro

## üìö 1. Imports e Configura√ß√µes

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
    cross_validate
)

from sklearn.metrics import (
    roc_auc_score,
    f1_score,
    precision_score,
    recall_score,
    classification_report
)

from src.config import TARGET_COL, ID_COL, DATA_PATH_PROCESSED, TEST_SIZE, RANDOM_STATE, N_SPLITS

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier


from xgboost import XGBClassifier

import mlflow
import mlflow.sklearn

## üìÇ 2. Carregamento dos Dados

In [None]:
df = pd.read_csv(DATA_PATH_PROCESSED)

X = df.drop(columns=[TARGET_COL, ID_COL], errors="ignore")
y = df[TARGET_COL]

## üîÄ 3. Train / Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=TEST_SIZE,
    stratify=y,
    random_state=RANDOM_STATE
)


**üìå Por qu√™?**

- Hold-out final garante avalia√ß√£o honesta
- Estratifica√ß√£o mant√©m propor√ß√£o do target

## üìâ 4. Baseline Model

In [None]:
baseline_model = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    random_state=RANDOM_STATE
)


In [None]:
baseline_model.fit(X_train, y_train)

y_pred = baseline_model.predict(X_test)
y_proba = baseline_model.predict_proba(X_test)[:, 1]

print("ROC-AUC:", roc_auc_score(y_test, y_proba))
print(classification_report(y_test, y_pred))


üìå **INSIGHT ESPERADO**

Qualquer modelo mais complexo deve superar esse baseline.

## üîÅ 5. Valida√ß√£o Cruzada

In [None]:
cv = StratifiedKFold(
    n_splits=N_SPLITS,
    shuffle=True,
    random_state=RANDOM_STATE
)

scoring = {
    "roc_auc": "roc_auc",
    "f1": "f1",
    "precision": "precision",
    "recall": "recall"
}


## ü§ñ 6. Modelos Candidatos

In [None]:
models = {
    "logistic_regression": LogisticRegression(
        max_iter=1000,
        class_weight="balanced",
        random_state=RANDOM_STATE
    ),
    "random_forest": RandomForestClassifier(
        n_estimators=300,
        random_state=RANDOM_STATE,
        class_weight="balanced"
    ),
    "xgboost": XGBClassifier(
        eval_metric="logloss",
        use_label_encoder=False,
        random_state=RANDOM_STATE
    ),
    "catboost": CatBoostClassifier(
        random_state=RANDOM_STATE,
        verbose=0
    ),
    "lightgbm": LGBMClassifier(
        random_state=RANDOM_STATE
    ),
    "knn": KNeighborsClassifier(),
    "svc": SVC(
        probability=True,
        random_state=RANDOM_STATE),
    "naive_bayes": GaussianNB(),
    "decision_tree": DecisionTreeClassifier(
        random_state=RANDOM_STATE,
        class_weight="balanced"),
    "gradient_boosting": GradientBoostingClassifier(
        random_state=RANDOM_STATE
    ),
    "adaboost": AdaBoostClassifier(
        random_state=RANDOM_STATE
    ),
    "random_forest": RandomForestClassifier(
        n_estimators=300,
        random_state=RANDOM_STATE,
        class_weight="balanced"
    )
}


## üìä 7. Avalia√ß√£o com Cross-Validation

In [None]:
results = {}

for name, model in models.items():
    cv_results = cross_validate(
        model,
        X_train,
        y_train,
        cv=cv,
        scoring=scoring,
        n_jobs=-1
    )
    
    results[name] = {
        metric: np.mean(scores)
        for metric, scores in cv_results.items()
        if metric.startswith("test")
    }

pd.DataFrame(results).T

## üìà 8. Treinamento Final + Avalia√ß√£o Hold-out

In [None]:
best_model = models["xgboost"]

best_model.fit(X_train, y_train)

y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print("ROC-AUC:", roc_auc_score(y_test, y_proba))
print(classification_report(y_test, y_pred))

## üî• 9. Feature Importance

In [None]:
importances = best_model.feature_importances_

pd.Series(importances, index=X.columns).sort_values(ascending=False).head(15)

üìå **Insight**

- Explica decis√µes e valida hip√≥teses da EDA.

## üß™ 10. MLflow Tracking

In [None]:
mlflow.set_experiment("modeling_experiment")

with mlflow.start_run():
    mlflow.log_param("model_type", "xgboost")
    mlflow.log_metric("roc_auc", roc_auc_score(y_test, y_proba))
    
    mlflow.sklearn.log_model(best_model, "model")

## üß† 11. Registro de Decis√µes

**Decis√µes:**
- Modelo final: XGBoost
- M√©trica principal: ROC-AUC
- Valida√ß√£o: Stratified K-Fold
- **Motiva√ß√£o:**
    - Melhor trade-off bias/variance
    - Melhor captura de n√£o linearidades

## üöÄ 12. Pr√≥ximos Passos

**Pr√≥ximos passos:**
- Tuning de hiperpar√¢metros
- Threshold tuning
- Monitoramento em produ√ß√£o
- Testes de drift