# 03 - Modelagem de Classificação / Classification Modeling

Este notebook constrói um pipeline supervisionado com validação cruzada, tuning de hiperparâmetros, explicabilidade (SHAP/feature importance), e avaliação avançada (métricas e curvas ROC/PR).

This notebook builds a supervised classification pipeline with cross-validation, hyperparameter tuning, explainability (SHAP/feature importance), and advanced evaluation (metrics and ROC/PR curves).

## 1) Imports e configuração / Imports and setup

In [None]:
# !pip install scikit-learn xgboost shap matplotlib seaborn numpy pandas --quiet
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (classification_report, confusion_matrix, roc_auc_score, roc_curve, auc,
                            precision_recall_curve, average_precision_score)
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import shap
import warnings; warnings.filterwarnings('ignore')
sns.set(style='whitegrid', context='talk')

## 2) Dados / Data

Substitua esta célula para carregar seus dados. Abaixo usamos o dataset Breast Cancer (sklearn) como exemplo.
Replace this cell to load your data. Below we use the Breast Cancer dataset (sklearn) as an example.

In [None]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer(as_frame=True)
df = data.frame.copy()
target_col = 'target'
df.head()

## 3) Split treino/teste / Train-test split

In [None]:
X = df.drop(columns=[target_col])
y = df[target_col]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape

## 4) Pré-processamento / Preprocessing

In [None]:
numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = [c for c in X.columns if c not in numeric_features]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ], remainder='drop'
)
preprocessor

## 5) Modelo base / Base model

Escolha entre RandomForest e XGBoost alterando o pipeline abaixo.
Choose between RandomForest and XGBoost by editing the pipeline below.

In [None]:
use_xgb = True  # mude para False para usar RandomForest / change to False for RandomForest

rf = RandomForestClassifier(random_state=42, n_estimators=300, max_depth=None)
xgb = XGBClassifier(
    random_state=42,
    n_estimators=500,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    eval_metric='logloss'
)

estimator = xgb if use_xgb else rf
pipe = Pipeline(steps=[('pre', preprocessor), ('clf', estimator)])
pipe

## 6) Validação cruzada e tuning / Cross-validation and hyperparameter tuning

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

if use_xgb:
    param_grid = {
        'clf__n_estimators': [300, 500],
        'clf__max_depth': [3, 4, 5],
        'clf__learning_rate': [0.03, 0.05, 0.1],
        'clf__subsample': [0.8, 1.0],
        'clf__colsample_bytree': [0.8, 1.0]
    }
else:
    param_grid = {
        'clf__n_estimators': [200, 400, 800],
        'clf__max_depth': [None, 5, 10],
        'clf__max_features': ['sqrt', 'log2', None]
    }

grid = GridSearchCV(
    pipe, param_grid=param_grid, cv=cv, n_jobs=-1, scoring='roc_auc', verbose=1
)
grid.fit(X_train, y_train)
print('Best AUC-ROC (CV):', grid.best_score_)
print('Best params:', grid.best_params_)
best_model = grid.best_estimator_

## 7) Avaliação no teste / Test evaluation

In [None]:
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:,1]
print(classification_report(y_test, y_pred))
print('ROC AUC:', roc_auc_score(y_test, y_proba))
print('Average Precision (PR AUC):', average_precision_score(y_test, y_proba))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Matriz de confusão / Confusion matrix')
plt.xlabel('Predito / Predicted')
plt.ylabel('Verdadeiro / True')
plt.show()

fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Curva ROC')
plt.legend(loc='lower right')
plt.show()

prec, rec, thr = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)
plt.figure()
plt.plot(rec, prec, label=f'AP = {ap:.3f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Curva Precisão-Recall / Precision-Recall')
plt.legend()
plt.show()

## 8) Explicabilidade / Explainability (Feature Importance, SHAP)

In [None]:
# Importâncias de features (para modelos tree-based)
final_clf = best_model.named_steps['clf']
feature_names = numeric_features + categorical_features
if hasattr(final_clf, 'feature_importances_'):
    importances = final_clf.feature_importances_
    imp = pd.Series(importances, index=feature_names).sort_values(ascending=False).head(20)
    plt.figure(figsize=(8,5))
    sns.barplot(x=imp.values, y=imp.index)
    plt.title('Top feature importances')
    plt.show()

# SHAP (pode ser custoso)
# SHAP (can be expensive)
try:
    explainer = shap.TreeExplainer(final_clf)
    # Transform train to model input space
    X_train_trans = best_model.named_steps['pre'].transform(X_train)
    shap_values = explainer.shap_values(X_train_trans)
    shap.summary_plot(shap_values, X_train_trans, feature_names=feature_names, show=False)
    plt.tight_layout()
    plt.show()
except Exception as e:
    print('SHAP skipped:', e)

## 9) Exportação de resultados / Results export

In [None]:
# Probabilidades e predições
preds = pd.DataFrame({
    'y_true': y_test.values,
    'y_proba': y_proba,
    'y_pred': y_pred
})
preds.to_csv('predicoes_test.csv', index=False)

# Melhor conjunto de hiperparâmetros
pd.Series(grid.best_params_).to_json('best_params.json')

# Importâncias (se disponíveis)
if 'importances' in locals():
    imp.to_csv('feature_importances_top20.csv')

print('Arquivos exportados: predicoes_test.csv, best_params.json, feature_importances_top20.csv (se aplicável)')
print('Exported files: predicoes_test.csv, best_params.json, feature_importances_top20.csv (if applicable)')

## 10) Notas finais / Final notes
- Adapte o pré-processamento às suas variáveis.
- Para dados desbalanceados, considere class_weight, thresholds ou técnicas de reamostragem.
- Troque o scoring conforme objetivo (f1, average_precision, etc.).
- Pipeline e GridSearch evitam vazamento de dados.

- Adapt preprocessing to your variables.
- For imbalanced data, consider class_weight, thresholds, or resampling.
- Change scoring as needed (f1, average_precision, etc.).
- Pipeline and GridSearch prevent data leakage.