<VSCode.Cell id="#VSC-23eb6efe" language="markdown">
# 02 - Baseline Models
## Desenvolvimento de Modelos Baseline para Detecção de Fraude

---

## Objetivo

Estabelecer baselines robustos de performance usando três algoritmos de gradient boosting, servindo como referência para otimizações futuras.

## Dimensões de Análise

| LightGBM | XGBoost | CatBoost | Comparação |
|----------|---------|----------|------------|
| Baseline | Baseline | Baseline | Performance |

## Questões de Pesquisa

1. Qual algoritmo apresenta melhor performance baseline?
2. Como se comportam os modelos com dados balanceados?
3. Qual é o gap entre treino e teste inicial?
4. Quais hiperparâmetros default são mais promissores?

---

## Fase 1: Imports e Configuração

**Objetivo:** Carregar todas as dependências necessárias e dados preparados do Notebook 01.

### Imports Necessários:
- **Bibliotecas padrão**: pandas, numpy, sklearn, etc.
- **Bibliotecas ML**: lightgbm, xgboost, catboost
- **Módulos utils**: sampling, tuning, metrics
- **Dados preparados**: X_train, y_train, X_test, y_test
- **Configuração**: config.yaml e tuning_config

In [1]:
# Imports: Standard Libraries

import sys
import os
import pickle
import json
import warnings
from pathlib import Path
from datetime import datetime

# Data Science
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.metrics import (
    classification_report, confusion_matrix,
    precision_recall_curve, roc_curve, auc
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Progress tracking
from tqdm import tqdm

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

print("Standard libraries imported successfully")

Standard libraries imported successfully


In [2]:
# Configuration & Paths

import yaml

# Load config from YAML (single source of truth)
with open('../../config.yaml', 'r') as f:
    CONFIG = yaml.safe_load(f)

# Notebook-specific overrides
CONFIG['notebook_mode'] = True
CONFIG['dev_mode'] = False

# Paths - use absolute paths from notebook directory
notebook_dir = Path('.').resolve().parent
DATA_DIR = notebook_dir.parent / CONFIG['paths']['data']
ARTIFACTS_DIR = notebook_dir.parent / CONFIG['paths']['artifacts']
MODELS_DIR = ARTIFACTS_DIR / 'models'
ARTIFACTS_DIR.mkdir(exist_ok=True, parents=True)
MODELS_DIR.mkdir(exist_ok=True, parents=True)

print(f"Configuration loaded")
print(f"Primary metric: {CONFIG['modeling']['primary_metric'].upper()}")
print(f"CV folds: {CONFIG['modeling']['cv_folds']}")
print(f"Random state: {CONFIG['random_state']}")
print(f"Data directory: {DATA_DIR}")
print(f"Artifacts directory: {ARTIFACTS_DIR}")
print(f"Models directory: {MODELS_DIR}")

Configuration loaded
Primary metric: PR_AUC
CV folds: 5
Random state: 42
Data directory: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\data
Artifacts directory: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts
Models directory: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\models


In [3]:
# Load Processed Data from Notebook 01

# Load training data (balanced)
X_train_balanced = pd.read_parquet(ARTIFACTS_DIR / 'X_train_processed.parquet')
y_train_balanced = pd.read_parquet(ARTIFACTS_DIR / 'y_train_processed.parquet')['target']

# Load test data (featured)
X_test_featured = pd.read_parquet(ARTIFACTS_DIR / 'X_test_processed.parquet')
y_test = pd.read_parquet(ARTIFACTS_DIR / 'y_test_processed.parquet')['target']

# Load aggregation statistics for consistent transformation
with open(ARTIFACTS_DIR / 'agg_stats.pkl', 'rb') as f:
    agg_stats = pickle.load(f)

# Load metadata for verification
with open(ARTIFACTS_DIR / 'data_prep_metadata.json', 'r') as f:
    data_metadata = json.load(f)

print("Training data loaded:")
print(f"  X_train_balanced: {X_train_balanced.shape}")
print(f"  y_train_balanced: {len(y_train_balanced)} samples ({y_train_balanced.mean():.2%} fraud)")
print("Test data loaded:")
print(f"  X_test_featured: {X_test_featured.shape}")
print(f"  y_test: {len(y_test)} samples ({y_test.mean():.2%} fraud)")
print("Aggregation stats loaded")
print("Metadata loaded")
print(f"Data prepared on: {data_metadata['timestamp'][:19]}")

Training data loaded:
  X_train_balanced: (6855, 32)
  y_train_balanced: 6855 samples (34.28% fraud)
Test data loaded:
  X_test_featured: (2938, 32)
  y_test: 2938 samples (41.35% fraud)
Aggregation stats loaded
Metadata loaded
Data prepared on: 2025-10-06T18:23:30


In [4]:
# Import Utils Modules & ML Libraries

# Add utils to path
sys.path.append(str(Path('..').resolve()))
sys.path.append(str(Path('../..').resolve()))

# Import utils modules
from utils.sampling import train_baseline_models
from utils.tuning import run_staged_tuning, print_progress
from utils.modeling import FraudMetrics, cross_validate_with_metrics, get_cv_strategy
from utils.data import save_artifact, load_artifact, check_artifact_exists

# Import ML libraries
try:
    import lightgbm as lgb
    from lightgbm import LGBMClassifier
    print("LightGBM imported")
except ImportError:
    print("LightGBM not available")

try:
    import xgboost as xgb
    from xgboost import XGBClassifier
    print("XGBoost imported")
except ImportError:
    print("XGBoost not available")

try:
    import catboost
    from catboost import CatBoostClassifier
    print("CatBoost imported")
except ImportError:
    print("CatBoost not available - will install if needed")

# Setup logging
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("Utils modules and ML libraries imported successfully")

SUCCESS: Feature Engineering module loaded successfully!




[OK] Visualization utilities loaded!
LightGBM imported
XGBoost imported
CatBoost imported
Utils modules and ML libraries imported successfully


In [5]:
# Tuning Configuration

tuning_config = {
    'n_trials_coarse': 10,
    'n_trials_fine': 20,
    'cv_folds': 3,
    'min_improvement_for_fine': 0.003,
    'random_state': CONFIG['random_state']
}

print(f"Tuning configuration set:")
print(f"  Coarse trials: {tuning_config['n_trials_coarse']}")
print(f"  Fine trials: {tuning_config['n_trials_fine']}")
print(f"  CV folds: {tuning_config['cv_folds']}")
print(f"  Min improvement: {tuning_config['min_improvement_for_fine']}")

print("\n" + "=" * 60)
print("NOTEBOOK 02 SETUP COMPLETE")
print("=" * 60)
print("Ready to train baseline models!")
print("=" * 60)

Tuning configuration set:
  Coarse trials: 10
  Fine trials: 20
  CV folds: 3
  Min improvement: 0.003

NOTEBOOK 02 SETUP COMPLETE
Ready to train baseline models!


# ============================================================================
# NOTEBOOK 02: BASELINE MODELS
# ============================================================================

## Fase 1: Imports e Configuração

**Objetivo:** Carregar todas as dependências necessárias e dados preparados do Notebook 01.

### Imports Necessários:
- **Bibliotecas padrão**: pandas, numpy, sklearn, etc.
- **Bibliotecas ML**: lightgbm, xgboost, catboost
- **Módulos utils**: sampling, tuning, metrics
- **Dados preparados**: X_train, y_train, X_test, y_test
- **Configuração**: config.yaml e tuning_config

In [6]:
# ============================================================================
# IMPORTS: Standard Libraries
# ============================================================================

import sys
import os
import pickle
import json
import warnings
from pathlib import Path
from datetime import datetime

# Data Science
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.metrics import (
    classification_report, confusion_matrix,
    precision_recall_curve, roc_curve, auc
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Progress tracking
from tqdm import tqdm

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

print("Standard libraries imported successfully")

Standard libraries imported successfully


In [7]:
# ============================================================================
# CONFIGURATION & PATHS
# ============================================================================

import yaml

# Load config from YAML (single source of truth)
with open('../../config.yaml', 'r') as f:
    CONFIG = yaml.safe_load(f)

# Notebook-specific overrides
CONFIG['notebook_mode'] = True
CONFIG['dev_mode'] = False

# Paths - use absolute paths from notebook directory
notebook_dir = Path('.').resolve().parent
DATA_DIR = notebook_dir.parent / CONFIG['paths']['data']
ARTIFACTS_DIR = notebook_dir.parent / CONFIG['paths']['artifacts']
MODELS_DIR = ARTIFACTS_DIR / 'models'
ARTIFACTS_DIR.mkdir(exist_ok=True, parents=True)
MODELS_DIR.mkdir(exist_ok=True, parents=True)

print(f"Configuration loaded")
print(f"Primary metric: {CONFIG['modeling']['primary_metric'].upper()}")
print(f"CV folds: {CONFIG['modeling']['cv_folds']}")
print(f"Random state: {CONFIG['random_state']}")
print(f"Data directory: {DATA_DIR}")
print(f"Artifacts directory: {ARTIFACTS_DIR}")
print(f"Models directory: {MODELS_DIR}")

Configuration loaded
Primary metric: PR_AUC
CV folds: 5
Random state: 42
Data directory: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\data
Artifacts directory: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts
Models directory: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\models


In [8]:
# ============================================================================
# LOAD PROCESSED DATA FROM NOTEBOOK 01
# ============================================================================

print("=" * 60)
print("LOADING PROCESSED DATA FROM NOTEBOOK 01")
print("=" * 60)

# Load training data (balanced)
X_train_balanced = pd.read_parquet(ARTIFACTS_DIR / 'X_train_processed.parquet')
y_train_balanced = pd.read_parquet(ARTIFACTS_DIR / 'y_train_processed.parquet')['target']

# Load test data (featured)
X_test_featured = pd.read_parquet(ARTIFACTS_DIR / 'X_test_processed.parquet')
y_test = pd.read_parquet(ARTIFACTS_DIR / 'y_test_processed.parquet')['target']

# Load aggregation statistics for consistent transformation
with open(ARTIFACTS_DIR / 'agg_stats.pkl', 'rb') as f:
    agg_stats = pickle.load(f)

# Load metadata for verification
with open(ARTIFACTS_DIR / 'data_prep_metadata.json', 'r') as f:
    data_metadata = json.load(f)

print("Training data loaded:")
print(f"  X_train_balanced: {X_train_balanced.shape}")
print(f"  y_train_balanced: {len(y_train_balanced)} samples ({y_train_balanced.mean():.2%} fraud)")
print("Test data loaded:")
print(f"  X_test_featured: {X_test_featured.shape}")
print(f"  y_test: {len(y_test)} samples ({y_test.mean():.2%} fraud)")
print("Aggregation stats loaded")
print("Metadata loaded")
print(f"Data prepared on: {data_metadata['timestamp'][:19]}")

LOADING PROCESSED DATA FROM NOTEBOOK 01
Training data loaded:
  X_train_balanced: (6855, 32)
  y_train_balanced: 6855 samples (34.28% fraud)
Test data loaded:
  X_test_featured: (2938, 32)
  y_test: 2938 samples (41.35% fraud)
Aggregation stats loaded
Metadata loaded
Data prepared on: 2025-10-06T18:23:30


In [9]:
# ============================================================================
# IMPORT UTILS MODULES & ML LIBRARIES
# ============================================================================

# Add utils to path
sys.path.append(str(Path('..').resolve()))
sys.path.append(str(Path('../..').resolve()))

# Import utils modules
from utils.sampling import train_baseline_models
from utils.tuning import run_staged_tuning, print_progress
from utils.modeling import FraudMetrics, cross_validate_with_metrics, get_cv_strategy
from utils.data import save_artifact, load_artifact, check_artifact_exists

# Import ML libraries
try:
    import lightgbm as lgb
    from lightgbm import LGBMClassifier
    print("LightGBM imported")
except ImportError:
    print("LightGBM not available")

try:
    import xgboost as xgb
    from xgboost import XGBClassifier
    print("XGBoost imported")
except ImportError:
    print("XGBoost not available")

try:
    import catboost
    from catboost import CatBoostClassifier
    print("CatBoost imported")
except ImportError:
    print("CatBoost not available - will install if needed")

# Setup logging
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("Utils modules and ML libraries imported successfully")

LightGBM imported
XGBoost imported
CatBoost imported
Utils modules and ML libraries imported successfully


In [10]:
# ============================================================================
# TUNING CONFIGURATION
# ============================================================================

tuning_config = {
    'n_trials_coarse': 10,
    'n_trials_fine': 20,
    'cv_folds': 3,
    'min_improvement_for_fine': 0.003,
    'random_state': CONFIG['random_state']
}

print(f"Tuning configuration set:")
print(f"  Coarse trials: {tuning_config['n_trials_coarse']}")
print(f"  Fine trials: {tuning_config['n_trials_fine']}")
print(f"  CV folds: {tuning_config['cv_folds']}")
print(f"  Min improvement: {tuning_config['min_improvement_for_fine']}")

print("\n" + "=" * 60)
print("NOTEBOOK 02 SETUP COMPLETE")
print("=" * 60)
print("Ready to train baseline models!")
print("=" * 60)

Tuning configuration set:
  Coarse trials: 10
  Fine trials: 20
  CV folds: 3
  Min improvement: 0.003

NOTEBOOK 02 SETUP COMPLETE
Ready to train baseline models!


## Fase 5: Hyperparameter Tuning via Optuna

### Estratégia de Otimização

| Parâmetro | Configuração |
|-----------|--------------|
| Algorithm | TPE (Tree-structured Parzen Estimator) |
| Trials | Stage 1: 10 trials (rápido)<br>Stage 2: 20 trials (completo) |
| Objective | Maximize PR-AUC |
| Pruning | MedianPruner (early stopping) |
| Validation | 3-Fold Stratified CV |

### Modelos Otimizados

**1. LightGBM** (Primary)
```python
search_space = {
    'learning_rate': (0.01, 0.3),      # log scale
    'max_depth': (3, 12),              # int
    'num_leaves': (20, 150),           # int
    'min_child_samples': (5, 100),     # int
    'feature_fraction': (0.5, 1.0),    # float
    'bagging_fraction': (0.5, 1.0),    # float
    'lambda_l1': (0.0, 10.0),          # L1 reg
    'lambda_l2': (0.0, 10.0),          # L2 reg
}
```

**2. XGBoost** (Baseline)
```python
search_space = {
    'learning_rate': (0.01, 0.3),
    'max_depth': (3, 12),
    'min_child_weight': (1, 10),
    'subsample': (0.5, 1.0),
    'colsample_bytree': (0.5, 1.0),
    'gamma': (0.0, 5.0),
    'reg_alpha': (0.0, 10.0),
    'reg_lambda': (0.0, 10.0)
}
```

**3. CatBoost** (NEW - Categorical Optimized)
```python
search_space = {
    'learning_rate': (0.01, 0.3),
    'depth': (4, 10),
    'l2_leaf_reg': (1, 10),
    'random_strength': (0, 10),
    'bagging_temperature': (0, 1),
    'border_count': (32, 255)
}
```

**CatBoost Advantages:**
- Native categorical feature handling (no encoding needed)
- Built-in overfitting protection
- Robust to parameter changes
- Excellent for AML detection with mixed data types

### Gating Mechanism

```
Stage 1 (10 trials)
├─ IF best_score < 0.75:
│  └─ ABORT (modelo não viável)
└─ ELSE:
   └─ Continue to Stage 2 (20 trials)
```

Economia de ~60% de tempo em experimentos não promissores.


In [11]:
# NOTEBOOK: 02_baseline_models.ipynb
# Force reload of tuning module (fixes early_stopping issue)
import importlib
import utils.tuning
importlib.reload(utils.tuning)
from utils.tuning import run_staged_tuning, print_progress

print("[OK] Tuning module reloaded (with progress callback)")

[OK] Tuning module reloaded (with progress callback)


In [12]:
# NOTEBOOK: 02_baseline_models.ipynb
# ============================================================================
# CATBOOST INSTALLATION & VERIFICATION
# ============================================================================

print("=" * 60)
print("CATBOOST SETUP")
print("=" * 60)

try:
    from catboost import CatBoostClassifier
    import catboost
    print(f"[OK] CatBoost already installed (v{catboost.__version__})")
    CATBOOST_AVAILABLE = True
except ImportError:
    print("[INFO] Installing CatBoost...")
    import subprocess
    try:
        subprocess.check_call(
            ['pip', 'install', 'catboost'],
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL
        )
        from catboost import CatBoostClassifier
        import catboost
        print(f"[OK] CatBoost installed successfully (v{catboost.__version__})")
        CATBOOST_AVAILABLE = True
    except Exception as e:
        print(f"[ERROR] Failed to install CatBoost: {e}")
        print("[WARNING] CatBoost will be skipped in tuning")
        CATBOOST_AVAILABLE = False

if CATBOOST_AVAILABLE:
    # Test CatBoost with sample data
    print("\n[TEST] Training quick CatBoost baseline...")
    cat_test = CatBoostClassifier(
        iterations=10,
        verbose=False,
        random_state=CONFIG['random_state']
    )
    # Test CatBoost with sample data (use random sample to avoid class imbalance issues)
    sample_indices = np.random.RandomState(CONFIG['random_state']).choice(
        len(X_train_balanced), 1000, replace=False
    )
    cat_test.fit(
        X_train_balanced.iloc[sample_indices], 
        y_train_balanced.iloc[sample_indices]
    )
    test_score = cat_test.score(
        X_test_featured.iloc[:200], 
        y_test.iloc[:200]
    )
    print(f"[OK] CatBoost test successful (accuracy: {test_score:.3f})")
    print(f"[OK] Ready for hyperparameter tuning")

print("=" * 60)


CATBOOST SETUP
[OK] CatBoost already installed (v1.2.8)

[TEST] Training quick CatBoost baseline...
[OK] CatBoost test successful (accuracy: 0.785)
[OK] Ready for hyperparameter tuning


In [13]:
# NOTEBOOK: 02_baseline_models.ipynb
# Check cache
tuning_cache = ARTIFACTS_DIR / 'tuning_results.json'

# Force re-run by deleting cache if it exists with errors
if tuning_cache.exists():
    print("[WARNING] Deleting previous tuning cache (had errors)...")
    tuning_cache.unlink()

if check_artifact_exists(tuning_cache) and not CONFIG.get('force_retune', False):
    print("[OK] Loading cached tuning results...")
    tuning_results = load_artifact(tuning_cache)
else:
    # Reload tuning module to get latest changes
    import importlib
    import utils.tuning
    importlib.reload(utils.tuning)
    from utils.tuning import run_staged_tuning, print_progress, _progress_states
    
    # Clear previous progress state (for clean display)
    _progress_states.clear()
    
    print("Running staged hyperparameter tuning...")
    print(f"Strategy: {tuning_config['n_trials_coarse']}+{tuning_config['n_trials_fine']} trials, {tuning_config['cv_folds']}-fold CV")
    print("=" * 60)
    
    tuning_results = {}
    
    # LightGBM (primary)
    lgbm_results = run_staged_tuning(
        model_name='LightGBM',
        X_train=X_train_balanced,
        y_train=y_train_balanced,
        config=tuning_config,
        metric='pr_auc',
        progress_callback=print_progress 
    )
    tuning_results['LightGBM'] = lgbm_results
    
    # XGBoost (baseline)
    xgb_results = run_staged_tuning(
        model_name='XGBoost',
        X_train=X_train_balanced,
        y_train=y_train_balanced,
        config=tuning_config,
        metric='pr_auc',
        progress_callback=print_progress
    )
    tuning_results['XGBoost'] = xgb_results
    
    # CatBoost (NEW - optimized for categorical features)
    catboost_results = run_staged_tuning(
        model_name='CatBoost',
        X_train=X_train_balanced,
        y_train=y_train_balanced,
        config=tuning_config,
        metric='pr_auc',
        progress_callback=print_progress
    )
    tuning_results['CatBoost'] = catboost_results
    
    # Save
    save_artifact(tuning_results, tuning_cache)

# Display results
print("\n" + "="*60)
print("TUNING RESULTS SUMMARY")
print("="*60)
for model, results in tuning_results.items():
    print(f"{model:>12}: {results['best_score']:.4f} ({results['n_trials']} trials)")


INFO:utils.tuning:  Fine best: 0.9007
INFO:utils.tuning:✅ Tuning complete: CatBoost → 0.9007
INFO:utils.data:✓ Saved json to C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\tuning_results.json



TUNING RESULTS SUMMARY
    LightGBM: 0.8971 (30 trials)
     XGBoost: 0.8959 (10 trials)
    CatBoost: 0.9007 (30 trials)


In [14]:
# NOTEBOOK: 02_baseline_models.ipynb
# Garantir que `tuning_results` esteja disponível (carrega do disco se necessário)
try:
    tuning_results
    print("`tuning_results` já está em memória — nada a fazer.")
except NameError:
    candidates = [ARTIFACTS_DIR / 'tuning_results.json', ARTIFACTS_DIR / 'advanced_tuning_results.json']
    loaded = False
    for p in candidates:
        if p.exists():
            tuning_results = load_artifact(p)
            print(f"Carregado tuning_results de: {p.name}")
            loaded = True
            break
    if not loaded:
        raise FileNotFoundError('Nenhum arquivo de tuning encontrado em ARTIFACTS_DIR (tuning_results.json ou advanced_tuning_results.json)')

# Pequeno resumo para confirmar disponibilidade
print('\nModelos em tuning_results:', list(tuning_results.keys()))
for m, info in tuning_results.items():
    bs = info.get('best_score') if isinstance(info, dict) else None
    print(f" - {m}: best_score={bs}")


`tuning_results` já está em memória — nada a fazer.

Modelos em tuning_results: ['LightGBM', 'XGBoost', 'CatBoost']
 - LightGBM: best_score=0.8970627857818368
 - XGBoost: best_score=0.8958629387415359
 - CatBoost: best_score=0.9007345335336209


In [15]:
# NOTEBOOK: 02_baseline_models.ipynb
# ============================================================================
# RECUPERAR MODELOS TREINADOS DO DISCO (SEM CACHE)
# ============================================================================
print("Recuperando modelos treinados salvos em 'artifacts/models/'...")

# Caminhos dos modelos salvos
model_paths = {
    'LightGBM': MODELS_DIR / 'lightgbm_tuned.pkl',
    'XGBoost': MODELS_DIR / 'xgboost_tuned.pkl', 
    'CatBoost': MODELS_DIR / 'catboost_tuned.pkl',
    'Best': MODELS_DIR / 'best_model_tuned.pkl',
    'Final': MODELS_DIR / 'final_model_tuned.pkl'
}

# Dicionário para armazenar os modelos carregados
loaded_models = {}

# Carregar cada modelo se existir
for model_name, model_path in model_paths.items():
    if check_artifact_exists(model_path):
        loaded_models[model_name] = load_artifact(model_path)
        print(f"[OK] Modelo {model_name} carregado de: {model_path}")
    else:
        print(f"[INFO] Modelo {model_name} não encontrado em: {model_path}")

# Verificar se conseguimos carregar algum modelo
if loaded_models:
    print(f"\n[OK] {len(loaded_models)} modelos recuperados com sucesso!")
    print("Modelos disponíveis:", list(loaded_models.keys()))
    
    # Definir o melhor modelo (usando o 'Best' se disponível, senão 'Final')
    if 'Best' in loaded_models:
        best_model = loaded_models['Best']
        best_model_name = 'Best (from disk)'
    elif 'Final' in loaded_models:
        best_model = loaded_models['Final'] 
        best_model_name = 'Final (from disk)'
    else:
        print("[WARNING] Nenhum modelo 'Best' ou 'Final' encontrado")
        
else:
    print("[ERROR] Nenhum modelo pôde ser carregado do disco")
    print("Execute as células de treinamento primeiro para salvar os modelos")

print("\n[OK] Recuperação de modelos completa!")

INFO:utils.data:✓ Loaded pickle from C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\models\best_model_tuned.pkl
INFO:utils.data:✓ Loaded pickle from C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\models\final_model_tuned.pkl


Recuperando modelos treinados salvos em 'artifacts/models/'...
[INFO] Modelo LightGBM não encontrado em: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\models\lightgbm_tuned.pkl
[INFO] Modelo XGBoost não encontrado em: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\models\xgboost_tuned.pkl
[INFO] Modelo CatBoost não encontrado em: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\models\catboost_tuned.pkl
[OK] Modelo Best carregado de: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\models\best_model_tuned.pkl
[OK] Modelo Final carregado de: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\models\final_model_tuned.pkl

[OK] 2 modelos recuperados com sucesso!
Modelos disponíveis: ['Best', 'Final']

[OK] Recuperação de modelos completa!


In [16]:
# NOTEBOOK: 02_baseline_models.ipynb
# ============================================================================
# MODEL CACHE CHECK - Avoid retraining if models exist
# ============================================================================

# Check if final models are already trained and saved
best_model_path = MODELS_DIR / 'best_model_tuned.pkl'
cv_results_path = ARTIFACTS_DIR / 'cv_results.json'

if check_artifact_exists(best_model_path) and check_artifact_exists(cv_results_path) and not CONFIG.get('force_retrain', False):
    print("[OK] Loading pre-trained final models from cache...")
    
    # Load best model
    best_model = load_artifact(best_model_path)
    cv_results = load_artifact(cv_results_path)
    
    # Extract best model name from cv_results
    best_model_name = max(
        cv_results.keys(),
        key=lambda m: cv_results[m]['mean_metrics']['pr_auc_mean']
    )
    
    # Load all final models
    final_models = {}
    for model_name in cv_results.keys():
        model_path = MODELS_DIR / f'{model_name.lower()}_tuned.pkl'
        if check_artifact_exists(model_path):
            final_models[model_name] = load_artifact(model_path)
        else:
            print(f"[WARNING] {model_name} model not found, will need to retrain")
    
    print(f"[OK] Loaded best model: {best_model_name}")
    print(f"[OK] CV results loaded for {len(cv_results)} models")
    
    # Skip the training section below
    skip_training = True
    
else:
    print("[INFO] No cached models found or force_retrain=True, will train new models...")
    skip_training = False

INFO:utils.data:✓ Loaded pickle from C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\models\best_model_tuned.pkl


[OK] Loading pre-trained final models from cache...


INFO:utils.data:✓ Loaded json from C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts\cv_results.json


[OK] Loaded best model: CatBoost
[OK] CV results loaded for 3 models


In [17]:
# NOTEBOOK: 02_baseline_models.ipynb
# recapturar o tuning_results sem precisar de recarregar a celula

if not skip_training:
    print("Training final models with optimized parameters...\n")
    
    cv_strategy = get_cv_strategy('stratified', n_splits=3, random_state=CONFIG['random_state'])
    
    final_models = {}
    cv_results = {}
    
    for model_name, tuning_result in tuning_results.items():
        print(f"Evaluating {model_name}...")
        
        # Create model with best params
        best_params = tuning_result['best_params']
        
        if model_name == 'LightGBM':
            model = LGBMClassifier(
                **best_params,
                n_estimators=1000,
                class_weight='balanced',
                random_state=CONFIG['random_state'],
                verbosity=-1
            )
        elif model_name == 'XGBoost':
            model = XGBClassifier(
                **best_params,
                n_estimators=1000,
                random_state=CONFIG['random_state'],
                verbosity=0
            )
        elif model_name == 'CatBoost':
            from catboost import CatBoostClassifier
            model = CatBoostClassifier(
                **best_params,
                iterations=1000,
                random_state=CONFIG['random_state'],
                verbose=False,
                early_stopping_rounds=50
            )
        
        # Cross-validation
        cv_result = cross_validate_with_metrics(
            model, X_train_balanced, y_train_balanced,
            cv_strategy=cv_strategy,
            metric_fn=FraudMetrics.compute_all,
            verbose=False
        )
        
        cv_results[model_name] = cv_result
        
        # Train on full training set
        model.fit(X_train_balanced, y_train_balanced)
        final_models[model_name] = model
        
        # Save individual model
        save_artifact(model, MODELS_DIR / f'{model_name.lower()}_tuned.pkl', artifact_type='pickle')
        
        # Print summary
        mean_metrics = cv_result['mean_metrics']
        print(f"  PR-AUC: {mean_metrics['pr_auc_mean']:.4f} ± {mean_metrics['pr_auc_std']:.4f}")
        print(f"  Recall: {mean_metrics['recall_fraud_mean']:.4f} ± {mean_metrics['recall_fraud_std']:.4f}")
        print(f"  F1:     {mean_metrics['f1_fraud_mean']:.4f} ± {mean_metrics['f1_fraud_std']:.4f}")
        print()
    
    # Save cv_results
    save_artifact(cv_results, cv_results_path, artifact_type='json')
    
    # Select best model
    best_model_name = max(
        cv_results.keys(),
        key=lambda m: cv_results[m]['mean_metrics']['pr_auc_mean']
    )
    best_model = final_models[best_model_name]
    
    # Save best model
    save_artifact(best_model, best_model_path, artifact_type='pickle')
    
    print(f"[OK] Best model: {best_model_name} (saved to {best_model_path})")


## Fase 6: Salvamento de Dados para Próximos Notebooks

**Objetivo:** Salvar dados de teste em formato CSV para compatibilidade com próximos notebooks do pipeline.

### Compatibilidade com Pipeline:
- **Notebook 09 (Threshold Optimization)**: Espera `X_test_engineered.csv` e `y_test_engineered.csv`
- **Outros notebooks**: Podem usar tanto CSV quanto Parquet

### Arquivos salvos:
- **X_test_engineered.csv**: Features de teste (formato CSV)
- **y_test_engineered.csv**: Target de teste (formato CSV)

In [18]:
# ============================================================================
# SAVE TEST DATA IN CSV FORMAT FOR NEXT NOTEBOOKS
# ============================================================================

print("=" * 60)
print("SAVING TEST DATA FOR NEXT NOTEBOOKS")
print("=" * 60)

# Save test data in CSV format (expected by notebook 09)
X_test_featured.to_csv(ARTIFACTS_DIR / 'X_test_engineered.csv', index=False)
y_test.to_frame('target').to_csv(ARTIFACTS_DIR / 'y_test_engineered.csv', index=False)

print("Test data saved in CSV format:")
print(f"  - X_test_engineered.csv: {X_test_featured.shape}")
print(f"  - y_test_engineered.csv: {len(y_test)} samples")
print(f"Files saved to: {ARTIFACTS_DIR}")

# Verify files exist
csv_files = ['X_test_engineered.csv', 'y_test_engineered.csv']
for csv_file in csv_files:
    file_path = ARTIFACTS_DIR / csv_file
    if file_path.exists():
        size_mb = file_path.stat().st_size / (1024 * 1024)
        print(f"{csv_file}: {size_mb:.2f} MB")
    else:
        print(f"{csv_file}: NOT FOUND")

print("\n" + "=" * 60)
print("NOTEBOOK 02 COMPLETE - READY FOR NEXT NOTEBOOKS")
print("=" * 60)
print("Next notebooks can now load:")
print("• Models: best_model_tuned.pkl, lightgbm_tuned.pkl, etc.")
print("• Test data: X_test_engineered.csv, y_test_engineered.csv")
print("• Results: cv_results.json, tuning_results.json")
print("=" * 60)

SAVING TEST DATA FOR NEXT NOTEBOOKS
Test data saved in CSV format:
  - X_test_engineered.csv: (2938, 32)
  - y_test_engineered.csv: 2938 samples
Files saved to: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts
X_test_engineered.csv: 0.65 MB
y_test_engineered.csv: 0.01 MB

NOTEBOOK 02 COMPLETE - READY FOR NEXT NOTEBOOKS
Next notebooks can now load:
• Models: best_model_tuned.pkl, lightgbm_tuned.pkl, etc.
• Test data: X_test_engineered.csv, y_test_engineered.csv
• Results: cv_results.json, tuning_results.json
