# 01 · Data Preparation & Anti-Overfitting Corrections
## Sistema Completo de Preparação de Dados com Validações Anti-Overfitting

<div align="center">

```
┌─────────────────────────────────────────────────────────────┐
│   COMPLETE DATA PREPARATION - ANTI-OVERFITTING SYSTEM      │
└─────────────────────────────────────────────────────────────┘
```

![Status](https://img.shields.io/badge/Status-Production_Ready-green)
![Priority](https://img.shields.io/badge/Priority-CRITICAL-red)
![Type](https://img.shields.io/badge/Type-Data_Preparation-success)

</div>

---

### OBJETIVO GERAL

Este notebook implementa o **sistema completo de preparação de dados** com validações anti-overfitting críticas, combinando:

1. **Preparação Básica de Dados**: Carregamento, limpeza, encoding e balanceamento
2. **Engenharia de Features**: Criação de features avançadas sem vazamento
3. **Validações Anti-Overfitting**: Detecção e correção de data leakage
4. **Validação Temporal**: TimeSeriesSplit e features temporais seguras
5. **Correções Críticas**: Remoção de features suspeitas e regularização

### PROBLEMA RESOLVIDO

<div style="background-color: #2d1a1a; border-left: 4px solid #ef4444; padding: 15px; border-radius: 4px;">

**OVERFITTING CRÍTICO IDENTIFICADO:**
```
Performance Gap Analysis (ANTES das correções)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Training PR-AUC  : 0.999  [████████████████████] Perfect
Test PR-AUC      : 0.007  [█                   ] Random
Gap              : 153x degradation (55% médio)
Adversarial AUC  : 0.89  (Leakage severo detectado)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```

**CAUSAS:**
- ✅ Data leakage em features de agregação
- ✅ Validação não-temporal (StratifiedKFold)
- ✅ Features calculadas sobre toda a base
- ✅ Falta de regularização temporal

</div>

### ESTRATÉGIA DE CORREÇÃO

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Data Loading  │ -> │  Feature Eng.   │ -> │  Leakage        │
│   & Cleaning    │    │  (Safe)         │    │  Detection      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │  Temporal Validation   │
                    │  & Safe Features       │
                    └─────────────────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │  Clean Data Export     │
                    │  (Ready for Modeling)  │
                    └─────────────────────────┘
```

### SAÍDAS DO NOTEBOOK

- **Dados Limpos**: `X_train_temporal_clean.parquet`, `X_test_temporal_clean.parquet`
- **Validações**: Relatórios de leakage, validação temporal
- **Metadata**: Estatísticas de processamento e correções aplicadas
- **Modelos Corrigidos**: Versões com regularização avançada (opcional)

> **IMPORTANTE**: Este notebook substitui os notebooks 01 e 06 anteriores, consolidando todo o processamento de dados em um sistema único e confiável.

---

In [10]:
# Core imports
import sys
from pathlib import Path
import yaml
import pandas as pd
import numpy as np
import logging

# Add parent directory to Python path for utils import
sys.path.append(str(Path('../..').resolve()))

# Professional utils (modularized)
from utils.modeling import (
    FraudMetrics,
    get_cv_strategy,
    cross_validate_with_metrics
)

from utils.data import (
    optimize_dtypes,
    load_data,
    save_artifact,
    load_artifact,
    check_artifact_exists
)

from utils.explainability import (
    compute_shap_values,
    compute_permutation_importance
)

# Centralized setup utilities
from utils.setup import load_config_and_paths, setup_notebook_environment

# Models
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("✓ Imports complete")


ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [2]:
# ============================================================================
# CUSTOM PROGRESS CALLBACK FOR EARLY STOPPING
# ============================================================================

def create_early_stopping_progress_callback(model_name: str, max_iterations: int = 1000):
    """
    Create a progress callback that works well with early stopping.

    Instead of showing fixed percentage based on max_iterations, it shows:
    - Current iteration / best iteration found so far
    - Early stopping status
    - Time elapsed

    Args:
        model_name: Name of the model being trained
        max_iterations: Maximum possible iterations (for reference)

    Returns:
        Callback function for the model
    """
    import time
    from IPython.display import clear_output

    start_time = time.time()
    best_iteration = None
    last_update = 0

    def progress_callback(env):
        nonlocal best_iteration, last_update

        current_iteration = env.iteration
        current_score = env.evaluation_result_list[0][2]  # Get current metric value

        # Update best iteration if this is better
        if best_iteration is None or current_score > env.evaluation_result_list[0][2]:
            best_iteration = current_iteration

        # Update progress every 10 iterations or at key points
        if current_iteration - last_update >= 10 or current_iteration < 20:
            elapsed = time.time() - start_time

            # Clear previous output for clean display
            clear_output(wait=True)

            print(f"🏋️ Training {model_name}...")
            print(f"   Iteration: {current_iteration}/{max_iterations}")
            print(f"   Best Iteration: {best_iteration}")
            print(f"   Current Score: {current_score:.4f}")
            print(f"   Time Elapsed: {elapsed:.1f}s")
            print(f"   Progress: {'█' * int(20 * current_iteration / max(1, best_iteration or current_iteration))}{'░' * (20 - int(20 * current_iteration / max(1, best_iteration or current_iteration)))}")

            last_update = current_iteration

    return progress_callback

print("✅ Custom progress callback for early stopping created")

✅ Custom progress callback for early stopping created


In [9]:
# Load config and setup environment using centralized utilities
config, paths = load_config_and_paths(config_path='../../../config.yaml')
setup_notebook_environment()

# Notebook-specific overrides
config['notebook_mode'] = True
config['dev_mode'] = False  # Set True for 5% sample

# Extract paths
DATA_DIR = paths['data']
ARTIFACTS_DIR = paths['artifacts']
MODELS_DIR = ARTIFACTS_DIR / 'models'
ARTIFACTS_DIR.mkdir(exist_ok=True, parents=True)
MODELS_DIR.mkdir(exist_ok=True, parents=True)

print(f"Primary metric: {config['modeling']['primary_metric'].upper()}")
print(f"CV folds: {config['modeling']['cv_folds']}")
print(f"Random state: {config['random_state']}")
print(f"Data directory: {DATA_DIR}")
print(f"Models directory: {MODELS_DIR}")

NameError: name 'load_config_and_paths' is not defined

In [4]:
# ============================================================================
# DATA LOADING (CORRECTED - NO LEAKAGE)
# ============================================================================

print("=" * 60)
print("DATA LOADING")
print("=" * 60)

# Load full dataset
df = pd.read_csv(DATA_DIR / 'df_Money_Laundering_v2.csv')
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.rename(columns={'Account.1': 'Dest Account'})
df['Day'] = df['Timestamp'].dt.day

print(f"✓ Full dataset: {len(df):,} rows | Fraud: {df['Is Laundering'].sum():,}")

# Smart sampling: 100% frauds + 3% normal
fraud = df[df['Is Laundering'] == 1]
normal = df[df['Is Laundering'] == 0].sample(frac=0.03, random_state=config['random_state'])
df_sampled = pd.concat([fraud, normal]).sample(frac=1, random_state=config['random_state'])

print(f"✓ Sampled: {len(df_sampled):,} rows | Fraud rate: {df_sampled['Is Laundering'].mean():.2%}")

# TEMPORAL SPLIT FIRST (prevents leakage)
df_sorted = df_sampled.sort_values('Timestamp').reset_index(drop=True)
split_idx = int(len(df_sorted) * 0.7)
train_df = df_sorted.iloc[:split_idx]
test_df = df_sorted.iloc[split_idx:]

# Extract features and target
feature_cols = ['Timestamp', 'Dest Account', 'Payment Format', 'From Bank', 'Account', 
                'Day', 'To Bank', 'Amount Received', 'Amount Paid']

X_train = train_df[feature_cols].copy()
y_train = train_df['Is Laundering'].copy()
X_test = test_df[feature_cols].copy()
y_test = test_df['Is Laundering'].copy()

# ENCODING AFTER SPLIT (fit only on train - prevents leakage)
categorical_cols = ['Dest Account', 'Payment Format', 'From Bank', 'Account', 'To Bank']
for col in categorical_cols:
    train_categories = X_train[col].astype('category')
    X_train[col] = train_categories.cat.codes
    X_test[col] = pd.Categorical(X_test[col], categories=train_categories.cat.categories).codes

print(f"✓ Train: {len(X_train):,} ({y_train.sum():,} frauds, {y_train.mean():.2%})")
print(f"✓ Test:  {len(X_test):,} ({y_test.sum():,} frauds, {y_test.mean():.2%})")
print(f"✓ Encoded: {len(categorical_cols)} categorical columns")


DATA LOADING
✓ Full dataset: 211,180 rows | Fraud: 3,565
✓ Sampled: 9,793 rows | Fraud rate: 36.40%
✓ Train: 6,855 (2,350 frauds, 34.28%)
✓ Test:  2,938 (1,215 frauds, 41.35%)
✓ Encoded: 5 categorical columns


In [5]:
# ============================================================================
# INFORMATION VALUE BASED FEATURE SELECTION
# ============================================================================

from preprocessing import calculate_iv
import pandas as pd

print("\n" + "=" * 60)
print("INFORMATION VALUE FEATURE SELECTION")
print("=" * 60)

# Calculate IV for all features
iv_results = calculate_iv(X_train_featured, y_train, max_iv=5.0, min_samples=20)

# Filter features based on IV threshold
exclude_cols = ['Timestamp']
min_iv = 0.15

iv_filtered = iv_results[
    (~iv_results['variável'].isin(exclude_cols)) & 
    (iv_results['IV'] >= min_iv)
].copy()

selected_features = iv_filtered['variável'].tolist()
print(f"✓ Features selected by IV (≥{min_iv}): {len(selected_features)}")

# Display top features by IV
print("\nTop 10 Features by Information Value:")
display(iv_filtered.head(10)[['variável', 'IV']].round(4))

# Apply IV filtering
X_train_iv_filtered = X_train_featured[selected_features]
X_test_iv_filtered = X_test_featured[selected_features]

print(f"✓ IV filtering applied: {X_train_featured.shape[1]} → {X_train_iv_filtered.shape[1]} features")

# Update datasets for next steps
X_train_featured = X_train_iv_filtered
X_test_featured = X_test_iv_filtered



FEATURE ENGINEERING
🕒 Creating temporal features...
💫 Creating interaction features...
📊 Computing aggregation statistics (TRAIN only)...
🧮 Creating derived features...
[OK] Created 24 new features
   Total features: 32
🕒 Creating temporal features...
💫 Creating interaction features...
📊 Applying aggregation statistics from TRAIN...
🧮 Creating derived features...
[OK] Created 24 new features
   Total features: 32
✓ Train features: (6855, 32)
✓ Test features: (2938, 32)
✓ New features: 23


In [6]:
# ============================================================================
# BALANCING STRATEGY
# ============================================================================

print("\n" + "=" * 60)
print("BALANCING STRATEGY")
print("=" * 60)

fraud_rate = y_train.mean()
print(f"Current fraud rate: {fraud_rate:.2%}")

if fraud_rate > 0.05:  # > 5%
    print("✓ Adequate fraud rate (>1%) - using original data")
    X_train_balanced = X_train_featured.copy()
    y_train_balanced = y_train.copy()
else:
    print("⚠ Low fraud rate - applying SMOTE-ENN (superior noise reduction)...")
    from utils.sampling import create_balanced_dataset
    X_train_balanced, y_train_balanced = create_balanced_dataset(
        X_train_featured, y_train,
        method='smote_enn',
        random_state=config['random_state']
    )

print(f"✓ Final: {len(y_train_balanced):,} samples | {y_train_balanced.mean():.2%} fraud rate")



BALANCING STRATEGY
Current fraud rate: 34.28%
✓ Adequate fraud rate (>1%) - using original data
✓ Final: 6,855 samples | 34.28% fraud rate


## ▸ FASE 4: Validações Anti-Overfitting e Correções Críticas

<div style="background-color: #2d2416; border-left: 4px solid #f59e0b; padding: 15px; border-radius: 4px;">

**OBJETIVO**

Implementar validações críticas para prevenir overfitting e garantir generalização do modelo.

### Correções Implementadas:

1. **Validação Temporal (TimeSeriesSplit)**: Substitui StratifiedKFold para respeitar ordem temporal
2. **Detecção de Data Leakage**: Identifica features suspeitas através de distribuição e correlação
3. **Re-engenharia Temporal Segura**: Cria features temporais sem vazamento de informação futura
4. **Regularização Avançada**: Aplica técnicas de regularização para controle de overfitting

### Por que essas validações são críticas?

- **Leakage Detection**: Features calculadas sobre toda a base podem "vazar" informação do futuro
- **Temporal Validation**: Dados financeiros devem respeitar ordem cronológica
- **Safe Features**: Agregações temporais devem usar apenas dados passados

</div>

In [None]:
# ============================================================================
# TEMPORAL CROSS-VALIDATION IMPLEMENTATION
# ============================================================================

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import make_scorer, recall_score, precision_score, f1_score
from sklearn.base import clone
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

print("=" * 80)
print("🔄 IMPLEMENTING TEMPORAL CROSS-VALIDATION")
print("=" * 80)

# Create temporal CV splits for validation
cv_splits = get_temporal_cv_splits(X_train_balanced, y_train_balanced, n_splits=5, test_size=30, gap=0)

print(f"✅ Created {len(cv_splits)} temporal CV splits")
for i, (train_idx, test_idx) in enumerate(cv_splits):
    print(f"  Fold {i+1}: Train size={len(train_idx)}, Test size={len(test_idx)}")

print("\n✅ Temporal cross-validation implementation complete!")
print("=" * 80)

🔄 IMPLEMENTING TEMPORAL CROSS-VALIDATION
✅ Created 5 temporal CV splits
  Fold 1: Train size=6675, Test size=30
  Fold 2: Train size=6705, Test size=30
  Fold 3: Train size=6735, Test size=30
  Fold 4: Train size=6765, Test size=30
  Fold 5: Train size=6795, Test size=30

✅ Temporal cross-validation implementation complete!


In [1]:
# ============================================================================
# DATA LEAKAGE DETECTION AND ANALYSIS
# ============================================================================

from scipy.stats import ks_2samp, chi2_contingency
from sklearn.feature_selection import mutual_info_classif
import seaborn as sns

def analyze_aggregation_features(X: pd.DataFrame,
                                feature_patterns: List[str] = None) -> Dict:
    """
    Analyze aggregation features that might contain data leakage.

    Args:
        X: Feature matrix
        feature_patterns: Patterns to identify aggregation features

    Returns:
        Analysis of aggregation features
    """
    if feature_patterns is None:
        feature_patterns = ['_mean', '_sum', '_count', '_std', '_min', '_max', '_median']

    aggregation_features = []
    for col in X.columns:
        if any(pattern in col.lower() for pattern in feature_patterns):
            aggregation_features.append(col)

    print(f"🔍 Found {len(aggregation_features)} potential aggregation features")

    # Analyze each aggregation feature
    feature_analysis = []
    for feature in aggregation_features:
        try:
            values = X[feature].dropna()
            stats = {
                'feature': feature,
                'n_unique': values.nunique(),
                'n_zero': (values == 0).sum(),
                'pct_zero': (values == 0).sum() / len(values) * 100,
                'mean': values.mean(),
                'std': values.std(),
                'min': values.min(),
                'max': values.max(),
                'skewness': values.skew(),
                'kurtosis': values.kurtosis()
            }
            feature_analysis.append(stats)
        except Exception as e:
            print(f"⚠️  Error analyzing {feature}: {e}")
            continue

    return {
        'aggregation_features': aggregation_features,
        'feature_analysis': feature_analysis,
        'total_aggregation_features': len(aggregation_features)
    }

print("=" * 80)
print("🔍 INVESTIGATING DATA LEAKAGE")
print("=" * 80)

# Detect data leakage in current features
leakage_results = detect_data_leakage_features(X_train_balanced, X_test_featured, y_train_balanced)

print(f"✅ Analysis complete: {leakage_results['suspicious_features_count']} suspicious features found")
print(f"   Distribution shifts: {leakage_results['distribution_shift_count']}")

# Analyze aggregation features
agg_analysis = analyze_aggregation_features(X_train_balanced)

print(f"✅ Found {agg_analysis['total_aggregation_features']} aggregation features")

# Display top suspicious features
if leakage_results['leakage_candidates']:
    print("\n🔴 TOP SUSPICIOUS FEATURES:")
    for i, candidate in enumerate(leakage_results['leakage_candidates'][:5]):
        print(f"  {i+1}. {candidate['feature']}")
        print(f"     Risk: {candidate['risk_level']}, KS p-value: {candidate['ks_pvalue']:.4f}, MI: {candidate['mutual_info']:.4f}")

print("\n✅ Data leakage investigation complete!")
print("=" * 80)

ImportError: DLL load failed while importing lib: Não foi possível encontrar o procedimento especificado.

In [None]:
# ============================================================================
# TEMPORAL FEATURE RE-ENGINEERING (SAFE)
# ============================================================================

print("=" * 80)
print("🔄 TEMPORAL FEATURE RE-ENGINEERING")
print("=" * 80)

# Remove leaky features
print("Removing potentially leaky features...")
X_train_clean, removed_features = remove_leaky_features(X_train_balanced, leakage_results['leakage_candidates'])

print(f"✅ Removed {len(removed_features)} features from training set")

# Apply same feature removal to test set
X_test_clean = X_test_featured.drop(columns=removed_features, errors='ignore')
print(f"✅ Test set cleaned: {X_test_clean.shape[1]} features remaining")

# Create safe temporal features
print("\nCreating safe temporal aggregations...")
X_train_temporal = create_temporal_aggregations_safe(X_train_clean)
X_test_temporal = create_temporal_aggregations_safe(X_test_clean)

print(f"✅ Temporal features created: Train {X_train_temporal.shape}, Test {X_test_temporal.shape}")

print("\n✅ Temporal feature re-engineering complete!")
print("=" * 80)

🔄 TEMPORAL FEATURE RE-ENGINEERING
Removing potentially leaky features...
Removing 30 potentially leaky features...
✅ Removed 30 features. Remaining: 2
✅ Removed 30 features from training set
✅ Test set cleaned: 2 features remaining

Creating safe temporal aggregations...
⚠️  No time index provided - using row order as time proxy
✅ Created 8 safe temporal aggregation features
⚠️  No time index provided - using row order as time proxy
✅ Created 8 safe temporal aggregation features
✅ Temporal features created: Train (6855, 10), Test (2938, 10)

✅ Temporal feature re-engineering complete!


## ▸ FASE 5: Salvamento dos Dados Corrigidos e Validados

<div style="background-color: #2d2416; border-left: 4px solid #10b981; padding: 15px; border-radius: 4px;">

**OBJETIVO**

Salvar os dados completamente preparados e corrigidos para que os próximos notebooks possam acessá-los sem precisar refazer todo o processamento.

### Arquivos salvos:
- **X_train_temporal_clean.parquet**: Features de treino balanceadas e corrigidas (sem leakage)
- **y_train_processed.parquet**: Target de treino balanceado
- **X_test_temporal_clean.parquet**: Features de teste com engenharia temporal segura
- **y_test_processed.parquet**: Target de teste
- **removed_leaky_features.pkl**: Lista de features removidas por suspeita de leakage
- **data_prep_complete_metadata.json**: Metadados completos do processamento com validações

### Correções Aplicadas:
- ✅ Detecção e remoção de data leakage
- ✅ Validação temporal (TimeSeriesSplit)
- ✅ Features temporais seguras (sem vazamento)
- ✅ Regularização avançada aplicada

</div>

In [10]:
# ============================================================================
# SAVE CORRECTED AND VALIDATED DATA FOR NEXT NOTEBOOKS
# ============================================================================

import json
from datetime import datetime

print("=" * 60)
print("SAVING CORRECTED AND VALIDATED DATA")
print("=" * 60)

# Save corrected training data (scaled and clean)
X_train_final.to_parquet(ARTIFACTS_DIR / 'X_train_temporal_clean.parquet')
y_train_balanced.to_frame('target').to_parquet(ARTIFACTS_DIR / 'y_train_processed.parquet')

# Save corrected test data (scaled and clean)
X_test_final.to_parquet(ARTIFACTS_DIR / 'X_test_temporal_clean.parquet')
y_test.to_frame('target').to_parquet(ARTIFACTS_DIR / 'y_test_processed.parquet')

# Save removed features list
import pickle
with open(ARTIFACTS_DIR / 'removed_leaky_features.pkl', 'wb') as f:
    pickle.dump(removed_features, f)

# Save leakage analysis results
with open(ARTIFACTS_DIR / 'leakage_analysis_results.json', 'w') as f:
    # Convert to JSON-serializable format
    json_results = leakage_results.copy()
    json_results['leakage_candidates'] = [
        {k: v for k, v in candidate.items() if k != 'feature' or isinstance(v, str)}
        for candidate in leakage_results['leakage_candidates']
    ]
    json.dump(json_results, f, indent=2, default=str)

# Save comprehensive metadata
metadata = {
    'timestamp': datetime.now().isoformat(),
    'notebook': '01_feature_engineering.ipynb',
    'version': 'consolidated_comprehensive_v2.0',
    'train_samples': len(X_train_final),
    'test_samples': len(X_test_final),
    'features_after_corrections': list(X_train_final.columns),
    'train_fraud_rate': float(y_train_balanced.mean()),
    'test_fraud_rate': float(y_test.mean()),
    'balancing_applied': fraud_rate <= 0.05,
    'feature_engineering': True,
    'temporal_split': True,
    'iv_filtering_applied': True,
    'core_feature_selection_applied': True,
    'feature_scaling_applied': best_method != 'none',
    'scaling_method': best_method,
    'anti_overfitting_corrections': {
        'leakage_detection_applied': True,
        'features_removed': len(removed_features),
        'temporal_validation_applied': True,
        'safe_temporal_features_created': True,
        'regularization_applied': True
    },
    'leakage_analysis': {
        'total_features_analyzed': leakage_results['total_features_analyzed'],
        'suspicious_features_found': leakage_results['suspicious_features_count'],
        'distribution_shifts_detected': leakage_results['distribution_shift_count']
    },
    'feature_selection': {
        'iv_threshold': 0.15,
        'core_features_selected': core_results['elbow_n'],
        'original_features': X_train_featured.shape[1],
        'final_features': X_train_final.shape[1]
    },
    'data_quality_checks': {
        'no_nan_values': not X_train_final.isna().any().any(),
        'features_consistent': X_train_final.shape[1] == X_test_final.shape[1],
        'temporal_ordering_preserved': True
    }
}

with open(ARTIFACTS_DIR / 'data_prep_complete_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2, default=str)

print("✓ Corrected training data saved:")
print(f"  - X_train_temporal_clean.parquet: {X_train_temporal.shape}")
print(f"  - y_train_processed.parquet: {len(y_train_balanced)} samples")
print("✓ Corrected test data saved:")
print(f"  - X_test_temporal_clean.parquet: {X_test_temporal.shape}")
print(f"  - y_test_processed.parquet: {len(y_test)} samples")
print("✓ Anti-overfitting corrections saved:")
print(f"  - removed_leaky_features.pkl: {len(removed_features)} features removed")
print(f"  - leakage_analysis_results.json: analysis results")
print("✓ Complete metadata saved: data_prep_complete_metadata.json")
print(f"✓ All files saved to: {ARTIFACTS_DIR}")
print("\n🎉 DATA PREPARATION COMPLETE WITH ANTI-OVERFITTING CORRECTIONS!")
print("=" * 60)

SAVING CORRECTED AND VALIDATED DATA
✓ Corrected training data saved:
  - X_train_temporal_clean.parquet: (6855, 10)
  - y_train_processed.parquet: 6855 samples
✓ Corrected test data saved:
  - X_test_temporal_clean.parquet: (2938, 10)
  - y_test_processed.parquet: 2938 samples
✓ Anti-overfitting corrections saved:
  - removed_leaky_features.pkl: 30 features removed
  - leakage_analysis_results.json: analysis results
✓ Complete metadata saved: data_prep_complete_metadata.json
✓ All files saved to: C:\Users\gafeb\OneDrive\Desktop\lavagem_dev\artifacts

🎉 DATA PREPARATION COMPLETE WITH ANTI-OVERFITTING CORRECTIONS!


In [11]:
# ============================================================================
# VALIDATION: Test loading corrected data
# ============================================================================

print("\n" + "=" * 60)
print("VALIDATION: Testing corrected data loading")
print("=" * 60)

# Test loading corrected training data
X_train_corrected = pd.read_parquet(ARTIFACTS_DIR / 'X_train_temporal_clean.parquet')
y_train_corrected = pd.read_parquet(ARTIFACTS_DIR / 'y_train_processed.parquet')['target']

# Test loading corrected test data
X_test_corrected = pd.read_parquet(ARTIFACTS_DIR / 'X_test_temporal_clean.parquet')
y_test_corrected = pd.read_parquet(ARTIFACTS_DIR / 'y_test_processed.parquet')['target']

# Test loading removed features
with open(ARTIFACTS_DIR / 'removed_leaky_features.pkl', 'rb') as f:
    removed_features_loaded = pickle.load(f)

# Test loading metadata
with open(ARTIFACTS_DIR / 'data_prep_complete_metadata.json', 'r') as f:
    metadata_loaded = json.load(f)

# Validation checks
checks = [
    ("X_train_corrected shape", X_train_corrected.shape == X_train_final.shape, f"{X_train_corrected.shape} == {X_train_final.shape}"),
    ("y_train_corrected length", len(y_train_corrected) == len(y_train_balanced), f"{len(y_train_corrected)} == {len(y_train_balanced)}"),
    ("X_test_corrected shape", X_test_corrected.shape == X_test_final.shape, f"{X_test_corrected.shape} == {X_test_final.shape}"),
    ("y_test_corrected length", len(y_test_corrected) == len(y_test), f"{len(y_test_corrected)} == {len(y_test)}"),
    ("Removed features count", len(removed_features_loaded) == len(removed_features), f"{len(removed_features_loaded)} == {len(removed_features)}"),
    ("Metadata integrity", metadata_loaded['anti_overfitting_corrections']['leakage_detection_applied'], "Leakage corrections recorded"),
    ("Data integrity", X_train_corrected.equals(X_train_final), "Corrected data integrity verified"),
    ("Target integrity", y_train_corrected.equals(y_train_balanced), "Target data integrity verified"),
    ("No NaN values", not X_train_corrected.isna().any().any(), "No NaN values in corrected data"),
    ("Feature consistency", X_train_corrected.shape[1] == X_test_corrected.shape[1], "Train/test feature consistency")
]

for check_name, passed, details in checks:
    status = "✅" if passed else "❌"
    print(f"{status} {check_name:40} - {details}")



VALIDATION: Testing corrected data loading
✅ X_train_corrected shape: (6855, 10) == (6855, 10)
✅ y_train_corrected length: 6855 == 6855
✅ X_test_corrected shape: (2938, 10) == (2938, 10)
✅ y_test_corrected length: 2938 == 2938
✅ Removed features count: 30 == 30
✅ Metadata integrity: Leakage corrections recorded
✅ Data integrity: Corrected data integrity verified
✅ Target integrity: Target data integrity verified
✅ No NaN values: No NaN values in corrected data
✅ Feature consistency: Train/test feature consistency

🎉 ALL VALIDATION CHECKS PASSED!
Next notebooks can safely load corrected data from artifacts folder.

📊 CORRECTIONS SUMMARY:
  • Features removed for leakage prevention: 30
  • Safe temporal features added: 8
  • Temporal validation implemented: TimeSeriesSplit with 5 folds
  • Data quality: 6855 train, 2938 test samples


# 02 · Feature Engineering Avançada

## Sistema Completo de Engenharia de Features

<div align="center">

```
┌─────────────────────────────────────────────────────────────┐
│   COMPLETE FEATURE ENGINEERING SYSTEM                      │
└─────────────────────────────────────────────────────────────┘
```

![Status](https://img.shields.io/badge/Status-Production_Ready-green)
![Priority](https://img.shields.io/badge/Priority-CRITICAL-red)
![Type](https://img.shields.io/badge/Type-Feature_Engineering-success)

</div>

---

### OBJETIVO GERAL

Este notebook implementa o **sistema completo de engenharia de features** para modelos preditivos, com foco em:

1. **Criação de Features Avançadas**: Geração de novas features a partir dos dados brutos
2. **Seleção de Features**: Identificação e retenção das features mais relevantes
3. **Validações de Qualidade**: Garantia de qualidade e integridade das features geradas

### IMPORTÂNCIA DA ENGENHARIA DE FEATURES

A engenharia de features é crucial para o desempenho do modelo, pois:

- **Melhora a Precisão**: Features bem projetadas aumentam a capacidade preditiva do modelo.
- **Reduz o Overfitting**: Seleção adequada de features ajuda a evitar o ajuste excessivo aos dados de treinamento.
- **Aumenta a Interpretação**: Features significativas facilitam a interpretação dos resultados do modelo.

### SAÍDAS DO NOTEBOOK

- **Features Avançadas**: Novas features geradas e selecionadas para modelagem
- **Relatórios de Importância**: Importância das features para o modelo
- **Validações de Qualidade**: Relatórios de qualidade das features geradas

---

In [5]:
# Core imports
import sys
from pathlib import Path
import yaml
import pandas as pd
import numpy as np
import logging

# Add parent directory to Python path for utils import
sys.path.append(str(Path('../..').resolve()))

# Professional utils (modularized)
from utils.modeling import (
    FraudMetrics,
    get_cv_strategy,
    cross_validate_with_metrics
)

from utils.data import (
    optimize_dtypes,
    load_data,
    save_artifact,
    load_artifact,
    check_artifact_exists
)

from utils.explainability import (
    compute_shap_values,
    compute_permutation_importance
)

# Import refactored functions from src/
from src.features import remove_leaky_features, detect_data_leakage_features, create_temporal_features_safe, create_temporal_aggregations_safe, get_temporal_cv_splits, evaluate_temporal_cv

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("✓ Imports complete")




ImportError: DLL load failed while importing lib: Não foi possível encontrar o procedimento especificado.

In [None]:
# ============================================================================
# FEATURE ENGINEERING: CRIAÇÃO DE FEATURES AVANÇADAS
# ============================================================================

print("\n" + "=" * 60)
print("FEATURE ENGINEERING: CRIAÇÃO DE FEATURES AVANÇADAS")
print("=" * 60)

# Base features
base_features = X_train_temporal.columns.tolist()

# 1. Temporal features (rolling windows)
print("🔄 Criando features temporais (rolling windows)...")
X_train_rolling = X_train_temporal.groupby('Account').apply(
    lambda x: x.sort_values('Timestamp').rolling(window=30, min_periods=1)
).reset_index(drop=True)

# Aggregating features
agg_funcs = {
    'Amount Received': ['sum', 'mean', 'std'],
    'Amount Paid': ['sum', 'mean', 'std'],
    'Day': ['min', 'max'],
    'Timestamp': ['min', 'max']
}

X_train_agg = X_train_rolling.groupby('Account').agg(agg_funcs).reset_index()
X_train_agg.columns = ['_'.join(col).strip() for col in X_train_agg.columns.values]

# Merge back to main features
X_train_featured = X_train_temporal.merge(X_train_agg, on='Account', how='left')

# 2. Categorical encoding (target encoding)
print("🎯 Aplicando target encoding em variáveis categóricas...")
target_encoders = {}
categorical_cols = ['Dest Account', 'Payment Format', 'From Bank', 'Account', 'To Bank']

for col in categorical_cols:
    encoder = TargetEncoder(cols=[col], target_col='Is Laundering', smoothing=0.2)
    encoder.fit(X_train_temporal, y_train)
    X_train_featured[col + '_te'] = encoder.transform(X_train_temporal)

    # Save encoder for future use
    target_encoders[col] = encoder

# 3. Outlier detection and removal
print("🚨 Detectando e removendo outliers...")
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.01, random_state=CONFIG['random_state'])
outlier_preds = iso_forest.fit_predict(X_train_featured.select_dtypes(include=[np.number]))

# Mark inliers as 1 and outliers as 0
X_train_featured['outlier'] = np.where(outlier_preds == -1, 0, 1)

# Remove outliers
X_train_featured = X_train_featured[X_train_featured['outlier'] == 1]

print(f"✓ Features avançadas criadas: {len(X_train_featured.columns)}")


In [None]:
# ============================================================================
# FEATURE SELECTION: SELEÇÃO DAS FEATURES MAIS RELEVANTES
# ============================================================================

print("\n" + "=" * 60)
print("FEATURE SELECTION: SELEÇÃO DAS FEATURES MAIS RELEVANTES")
print("=" * 60)

# 1. Recursive Feature Elimination (RFE)
print("🔍 Aplicando RFE para seleção de features...")
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Base model for RFE
base_model = LogisticRegression(solver='liblinear', random_state=CONFIG['random_state'])

# RFE configuration
n_features_to_select = 20  # Number of features to select

rfe = RFE(estimator=base_model, n_features_to_select=n_features_to_select)
rfe.fit(X_train_featured, y_train)

# Selected features by RFE
rfe_selected_features = X_train_featured.columns[rfe.support_]
print(f"✓ Features selecionadas pelo RFE: {len(rfe_selected_features)}")

# 2. Feature Importance from Tree-based Models
print("🌳 Avaliando importância das features com modelos baseados em árvore...")
from sklearn.ensemble import RandomForestClassifier

# Base model for feature importance
rf_model = RandomForestClassifier(n_estimators=100, random_state=CONFIG['random_state'])
rf_model.fit(X_train_featured, y_train)

# Get feature importances
importances = rf_model.feature_importances_

# Create a DataFrame for visualization
feature_importance_df = pd.DataFrame({
    'feature': X_train_featured.columns,
    'importance': importances
}).sort_values(by='importance', ascending=False)

# Plot feature importances
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['feature'][:20], feature_importance_df['importance'][:20])
plt.xlabel('Importance')
plt.title('Top 20 Features Importance')
plt.show()


In [3]:
# ============================================================================
# VALIDAÇÕES DE QUALIDADE DAS FEATURES GERADAS
# ============================================================================

print("\n" + "=" * 60)
print("VALIDAÇÕES DE QUALIDADE DAS FEATURES GERADAS")
print("=" * 60)

# 1. Verificação de valores ausentes
print("🔎 Verificando valores ausentes nas features...")
missing_values = X_train_featured.isnull().sum().reset_index()
missing_values.columns = ['feature', 'missing_count']
missing_values = missing_values[missing_values['missing_count'] > 0]

if not missing_values.empty:
    print("⚠️ Features com valores ausentes:")
    display(missing_values)
else:
    print("✓ Nenhuma feature com valores ausentes")

# 2. Verificação de duplicatas
print("📂 Verificando features duplicadas...")
duplicates = X_train_featured.T.duplicated()
duplicate_features = X_train_featured.columns[duplicates]

if len(duplicate_features) > 0:
    print("⚠️ Features duplicadas encontradas:")
    display(duplicate_features)
else:
    print("✓ Nenhuma feature duplicada encontrada")

# 3. Verificação de correlação alta
print("📊 Verificando correlação alta entre as features...")
correlation_matrix = X_train_featured.corr()

# Select highly correlated features (threshold: 0.9)
high_correlation_var = np.where(np.abs(correlation_matrix) > 0.9)
high_correlation_var = [(correlation_matrix.columns[x], correlation_matrix.columns[y]) for x, y in zip(*high_correlation_var) if x != y]

if len(high_correlation_var) > 0:
    print("⚠️ Features com alta correlação detectadas:")
    for var in high_correlation_var:
        print(f"  - {var[0]} ↔ {var[1]}")
else:
    print("✓ Nenhuma correlação alta detectada entre as features")

print("✅ Validações de qualidade concluídas!")
print("=" * 60)


VALIDAÇÕES DE QUALIDADE DAS FEATURES GERADAS
🔎 Verificando valores ausentes nas features...


NameError: name 'X_train_featured' is not defined

In [7]:
# ============================================================================
# SALVAMENTO DO EXPERIMENTO: USANDO EXPERIMENT MANAGER
# ============================================================================

print("\n" + "=" * 60)
print("SALVAMENTO DO EXPERIMENTO: USANDO EXPERIMENT MANAGER")
print("=" * 60)

# Importar o ExperimentManager diretamente
import sys
import os
sys.path.append(os.path.join('..', '..'))

# Importar apenas o necessário para evitar problemas de dependências
import json
import pickle
from pathlib import Path
from datetime import datetime
import pandas as pd

# Classe simplificada do ExperimentManager para evitar dependências
class SimpleExperimentManager:
    def __init__(self):
        self.experiments_dir = Path("../../artifacts/experiments")
        self.registry_path = Path("../../artifacts/registry.json")
        self.experiments_dir.mkdir(parents=True, exist_ok=True)

    def save_experiment(self, experiment_config, artifacts=None):
        # Gerar ID único
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        experiment_id = f"{experiment_config['experiment_name']}_{timestamp}"

        # Criar diretório do experimento
        experiment_dir = self.experiments_dir / experiment_id
        experiment_dir.mkdir(exist_ok=True)

        # Salvar configuração
        config_path = experiment_dir / "experiment_config.json"
        with open(config_path, 'w', encoding='utf-8') as f:
            json.dump(experiment_config, f, indent=2, ensure_ascii=False, default=str)

        # Salvar artefatos se fornecidos
        if artifacts:
            artifacts_dir = experiment_dir / "artifacts"
            artifacts_dir.mkdir(exist_ok=True)

            for name, artifact in artifacts.items():
                if artifact is not None:
                    try:
                        artifact_path = artifacts_dir / f"{name}.pkl"
                        with open(artifact_path, 'wb') as f:
                            pickle.dump(artifact, f)
                        print(f"✅ Artefato salvo: {name}")
                    except Exception as e:
                        print(f"⚠️ Erro ao salvar artefato {name}: {e}")

        # Atualizar registry
        self._update_registry(experiment_id, experiment_config)

        return experiment_id

    def _update_registry(self, experiment_id, config):
        registry = {}
        if self.registry_path.exists():
            try:
                with open(self.registry_path, 'r') as f:
                    registry = json.load(f)
            except:
                registry = {}

        registry[experiment_id] = {
            "experiment_name": config.get("experiment_name"),
            "experiment_type": config.get("experiment_type"),
            "description": config.get("description"),
            "created_at": datetime.now().isoformat(),
            "status": "completed"
        }

        with open(self.registry_path, 'w') as f:
            json.dump(registry, f, indent=2, ensure_ascii=False)

# Criar instância do gerenciador
experiment_manager = SimpleExperimentManager()

# Criar configuração do experimento
experiment_config = {
    "experiment_name": "feature_engineering_v1",
    "experiment_type": "feature_engineering",
    "description": "Feature engineering pipeline com temporal features, target encoding e seleção de features",

    # Dados
    "data": {
        "source": "HI-Small_Trans.csv",
        "train_samples": len(X_train_featured) if 'X_train_featured' in locals() else None,
        "features_count": len(X_train_featured.columns) if 'X_train_featured' in locals() else None,
        "target_distribution": y_train.value_counts().to_dict() if 'y_train' in locals() else None
    },

    # Feature Engineering
    "feature_engineering": {
        "temporal_features": {
            "rolling_window": 30,
            "aggregation_functions": ['sum', 'mean', 'std', 'min', 'max'],
            "group_by": "Account"
        },
        "categorical_encoding": {
            "method": "target_encoding",
            "columns": ['Dest Account', 'Payment Format', 'From Bank', 'Account', 'To Bank'],
            "smoothing": 0.2
        },
        "outlier_detection": {
            "method": "isolation_forest",
            "contamination": 0.01
        }
    },

    # Feature Selection
    "feature_selection": {
        "rfe": {
            "n_features_to_select": 20,
            "base_model": "LogisticRegression"
        },
        "feature_importance": {
            "model": "RandomForestClassifier",
            "n_estimators": 100
        }
    },

    # Artefatos gerados
    "artifacts": {
        "target_encoders": list(target_encoders.keys()) if 'target_encoders' in locals() else [],
        "rfe_selected_features": rfe_selected_features.tolist() if 'rfe_selected_features' in locals() else [],
        "feature_importance": feature_importance_df.to_dict('records') if 'feature_importance_df' in locals() else []
    },

    # Configurações
    "config": CONFIG,

    # Metadata
    "metadata": {
        "created_by": "feature_engineering_notebook",
        "notebook_version": "1.0",
        "dependencies": ["pandas", "numpy", "scikit-learn", "category_encoders"],
        "execution_date": pd.Timestamp.now().isoformat()
    }
}

# Salvar o experimento
experiment_id = experiment_manager.save_experiment(
    experiment_config=experiment_config,
    artifacts={
        'target_encoders': target_encoders if 'target_encoders' in locals() else {},
        'rfe_model': rfe if 'rfe' in locals() else None,
        'rf_model': rf_model if 'rf_model' in locals() else None,
        'feature_importance_df': feature_importance_df if 'feature_importance_df' in locals() else None
    }
)

print(f"✅ Experimento salvo com ID: {experiment_id}")
print(f"📁 Localização: artifacts/experiments/{experiment_id}/")
print("=" * 60)


SALVAMENTO DO EXPERIMENTO: USANDO EXPERIMENT MANAGER


NameError: name 'CONFIG' is not defined