# 02 · Data Preprocessing
## Pipeline de Pré-processamento Básico para AML

<div align="center">

```
┌─────────────────────────────────────────────────────────────┐
│   BASIC DATA PREPROCESSING - AML FEATURE PIPELINE         │
└─────────────────────────────────────────────────────────────┘
```

![Status](https://img.shields.io/badge/Status-Production_Ready-green)
![Priority](https://img.shields.io/badge/Priority-HIGH-red)
![Type](https://img.shields.io/badge/Type-Preprocessing-success)

</div>

---

### OBJETIVO GERAL

Este notebook implementa o **pré-processamento básico** dos dados preparados no notebook 01, criando features prontas para modelagem:

1. **Carregamento** dos datasets temporais do notebook 01
2. **Pré-processamento** básico (scaling, encoding)
3. **Validações** de qualidade e consistência
4. **Persistência** dos dados preprocessados

### DEPENDÊNCIAS

- **Notebook 01**: `01_data_ingestion_and_split.ipynb` (datasets temporais)
- **Saídas**: Datasets preprocessados salvos em `data/`

### ESTRATÉGIA DE PREPROCESSAMENTO

```
┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Load Data  │ -> │   Encoding   │ -> │   Scaling    │ -> │   Validation │
│  (Temporal)  │    │ (Categorical)│    │ (Numerical)  │    │   & Save     │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘
```

---

## ▸ SEÇÃO 1: Setup do Ambiente

In [None]:
# Environment setup
import sys
import warnings
from pathlib import Path
from datetime import datetime

# Import base modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import yaml
import json
import pickle

# Add utils to path
utils_path = Path('../utils')
if str(utils_path) not in sys.path:
    sys.path.insert(0, str(utils_path))

warnings.filterwarnings('ignore')

# Configuration
RANDOM_STATE = 42
TARGET_COLUMN = 'Is Laundering'

# Paths
data_dir = Path('../data')
artifacts_dir = Path('../artifacts')

print("Environment configured successfully")
print(f"Data directory: {data_dir}")
print(f"Artifacts directory: {artifacts_dir}")
print(f"Random state: {RANDOM_STATE}")

## ▸ SEÇÃO 2: Carregamento dos Dados Temporais

<div style="background-color: #2d2416; border-left: 4px solid #f59e0b; padding: 15px; border-radius: 4px;">

**OBJETIVO**

Carregar os datasets temporais preparados no notebook 01 (`01_data_ingestion_and_split.ipynb`).

</div>

In [None]:
# Load temporal datasets from notebook 01
print("Loading temporal datasets from notebook 01...")
print("=" * 50)

# Load training data
X_train_path = data_dir / 'X_train_temporal.csv'
y_train_path = data_dir / 'y_train_temporal.csv'

if X_train_path.exists() and y_train_path.exists():
    X_train = pd.read_csv(X_train_path)
    y_train = pd.read_csv(y_train_path).iloc[:, 0]  # Get target column
    
    print(f"✓ Training data loaded:")
    print(f"  Features: {X_train.shape[1]} columns, {X_train.shape[0]:,} rows")
    print(f"  Target: {len(y_train):,} samples, {y_train.mean():.2%} positive rate")
else:
    raise FileNotFoundError(f"Temporal training datasets not found. Please run notebook 01 first.")

# Load test data
X_test_path = data_dir / 'X_test_temporal.csv'
y_test_path = data_dir / 'y_test_temporal.csv'

if X_test_path.exists() and y_test_path.exists():
    X_test = pd.read_csv(X_test_path)
    y_test = pd.read_csv(y_test_path).iloc[:, 0]  # Get target column
    
    print(f"✓ Test data loaded:")
    print(f"  Features: {X_test.shape[1]} columns, {X_test.shape[0]:,} rows")
    print(f"  Target: {len(y_test):,} samples, {y_test.mean():.2%} positive rate")
else:
    raise FileNotFoundError(f"Temporal test datasets not found. Please run notebook 01 first.")

print("\nData loading completed successfully!")

## ▸ SEÇÃO 3: Análise Exploratória dos Dados

<div style="background-color: #2d2416; border-left: 4px solid #f59e0b; padding: 15px; border-radius: 4px;">

**OBJETIVO**

Analisar a estrutura e qualidade dos dados carregados antes do pré-processamento.

</div>

In [None]:
# Data structure analysis
print("DATA STRUCTURE ANALYSIS")
print("=" * 40)

print(f"Training set: {X_train.shape[0]:,} rows × {X_train.shape[1]} columns")
print(f"Test set: {X_test.shape[0]:,} rows × {X_test.shape[1]} columns")

# Data types
print(f"\nFeature data types:")
dtype_counts = X_train.dtypes.value_counts()
for dtype, count in dtype_counts.items():
    print(f"  {dtype}: {count} features")

# Identify categorical and numerical features
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X_train.select_dtypes(include=[np.number]).columns.tolist()

print(f"\nCategorical features ({len(categorical_features)}):")
for feat in categorical_features[:5]:  # Show first 5
    unique_vals = X_train[feat].nunique()
    print(f"  {feat}: {unique_vals} unique values")
if len(categorical_features) > 5:
    print(f"  ... and {len(categorical_features) - 5} more")

print(f"\nNumerical features ({len(numerical_features)}):")
for feat in numerical_features[:5]:  # Show first 5
    print(f"  {feat}: [{X_train[feat].min():.2f}, {X_train[feat].max():.2f}]")
if len(numerical_features) > 5:
    print(f"  ... and {len(numerical_features) - 5} more")

# Missing values check
print(f"\nMissing values analysis:")
train_missing = X_train.isnull().sum().sum()
test_missing = X_test.isnull().sum().sum()
print(f"  Training set: {train_missing} missing values")
print(f"  Test set: {test_missing} missing values")

if train_missing > 0:
    print("  Features with missing values in training:")
    missing_cols = X_train.columns[X_train.isnull().any()]
    for col in missing_cols:
        pct = X_train[col].isnull().mean() * 100
        print(f"    {col}: {pct:.1f}% missing")

# Target distribution
print(f"\nTarget distribution:")
print(f"  Training: {y_train.mean():.2%} positive rate ({y_train.sum():,} positives)")
print(f"  Test: {y_test.mean():.2%} positive rate ({y_test.sum():,} positives)")
print(f"  Rate difference: {abs(y_train.mean() - y_test.mean())*100:.2f} percentage points")