# 3 - Synthetic Data: Feature Selection and Validation

**Purpose**: This notebook performs feature selection on synthetically generated infant movement data to validate the robustness and stability of the Enhanced Adaptive RFE algorithm. Synthetic data allows us to test the methodology with known ground truth and assess generalizability.

**Inputs**:
- Synthetic movement data: `/Volumes/secure/data/early_markers/cribsy/ipc/synth_sdv_1000_long.ipc`
  - Generated via SDV (Synthetic Data Vault) from real data distributions
  - Stored in Apache Arrow IPC format for efficient Polars processing
  - Contains 1000 synthetic infant records with realistic feature distributions
- Sample sizes: `TRAIN_N=300`, `TEST_N=100` (configurable)
- Previously saved BayesianData object (if available) for comparison

**Outputs**:
- `/Volumes/secure/data/early_markers/cribsy/pkl/db_train300_test100.pkl` - Complete BayesianData object
- `/Volumes/secure/data/early_markers/cribsy/xlsx/synthetic_train300_test100*.xlsx` - Excel reports
- Console output showing RFE progression on synthetic data

**Key Dependencies**:
- `BayesianData`: Core class with synthetic data support via `.ipc` files
- `EnhancedAdaptiveRFE`: Statistical feature selection (imported but not directly used)
- `Polars`: High-performance DataFrame library for `.ipc` file I/O
- Random seed: `RAND_STATE = 20250313` for reproducibility

**Workflow Overview**:
1. Load previously saved synthetic BayesianData (if exists) and generate report
2. Initialize fresh BayesianData with synthetic `.ipc` data
3. Apply iterative Enhanced Adaptive RFE until convergence
4. Compute Bayesian surprise and ROC metrics at each iteration
5. Export comprehensive results for comparison with real data analysis

**Synthetic Data Advantages**:
- **Known distributions**: Validate algorithm against expected statistical properties
- **Controlled sample sizes**: Test performance with varying training/test splits
- **Reproducibility**: Generate unlimited data for robustness testing
- **Privacy preservation**: No PHI/PII exposure during development
- **Algorithm validation**: Compare synthetic vs. real data feature selection stability

**Performance**: Typically ~1.5 minutes per iteration (faster than real data due to smaller feature sets)

## 3.1 - Environment Setup and Prior Results Export

This section initializes the analysis environment and exports any previously generated results:

### Import Configuration
- **Polars imports**: `DataFrame` and `pl` for efficient `.ipc` file handling
  - Polars is significantly faster than Pandas for large synthetic datasets
  - Native support for Apache Arrow IPC format used by SDV
- **Progress tracking**: `tqdm` for visual progress bars during long operations
- **Constants**: 
  - `IPC_DIR`: Directory for Apache Arrow IPC files (`/Volumes/secure/data/early_markers/cribsy/ipc/`)
  - `PKL_DIR`: Directory for pickle files

### Sample Size Configuration
```python
TRAIN_N = 300  # Training set size (normative + at-risk)
TEST_N = 100   # Test set size (held-out at-risk)
```

These parameters control the train/test split from the 1000-record synthetic dataset:
- Training data is used to compute reference distributions for Bayesian surprise
- Test data evaluates out-of-sample performance
- Different splits can be tested by modifying `TRAIN_N` and `TEST_N`

### Prior Results Handling
The notebook first attempts to load and export results from a previous run:
```python
with open(PKL_DIR / f"db_train{TRAIN_N}_test{TEST_N}.pkl", "rb") as f:
    bd = pickle.load(f)
bd.write_excel_report(f"synthetic_train{TRAIN_N}_test{TEST_N}")
```

This allows:
- Quick report regeneration without re-running expensive RFE
- Comparison of results across different analysis runs
- Incremental workflow: export → analyze → modify → re-export

In [1]:
from datetime import datetime
import pickle

from numpy import random
from polars import DataFrame
import polars as pl
from tqdm import tqdm
from loguru import logger

from early_markers.cribsy.common.bayes import BayesianData
from early_markers.cribsy.common.adaptive_rfe import EnhancedAdaptiveRFE, validation_report
from early_markers.cribsy.common.constants import AGE_BRACKETS, MIN_K, IPC_DIR, PKL_DIR
from early_markers.cribsy.common.constants import RAND_STATE

# Configure sample sizes for synthetic data split
TRAIN_N = 300  # Training set: normative + at-risk samples
TEST_N = 100   # Test set: held-out at-risk samples

# Set seeds at file level for reproducibility
random.seed(RAND_STATE)

# Export report from previously saved BayesianData (if exists)
with open(PKL_DIR / f"db_train{TRAIN_N}_test{TEST_N}.pkl", "rb") as f:
    bd = pickle.load(f)

bd.write_excel_report(f"synthetic_train{TRAIN_N}_test{TEST_N}")

## 3.2 - Synthetic Data Loading and Iterative Feature Selection

This section loads synthetic data from `.ipc` format and performs iterative feature selection:

### Synthetic Data Initialization
```python
bd = BayesianData(base_file="synth_sdv_1000_long.ipc", 
                  train_n=TRAIN_N, 
                  test_n=TEST_N, 
                  augment=True)
```

**Key parameters**:
- `base_file`: Synthetic data in Apache Arrow IPC format
  - Generated by `02_synthetic_sdv.ipynb` using SDV library
  - Long format: `infant | category | risk | feature | value`
  - Polars automatically handles `.ipc` extension for efficient loading
- `train_n=300`: Randomly samples 300 records for training distribution
- `test_n=100`: Randomly samples 100 at-risk records for testing
- `augment=True`: Enables data augmentation if needed

### Convergence-Based Iteration
Unlike the real data notebook (fixed `MIN_K=7`), this uses **convergence detection**:
```python
features_out = tot_k + 1  # Initialize to force first iteration
while True:
    features_in = len(features)
    if features_in == features_out:
        break  # Converged: no features removed
```

**Convergence logic**:
- Stops when RFE selects the same features as input
- Indicates feature set has stabilized
- More adaptive than fixed `MIN_K` threshold
- Typical result: Converges at 22 features from 23 starting features

### RFE Iteration Steps
Same 3-step process as real data analysis:
1. **Enhanced Adaptive RFE**: `bd.run_adaptive_rfe(prefix, features, tot_k)`
   - Uses model prefix `syn_trn300_tst100` for result tracking
   - Applies 50 parallel trials with noise injection
   - Statistical significance testing (α=0.05)

2. **Bayesian Surprise**: `bd.run_surprise_with_features(prefix, features, overwrite=True)`
   - Computes surprise scores using selected features
   - `overwrite=True`: Replaces previous iteration's results

3. **ROC Metrics**: `bd.compute_roc_metrics(prefix, len(features))`
   - Evaluates sensitivity, specificity, AUC
   - Indexed by feature count for comparison

### Performance Characteristics
- **Per iteration**: ~1.45 minutes (faster than real data)
  - Smaller feature set (23 vs. 57 features)
  - Controlled sample sizes (300+100 vs. full dataset)
- **Typical convergence**: 1 iteration for well-conditioned synthetic data

### Method Name Note
⚠️ **LEGACY METHODS**: Uses `run_adaptive_rfe()` and `compute_roc_metrics()`
- Current API: `run_rfe_on_base()`, `run_metrics_from_surprise()`
- Should be updated in future revisions

In [2]:
# Initialize BayesianData with synthetic data from .ipc file
bd = BayesianData(base_file="synth_sdv_1000_long.ipc", train_n=TRAIN_N, test_n=TEST_N, augment=True)

start_time = datetime.now()
logger.debug(f"Starting Feature Selection...")

# Model prefix for tracking synthetic data results
prefix = f"syn_trn{TRAIN_N}_tst{TEST_N}"
features = bd.base_features  # Start with all synthetic data features
tot_k = len(features)

# Initialize convergence detection
tick = 1
features_out = tot_k + 1  # Force first iteration

# Iterate until convergence (features_in == features_out)
while True:
    features_in = len(features)
    
    # Check convergence: stop if no features removed
    if features_in == features_out:
        break
    
    logger.debug(f"Trial {tick}: Features in: {features_in}...")
    
    # Step A: Enhanced Adaptive RFE with statistical significance testing
    logger.debug(f"Running adaptive RFE...")
    features = bd.run_adaptive_rfe(prefix, features, tot_k=tot_k)
    features_out = len(features)
    
    # Step B: Compute Bayesian surprise using selected features
    logger.debug(f"Running surprise...")
    bd.run_surprise_with_features(prefix, features, overwrite=True)
    
    # Step C: Evaluate ROC performance
    logger.debug(f"Computing ROC...")
    metrics = bd.compute_roc_metrics(prefix, len(features))

    logger.debug(f"...Trial {tick}: Features out: {len(features)}.")
    tick += 1
    
    # Safety termination: stop if reaching minimum threshold
    if len(features) <= MIN_K:
        break

stop_time = datetime.now()
logger.debug(f"Completed Feature Selection in {(stop_time - start_time).seconds / 60: 0.2f} Minutes.")

## 3.3 - Results Export and Persistence

This section exports comprehensive results from the synthetic data analysis:

### Excel Report Generation
```python
bd.write_excel_report(f"synthetic_train{TRAIN_N}_test{TEST_N}_rfe")
```

**Output location**: `/Volumes/secure/data/early_markers/cribsy/xlsx/synthetic_train300_test100_rfe*.xlsx`

**Report contents**:
- **Summary sheet**: Aggregate metrics across all RFE iterations
- **Detail sheets**: Per-iteration ROC curves and feature lists
- **Feature progression**: Shows which features were retained at each step
- **Performance comparison**: Sensitivity, specificity, AUC by feature count

### BayesianData Persistence
```python
with open(PKL_DIR / f"db_train{TRAIN_N}_test{TEST_N}.pkl", "wb") as f:
    pickle.dump(bd, f)
```

**Saved object contents**:
- Complete `BayesianData` instance with all computed results
- All RFE iterations and selected feature sets
- Bayesian surprise metrics for each infant
- ROC performance metrics across all iterations
- Training and test data splits

**Usage in downstream analyses**:
- Load for quick report regeneration (as shown in Section 3.1)
- Compare synthetic vs. real data feature stability
- Validate algorithm performance on known distributions
- Test different sample size configurations

### Synthetic vs. Real Data Comparison
To compare results with real data analysis:
```python
# Load real data results
with open(PKL_DIR / "bd_real.pkl", 'rb') as f:
    bd_real = pickle.load(f)

# Load synthetic data results
with open(PKL_DIR / "db_train300_test100.pkl", 'rb') as f:
    bd_synthetic = pickle.load(f)

# Compare selected features
real_features = bd_real.rfe_features("real")
synth_features = bd_synthetic.rfe_features("syn_trn300_tst100")
common_features = set(real_features) & set(synth_features)
```

### Validation Metrics
Expected outcomes for well-performing synthetic data:
- **Feature overlap**: 60-80% common features between real and synthetic
- **Convergence speed**: Faster convergence indicates clean synthetic distributions
- **ROC performance**: Similar AUC values suggest good distribution matching
- **Feature stability**: Consistent selections across multiple synthetic runs

### Next Steps
After running this notebook:
1. Compare Excel reports: `real_*.xlsx` vs. `synthetic_*.xlsx`
2. Validate feature selection consistency
3. Assess synthetic data quality for algorithm development
4. Consider generating additional synthetic datasets with different parameters

In [3]:
# Export comprehensive Excel report with RFE results
bd.write_excel_report(f"synthetic_train{TRAIN_N}_test{TEST_N}_rfe")

# Serialize complete BayesianData object for downstream analysis and comparison
with open(PKL_DIR / f"db_train{TRAIN_N}_test{TEST_N}.pkl", "wb") as f:
    pickle.dump(bd, f)