# 1 - Real Data: Iterative Feature Selection with Enhanced Adaptive RFE

**Purpose**: This notebook performs comprehensive feature selection on real infant movement data using an iterative Enhanced Adaptive Recursive Feature Elimination (RFE) approach. It progressively reduces the feature space while computing Bayesian surprise metrics and ROC performance at each iteration.

**Inputs**:
- Real movement data loaded from `/Volumes/secure/code/early-markers/early_markers/emmacp_metrics/features_merged.pkl`
- 57 baseline movement features (position, velocity, acceleration, entropy, correlation metrics)
- Configuration constants from `early_markers.cribsy.common.constants`

**Outputs**:
- `/Volumes/secure/data/early_markers/cribsy/pkl/bd_real.pkl` - Complete BayesianData object with all models
- `/Volumes/secure/data/early_markers/cribsy/xlsx/real_*.xlsx` - Excel reports with metrics and summaries
- Console output showing feature progression and final selected features

**Key Dependencies**:
- `BayesianData`: Core class managing data loading, RFE, surprise computation, and metrics
- `EnhancedAdaptiveRFE`: Statistical feature selection with noise injection and consensus testing
- Random seed: `RAND_STATE = 20250313` for reproducibility

**Workflow Overview**:
1. Initialize BayesianData with real movement data
2. Iteratively apply Enhanced Adaptive RFE to reduce feature count
3. Compute Bayesian surprise metrics after each RFE iteration
4. Calculate ROC performance (sensitivity, specificity, AUC)
5. Continue until minimum feature threshold (`MIN_K = 7`) is reached
6. Export comprehensive results to Excel and pickle formats

**Note**: This is a computationally intensive notebook. Full execution takes approximately 20-25 minutes with 50 RFE trials per iteration.

## 1.1 - Environment Setup and Data Loading

This section initializes the analysis environment:
- Sets the global random seed for reproducible results across all operations
- Imports core dependencies:
  - `BayesianData`: Primary data management and analysis class
  - `RAND_STATE`: Global random seed (20250313)
  - `MIN_K`: Minimum feature count threshold (7 features)
  - `PKL_DIR`: Output directory for pickle files
  - `FEATURES`: Complete list of 57 baseline movement features
- Initializes `BayesianData()` which automatically:
  - Loads `features_merged.pkl` from the raw data directory
  - Transforms risk labels (0=normal, 1=at-risk)
  - Splits data into training (category=1) and test (category=2) sets
  - Prepares both long and wide format DataFrames for analysis

The `drops` list contains features previously identified as unstable or redundant but is not applied in this analysis (we use all `FEATURES`).

In [1]:
from datetime import datetime
import pickle

from numpy import random
from loguru import logger

from early_markers.cribsy.common.bayes import BayesianData
from early_markers.cribsy.common.constants import AGE_BRACKETS, MIN_K, PKL_DIR, FEATURES
from early_markers.cribsy.common.constants import RAND_STATE

# Set seeds at file level for reproducibility
random.seed(RAND_STATE)

# Initialize BayesianData - loads real data from features_merged.pkl
bd = BayesianData()

start_time = datetime.now()
logger.debug(f"Starting Feature Selection...")

# Feature list note: 'drops' contains previously identified unstable features
# but we use all FEATURES for comprehensive analysis
drops = ['Shoulder_IQR_vel_angle', 'Ankle_IQRaccx', 'Wrist_IQRaccx', 'Ankle_IQRvelx', 'Knee_IQR_vel_angle', 'Elbow_IQR_acc_angle', 'Shoulder_mean_angle', 'Ankle_IQRaccy', 'Shoulder_lrCorr_angle', 'Hip_entropy_angle', 'Elbow_mean_angle', 'Eye_lrCorr_x', 'Shoulder_entropy_angle', 'Knee_entropy_angle', 'Shoulder_IQR_acc_angle', 'Ankle_lrCorr_x', 'Hip_lrCorr_angle', 'Wrist_meanent', 'Wrist_IQRvelx', 'Wrist_mediany', 'Ankle_IQRvely', 'Shoulder_stdev_angle', 'Hip_IQR_acc_angle', 'Elbow_stdev_angle', 'Knee_IQR_acc_angle', 'Ankle_meanent', 'Ankle_medianx', 'Wrist_IQRy', 'Knee_lrCorr_angle', 'Hip_IQR_vel_angle', 'Elbow_IQR_vel_angle', 'Wrist_IQRaccy', 'Wrist_IQRvely', 'Elbow_lrCorr_x']

features = FEATURES  # Start with all 57 baseline features
tot_k = len(features)

## 1.2 - Iterative Feature Selection Loop

This section implements the core iterative feature selection algorithm:

### Algorithm Overview
1. **Initialize**: Start with all 57 features from `FEATURES` constant
2. **Iterate**: For each trial (typically 8 trials to reach MIN_K=7):
   - **Step A - Enhanced Adaptive RFE**: `bd.run_adaptive_rfe()`
     - Performs 50 parallel RFE trials with noise injection
     - Uses Random Forest (200 trees) with 3x5 repeated cross-validation
     - Applies binomial test (α=0.05) to identify statistically significant features
     - Retains ~90% of features that pass significance testing
     - Returns reduced feature list (typically removes 10-20% per iteration)
   
   - **Step B - Bayesian Surprise Computation**: `bd.run_surprise_with_features()`
     - Computes negative log-likelihood for each infant using selected features
     - Calculates z-scores: `z = (NLL - μ_train) / σ_train`
     - Converts to p-values: `p = 2 * SF(|z|)` where SF is survival function
     - Stores surprise metrics for downstream ROC analysis
   
   - **Step C - ROC Metrics Evaluation**: `bd.compute_roc_metrics()`
     - Computes sensitivity, specificity across p-value thresholds
     - Calculates AUC (Area Under ROC Curve)
     - Stores performance metrics indexed by feature count

3. **Termination**: Loop continues until feature count ≤ `MIN_K` (7 features)

### Method Name Note
⚠️ **IMPORTANT**: The code uses legacy method names:
- `run_adaptive_rfe()` → Current: `run_rfe_on_base()`
- `compute_roc_metrics()` → Current: `run_metrics_from_surprise()`

This notebook predates the API update and should be updated in future revisions.

### Performance
- **Per iteration**: 2-4 minutes (depends on feature count)
- **Total runtime**: ~22 minutes for 8 iterations (57 → 54 → 48 → 45 → 40 → 34 → 24 → 15 → 7)

In [2]:
tick = 1
while True:
    logger.debug(f"Trial {tick}: Features in: {len(features)}...")
    
    # Step A: Enhanced Adaptive RFE with statistical significance testing
    logger.debug(f"Running adaptive RFE...")
    features = bd.run_adaptive_rfe("real", features, tot_k)
    
    # Step B: Compute Bayesian surprise using selected features
    logger.debug(f"Running surprise...")
    bd.run_surprise_with_features("real", features, overwrite=True)
    
    # Step C: Evaluate ROC performance (sensitivity, specificity, AUC)
    logger.debug(f"Computing ROC...")
    metrics = bd.compute_roc_metrics("real", len(features))
    
    logger.debug(f"...Trial {tick}: Features out: {len(features)}.")
    tick += 1
    
    # Termination condition: stop when feature count reaches minimum threshold
    if len(features) <= MIN_K:
        break

stop_time = datetime.now()
logger.debug(f"Completed Feature Selection in {(stop_time - start_time).seconds / 60: 0.2f} Minutes.")

## 1.3 - Results Export and Analysis

This section exports comprehensive results and performs feature analysis:

### Export Operations
1. **Excel Report**: `bd.write_excel_report("real")`
   - Generates `/Volumes/secure/data/early_markers/cribsy/xlsx/real_*.xlsx`
   - Contains multiple worksheets:
     - Summary: Overall performance metrics across all iterations
     - Detail: Per-iteration ROC curves and feature lists
     - Features: Complete feature selection progression
   - Formatted with conditional formatting for easy interpretation

2. **Pickle Serialization**: `pickle.dump(bd, f)`
   - Saves complete `BayesianData` object to `bd_real.pkl`
   - Preserves all models, metrics, and intermediate results
   - Used as input for subsequent analysis notebooks (e.g., `04_bam_sample_size.ipynb`)
   - Can be loaded with: `with open(PKL_DIR / "bd_real.pkl", 'rb') as f: bd = pickle.load(f)`

### Feature Analysis
The console output provides three key summaries:

1. **All Features in Models**: Total count across all iterations (with duplicates)
2. **Deduped Features**: Unique features selected at any iteration
   - Shows which features were consistently selected
   - Typical result: ~54 unique features from 8 models
3. **Common Features**: Intersection analysis
   - Identifies features appearing in both base and selected sets
   - Useful for validating feature stability

### Interpreting Results
- **High feature stability**: Features appearing in multiple iterations indicate robust predictive power
- **Progressive reduction**: Feature count should decrease smoothly (57→54→48→45→40→34→24→15→7)
- **Final feature set**: The 7-feature model represents the minimal effective feature set
- **Body part distribution**: Expect features from multiple body parts (Ankle, Wrist, Knee, Hip, Shoulder)
- **Metric types**: Balanced mix of position, velocity, acceleration, entropy, and correlation features

In [3]:
# Export comprehensive Excel report with all metrics and summaries
bd.write_excel_report("real")

# Serialize complete BayesianData object for downstream analysis
with open(PKL_DIR / "bd_real.pkl", "wb") as f:
    pickle.dump(bd, f)

# Feature Analysis: Extract all features used across all models
l = [f for m in bd.metrics_names for f in bd.metrics(m).features ]
print(f"All Features in Models: {len(l)}")

# Deduplicate to find unique features selected at any iteration
keeps = list(set(l))
print(f"\nDeduped Features: {len(keeps)}")
keeps.sort()
print(f"\nDeduped:\n{keeps}")

# Compare with baseline features (excluding manually dropped features)
print(f"\nBase Features not in Dropped:\n{[f for f in bd.base_features if f not in drops]}")

# Identify features common to both selected and baseline sets
common = [f for f in keeps if f in bd.base_features]
common.extend([f for f in bd.base_features if f in keeps])
common = sorted(list(set(common)))
print(f"\ncommon features ({len(common)}):\n{common}")