# Fast RFE Feature Selection (Optimized)

This is a performance-optimized version of `01_rfe_real.ipynb` with:
- Reduced number of RFE trials (10 instead of 50)
- Fewer Random Forest estimators (100 instead of 200)
- Configurable parallelization

**Performance Impact:**
- Original: ~30-60 minutes (50 trials × 15 folds × 200 trees = 150,000 models)
- Optimized: ~5-10 minutes (10 trials × 15 folds × 100 trees = 15,000 models)

**Trade-offs:**
- Faster execution (10x speedup)
- Slightly less stable feature selection
- Still uses statistical significance testing

**For production:** Use the original parameters for final results.

## 1.1 Environment Setup and Configuration

This section initializes the environment with performance-optimized parameters for fast feature selection.

**Key Optimizations:**
- `FAST_N_TRIALS = 10`: Reduced from 50 (5x speedup in RFE trials)
- `FAST_N_ESTIMATORS = 100`: Reduced from 200 (2x speedup in Random Forest)
- Combined effect: ~10x overall speedup

**Data Loading:**
- `BayesianData()` initializes with `features_merged.pkl` (default)
- Loads 1,250 infants × 56 features from `/Volumes/secure/code/early-markers/early_markers/emmacp_metrics/`
- Automatically transforms risk labels: `risk <= 1` → 0 (normal), `risk > 1` → 1 (at-risk)
- Category assignment: `category = 1` (training), `category = 2` (testing)

**Random Seed:**
- `RAND_STATE = 20250313` ensures reproducibility across runs
- Set at module level before BayesianData initialization

**Feature Pre-filtering (Optional):**
- The `drops` list contains 34 features with historically low importance
- Uncomment filtering to start with 22 features instead of 56 for even faster iteration
- Default: Uses all 56 features from `FEATURES` constant

**Custom `run_fast_rfe()` Function:**
- Wraps `EnhancedAdaptiveRFE` with fast parameters
- Implements convergence logic: adjusts `pct` if no features are eliminated
- Automatically calls `run_surprise_with_features()` after RFE completes
- Returns selected feature list for next iteration


In [None]:
from datetime import datetime
import pickle

from numpy import random
from loguru import logger

from early_markers.cribsy.common.bayes import BayesianData
from early_markers.cribsy.common.adaptive_rfe import EnhancedAdaptiveRFE
from early_markers.cribsy.common.constants import (
    AGE_BRACKETS, MIN_K, PKL_DIR, FEATURES, RAND_STATE,
    RFE_ALPHA, RFE_KEEP_PCT
)

# ============================================================================
# PERFORMANCE CONFIGURATION
# ============================================================================
# Adjust these for speed vs. accuracy trade-off

FAST_N_TRIALS = 10      # Down from 50 (5x faster)
FAST_N_ESTIMATORS = 100 # Down from 200 (2x faster)
# Combined: 10x speedup

# For production/final results, set these to match constants:
# FAST_N_TRIALS = 50
# FAST_N_ESTIMATORS = 200

# ============================================================================

# Set seeds at file level
random.seed(RAND_STATE)

bd = BayesianData()

start_time = datetime.now()
logger.info(f"Starting FAST Feature Selection (n_trials={FAST_N_TRIALS}, n_estimators={FAST_N_ESTIMATORS})...")

# Optional: Start with reduced feature set if you want even faster iteration
drops = ['Shoulder_IQR_vel_angle', 'Ankle_IQRaccx', 'Wrist_IQRaccx', 'Ankle_IQRvelx', 
         'Knee_IQR_vel_angle', 'Elbow_IQR_acc_angle', 'Shoulder_mean_angle', 'Ankle_IQRaccy', 
         'Shoulder_lrCorr_angle', 'Hip_entropy_angle', 'Elbow_mean_angle', 'Eye_lrCorr_x', 
         'Shoulder_entropy_angle', 'Knee_entropy_angle', 'Shoulder_IQR_acc_angle', 'Ankle_lrCorr_x', 
         'Hip_lrCorr_angle', 'Wrist_meanent', 'Wrist_IQRvelx', 'Wrist_mediany', 'Ankle_IQRvely', 
         'Shoulder_stdev_angle', 'Hip_IQR_acc_angle', 'Elbow_stdev_angle', 'Knee_IQR_acc_angle', 
         'Ankle_meanent', 'Ankle_medianx', 'Wrist_IQRy', 'Knee_lrCorr_angle', 'Hip_IQR_vel_angle', 
         'Elbow_IQR_vel_angle', 'Wrist_IQRaccy', 'Wrist_IQRvely', 'Elbow_lrCorr_x']

features = FEATURES  # or: [f for f in FEATURES if f not in drops]
tot_k = len(features)

# Custom RFE function with fast parameters
def run_fast_rfe(bd, model_prefix, features, tot_k):
    """Fast version of run_adaptive_rfe with custom parameters."""
    import polars as pl
    from early_markers.cribsy.common.bayes import BayesianRfeResult
    
    if bd._frames is None:
        raise AttributeError("DataFrames are not set.")
    
    frames = bd._frames.get(bd.base_model_name)
    if frames is None:
        raise AttributeError(f"Base DataFrames are not set.")
    
    df_raw = frames.train
    df_surprise = frames.train_surprise
    
    df_rfe = df_raw.join(df_surprise, on="infant", how="inner").sort(["infant", "feature"]).filter(pl.col("feature").is_in(features))
    
    # Generate training samples for RFE
    df_x = df_rfe.pivot(index='infant', on='feature', values='value').drop("infant").to_pandas()
    y = df_rfe.group_by('infant', maintain_order=True).agg(pl.col('z').first()).select("z").to_pandas()["z"]
    
    in_time = datetime.now()
    pct = RFE_KEEP_PCT - (1 - len(features)/tot_k) / 2
    
    while True:
        # Use FAST parameters
        selector = EnhancedAdaptiveRFE(
            n_trials=FAST_N_TRIALS,        # Fast: 10 instead of 50
            alpha=RFE_ALPHA,                # Keep same statistical threshold
            n_estimators=FAST_N_ESTIMATORS  # Fast: 100 instead of 200
        )
        selector.fit(df_x, y, pct)
        new_features = selector.get_significant_features()
        
        if len(new_features) == len(features):
            pct -= 0.1
            if pct < 0.1:
                break
            logger.debug(f"No reduction. Adjusting target pct to {pct:0.2f}")
        else:
            break
    
    out_time = datetime.now()
    logger.info(f"Selected {len(new_features)} features in {(out_time - in_time).seconds / 60: 0.2f} minutes")
    
    rfe_name = f"{model_prefix}_k_{len(new_features)}"
    if bd._rfes is None:
        bd._rfes = {}
    bd._rfes[rfe_name] = result = BayesianRfeResult(
        name=rfe_name,
        k=len(new_features),
        features=new_features,
    )
    bd.run_surprise_with_features(model_prefix, result.features)
    return result.features

# Main loop
tick = 1
iteration_times = []

while True:
    iter_start = datetime.now()
    logger.info(f"\n{'='*60}")
    logger.info(f"Iteration {tick}: Starting with {len(features)} features")
    logger.info(f"{'='*60}")
    
    # Run RFE
    features = run_fast_rfe(bd, "real", features, tot_k)
    
    # Run surprise
    logger.debug("Computing surprise scores...")
    bd.run_surprise_with_features("real", features, overwrite=True)
    
    # Compute ROC metrics
    logger.debug("Computing ROC metrics...")
    metrics = bd.compute_roc_metrics("real", len(features))
    
    iter_time = (datetime.now() - iter_start).seconds / 60
    iteration_times.append(iter_time)
    logger.info(f"Iteration {tick} complete: {len(features)} features in {iter_time:.2f} minutes")
    
    tick += 1
    if len(features) <= MIN_K:
        logger.info(f"Reached minimum feature count ({MIN_K}). Stopping.")
        break

stop_time = datetime.now()
total_time = (stop_time - start_time).seconds / 60
logger.info(f"\n{'='*60}")
logger.info(f"COMPLETE: Feature selection finished in {total_time:.2f} minutes")
logger.info(f"Iterations: {tick-1}")
logger.info(f"Average time per iteration: {sum(iteration_times)/len(iteration_times):.2f} minutes")
logger.info(f"Final feature count: {len(features)}")
logger.info(f"{'='*60}")

# Generate report
logger.info("Generating Excel report...")
bd.write_excel_report("real")

# Save results
logger.info("Saving results...")
with open(PKL_DIR / "bd_real_fast.pkl", "wb") as f:
    pickle.dump(bd, f)

# Summary statistics
l = [f for m in bd.metrics_names for f in bd.metrics(m).features]
print(f"\nAll Features in Models: {len(l)}")
keeps = list(set(l))
print(f"Deduped Features: {len(keeps)}")
keeps.sort()
print(f"\nSelected Features:\n{keeps}")

print(f"\nBase Features not in Dropped:\n{[f for f in bd.base_features if f not in drops]}")

common = [f for f in keeps if f in bd.base_features]
common.extend([f for f in bd.base_features if f in keeps])
common = sorted(list(set(common)))
print(f"\nCommon features ({len(common)}):\n{common}")

## 1.2 Iterative Feature Selection Loop

The main loop iteratively refines the feature set using a 3-step algorithm until reaching `MIN_K = 10` features.

**Algorithm Flow (per iteration):**

1. **Feature Selection** (`run_fast_rfe`):
   - Creates `EnhancedAdaptiveRFE` selector with fast parameters
   - Runs 10 parallel trials (each with 15-fold cross-validation)
   - Uses binomial test (α = 0.05) to identify statistically significant features
   - Target: Keep 90% of features (adjusted dynamically)
   - Returns list of significant features

2. **Surprise Computation** (`run_surprise_with_features`):
   - Computes Bayesian surprise scores for each infant
   - Calculates negative log-likelihood: `minus_log_p = Σ(-log P_i)` across selected features
   - Standardizes to z-scores: `z = (minus_log_p - μ_train) / σ_train`
   - Converts to p-values: `p = 2 * SF(|z|)` where SF is survival function
   - Higher z-scores indicate greater deviation from normative patterns

3. **Metrics Evaluation** (`compute_roc_metrics`):
   - **Note**: This is the legacy method name; target API is `run_metrics_from_surprise()`
   - Computes ROC curve analysis on test data
   - Calculates AUC, sensitivity, specificity at optimal threshold
   - Stores metrics for Excel report generation

**Convergence Criteria:**
- Loop terminates when `len(features) <= MIN_K` (7 features)
- Each iteration typically reduces features by 10-30%
- Fast mode: Expected 5-10 minutes total runtime
- Standard mode would take 30-60 minutes with same convergence path

**Performance Tracking:**
- `iteration_times` list records duration of each iteration
- Average time per iteration logged at completion
- Total runtime calculated from start to stop time

**Typical Fast Mode Performance:**
- 8-10 iterations to reach MIN_K
- ~0.5-1.5 minutes per iteration
- Feature progression similar to standard mode but with slightly more variability


## 1.3 Results Export and Analysis

This section exports results and provides summary statistics for the fast feature selection run.

**Excel Report Generation:**
- `write_excel_report("real")` creates formatted workbook: `real_*.xlsx`
- Output location: `/Volumes/secure/data/early_markers/cribsy/xlsx/`
- Contains multiple sheets:
  - **Summary**: Overall metrics across all models
  - **Detail**: Per-infant results with z-scores and p-values
  - **ROC**: Sensitivity/specificity at various thresholds
  - **Features**: Selected features for each model

**Pickle Serialization:**
- Saves complete `BayesianData` object to `bd_real_fast.pkl`
- Location: `/Volumes/secure/data/early_markers/cribsy/pkl/`
- Contains all computed results:
  - RFE results (`_rfes` dictionary)
  - Surprise scores (`_surprise` dictionary)
  - ROC metrics (`_metrics` dictionary)
  - Base DataFrames (`_base`, `_base_train`, `_base_test`)
- Can be loaded in subsequent notebooks for downstream analysis

**Summary Statistics:**
- **All Features in Models**: Total features across all iterations (with duplicates)
- **Deduped Features**: Unique features that appeared in any model
- **Selected Features**: Sorted list of final feature names
- **Common Features**: Intersection with base model features

**Interpretation:**
- Fast mode results should be validated against standard mode
- Feature lists should be similar (70-90% overlap expected)
- ROC metrics may differ by ±0.02-0.05 AUC points
- For publication/production, always re-run with standard parameters (50 trials, 200 estimators)

**Next Steps:**
- Compare with `bd_real.pkl` from `01_rfe_real.ipynb` (standard mode)
- Validate feature stability across modes
- If results diverge significantly, use standard mode results
- Use fast mode primarily for development and rapid iteration


## 1.4 Performance Notes

### Speed vs. Accuracy Trade-off

| Configuration | N Trials | N Estimators | Total Models | Approx Time | Use Case |
|---------------|----------|--------------|--------------|-------------|----------|
| **Fast** (this notebook) | 10 | 100 | 15,000 | 5-10 min | Development, iteration |
| **Standard** (original) | 50 | 200 | 150,000 | 30-60 min | Production, final results |
| **Ultra-fast** (testing) | 5 | 50 | 3,750 | 2-3 min | Quick tests only |

### Additional Optimizations

1. **Pre-filter features**: Start with a reduced feature set (uncomment the `drops` filtering)
2. **Adjust MIN_K**: Set a higher minimum feature count to stop earlier
3. **Parallel jobs**: The code uses `RFE_N_JOBS=12` by default - adjust in constants.py
4. **Single iteration**: For quick testing, run only one iteration instead of the full loop

### Validation

After running the fast version:
- Features should still pass statistical significance tests (binomial test, p < 0.05)
- Check ROC metrics are reasonable
- For final publication, re-run with standard parameters