# 02 - Bayesian Surprise and Feature Selection

**Purpose**: This notebook is the analytical core of the project. It orchestrates the main feature selection and Bayesian surprise analysis pipeline on both real and synthetic data. It iteratively reduces the feature set and evaluates model performance at each step.

**Inputs**:
- Real data, accessed by instantiating the `BayesianData` class with default parameters.
- Synthetic data (`synth_sdv_1000_long.ipc`), used in the second half of the notebook.

**Outputs**:
- `bd_real.pkl`: A pickled `BayesianData` object containing the complete state of the analysis on real data (feature lists, surprise scores, ROC metrics).
- `db_train{N}_test{N}.pkl`: A pickled `BayesianData` object for the synthetic data analysis.
- Excel reports (`real_report.xlsx`, `synthetic_...xlsx`) summarizing the performance of models at each stage of feature reduction.

### Key Sections:
1.  **Developer Notes**: Initial thoughts and a checklist from the development process.
2.  **Analysis on Real Data**: An iterative loop that:
    a. Runs `EnhancedAdaptiveRFE` to find the best features.
    b. Computes Bayesian surprise scores with the selected features.
    c. Calculates ROC metrics to evaluate performance.
    d. Reduces the feature set and repeats until a minimum number of features is reached.
3.  **Analysis on Synthetic Data**: Repeats the same iterative process on a synthetic dataset to validate the methodology.
4.  **Result Aggregation**: Consolidates and prints the final selected feature sets.

### 2.1 PEB 2025.03.27 21:06 =>
- [X] Remove features_38... nothing special.
- [X] Do not define max features statically.
- [X] Set age_bracket [1,2] using >= 6 threshold (B)
- [X] Set risk in [0,1] using threshold >= 2 (B)
- [X] Set category in [1,2] using thresholds category == 0 or risk == 0
- [X] Random sample n=40 of risk==0 && category==1 and assign category to 2
- [X] Drop infant clin_100_6
- [X] Produce a TRAIN set size of 94 and TEST set size of 54
- [X] RFE for 20 features and test

Create synthetic data on risk in [0,1].
Ensure Synthetic training and testing data have 90/10, 85/15 ratios of risk 0/1
Setup for cross validation and feature selection to arrive at optimal estimates


## Enhanced Adaptive RFE on Real Data

### 2.2 Iterative RFE and Surprise Analysis on Real Data

This is the main computational cell for the analysis of real-world data. It performs an iterative feature selection and evaluation loop:
1.  **Initialization**: Sets the global random seed, instantiates the `BayesianData` class (loading real data by default), and initializes the feature list.
2.  **Iterative Loop**: Continuously loops until the number of features is reduced to `MIN_K` (a predefined constant).
3.  **Run RFE**: In each iteration, it calls `bd.run_adaptive_rfe()` to perform Enhanced Adaptive RFE on the current feature set.
4.  **Run Surprise**: The reduced feature set is then used to compute Bayesian surprise scores via `bd.run_surprise_with_features()`.5.  **Compute Metrics**: `bd.compute_roc_metrics()` is called to evaluate the diagnostic performance of the model with the current feature set.6.  **Log Progress**: `loguru` is used to log the number of features and progress at each trial.
7.  **Final Report**: After the loop terminates, `bd.write_excel_report()` is called to generate a comprehensive Excel spreadsheet summarizing the results from all iterations.

In [None]:
%reload_ext autoreload
%autoreload 2

from datetime import datetime

from numpy import random
from polars import DataFrame
import polars as pl
from tqdm import tqdm
from loguru import logger

from early_markers.cribsy.common.bayes import BayesianData
from early_markers.cribsy.common.adaptive_rfe import EnhancedAdaptiveRFE, validation_report
from early_markers.cribsy.common.constants import AGE_BRACKETS, MIN_K, PKL_DIR, FEATURES

from early_markers.cribsy.common.constants import RAND_STATE


# Set seeds at file level
random.seed(RAND_STATE)

bd = BayesianData()

start_time = datetime.now()
logger.debug(f"Starting Feature Selection...")
# features = bd.base_features
drops = ['Shoulder_IQR_vel_angle', 'Ankle_IQRaccx', 'Wrist_IQRaccx', 'Ankle_IQRvelx', 'Knee_IQR_vel_angle', 'Elbow_IQR_acc_angle', 'Shoulder_mean_angle', 'Ankle_IQRaccy', 'Shoulder_lrCorr_angle', 'Hip_entropy_angle', 'Elbow_mean_angle', 'Eye_lrCorr_x', 'Shoulder_entropy_angle', 'Knee_entropy_angle', 'Shoulder_IQR_acc_angle', 'Ankle_lrCorr_x', 'Hip_lrCorr_angle', 'Wrist_meanent', 'Wrist_IQRvelx', 'Wrist_mediany', 'Ankle_IQRvely', 'Shoulder_stdev_angle', 'Hip_IQR_acc_angle', 'Elbow_stdev_angle', 'Knee_IQR_acc_angle', 'Ankle_meanent', 'Ankle_medianx', 'Wrist_IQRy', 'Knee_lrCorr_angle', 'Hip_IQR_vel_angle', 'Elbow_IQR_vel_angle', 'Wrist_IQRaccy', 'Wrist_IQRvely', 'Elbow_lrCorr_x']
features = FEATURES  # bd.base_features  # [f for f in bd.base_features if f not in drops]
tot_k = len(features)
tick = 1
while True:
    logger.debug(f"Trial {tick}: Features in: {len(features)}...")
    features = bd.run_adaptive_rfe("real", features, tot_k)
    bd.run_surprise_with_features("real", features, overwrite=True)
    metrics = bd.compute_roc_metrics("real", len(features))
    logger.debug(f"...Trial {tick}: Features out: {len(features)}.")
    tick += 1
    if len(features) <= MIN_K:
        break
stop_time = datetime.now()
logger.debug(f"Completed Feature Selection in {(stop_time - start_time).seconds / 60: 0.2f} Minutes.")
bd.write_excel_report("real")

### 2.3 Persist Analysis State

This cell uses `pickle` to save the entire state of the `BayesianData` object (`bd`) to a file named `bd_real.pkl`. This is a critical step for caching results, as it saves all computed metrics, feature lists, and surprise scores, allowing the analysis state to be reloaded in other notebooks without re-running the time-consuming feature selection process.

In [None]:
import pickle

with open(PKL_DIR / "bd_real.pkl", "wb") as f:
    pickle.dump(bd, f)

### 2.5 Aggregate and Display Final Feature Sets

This cell consolidates the results of the iterative feature selection. It extracts the list of selected features from all the models generated during the loop, de-duplicates them to create a final consensus set (`keeps`), and prints the total counts and the full list of features for review.

In [None]:
l = [f for m in bd.metrics_names for f in bd.metrics(m).features ]
print(f"All Features in Models: {len(l)}")
keeps = list(set(l))
print(f"\nDeduped Features: {len(keeps)}")
keeps.sort()
print(f"\nDeduped:\n{keeps}")

print(f"\nBase Features not in Dropped:\n{[f for f in bd.base_features if f not in drops]}")

common = [f for f in keeps if f in bd.base_features]
common.extend([f for f in bd.base_features if f in keeps])
common = sorted(list(set(common)))
print(f"\ncommon features ({len(common)}):\n{common}")

### 2.4 Plan
1. Feed reduced feature set to SDV CTGAN model.
2. Rerun RFE w Synthetic Data.
3. Test With Real Test Data.
4. Find N based on CI


### 2.7 Write Synthetic Data Excel Report

This cell loads the pickled `BayesianData` object from the synthetic analysis and calls `write_excel_report()` to generate the final, formatted Excel summary. A key detail is the note about shortening worksheet names to ensure they are `_xlwt` compatible (<= 31 characters).

In [None]:
# PEB 2025.04.02 23:21 => shorten worksheet names to <= 32 chars
with open(PKL_DIR / f"db_train{TRAIN_N}_test{TEST_N}.pkl", "rb") as f:
    bd = pickle.load(f)

bd.write_excel_report(f"synthetic_train{TRAIN_N}_test{TEST_N}")

In [None]:
%reload_ext autoreload
%autoreload 2

from datetime import datetime
import pickle

from numpy import random
from polars import DataFrame
import polars as pl
from tqdm import tqdm
from loguru import logger

from early_markers.cribsy.common.bayes import BayesianData
from early_markers.cribsy.common.adaptive_rfe import EnhancedAdaptiveRFE, validation_report
from early_markers.cribsy.common.constants import AGE_BRACKETS, MIN_K, IPC_DIR, PKL_DIR

from early_markers.cribsy.common.constants import RAND_STATE

TRAIN_N = 300
TEST_N = 100

# Set seeds at file level
random.seed(RAND_STATE)

bd = BayesianData(base_file="synth_sdv_1000_long.ipc", train_n=TRAIN_N, test_n=TEST_N, augment=True)
start_time = datetime.now()
logger.debug(f"Starting Feature Selection...")
prefix = f"syn_trn{TRAIN_N}_tst{TEST_N}"
features = bd.base_features
tot_k = len(features)
tick = 1
# features_in = features_out = tot_k
features_out = tot_k + 1
while True:
    features_in = len(features)
    if features_in == features_out:
        break
    logger.debug(f"Trial {tick}: Features in: {features_in}...")
    features = bd.run_adaptive_rfe(prefix, features, tot_k=tot_k)
    features_out = len(features)
    bd.run_surprise_with_features(prefix, features, overwrite=True)
    metrics = bd.compute_roc_metrics(prefix, len(features))

    logger.debug(f"...Trial {tick}: Features out: {len(features)}.")
    tick += 1
    if len(features) <= MIN_K:
        break
stop_time = datetime.now()
logger.debug(f"Completed Feature Selection in {(stop_time - start_time).seconds / 60: 0.2f} Minutes.")

with open(PKL_DIR / f"db_train{TRAIN_N}_test{TEST_N}.pkl", "wb") as f:
    pickle.dump(bd, f)


### 2.8 Describe Base DataFrame

This cell provides a quick summary of the `category` and `risk` columns in the base wide-format DataFrame, allowing for a final check of the data distribution after the synthetic analysis.

In [None]:
df_base = bd.base_wide
df_base.select("category", "risk").describe()

## Enhanced Adaptive RFE on Synthetic Data

### 2.6 Iterative RFE and Surprise Analysis on Synthetic Data

This section mirrors the analysis performed on real data, but instead uses a synthetic dataset to validate the feature selection and modeling pipeline. Key differences include:
- **Data Source**: It loads a synthetic dataset from an IPC file (`synth_sdv_1000_long.ipc`).
- **Subsampling**: It uses `train_n` and `test_n` to work with smaller subsets of the synthetic data (300 training, 100 testing samples).
- **Augmentation**: It sets `augment=True` in the `BayesianData` constructor, which may introduce additional noise features to further test the robustness of the RFE process.
- **Output Files**: The resulting `BayesianData` object and Excel report are saved with filenames that reflect the synthetic data source and sample sizes used.

# ~Find Minimum Sample Size Requirement~

### 2.9 Placeholder for Minimum Sample Size Analysis

This final cell is a placeholder for a potential future analysis to determine the minimum sample size required to achieve a certain level of statistical power or confidence interval width. The commented-out code suggests an iterative approach, looping through different `train_n` and `test_n` values.

In [None]:
# from early_markers.cribsy.common.bayes import BayesianData, BayesianCI
#
#
# random.seed(RAND_STATE)
#
# ci_metrics = {}
#
#
# for train_n in range(1000, 50, -50):
#     for test_n in range(1000, 50 -50):
#         bd = BayesianData(base_file="synth_long.ipc", train_n=train_n, test_n=test_n)