# Feature Engineering & Train/Test Split ‚Äî Resting State HR Estimation

This notebook performs **Feature Engineering (Phase 5)** and **Train/Test Split (Phase 6)** for the resting-state
heart-rate estimation dataset (phases 0 and 2). It uses the output from the EDA pipeline (`df_all.csv`) and prepares
the dataset for model training.

### Objectives

1. **Load processed windows + advanced features** produced in the EDA notebook (`df_all.csv`).
2. **Engineer relevant features**, including:
   - low-variance filtering
   - derived physiological features
   - motion-aware and SQI-based transformations
3. **Generate clean training data** for ML models.
4. **Create a reproducible train/test split**, with phase-balanced stratification.
5. **Save all outputs as `.txt` files** following the project‚Äôs directory structure.
6. **Log all metadata** (sample counts, splits, notes) into `results.txt` to keep complete experiment tracking.

This notebook marks the transition from **Exploratory Data Analysis** to the **Modeling stage**, ensuring the dataset
is well-structured, consistent, and ready for machine learning pipelines.

## Parameters + Loader

In [4]:
# ============================================================
# Global Parameters ‚Äî Feature Engineering + Split Notebook
# ============================================================

import os

# -------------------------
# Experiment Round Initial
# -------------------------
ROUND_NAME = "round_01"          
MODEL_NAME = "xgboost"          
NOTES = "Baseline - With no oversampling"

# -------------------------
# Base Directory Structure
# -------------------------
BASE_DIR = "/Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso"

# Input (from EDA)
DF_ALL_PATH = os.path.join(BASE_DIR, "data", "processed", "df_all.csv")

# Output directories
PROCESSED_DIR = os.path.join(BASE_DIR, "data", "processed")
SPLIT_DIR      = os.path.join(BASE_DIR, "data", "splits", ROUND_NAME)
RESULTS_DIR    = os.path.join(BASE_DIR, "results")

# Ensure folders exist
os.makedirs(PROCESSED_DIR, exist_ok=True)
os.makedirs(SPLIT_DIR, exist_ok=True)
os.makedirs(RESULTS_DIR, exist_ok=True)

# Output files (TXT preferred)
TRAIN_OUT = os.path.join(SPLIT_DIR, f"{ROUND_NAME}_train.txt")
TEST_OUT  = os.path.join(SPLIT_DIR, f"{ROUND_NAME}_test.txt")

# Experiment log file
LOG_FILE = os.path.join(RESULTS_DIR, "results.txt")

# FE output directory (you requested it inside data/processed)
FE_OUTPUT_DIR = os.path.join(PROCESSED_DIR, "fe")
os.makedirs(FE_OUTPUT_DIR, exist_ok=True)

# Prefix to name FE outputs (use the round name as prefix)
ROUND_PREFIX = ROUND_NAME  # e.g. "round_01"

# Example FE output file (used by the FE cell)
FE_OUTPUT_EXAMPLE = os.path.join(FE_OUTPUT_DIR, f"{ROUND_PREFIX}_fe_dataset.txt")

print("‚Üí FE_OUTPUT_DIR:", FE_OUTPUT_DIR)
print("‚Üí ROUND_PREFIX:", ROUND_PREFIX)
print("‚Üí Example FE output file:", FE_OUTPUT_EXAMPLE)

print("üìå Parameters loaded:")
print(f"DF_ALL_PATH : {DF_ALL_PATH}")
print(f"SPLIT_DIR   : {SPLIT_DIR}")
print(f"RESULTS_DIR : {RESULTS_DIR}")
print(f"LOG_FILE    : {LOG_FILE}")

‚Üí FE_OUTPUT_DIR: /Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/data/processed/fe
‚Üí ROUND_PREFIX: round_01
‚Üí Example FE output file: /Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/data/processed/fe/round_01_fe_dataset.txt
üìå Parameters loaded:
DF_ALL_PATH : /Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/data/processed/df_all.csv
SPLIT_DIR   : /Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/data/splits/round_01
RESULTS_DIR : /Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/results
LOG_FILE    : /Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/results/results.txt


In [2]:
# ============================================================
# Loader ‚Äî Load df_all from EDA Output
# ============================================================

import pandas as pd

print("üì• Loading processed dataset (df_all)...")
print(f"‚Üí Source: {DF_ALL_PATH}")

# Load file
df_all = pd.read_csv(DF_ALL_PATH)

print("\n‚úÖ File loaded successfully!")
print(f"Shape: {df_all.shape}")
print(f"Columns ({len(df_all.columns)}): {df_all.columns.tolist()}")

# ------------------------------------------------------------
# Basic validation
# ------------------------------------------------------------

required_cols = [
    "Id", "phase", "window", "hr_true",
    "ppg_mean", "ppg_std", "imu_mean", "imu_std", "acc_rms"
]

missing = [c for c in required_cols if c not in df_all.columns]

if missing:
    print("‚ùå ERROR: Missing required columns in df_all:", missing)
    raise ValueError("df_all is incomplete. Please re-run EDA.")

print("\nüîç Quick stats:")
print(df_all[["phase", "hr_true"]].describe())

print("\nüìä Phase distribution:")
print(df_all["phase"].value_counts())

# Ensure phase is integer
df_all["phase"] = df_all["phase"].astype(int)

print("\nüéâ df_all is ready for Feature Engineering!")

üì• Loading processed dataset (df_all)...
‚Üí Source: /Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/data/processed/df_all.csv

‚úÖ File loaded successfully!
Shape: (792, 36)
Columns (36): ['Id', 'phase', 'window', 'hr_true', 'ppg_mean', 'ppg_std', 'ppg_min', 'ppg_max', 'ppg_range', 'imu_mean', 'imu_std', 'imu_p95', 'imu_energy', 'acc_rms', 'ppg_bp_low', 'ppg_bp_hr', 'ppg_bp_high', 'ppg_bp_hr_norm', 'ppg_f_dom', 'imu_bp_low', 'imu_bp_high', 'imu_jerk_mean', 'imu_jerk_std', 'coherence_ppg_imu', 'ppg_entropy', 'imu_entropy', 'sqi', 'fusion_ppg_imu', 'hr_candidate', 'phase_id', 'PC1', 'PC2', 'mov_bin', 'hr_err', 'low_sqi', 'cluster']

üîç Quick stats:
            phase     hr_true
count  792.000000  792.000000
mean     1.000000   80.205300
std      1.000632   10.151956
min      0.000000   62.029278
25%      0.000000   72.112676
50%      1.000000   79.227366
75%      2.000000   85.095648
max      2.000000  126.740717

üìä Phase distribution:
phase
0    396
2   

## Feature Engineering Overview

	1.	Remove colunas de identifica√ß√£o
(Id, window, phase).

	2.	Aplica Variance Threshold
Remove s√≥ colunas que n√£o carregam informa√ß√£o nenhuma.

	3.	Inclui features derivadas importantes
	‚Ä¢	sqi_flag
	‚Ä¢	motion_weight
	‚Ä¢	hr_cand_weighted
	‚Ä¢	ppg_hr_smooth
	‚Ä¢	artifact_ratio

	4.	Mant√©m TODAS as features
Nenhuma √© descartada com base em modelo.

	5.	Escala tudo com StandardScaler
exceto hr_true e phase.

	6.	Restaura phase no final
pois √© usada no estratificador.

	7.	Salva dataset completo do FE
pronto para split + training.

In [10]:
# ============================================================
# FASE 5 ‚Äî Feature Engineering (PHASES 0 & 2) ‚Äî FULL FEATURES
# ============================================================

import os
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold

print("üìå FASE 5 ‚Äî FEATURE ENGINEERING (PHASES 0 & 2)")
print(f"Entrada df_all: {df_all.shape}")

# ============================================================
# 1. Prepara√ß√£o e remo√ß√£o de colunas de identifica√ß√£o
# ============================================================

df_fe = df_all.copy()

# Guardamos a coluna de phase (ser√° recolocada no final)
phase_col = df_fe["phase"].values

cols_id = ["Id", "window", "phase"]
df_fe = df_fe.drop(columns=cols_id, errors="ignore")

print("‚Üí ID columns removed. Shape:", df_fe.shape)

# ============================================================
# 2. Variance Threshold ‚Äî Remove low-variance features
# ============================================================

X_numeric = df_fe.select_dtypes(include=[np.number])

selector = VarianceThreshold(threshold=1e-5)
X_var = selector.fit_transform(X_numeric)

low_var_cols = X_numeric.columns[~selector.get_support()]
print(f"‚Üí Removed {len(low_var_cols)} low-variance columns:", list(low_var_cols))

df_fe = pd.DataFrame(
    X_var,
    columns=X_numeric.columns[selector.get_support()]
)

# Recolocar hr_true (para garantir integridade)
df_fe["hr_true"] = df_all["hr_true"].values

# ============================================================
# 3. Derived Features ‚Äî mantidas no dataset final
# ============================================================

df_fe["sqi_flag"] = (df_fe["sqi"] < 0.5).astype(int)
df_fe["motion_weight"] = np.log1p(df_fe["acc_rms"])
df_fe["hr_cand_weighted"] = df_fe["hr_candidate"] * (1 - 0.3 * df_fe["sqi_flag"])
df_fe["ppg_hr_smooth"] = np.log1p(df_fe["ppg_bp_hr"])
df_fe["artifact_ratio"] = df_fe["imu_bp_high"] / (df_fe["ppg_bp_hr"] + 1e-6)

print("‚Üí Derived features added. Shape:", df_fe.shape)

# ============================================================
# 3.1 Retirando as colunas que podem gerar data leakage
# ============================================================
features_to_remove = ["hr_err", "cluster", "PC1", "PC2"]
df_fe = df_fe.drop(columns=[c for c in features_to_remove if c in df_fe], errors="ignore")

# ============================================================
# 4. Scaling ‚Äî aplica em TODAS as features exceto hr_true
# ============================================================

features = df_fe.drop(columns=["hr_true"]).columns.tolist()

scaler = StandardScaler()
scaled_vals = scaler.fit_transform(df_fe[features])

df_fe_scaled = pd.DataFrame(scaled_vals, columns=features)

# recoloca hr_true
df_fe_scaled["hr_true"] = df_fe["hr_true"].values

# ============================================================
# 5. Restore "phase"
# ============================================================

df_fe_scaled["phase"] = phase_col.astype(int)

print("‚Üí Phase restored. Final shape:", df_fe_scaled.shape)

# ============================================================
# 6. Save dataset
# ============================================================

OUT_FE = os.path.join(FE_OUTPUT_DIR, f"{ROUND_PREFIX}_fe_dataset.txt")
os.makedirs(FE_OUTPUT_DIR, exist_ok=True)

df_fe_scaled.to_csv(OUT_FE, index=False)

print(f"\nüíæ Saved FE dataset to:\n{OUT_FE}")
print("üéâ FASE 5 completed successfully!")

üìå FASE 5 ‚Äî FEATURE ENGINEERING (PHASES 0 & 2)
Entrada df_all: (792, 36)
‚Üí ID columns removed. Shape: (792, 33)
‚Üí Removed 0 low-variance columns: []
‚Üí Derived features added. Shape: (792, 36)
‚Üí Phase restored. Final shape: (792, 33)

üíæ Saved FE dataset to:
/Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/data/processed/fe/round_01_fe_dataset.txt
üéâ FASE 5 completed successfully!


## Round_01 - Split sem Oversampling

In [11]:
# ============================================================
# FASE 6 ‚Äî Train/Test Split (80/20) ‚Äî Baseline Round
# ============================================================

import os
import pandas as pd
from sklearn.model_selection import train_test_split

print("üìå FASE 6 ‚Äî Train/Test Split (Baseline)")
print(f"üìÇ FE dataset path: {FE_OUTPUT_EXAMPLE}")

# ------------------------------------------------------------
# 1. Load Feature-Engineered Dataset (Fase 5)
# ------------------------------------------------------------
df_fe = pd.read_csv(FE_OUTPUT_EXAMPLE)

print("\nüîç Loaded FE dataset")
print("Shape:", df_fe.shape)
print("Columns:", df_fe.columns.tolist())

# ------------------------------------------------------------
# 2. Create the 80/20 Split (stratified by phase)
# ------------------------------------------------------------

train_df, test_df = train_test_split(
    df_fe,
    test_size=0.20,
    random_state=42,
    shuffle=True,
    stratify=df_fe["phase"]
)

print("\nüìä Split Summary:")
print("   ‚Üí Train:", train_df.shape)
print("   ‚Üí Test :", test_df.shape)

# ------------------------------------------------------------
# 3. Save Split Files using Global Parameters
# ------------------------------------------------------------

os.makedirs(SPLIT_DIR, exist_ok=True)

train_df.to_csv(TRAIN_OUT, index=False)
test_df.to_csv(TEST_OUT, index=False)

print("\nüíæ Files saved:")
print(f"   ‚Ä¢ Train ‚Üí {TRAIN_OUT}")
print(f"   ‚Ä¢ Test  ‚Üí {TEST_OUT}")

# ------------------------------------------------------------
# 4. Update Experiment Log
# ------------------------------------------------------------

log_entry = (
    f"\n=== SPLIT GENERATED ‚Äî {ROUND_NAME} ===\n"
    f"Notes: {NOTES}\n"
    f"FE Input File: {FE_OUTPUT_EXAMPLE}\n"
    f"Train Shape : {train_df.shape}\n"
    f"Test Shape  : {test_df.shape}\n"
    f"Train File  : {TRAIN_OUT}\n"
    f"Test File   : {TEST_OUT}\n"
)

with open(LOG_FILE, "a") as f:
    f.write(log_entry)

print("\nüìù Log updated:")
print(LOG_FILE)
print("üéâ FASE 6 ‚Äî Split complete!")

üìå FASE 6 ‚Äî Train/Test Split (Baseline)
üìÇ FE dataset path: /Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/data/processed/fe/round_01_fe_dataset.txt

üîç Loaded FE dataset
Shape: (792, 33)
Columns: ['ppg_mean', 'ppg_std', 'ppg_min', 'ppg_max', 'ppg_range', 'imu_mean', 'imu_std', 'imu_p95', 'imu_energy', 'acc_rms', 'ppg_bp_low', 'ppg_bp_hr', 'ppg_bp_high', 'ppg_bp_hr_norm', 'ppg_f_dom', 'imu_bp_low', 'imu_bp_high', 'imu_jerk_mean', 'imu_jerk_std', 'coherence_ppg_imu', 'ppg_entropy', 'imu_entropy', 'sqi', 'fusion_ppg_imu', 'hr_candidate', 'phase_id', 'sqi_flag', 'motion_weight', 'hr_cand_weighted', 'ppg_hr_smooth', 'artifact_ratio', 'hr_true', 'phase']

üìä Split Summary:
   ‚Üí Train: (633, 33)
   ‚Üí Test : (159, 33)

üíæ Files saved:
   ‚Ä¢ Train ‚Üí /Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/data/splits/round_01/round_01_train.txt
   ‚Ä¢ Test  ‚Üí /Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/

## Round_02 - Oversampling Inteligente com SMOTE Modificado e Jitter nas amostras criticas que geraram mais erro e em cima das features com maior importancia

*Ranking de Features - An√°lise de Import√¢ncia*

| Ranking | Feature          | Import√¢ncia | Interpreta√ß√£o                                                                 |
|---------|------------------|-------------|-------------------------------------------------------------------------------|
| 1       | `fusion_ppg_imu` | 0.1525      | Feature h√≠brida que combina coer√™ncia entre sensores ‚Üí s√≠ntese poderosa do estado fisiol√≥gico. |
| 2       | `ppg_f_dom`      | 0.1011      | Frequ√™ncia dominante do PPG ‚Üí extremamente ligada ao HR (Heart Rate).         |
| 3       | `phase_id`       | 0.0784      | Modelo usa a fase do protocolo para calibrar erro (descanso, esfor√ßo etc.).   |
| 4       | `imu_entropy`    | 0.0774      | Movimento irregular / ru√≠do ‚Üí sinaliza perda de qualidade do sinal.           |
| 5       | `imu_p95`        | 0.0748      | Picos do aceler√¥metro ‚Üí √∫til para detectar movimento significativo.           |


*Estrat√©gia de Balanceamento de Dados*

| Faixa (BPM) | Count | Oversampling | Fator de Multiplica√ß√£o |
|-------------|-------|--------------|------------------------|
| 84‚Äì95 BPM   | 24    | √ó2           | 2√ó                     |
| 95‚Äì106 BPM  | 3     | √ó10          | 10√ó                    |
| 106‚Äì117 BPM | 2     | √ó10          | 10√ó                    |


 **Antes do Oversampling:**
- **84‚Äì95 BPM**: 24 amostras (82.8% do total)
- **95‚Äì106 BPM**: 3 amostras (10.3% do total)
- **106‚Äì117 BPM**: 2 amostras (6.9% do total)
- **Total**: 29 amostras


In [None]:
# ===============================================================
# FASE 6 ‚Äî Intelligent Oversampling (SMOTE-Modificado) + Split
# Governan√ßa + Logging
# ===============================================================

import os
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from datetime import datetime

# ---------------------------------------------------------------
# PARAMETERS
# ---------------------------------------------------------------
ROUND_NAME = "round_02"
PREFIX     = "r02"

BASE_DIR = "/Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso"

FE_OUT = os.path.join(BASE_DIR, "data", "processed", "fe", "round_01_fe_dataset.txt")

SPLIT_DIR = os.path.join(BASE_DIR, "data", "splits", ROUND_NAME)
PREP_DIR  = os.path.join(BASE_DIR, "data", "preparation")
RESULTS_LOG = os.path.join(BASE_DIR, "results", "results.txt")

os.makedirs(SPLIT_DIR, exist_ok=True)
os.makedirs(PREP_DIR, exist_ok=True)

TRAIN_OUT = os.path.join(SPLIT_DIR, f"{ROUND_NAME}_train.txt")
TEST_OUT  = os.path.join(SPLIT_DIR, f"{ROUND_NAME}_test.txt")

OVERSAMPLE_REPORT = os.path.join(
    PREP_DIR, f"{PREFIX}_oversampling_report.txt"
)

# ---------------------------------------------------------------
# LOAD FEATURE-ENGINEERED DATASET
# ---------------------------------------------------------------
df = pd.read_csv(FE_OUT)
df = df[df["phase"].isin([0, 2])].reset_index(drop=True)

initial_size = len(df)

print("üì• Loaded FE dataset:", df.shape)

# ---------------------------------------------------------------
# 1. HR RANGES
# ---------------------------------------------------------------
R1 = df[(df.hr_true >= 84) & (df.hr_true < 95)]
R2 = df[(df.hr_true >= 95) & (df.hr_true < 106)]
R3 = df[(df.hr_true >= 106) & (df.hr_true <= 117)]

multipliers = {
    "84‚Äì95 BPM": 2,
    "95‚Äì106 BPM": 10,
    "106‚Äì117 BPM": 10
}

# ---------------------------------------------------------------
# 2. IMPORTANT FEATURES
# ---------------------------------------------------------------
important_features = [
    "fusion_ppg_imu",
    "ppg_f_dom",
    "phase_id",
    "imu_entropy",
    "imu_p95",
    "imu_mean",
    "imu_energy",
    "acc_rms",
    "hr_cand_weighted",
    "ppg_bp_high",
]

top_jitter_features = important_features.copy()

# ---------------------------------------------------------------
# 3. SMOTE MODIFIED
# ---------------------------------------------------------------
def smote_modified(df_group, multiplier, feature_cols, jitter_cols):
    if len(df_group) < 2:
        return df_group.copy()

    original = df_group.copy()
    n_new = len(df_group) * (multiplier - 1)

    nbrs = NearestNeighbors(
        n_neighbors=min(5, len(df_group))
    ).fit(df_group[feature_cols])

    indices = nbrs.kneighbors(df_group[feature_cols], return_distance=False)
    synthetic = []

    for _ in range(n_new):
        base_idx = np.random.randint(len(df_group))
        neigh_idx = np.random.choice(indices[base_idx][1:])

        base = df_group.iloc[base_idx]
        neigh = df_group.iloc[neigh_idx]

        alpha = np.random.uniform(0.1, 0.9)
        new_row = base.copy()

        for col in feature_cols:
            new_row[col] = base[col] + alpha * (neigh[col] - base[col])

        for col in jitter_cols:
            new_row[col] += np.random.normal(
                0, 0.01 * df_group[col].std()
            )

        synthetic.append(new_row)

    return pd.concat([original, pd.DataFrame(synthetic)], ignore_index=True)

# ---------------------------------------------------------------
# 4. APPLY OVERSAMPLING
# ---------------------------------------------------------------
print("\nüìä Before Oversampling:")
print(f"84‚Äì95 BPM  : {len(R1)}")
print(f"95‚Äì106 BPM : {len(R2)}")
print(f"106‚Äì117 BPM: {len(R3)}")

R1_new = smote_modified(R1, multipliers["84‚Äì95 BPM"], important_features, top_jitter_features)
R2_new = smote_modified(R2, multipliers["95‚Äì106 BPM"], important_features, top_jitter_features)
R3_new = smote_modified(R3, multipliers["106‚Äì117 BPM"], important_features, top_jitter_features)

df_over = pd.concat([df, R1_new, R2_new, R3_new], ignore_index=True)
df_over = df_over.drop_duplicates().reset_index(drop=True)

final_size = len(df_over)

print("üì¶ After Oversampling:", df_over.shape)

# ---------------------------------------------------------------
# 5. TRAIN / TEST SPLIT
# ---------------------------------------------------------------
train_df, test_df = train_test_split(
    df_over,
    test_size=0.20,
    shuffle=True,
    random_state=42
)

train_df.to_csv(TRAIN_OUT, index=False)
test_df.to_csv(TEST_OUT, index=False)

# ---------------------------------------------------------------
# 6. SAVE OVERSAMPLING REPORT
# ---------------------------------------------------------------
with open(OVERSAMPLE_REPORT, "w") as f:
    f.write("=" * 70 + "\n")
    f.write("INTELLIGENT OVERSAMPLING REPORT\n")
    f.write("=" * 70 + "\n\n")

    f.write(f"Round: {ROUND_NAME}\n")
    f.write(f"Prefix: {PREFIX}\n")
    f.write(f"Timestamp: {datetime.now()}\n\n")

    f.write("INITIAL VOLUME:\n")
    f.write(f"- Samples: {initial_size}\n\n")

    f.write("OVERSAMPLING CRITERIA:\n")
    for k, v in multipliers.items():
        f.write(f"- {k}: x{v}\n")
    f.write("\nImportant features:\n")
    f.write(", ".join(important_features) + "\n\n")

    f.write("FINAL VOLUME:\n")
    f.write(f"- Samples: {final_size}\n\n")

    f.write("COMMENTARY:\n")
    f.write(
        "Intelligent oversampling applied in feature space to support round_02, using "
        "interpolation + jitter on high-importance features. "
        "Objective: densify rare HR regions with physiological realism.\n\n"
    )

    f.write("HR DISTRIBUTION (TRAIN):\n")
    f.write(pd.cut(train_df["hr_true"], bins=6).value_counts().to_string())

print("\nüìù Oversampling report saved ‚Üí")
print(OVERSAMPLE_REPORT)

# ---------------------------------------------------------------
# 7. LOG SPLIT IN results.txt
# ---------------------------------------------------------------
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

header = (
    "timestamp\tmodel\tround\ttype\ttrain_samples\ttest_samples\t"
    "train_phase_counts\ttest_phase_counts\tnotes\tmetrics\tmodel_file\tpreds_file\n"
)

if not os.path.exists(RESULTS_LOG):
    with open(RESULTS_LOG, "w") as f:
        f.write(header)

with open(RESULTS_LOG, "a") as f:
    f.write(
        f"{timestamp}\t{PREFIX}\t{ROUND_NAME}\tsplit\t"
        f"{len(train_df)}\t{len(test_df)}\t"
        f"{train_df['phase'].value_counts().to_dict()}\t"
        f"{test_df['phase'].value_counts().to_dict()}\t"
        f"Intelligent oversampling (feature-space, HR-aware)\t-\t"
        f"{os.path.basename(TRAIN_OUT)}\t{os.path.basename(TEST_OUT)}\n"
    )

print("\nüìå Split logged to results.txt")
print("üéâ FASE 6 COMPLETED WITH GOVERNANCE!")

üì• Loaded FE dataset: (792, 33)

üìä Before Oversampling:
84‚Äì95 BPM  : 174
95‚Äì106 BPM : 39
106‚Äì117 BPM: 13
üì¶ After Oversampling: (1434, 33)

üìù Oversampling report saved ‚Üí
/Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/data/preparation/r2_oversampling_report.txt

üìå Split logged to results.txt
üéâ FASE 6 COMPLETED WITH GOVERNANCE!


## Round_03 - Oversampling em 5 registros que estavam com os maiores erros

O que MUDA em rela√ß√£o ao round_02

‚úî N√ÉO refaz o oversampling anterior
‚úî N√ÉO mexe em bins intermedi√°rios
‚úî N√ÉO altera distribui√ß√£o global

hr_true > 115.5 BPM

‚úÖ Aplica oversampling cir√∫rgico x20
‚úÖ Usa interpola√ß√£o + jitter leve
‚úÖ Apenas no espa√ßo de features importantes
‚úÖ Mant√©m governan√ßa + logging + report

*Arquivos gerados*

- splits...................data/splits/round_03/
- report oversampling......data/preparation/r03_oversampling_report.txt
- log.......................results/results.txt


In [21]:
# ===============================================================
# FASE 6 ‚Äî Intelligent Oversampling (SMOTE-Modificado) + Split
# ROUND_03 ‚Äî Base + Extreme HR Refinement
# Governan√ßa + Logging
# ===============================================================

import os
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from datetime import datetime

# ---------------------------------------------------------------
# PARAMETERS
# ---------------------------------------------------------------
ROUND_NAME = "round_03"
PREFIX     = "r03"

BASE_DIR = "/Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso"

FE_OUT = os.path.join(
    BASE_DIR, "data", "processed", "fe", "round_01_fe_dataset.txt"
)

SPLIT_DIR   = os.path.join(BASE_DIR, "data", "splits", ROUND_NAME)
PREP_DIR    = os.path.join(BASE_DIR, "data", "preparation")
RESULTS_LOG = os.path.join(BASE_DIR, "results", "results.txt")

os.makedirs(SPLIT_DIR, exist_ok=True)
os.makedirs(PREP_DIR, exist_ok=True)

TRAIN_OUT = os.path.join(SPLIT_DIR, f"{ROUND_NAME}_train.txt")
TEST_OUT  = os.path.join(SPLIT_DIR, f"{ROUND_NAME}_test.txt")

OVERSAMPLE_REPORT = os.path.join(
    PREP_DIR, f"{PREFIX}_oversampling_report.txt"
)

# ---------------------------------------------------------------
# LOAD FEATURE-ENGINEERED DATASET
# ---------------------------------------------------------------
df = pd.read_csv(FE_OUT)
df = df[df["phase"].isin([0, 2])].reset_index(drop=True)

initial_size = len(df)
print("üì• Loaded FE dataset:", df.shape)

# ---------------------------------------------------------------
# 1. BASE HR RANGES (ROUND_02 STRATEGY ‚Äî UNCHANGED)
# ---------------------------------------------------------------
R1 = df[(df.hr_true >= 84) & (df.hr_true < 95)]
R2 = df[(df.hr_true >= 95) & (df.hr_true < 106)]
R3 = df[(df.hr_true >= 106) & (df.hr_true <= 117)]

multipliers = {
    "84‚Äì95 BPM": 2,
    "95‚Äì106 BPM": 10,
    "106‚Äì117 BPM": 10
}

# ---------------------------------------------------------------
# 2. IMPORTANT FEATURES
# ---------------------------------------------------------------
important_features = [
    "fusion_ppg_imu",
    "ppg_f_dom",
    "phase_id",
    "imu_entropy",
    "imu_p95",
    "imu_mean",
    "imu_energy",
    "acc_rms",
    "hr_cand_weighted",
    "ppg_bp_high",
]

top_jitter_features = important_features.copy()

# ---------------------------------------------------------------
# 3. SMOTE-MODIFIED FUNCTION
# ---------------------------------------------------------------
def smote_modified(df_group, multiplier, feature_cols, jitter_cols):
    if len(df_group) < 2:
        return df_group.copy()

    n_new = len(df_group) * (multiplier - 1)

    nbrs = NearestNeighbors(
        n_neighbors=min(5, len(df_group))
    ).fit(df_group[feature_cols])

    indices = nbrs.kneighbors(
        df_group[feature_cols], return_distance=False
    )

    synthetic = []

    for _ in range(n_new):
        base_idx = np.random.randint(len(df_group))
        neigh_idx = np.random.choice(indices[base_idx][1:])

        base  = df_group.iloc[base_idx]
        neigh = df_group.iloc[neigh_idx]

        alpha = np.random.uniform(0.1, 0.9)
        new_row = base.copy()

        for col in feature_cols:
            new_row[col] = base[col] + alpha * (neigh[col] - base[col])

        for col in jitter_cols:
            new_row[col] += np.random.normal(
                0, 0.01 * df_group[col].std()
            )

        synthetic.append(new_row)

    return pd.concat([df_group, pd.DataFrame(synthetic)], ignore_index=True)

# ---------------------------------------------------------------
# 4. BASE OVERSAMPLING (ROUND_02 BEHAVIOR)
# ---------------------------------------------------------------
print("\nüìä Before Base Oversampling:")
print(f"84‚Äì95 BPM  : {len(R1)}")
print(f"95‚Äì106 BPM : {len(R2)}")
print(f"106‚Äì117 BPM: {len(R3)}")

R1_new = smote_modified(R1, multipliers["84‚Äì95 BPM"], important_features, top_jitter_features)
R2_new = smote_modified(R2, multipliers["95‚Äì106 BPM"], important_features, top_jitter_features)
R3_new = smote_modified(R3, multipliers["106‚Äì117 BPM"], important_features, top_jitter_features)

df_over = pd.concat([df, R1_new, R2_new, R3_new], ignore_index=True)
df_over = df_over.drop_duplicates().reset_index(drop=True)

print("üì¶ After base oversampling:", df_over.shape)

# ===============================================================
# 4B. EXTREME HR REFINEMENT (ROUND_03 SURGICAL STEP)
# ===============================================================
EXTREME_HR_THRESHOLD = 115.5
MULTIPLIER_EXTREME   = 10   # <<< AJUST√ÅVEL (ex: 5, 8, 10)

print("\nüî• Extreme HR Refinement")
print(f"Threshold  : hr_true > {EXTREME_HR_THRESHOLD}")
print(f"Multiplier : x{MULTIPLIER_EXTREME}")

df_extreme = df_over[df_over["hr_true"] > EXTREME_HR_THRESHOLD].copy()
n_extreme_original = len(df_extreme)

print("Extreme HR samples:", n_extreme_original)

if n_extreme_original >= 2:
    df_extreme_aug = smote_modified(
        df_extreme,
        MULTIPLIER_EXTREME,
        important_features,
        top_jitter_features
    )
else:
    df_extreme_aug = df_extreme.copy()

df_final = pd.concat([df_over, df_extreme_aug], ignore_index=True)
df_final = df_final.reset_index(drop=True)

print("üì¶ After extreme refinement:", df_final.shape)

# ---------------------------------------------------------------
# 5. TRAIN / TEST SPLIT
# ---------------------------------------------------------------
train_df, test_df = train_test_split(
    df_final,
    test_size=0.20,
    shuffle=True,
    random_state=42
)

train_df.to_csv(TRAIN_OUT, index=False)
test_df.to_csv(TEST_OUT, index=False)

# ---------------------------------------------------------------
# 6. OVERSAMPLING REPORT
# ---------------------------------------------------------------
with open(OVERSAMPLE_REPORT, "w") as f:
    f.write("=" * 70 + "\n")
    f.write("INTELLIGENT OVERSAMPLING REPORT ‚Äî ROUND_03\n")
    f.write("=" * 70 + "\n\n")

    f.write(f"Round: {ROUND_NAME}\n")
    f.write(f"Prefix: {PREFIX}\n")
    f.write(f"Timestamp: {datetime.now()}\n\n")

    f.write("INITIAL DATASET:\n")
    f.write(f"- Samples: {initial_size}\n\n")

    f.write("BASE OVERSAMPLING:\n")
    for k, v in multipliers.items():
        f.write(f"- {k}: x{v}\n")

    f.write("\nEXTREME HR REFINEMENT:\n")
    f.write(f"- Threshold : hr_true > {EXTREME_HR_THRESHOLD}\n")
    f.write(f"- Multiplier: x{MULTIPLIER_EXTREME}\n")
    f.write(f"- Extreme samples (orig): {n_extreme_original}\n\n")

    f.write("IMPORTANT FEATURES:\n")
    f.write(", ".join(important_features) + "\n\n")

    f.write("FINAL DATASET:\n")
    f.write(f"- Samples: {len(df_final)}\n\n")

    f.write("TRAIN HR DISTRIBUTION:\n")
    f.write(pd.cut(train_df["hr_true"], bins=6).value_counts().to_string())

print("\nüìù Oversampling report saved ‚Üí", OVERSAMPLE_REPORT)

# ---------------------------------------------------------------
# 7. LOG SPLIT IN results.txt
# ---------------------------------------------------------------
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

header = (
    "timestamp\tmodel\tround\ttype\ttrain_samples\ttest_samples\t"
    "train_phase_counts\ttest_phase_counts\tnotes\tmetrics\tmodel_file\tpreds_file\n"
)

if not os.path.exists(RESULTS_LOG):
    with open(RESULTS_LOG, "w") as f:
        f.write(header)

with open(RESULTS_LOG, "a") as f:
    f.write(
        f"{timestamp}\t{PREFIX}\t{ROUND_NAME}\tsplit\t"
        f"{len(train_df)}\t{len(test_df)}\t"
        f"{train_df['phase'].value_counts().to_dict()}\t"
        f"{test_df['phase'].value_counts().to_dict()}\t"
        f"Base oversampling + extreme HR refinement\t-\t"
        f"{os.path.basename(TRAIN_OUT)}\t{os.path.basename(TEST_OUT)}\n"
    )

print("\nüìå Split logged to results.txt")
print("üéâ FASE 6 ‚Äî ROUND_03 COMPLETED WITH GOVERNANCE!")

üì• Loaded FE dataset: (792, 33)

üìä Before Base Oversampling:
84‚Äì95 BPM  : 174
95‚Äì106 BPM : 39
106‚Äì117 BPM: 13
üì¶ After base oversampling: (1434, 33)

üî• Extreme HR Refinement
Threshold  : hr_true > 115.5
Multiplier : x10
Extreme HR samples: 14
üì¶ After extreme refinement: (1574, 33)

üìù Oversampling report saved ‚Üí /Users/edmundobrown/Documents/MLGeral/AI-HealthCare/HREstimation/repouso/data/preparation/r03_oversampling_report.txt

üìå Split logged to results.txt
üéâ FASE 6 ‚Äî ROUND_03 COMPLETED WITH GOVERNANCE!
