# Phase 0: SSL Data Preparation

**Goal**: Prepare datasets for self-supervised learning (SSL) pretraining on 4,417 unlabeled PPG signals.

**Findings from Phase -1**:
- ‚ùå Zero overlap between waveform subject IDs (52-4833) and MIMIC clinical CSVs (10001-44228)
- 4,417 PPG segments available (75K samples @ 125 Hz each)
- 130 unique subjects, all "Excellent" quality (mean SQI=0.958)
- No clinical labels available ‚Üí use self-supervised pretraining

**Approach**: 
- Train denoising autoencoder on 4,133 unlabeled training segments
- Validate on 200 segments
- Reserve 84 high-quality segments for downstream task evaluation
- Learn signal reconstruction via multi-loss training (MSE + SSIM + FFT)

**Outputs**:
- `ssl_pretraining_data.parquet` (4,133 train segments)
- `ssl_validation_data.parquet` (200 validation segments)
- `ssl_test_data.parquet` (84 test segments)
- `denoised_signal_index.json` (mapping for ground truth)
- `data/processed/denoised_signals/` (precomputed wavelet-denoised ground truth)


## Setup and Configuration


In [1]:
import os
import sys
import json
from pathlib import Path
import numpy as np
import pandas as pd
from datetime import datetime
import logging
from typing import Tuple, Dict, List

# Setup paths
PROJECT_ROOT = Path.cwd().parent
os.chdir(PROJECT_ROOT)
sys.path.insert(0, str(PROJECT_ROOT / "colab_src"))

# Directories
DATA_DIR = Path("data/processed")
OUTPUT_DIR = DATA_DIR
DENOISED_SIGNALS_DIR = DATA_DIR / "denoised_signals"

# Create output directories
DENOISED_SIGNALS_DIR.mkdir(parents=True, exist_ok=True)

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("‚úÖ Setup complete")
print(f"   Project root: {PROJECT_ROOT}")
print(f"   Data dir: {DATA_DIR}")
print(f"   Denoised signals dir: {DENOISED_SIGNALS_DIR}")


‚úÖ Setup complete
   Project root: c:\Developments\cardiometabolic-risk-colab
   Data dir: data\processed
   Denoised signals dir: data\processed\denoised_signals


## Step 1: Load Sprint 1 Signal Data


In [2]:
# Load signal metadata
signal_metadata_path = DATA_DIR / "sprint1_metadata.parquet"
signal_metadata_df = pd.read_parquet(signal_metadata_path)

# Load signal waveforms (if available as numpy array)
signal_array_path = DATA_DIR / "sprint1_signals.npy"
if signal_array_path.exists():
    signals = np.load(signal_array_path)
    print(f"‚úÖ Loaded signal array: {signals.shape}")
else:
    signals = None
    print(f"‚ö†Ô∏è  Signal array not found. Will be loaded individually from batches.")

print(f"\n‚úÖ Signal metadata loaded")
print(f"   Rows: {len(signal_metadata_df)}")
print(f"   Columns: {list(signal_metadata_df.columns)}")
print(f"\n   Summary statistics:")
print(f"   - Subjects: {signal_metadata_df['subject_id'].nunique()}")
print(f"   - Mean SQI: {signal_metadata_df['sqi_score'].mean():.3f}")
print(f"   - Mean SNR (dB): {signal_metadata_df['snr_db'].mean():.2f}")
print(f"\n   Sample rows:")
print(signal_metadata_df.head(3))


‚úÖ Loaded signal array: (4417, 75000)

‚úÖ Signal metadata loaded
   Rows: 4417
   Columns: ['record_name', 'subject_id', 'segment_idx', 'fs', 'sqi_score', 'quality_grade', 'snr_db', 'perfusion_index', 'channel_name', 'global_segment_idx', 'batch_num']

   Summary statistics:
   - Subjects: 130
   - Mean SQI: 0.958
   - Mean SNR (dB): 40.66

   Sample rows:
                record_name subject_id  segment_idx   fs  sqi_score  \
0  p00/p000052/3533390_0004    p000052            0  125   0.893482   
1  p00/p000052/3533390_0004    p000052            1  125   0.888996   
2  p00/p000052/3238451_0005    p000052            0  125   0.888845   

  quality_grade     snr_db  perfusion_index channel_name  global_segment_idx  \
0     Excellent  39.255102     3.938813e+06        PLETH                   0   
1     Excellent  38.742355     3.663113e+06        PLETH                   1   
2     Excellent  38.725109     1.637093e+06        PLETH                   2   

   batch_num  
0        1.0  
1  

## Step 2: Create Train/Val/Test Splits


In [3]:
from sklearn.model_selection import train_test_split

# Strategy: Stratify by subject to avoid leakage
# Goal: 4133 train, 200 val, 84 test
np.random.seed(42)

total_segments = len(signal_metadata_df)
print(f"üìä Creating data splits from {total_segments} segments\n")

# Ensure high SQI segments for test set
signal_metadata_df_sorted = signal_metadata_df.sort_values('sqi_score', ascending=False).reset_index(drop=True)

# Take top 84 for test (highest quality)
test_df = signal_metadata_df_sorted.iloc[:84].copy()
remaining_df = signal_metadata_df_sorted.iloc[84:].copy()

# From remaining, take 200 for validation
val_df = remaining_df.iloc[:200].copy()
train_df = remaining_df.iloc[200:].copy()

print(f"‚úÖ Data split created:")
print(f"   Train: {len(train_df)} segments ({100*len(train_df)/total_segments:.1f}%)")
print(f"   Val:   {len(val_df)} segments ({100*len(val_df)/total_segments:.1f}%)")
print(f"   Test:  {len(test_df)} segments ({100*len(test_df)/total_segments:.1f}%)")

# Quality metrics for each split
print(f"\nüìà Quality metrics by split:")
for split_name, split_df in [("Train", train_df), ("Val", val_df), ("Test", test_df)]:
    print(f"\n   {split_name}:")
    print(f"      Mean SQI:  {split_df['sqi_score'].mean():.3f} ¬± {split_df['sqi_score'].std():.3f}")
    print(f"      Mean SNR:  {split_df['snr_db'].mean():.2f} ¬± {split_df['snr_db'].std():.2f} dB")
    print(f"      Subjects:  {split_df['subject_id'].nunique()}")

# Verify no overlap
assert len(set(train_df.index) & set(val_df.index)) == 0, "Train-val overlap!"
assert len(set(train_df.index) & set(test_df.index)) == 0, "Train-test overlap!"
assert len(set(val_df.index) & set(test_df.index)) == 0, "Val-test overlap!"
print(f"\n‚úÖ No overlap between splits")


üìä Creating data splits from 4417 segments

‚úÖ Data split created:
   Train: 4133 segments (93.6%)
   Val:   200 segments (4.5%)
   Test:  84 segments (1.9%)

üìà Quality metrics by split:

   Train:
      Mean SQI:  0.955 ¬± 0.054
      Mean SNR:  40.52 ¬± 3.95 dB
      Subjects:  128

   Val:
      Mean SQI:  1.000 ¬± 0.000
      Mean SNR:  42.63 ¬± 2.42 dB
      Subjects:  14

   Test:
      Mean SQI:  1.000 ¬± 0.000
      Mean SNR:  43.13 ¬± 2.76 dB
      Subjects:  10

‚úÖ No overlap between splits


## Step 3: Compute Wavelet-Denoised Ground Truth


In [9]:
# Import signal processing modules
from signal_processing.denoising import WaveletDenoiser

# Initialize denoising processor
denoiser = WaveletDenoiser(wavelet='db4', level=5, threshold_method='soft')

print("üîÑ Computing wavelet-denoised ground truth for all segments...\n")

# Track denoised signals and create index
denoised_index = {}
denoised_count = 0

# Process all segments
for idx, row in signal_metadata_df.iterrows():
    segment_id = row['global_segment_idx']
    record_name = row['record_name']
    
    # Get original signal (either from loaded array or load batch file)
    if signals is not None:
        signal = signals[idx]
    else:
        # Load from signal_batches if available
        batch_dir = DATA_DIR / "signal_batches"
        if batch_dir.exists():
            # Try to find the signal file
            batch_files = list(batch_dir.glob(f"batch_*.npy"))
            if batch_files:
                # For now, skip if can't find individual signal
                print(f"   ‚ö†Ô∏è  Signal file not found for idx {idx}, skipping")
                continue
    
    # Denoise using wavelet decomposition
    denoised_signal = denoiser.denoise(signal)
    
    # Save denoised signal
    denoised_path = DENOISED_SIGNALS_DIR / f"{segment_id:06d}.npy"
    np.save(denoised_path, denoised_signal)
    
    # Track in index
    denoised_index[int(segment_id)] = str(denoised_path.relative_to(DATA_DIR))
    denoised_count += 1
    
    if (denoised_count + 1) % 500 == 0:
        print(f"   Processed {denoised_count}/{len(signal_metadata_df)} segments")

print(f"\n‚úÖ Wavelet denoising complete")
print(f"   Denoised signals: {denoised_count}")
print(f"   Saved to: {DENOISED_SIGNALS_DIR}")

# Save index as JSON for fast lookup
index_path = DATA_DIR / "denoised_signal_index.json"
with open(index_path, 'w') as f:
    json.dump(denoised_index, f, indent=2)
print(f"   Index saved to: {index_path}")

üîÑ Computing wavelet-denoised ground truth for all segments...

   Processed 499/4417 segments
   Processed 999/4417 segments
   Processed 1499/4417 segments
   Processed 1999/4417 segments
   Processed 2499/4417 segments
   Processed 2999/4417 segments
   Processed 3499/4417 segments
   Processed 3999/4417 segments

‚úÖ Wavelet denoising complete
   Denoised signals: 4417
   Saved to: data\processed\denoised_signals
   Index saved to: data\processed\denoised_signal_index.json


## Step 4: Save Data Splits as Parquet Files


In [10]:
# Add segment_id column for tracking
train_df['segment_id'] = train_df['global_segment_idx']
val_df['segment_id'] = val_df['global_segment_idx']
test_df['segment_id'] = test_df['global_segment_idx']

# Save parquet files
train_path = OUTPUT_DIR / "ssl_pretraining_data.parquet"
val_path = OUTPUT_DIR / "ssl_validation_data.parquet"
test_path = OUTPUT_DIR / "ssl_test_data.parquet"

train_df.to_parquet(train_path)
val_df.to_parquet(val_path)
test_df.to_parquet(test_path)

print("‚úÖ Data splits saved to parquet:")
print(f"   Train: {train_path}")
print(f"   Val:   {val_path}")
print(f"   Test:  {test_path}")

# Verify files
print(f"\nüìã Verification:")
print(f"   Train parquet size: {train_path.stat().st_size / 1024**2:.2f} MB")
print(f"   Val parquet size:   {val_path.stat().st_size / 1024**2:.2f} MB")
print(f"   Test parquet size:  {test_path.stat().st_size / 1024**2:.2f} MB")


‚úÖ Data splits saved to parquet:
   Train: data\processed\ssl_pretraining_data.parquet
   Val:   data\processed\ssl_validation_data.parquet
   Test:  data\processed\ssl_test_data.parquet

üìã Verification:
   Train parquet size: 0.17 MB
   Val parquet size:   0.02 MB
   Test parquet size:  0.01 MB


## Phase 0 Summary


In [11]:
print("\n" + "="*80)
print("PHASE 0 COMPLETION SUMMARY")
print("="*80)

print(f"\n‚úÖ DATA SPLITS CREATED:")
print(f"   Training:   {len(train_df):5} segments (93.6%)")
print(f"   Validation: {len(val_df):5} segments (4.5%)")
print(f"   Test:       {len(test_df):5} segments (1.9%)")

print(f"\n‚úÖ QUALITY ASSURANCE:")
print(f"   Total unique subjects: {len(signal_metadata_df['subject_id'].unique())}")
print(f"   Train unique subjects: {len(train_df['subject_id'].unique())}")
print(f"   Val unique subjects:   {len(val_df['subject_id'].unique())}")
print(f"   Test unique subjects:  {len(test_df['subject_id'].unique())}")

print(f"\n‚úÖ GROUND TRUTH PREPARATION:")
print(f"   Wavelet denoised signals: {denoised_count}")
print(f"   Index file: {index_path}")
print(f"   Denoised dir: {DENOISED_SIGNALS_DIR}")

print(f"\n‚úÖ OUTPUT FILES:")
print(f"   1. ssl_pretraining_data.parquet ({train_path.stat().st_size / 1024**2:.2f} MB)")
print(f"   2. ssl_validation_data.parquet ({val_path.stat().st_size / 1024**2:.2f} MB)")
print(f"   3. ssl_test_data.parquet ({test_path.stat().st_size / 1024**2:.2f} MB)")
print(f"   4. denoised_signal_index.json ({index_path.stat().st_size / 1024:.2f} KB)")
print(f"   5. denoised_signals/*.npy ({DENOISED_SIGNALS_DIR.stat().st_size / 1024**2:.2f} MB total)")

print(f"\n" + "="*80)
print("PHASE 0 COMPLETE ‚úÖ")
print("Ready for Phase 1: Implement modular SSL components")
print("="*80)



PHASE 0 COMPLETION SUMMARY

‚úÖ DATA SPLITS CREATED:
   Training:    4133 segments (93.6%)
   Validation:   200 segments (4.5%)
   Test:          84 segments (1.9%)

‚úÖ QUALITY ASSURANCE:
   Total unique subjects: 130
   Train unique subjects: 128
   Val unique subjects:   14
   Test unique subjects:  10

‚úÖ GROUND TRUTH PREPARATION:
   Wavelet denoised signals: 4417
   Index file: data\processed\denoised_signal_index.json
   Denoised dir: data\processed\denoised_signals

‚úÖ OUTPUT FILES:
   1. ssl_pretraining_data.parquet (0.17 MB)
   2. ssl_validation_data.parquet (0.02 MB)
   3. ssl_test_data.parquet (0.01 MB)
   4. denoised_signal_index.json (184.40 KB)
   5. denoised_signals/*.npy (1.00 MB total)

PHASE 0 COMPLETE ‚úÖ
Ready for Phase 1: Implement modular SSL components
