# Data Preparation Tutorial

**Purpose:** A sample case of doing data preprocessing using example of the dataset presented in the paper "Wearable Physiological Signals under Acute Stress and Exercise Conditions"

**What this notebook contains**
- Clear modular functions for loading, preprocessing, feature extraction, combination, and saving.
- Relative paths only — saves outputs under `./Individual Dataset/`.

Run the notebook cells sequentially. The main entrypoint is `run_preprocessing_pipeline()`
which will create the following files inside `./Individual Dataset/`:

```
Raw_EDA.csv
Raw_PPG.csv
Features_EDA.csv
Features_PPG.csv
Features_Combined.csv
Features_FourClass_PPG.csv
Features_FourClass_EDA.csv
Features_FourClass_Combined.csv
Features_Combined_Demographics.csv (if applicable)
Features_PPG_Demographics.csv (if applicable)
Features_EDA_Demographics.csv (if applicable)
```

## Overview

The preprocessing pipeline includes:
- Loading and organizing raw physiological data
- Segmenting data based on experimental protocols
- Extracting meaningful features from EDA and PPG signals
- Categorizing data by arousal and valence dimensions
- Creating four-class emotional states classification
- Saving processed datasets for further analysis

## 1. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import os
from pathlib import Path
import neurokit2 as nk
import scipy.stats as stats
import cvxEDA.src.cvxEDA as cvxEDA
from sklearn.preprocessing import MinMaxScaler
import pickle
import warnings
warnings.filterwarnings('ignore')

## 2. Label Processing Functions

These functions process the stress level labels and create event sequences for different experimental conditions.

## Dataset Overview

The **Exercise dataset** (Hongn et al., 2025) contains physiological data from three experimental conditions:

1. **Stress Induction Protocol** (36 participants)  
2. **Aerobic Exercise** (30 participants)  
3. **Anaerobic Exercise** (31 participants)  

Data was collected using **Empatica E4** wearable devices capturing **EDA**, **PPG**, and other physiological signals.

---

## Two-Class Labeling Strategy

### Arousal Categories (High / Low)

**Stress Protocol:**
- **High Arousal (1)**: Tasks designed to elicit stress responses  
  - Stroop Test  
  - Trier Mental Challenge Test (mathematical tasks with annoying audio)  
  - Controversial opinion vocalization  
  - Backward counting from 1022 in decrements of 13  
- **Low Arousal (0)**: Baseline and rest periods  

**Exercise Sessions:**
- **High Arousal (1)**: Active cycling periods  
- **Low Arousal (0)**: Baseline, warm-up, cool-down, and rest periods  

---

### Valence Categories (Positive / Negative)

**Stress Protocol:**
- **Positive Valence (1)**: Low stress scores from self-reports  
- **Negative Valence (0)**: High stress scores from self-reports  

**Aerobic Exercise:**
- **Positive Valence (1)**: Cycling up to 85 rpm speed  
- **Negative Valence (0)**: Cycling above 85 rpm speed  

**Anaerobic Exercise:**
- **Positive Valence (1)**: Initial two sprints  
- **Negative Valence (0)**: Later sprints  

---

## Four-Class Emotional State Mapping

The four-class system combines arousal and valence dimensions to create comprehensive emotional states:

### Class 0: Low Arousal, Low Valence (0, 0)
**Physiological Interpretation:** Calm-negative state  
- **Stress Protocol:** Low stress but negative affective state  
- **Aerobic Exercise:** Rest periods with negative valence (post-high intensity)  
- **Anaerobic Exercise:** Rest periods between later sprints  

---

### Class 1: Low Arousal, High Valence (0, 1)
**Physiological Interpretation:** Calm-positive state  
- **Stress Protocol:** Low stress with positive affective state  
- **Aerobic Exercise:** Warm-up and initial cycling phases  
- **Anaerobic Exercise:** Rest periods between initial sprints  

---

### Class 2: High Arousal, Low Valence (1, 0)
**Physiological Interpretation:** Agitated-negative state  
- **Stress Protocol:** High stress conditions (Stroop, TMCT, controversial topics)  
- **Aerobic Exercise:** High-intensity cycling (>85 rpm)  
- **Anaerobic Exercise:** Later sprint sessions  

---

### Class 3: High Arousal, High Valence (1, 1)
**Physiological Interpretation:** Excited-positive state  
- **Stress Protocol:** Not typically present in stress induction  
- **Aerobic Exercise:** Moderate-intensity cycling (≤85 rpm)  
- **Anaerobic Exercise:** Initial sprint sessions  

---

In [None]:
def load_stress_labels():
    """Load and process stress level labels from CSV files."""
    level1_csv = pd.read_csv(LABEL_FILES["stress_v1"])
    level2_csv = pd.read_csv(LABEL_FILES["stress_v2"])
    return level1_csv, level2_csv

def create_stress_bin(level1_csv, level2_csv, data_folder):
    """Create stress event sequences for each participant."""
    pids = [f.name for f in os.scandir(data_folder) if f.is_dir()]
    stress_bin = {}
    
    for pid in pids:
        if pid == "f14_b":
            continue
        elif pid == "f14_a":
            # Handle special case for f14
            pid_df = level2_csv[level2_csv['Unnamed: 0'] == "f14"]
            arousal_seq, valence_seq = create_stress_sequence_v2(pid_df)
            stress_bin["f14"] = (arousal_seq, valence_seq)
        elif pid[0] == "f":
            # Version 2 participants
            pid_df = level2_csv[level2_csv['Unnamed: 0'] == pid]
            arousal_seq, valence_seq = create_stress_sequence_v2(pid_df)
            stress_bin[pid] = (arousal_seq, valence_seq)
        elif pid[0] == "S":
            # Version 1 participants
            pid_df = level1_csv[level1_csv['Unnamed: 0'] == pid]
            arousal_seq, valence_seq = create_stress_sequence_v1(pid_df)
            stress_bin[pid] = (arousal_seq, valence_seq)
        else:
            print(f"Unknown PID format: {pid}")
    
    return stress_bin

def create_stress_sequence_v1(pid_df):
    """Create stress event sequence for version 1 participants."""
    arousal_seq = [
        (3, pid_df['Baseline'].tolist()[0]), 
        (5, pid_df['Stroop'].tolist()[0]), 
        (10, pid_df['First Rest'].tolist()[0]), 
        (13, pid_df['TMCT'].tolist()[0]), 
        (18, pid_df['Second Rest'].tolist()[0]), 
        (18.5, pid_df['Real Opinion'].tolist()[0]), 
        (19, pid_df['Opposite Opinion'].tolist()[0]), 
        (19.5, pid_df['Subtract'].tolist()[0])
    ]
    arousal_seq = [(i[0], 1 if i[1] > 4 else 0) for i in arousal_seq]
    
    valence_seq = [
        (3, pid_df['Baseline'].tolist()[0]), 
        (5, pid_df['Stroop'].tolist()[0]), 
        (10, pid_df['First Rest'].tolist()[0]), 
        (13, pid_df['TMCT'].tolist()[0]), 
        (18, pid_df['Second Rest'].tolist()[0]), 
        (18.5, pid_df['Real Opinion'].tolist()[0]), 
        (19, pid_df['Opposite Opinion'].tolist()[0]), 
        (19.5, pid_df['Subtract'].tolist()[0])
    ]
    valence_seq = [(i[0], 0 if i[1] > 3 else 1) for i in valence_seq]
    
    return arousal_seq, valence_seq

def create_stress_sequence_v2(pid_df):
    """Create stress event sequence for version 2 participants."""
    arousal_seq = [
        (3, pid_df['Baseline'].tolist()[0]), 
        (6, pid_df['TMCT'].tolist()[0]), 
        (16, pid_df['First Rest'].tolist()[0]), 
        (16.5, pid_df['Real Opinion'].tolist()[0]), 
        (17, pid_df['Opposite Opinion'].tolist()[0]), 
        (27, pid_df['Second Rest'].tolist()[0]), 
        (27.5, pid_df['Subtract'].tolist()[0])
    ]
    arousal_seq = [(i[0], 1 if i[1] > 4 else 0) for i in arousal_seq]
    
    valence_seq = [
        (3, pid_df['Baseline'].tolist()[0]), 
        (6, pid_df['TMCT'].tolist()[0]), 
        (16, pid_df['First Rest'].tolist()[0]), 
        (16.5, pid_df['Real Opinion'].tolist()[0]), 
        (17, pid_df['Opposite Opinion'].tolist()[0]), 
        (27, pid_df['Second Rest'].tolist()[0]), 
        (27.5, pid_df['Subtract'].tolist()[0])
    ]
    valence_seq = [(i[0], 0 if i[1] > 3 else 1) for i in valence_seq]
    
    return arousal_seq, valence_seq

## 4. Data Loading and Segmentation

These functions handle loading raw physiological data and segmenting it according to experimental protocols.

In [None]:
def load_physiological_data(pid, condition):
    """Load EDA and PPG data for a given participant and condition."""
    data_folder = DATA_DIRS[condition]
    
    try:
        eda_data = pd.read_csv(data_folder / pid / "EDA.csv")
        ppg_data = pd.read_csv(data_folder / pid / "BVP.csv")
        
        # Extract data values (assuming first column contains the data)
        eda_values = eda_data[eda_data.columns[0]].tolist()[1:]  # Skip header
        ppg_values = ppg_data[ppg_data.columns[0]].tolist()[1:]  # Skip header
        
        return eda_values, ppg_values
    except FileNotFoundError:
        print(f"Data not found for {pid} in {condition}")
        return None, None

def segment_data(data, event_sequence, sampling_rate, start_offset=0):
    """Segment data based on event sequence timings."""
    segments = []
    
    for i in range(len(event_sequence)):
        current_time = event_sequence[i][0]
        
        if i == 0:
            start_idx = start_offset
        else:
            prev_time = event_sequence[i-1][0]
            start_idx = int(prev_time * 60 * sampling_rate)
        
        if i == len(event_sequence) - 1:
            end_idx = len(data)
        else:
            end_idx = int(current_time * 60 * sampling_rate)
        
        segment = data[start_idx:end_idx]
        segments.append(segment)
    
    return segments

def create_segmented_dataset(condition_bin, condition, eda_sampling_rate=4, ppg_sampling_rate=64):
    """Create segmented dataset for a given condition."""
    raw_eda = pd.DataFrame(columns=["PID", "arousal_category", "valence_category", "Data"])
    raw_ppg = pd.DataFrame(columns=["PID", "arousal_category", "valence_category", "Data"])
    
    for pid in condition_bin.keys():
        eda_data, ppg_data = load_physiological_data(pid, condition)
        
        if eda_data is None or ppg_data is None:
            continue
        
        arousal_seq = condition_bin[pid][0]
        valence_seq = condition_bin[pid][1]
        
        # Apply condition-specific preprocessing
        eda_processed, ppg_processed = preprocess_condition_data(pid, condition, eda_data, ppg_data)
        
        # Segment data
        eda_segments = segment_data(eda_processed, arousal_seq, eda_sampling_rate)
        ppg_segments = segment_data(ppg_processed, arousal_seq, ppg_sampling_rate)
        
        # Add segments to dataset
        for i, (eda_segment, ppg_segment) in enumerate(zip(eda_segments, ppg_segments)):
            arousal_cat = arousal_seq[i][1]
            valence_cat = valence_seq[i][1]
            
            eda_row = {"PID": pid, "arousal_category": arousal_cat, "valence_category": valence_cat, "Data": eda_segment}
            ppg_row = {"PID": pid, "arousal_category": arousal_cat, "valence_category": valence_cat, "Data": ppg_segment}
            
            raw_eda = pd.concat([raw_eda, pd.DataFrame([eda_row])], ignore_index=True)
            raw_ppg = pd.concat([raw_ppg, pd.DataFrame([ppg_row])], ignore_index=True)
    
    return raw_eda, raw_ppg

def preprocess_condition_data(pid, condition, eda_data, ppg_data):
    """Apply condition-specific preprocessing to data."""
    # This function should implement the condition-specific trimming and preprocessing
    # shown in the original code for different PIDs and conditions
    
    # Placeholder implementation - extend this based on your specific preprocessing needs
    if condition == "stress":
        # Apply stress-specific preprocessing
        if pid in ["f01", "f02", "f03", "f04"]:
            # Example: Trim specific ranges for certain participants
            eda_processed = eda_data[500:6500]
            ppg_processed = ppg_data[int(500*64/4):int(6500*64/4)]
        else:
            eda_processed = eda_data
            ppg_processed = ppg_data
    else:
        eda_processed = eda_data
        ppg_processed = ppg_data
    
    return eda_processed, ppg_processed

## 5. Feature Extraction

These functions extract meaningful features from EDA and PPG signals.

In [None]:
def extract_eda_features(eda_signal, sampling_rate=4):
    """Extract features from EDA signal."""
    features = {}
    
    try:
        # Basic statistical features
        features['eda_mean'] = np.mean(eda_signal)
        features['eda_std'] = np.std(eda_signal)
        features['eda_skew'] = stats.skew(eda_signal)
        features['eda_kurtosis'] = stats.kurtosis(eda_signal)
        
        # NeuroKit2 EDA analysis
        eda_cleaned = nk.eda_clean(eda_signal, sampling_rate=sampling_rate)
        eda_decomposed = nk.eda_phasic(eda_cleaned, sampling_rate=sampling_rate)
        
        # Phasic and tonic components
        features['eda_tonic_mean'] = np.mean(eda_decomposed['EDA_Tonic'])
        features['eda_phasic_mean'] = np.mean(eda_decomposed['EDA_Phasic'])
        
        # Peak detection
        signals, info = nk.eda_peaks(eda_decomposed['EDA_Phasic'], sampling_rate=sampling_rate)
        features['eda_peaks_count'] = len(info['SCR_Peaks'])
        
    except Exception as e:
        print(f"Error in EDA feature extraction: {e}")
        # Set default values for failed extractions
        for key in ['eda_mean', 'eda_std', 'eda_skew', 'eda_kurtosis', 
                   'eda_tonic_mean', 'eda_phasic_mean', 'eda_peaks_count']:
            features[key] = 0.0
    
    return features

def extract_ppg_features(ppg_signal, sampling_rate=64):
    """Extract features from PPG signal."""
    features = {}
    
    try:
        # Basic statistical features
        features['ppg_mean'] = np.mean(ppg_signal)
        features['ppg_std'] = np.std(ppg_signal)
        features['ppg_skew'] = stats.skew(ppg_signal)
        features['ppg_kurtosis'] = stats.kurtosis(ppg_signal)
        
        # Heart rate variability features
        ppg_cleaned = nk.ppg_clean(ppg_signal, sampling_rate=sampling_rate)
        signals, info = nk.ppg_process(ppg_cleaned, sampling_rate=sampling_rate)
        
        # Heart rate features
        features['hr_mean'] = np.mean(signals['PPG_Rate'])
        features['hr_std'] = np.std(signals['PPG_Rate'])
        
        # Additional PPG features
        hrv_features = nk.ppg_analyze(signals, sampling_rate=sampling_rate)
        if not hrv_features.empty:
            for col in hrv_features.columns:
                features[f'ppg_{col}'] = hrv_features[col].iloc[0]
        
    except Exception as e:
        print(f"Error in PPG feature extraction: {e}")
        # Set default values for failed extractions
        for key in ['ppg_mean', 'ppg_std', 'ppg_skew', 'ppg_kurtosis', 'hr_mean', 'hr_std']:
            features[key] = 0.0
    
    return features

def create_feature_dataset(raw_data, feature_extraction_func, signal_type):
    """Create feature dataset from raw segmented data."""
    features_list = []
    
    for idx, row in raw_data.iterrows():
        pid = row['PID']
        arousal = row['arousal_category']
        valence = row['valence_category']
        signal_data = row['Data']
        
        # Extract features
        features = feature_extraction_func(signal_data)
        
        # Add metadata
        features['PID'] = pid
        features['arousal_category'] = arousal
        features['valence_category'] = valence
        features['four_class_category'] = create_four_class_category(arousal, valence)
        
        features_list.append(features)
    
    feature_df = pd.DataFrame(features_list)
    return feature_df

def create_four_class_category(arousal, valence):
    """Create four-class emotional state category."""
    # [low arousal low valence, low arousal high valence, high arousal low valence, high arousal high valence]
    if arousal == 0 and valence == 0:
        return 0  # Low arousal, low valence
    elif arousal == 0 and valence == 1:
        return 1  # Low arousal, high valence
    elif arousal == 1 and valence == 0:
        return 2  # High arousal, low valence
    else:  # arousal == 1 and valence == 1
        return 3  # High arousal, high valence

## 6. Main Processing Pipeline

This section runs the complete preprocessing pipeline.

In [None]:
def run_complete_pipeline():
    """Run the complete data preprocessing pipeline."""
    
    print("Starting data preprocessing pipeline...")
    
    # Step 1: Load and process labels
    print("Step 1: Loading stress labels...")
    level1_csv, level2_csv = load_stress_labels()
    stress_bin = create_stress_bin(level1_csv, level2_csv, DATA_DIRS["stress"])
    
    # Define aerobic and anaerobic sequences (simplified - extend as needed)
    aerobic_bin = create_aerobic_sequences(stress_bin)
    anaerobic_bin = create_anaerobic_sequences(stress_bin)
    
    # Step 2: Create segmented datasets
    print("Step 2: Creating segmented datasets...")
    
    # Process stress condition
    raw_eda_stress, raw_ppg_stress = create_segmented_dataset(stress_bin, "stress")
    
    # Process aerobic condition
    raw_eda_aerobic, raw_ppg_aerobic = create_segmented_dataset(aerobic_bin, "aerobic")
    
    # Process anaerobic condition
    raw_eda_anaerobic, raw_ppg_anaerobic = create_segmented_dataset(anaerobic_bin, "anaerobic")
    
    # Combine all conditions
    raw_eda_combined = pd.concat([raw_eda_stress, raw_eda_aerobic, raw_eda_anaerobic], ignore_index=True)
    raw_ppg_combined = pd.concat([raw_ppg_stress, raw_ppg_aerobic, raw_ppg_anaerobic], ignore_index=True)
    
    # Step 3: Extract features
    print("Step 3: Extracting features...")
    
    features_eda = create_feature_dataset(raw_eda_combined, extract_eda_features, "EDA")
    features_ppg = create_feature_dataset(raw_ppg_combined, extract_ppg_features, "PPG")
    
    # Step 4: Create combined features
    print("Step 4: Creating combined features...")
    features_combined = combine_features(features_eda, features_ppg)
    
    # Step 5: Create four-class datasets
    print("Step 5: Creating four-class datasets...")
    features_fourclass_eda = create_four_class_dataset(features_eda)
    features_fourclass_ppg = create_four_class_dataset(features_ppg)
    features_fourclass_combined = create_four_class_dataset(features_combined)
    
    # Step 6: Save all datasets
    print("Step 6: Saving datasets...")
    save_datasets(
        raw_eda_combined, raw_ppg_combined,
        features_eda, features_ppg, features_combined,
        features_fourclass_eda, features_fourclass_ppg, features_fourclass_combined
    )
    
    print("Pipeline completed successfully!")
    
    return {
        'raw_eda': raw_eda_combined,
        'raw_ppg': raw_ppg_combined,
        'features_eda': features_eda,
        'features_ppg': features_ppg,
        'features_combined': features_combined,
        'features_fourclass_eda': features_fourclass_eda,
        'features_fourclass_ppg': features_fourclass_ppg,
        'features_fourclass_combined': features_fourclass_combined
    }

def create_aerobic_sequences(stress_bin):
    """Create aerobic event sequences."""
    aerobic_bin = {}
    for pid in stress_bin.keys():
        if pid[0] == "f":
            arousal_seq = [(4.5, 0), (6.75, 0), (8.25, 1), (9.75, 1), (11.25, 1), (22.5, 1), (27, 1), (30, 0), (32, 0)]
            valence_seq = [(4.5, 1), (6.75, 1), (8.25, 1), (9.75, 1), (11.25, 1), (22.5, 0), (27, 0), (30, 0), (32, 1)]
        else:  # S participants
            arousal_seq = [(3, 0), (6, 0), (9, 1), (12, 1), (15, 1), (18, 1), (21, 1), (23, 1), (25, 1), (27, 1), (29, 1), (33, 0), (35, 0)]
            valence_seq = [(3, 1), (6, 1), (9, 1), (12, 1), (15, 1), (18, 0), (21, 0), (23, 0), (25, 0), (27, 0), (29, 0), (33, 0), (35, 1)]
        aerobic_bin[pid] = (arousal_seq, valence_seq)
    return aerobic_bin

def create_anaerobic_sequences(stress_bin):
    """Create anaerobic event sequences."""
    anaerobic_bin = {}
    for pid in stress_bin.keys():
        if pid[0] == "f":
            arousal_seq = [(4.5, 0), (9, 0), (9.75, 1), (14, 0), (14.75, 1), (18.5, 0), (19.25, 1), (23, 0), (23.75, 1), (27.5, 0), (29.5, 0)]
            valence_seq = [(4.5, 1), (9, 1), (9.75, 1), (14, 1), (14.75, 1), (18.5, 1), (19.25, 0), (23, 0), (23.75, 0), (27.5, 0), (29.5, 1)]
        else:  # S participants
            arousal_seq = [(3, 0), (3.5, 1), (7.5, 0), (8, 1), (12, 0), (12.5, 1), (16.5, 0), (18.5, 0)]
            valence_seq = [(3, 1), (3.5, 1), (7.5, 1), (8, 1), (12, 1), (12.5, 0), (16.5, 0), (18.5, 1)]
        anaerobic_bin[pid] = (arousal_seq, valence_seq)
    return anaerobic_bin

def combine_features(features_eda, features_ppg):
    """Combine EDA and PPG features."""
    # Merge on PID and categories
    combined = pd.merge(
        features_eda, 
        features_ppg, 
        on=['PID', 'arousal_category', 'valence_category', 'four_class_category'],
        suffixes=('_eda', '_ppg')
    )
    return combined

def create_four_class_dataset(features_df):
    """Create dataset focused on four-class classification."""
    # Filter to include only the four-class category
    four_class_df = features_df.copy()
    four_class_df = four_class_df[['PID', 'four_class_category'] + 
                                 [col for col in four_class_df.columns 
                                  if col not in ['PID', 'arousal_category', 'valence_category', 'four_class_category']]]
    return four_class_df

def save_datasets(*datasets):
    """Save all processed datasets."""
    dataset_names = [
        "Raw_EDA", "Raw_PPG", 
        "Features_EDA", "Features_PPG", "Features_Combined",
        "Features_FourClass_EDA", "Features_FourClass_PPG", "Features_FourClass_Combined"
    ]
    
    for name, dataset in zip(dataset_names, datasets):
        filename = OUTPUT_DIR / f"{name}.csv"
        dataset.to_csv(filename, index=False)
        print(f"Saved: {filename}")

# Run the complete pipeline
results = run_complete_pipeline()

## 7. Summary and Next Steps

The preprocessing pipeline has successfully:

1. **Loaded and processed** raw physiological data from multiple conditions
2. **Segmented** data according to experimental protocols
3. **Extracted meaningful features** from EDA and PPG signals
4. **Created categorical labels** for emotional states
5. **Saved organized datasets** for further analysis

### Output Files Created:
```
Processed_Datasets/
├── Raw_EDA.csv
├── Raw_PPG.csv 
├── Features_EDA.csv
├── Features_PPG.csv 
├── Features_Combined.csv 
├── Features_FourClass_PPG.csv
├── Features_FourClass_EDA.csv 
└── Features_FourClass_Combined.csv
```

### Four-Class Emotional States:
- **Class 0**: Low arousal, low valence
- **Class 1**: Low arousal, high valence  
- **Class 2**: High arousal, low valence
- **Class 3**: High arousal, high valence

These datasets are now ready for machine learning model training and analysis.

## 8. Dataset Exploration

Let's explore the created datasets to understand their structure and distributions.

In [None]:
def explore_datasets(results):
    """Explore the created datasets."""
    
    print("=== Dataset Overview ===\n")
    
    for name, dataset in results.items():
        print(f"{name}:")
        print(f"  Shape: {dataset.shape}")
        if 'arousal_category' in dataset.columns:
            print(f"  Arousal distribution: {dataset['arousal_category'].value_counts().to_dict()}")
        if 'valence_category' in dataset.columns:
            print(f"  Valence distribution: {dataset['valence_category'].value_counts().to_dict()}")
        if 'four_class_category' in dataset.columns:
            print(f"  Four-class distribution: {dataset['four_class_category'].value_counts().to_dict()}")
        print()

# Explore the datasets
explore_datasets(results)