## Task 3 — Data Augmentation and Feature Engineering (30 points)
> First Practice: Time‑Compression Data Augmentation

- Our first technique for data augmentation is to create an additional copy of each ECG signal by speeding it up (compressing its duration by 20%). Following the trajectory:
   - Load originals, reading our raw signals into `X_orig`.  
   - Compression for each `s` in `X_orig`, apply `time_compress(s, rate=0.8)` → `X_comp`.  
   - Augmenting, concatenate `X_all = X_orig + X_comp` and duplicate labels `y_all = [y_orig; y_orig]`.  
   - Splitting, run a **stratified** train/validation split on `(X_all, y_all)` to preserve class balance.  
- Reasoning
   - **Variation in heart rate:** Simulates faster‑than‑normal beats, making the model robust to timing shifts.  
   - **Increases data volume:** Doubles your training set without collecting new recordings.  
   - **Preserves labels:** Time‑compression does not alter arrhythmia class.  

> Second Practice:


In [4]:
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

In [5]:
import numpy as np
from scipy.interpolate import interp1d

def time_compress(signal, rate=0.8):
    """
    Applies time compression to a 1D time series signal.

    This technique, often referred to as time stretching/compression or time warping,
    simulates natural variations in the speed or duration of events within a time series.
    By altering the temporal dimension, it helps the model become more robust to the
    precise "sampling" or timing of the signal, enhancing its invariance to time deformation.[1]

    Args:
        signal (np.ndarray): The input 1D time series signal.
        rate (float): The compression rate. A rate < 1.0 compresses the signal
                      (makes it shorter/faster), while a rate > 1.0 stretches it
                      (makes it longer/slower). Default is 0.8 for compression.

    Returns:
        np.ndarray: The time-compressed signal.
    """
    original_length = len(signal)
    new_length = int(original_length * rate)

    # Create original time points (indices)
    original_time_points = np.arange(original_length)

    # Create new time points for the compressed signal
    # These points will span the original signal's "time" but with a new number of steps
    new_time_points = np.linspace(0, original_length - 1, new_length)

    # Create an interpolation function based on the original signal
    # 'linear' interpolation is a common and simple choice for time series.
    interpolator = interp1d(original_time_points, signal, kind='linear', fill_value="extrapolate")

    # Apply the interpolation to the new time points
    compressed_signal = interpolator(new_time_points)

    return compressed_signal

In [6]:
import torch
import pandas as pd
import numpy as np
from torch.utils.data import DataLoader
from src.parser import read_zip_binary
from src.train import train_model
from src.ecg_dataset import ECGDataset, prep_batch
from src.stft_baseline import BaselineSTFTModel
from sklearn.utils.class_weight import compute_class_weight
from src.split import create_stratified_split

X_orig = read_zip_binary("../data/X_train.zip")     #list of 1D numpy arrays
y_orig = pd.read_csv("../data/y_train.csv", header=None, names=["y"])

# Make compressed copies at 80% length
X_comp = [ time_compress(s, rate=0.8) for s in X_orig ]

# Concatenate signals & labels
X_all = X_orig + X_comp
y_all = pd.concat([y_orig, y_orig], ignore_index=True)

# Stratified split on the _augmented_ set
train_idx, val_idx = create_stratified_split(X_all, y_all, val_size=0.2, seed=343)

# Build datasets & loaders
train_dataset = ECGDataset(X_all, y_all, indices=train_idx)
val_dataset   = ECGDataset(X_all, y_all, indices=val_idx)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True,  collate_fn=prep_batch)
val_loader   = DataLoader(val_dataset,   batch_size=16, shuffle=False, collate_fn=prep_batch)

device = "cpu"

# models is baseline stft model right now
model = BaselineSTFTModel().to(device)


# weights to avoid one class collapse
weights = compute_class_weight(class_weight="balanced", classes=np.array([0, 1, 2, 3]), y=y_all["y"])
weights = torch.tensor(weights, dtype=torch.float32).to(device)
loss_fn = torch.nn.CrossEntropyLoss(weight=weights)

# Learning rate not so important since we use OneCycleLR scheduler
optimizer = torch.optim.Adam(
    model.parameters(),
    lr=5e-4,
    weight_decay=1e-4  # L2 regularization to prevent overfitting
)

# train model
model, history = train_model(model, train_loader, val_loader, optimizer, loss_fn, device, num_epochs=50)


Unique predictions: (array([0, 1, 2]), array([1372,    1, 1099]))
Epoch 01 | Time: 52.2s
  Train Loss: 1.4056 | Acc: 0.3224 | F1: 0.2321
  Val   Loss: 1.3719 | Acc: 0.4814 | F1: 0.2453
Unique predictions: (array([0, 1, 2]), array([2153,    2,  317]))
Epoch 02 | Time: 51.3s
  Train Loss: 1.3922 | Acc: 0.3660 | F1: 0.2467
  Val   Loss: 1.3578 | Acc: 0.5740 | F1: 0.2380
Unique predictions: (array([0, 1, 2]), array([2168,   62,  242]))
Epoch 03 | Time: 50.8s
  Train Loss: 1.3664 | Acc: 0.3661 | F1: 0.2546
  Val   Loss: 1.3274 | Acc: 0.5688 | F1: 0.2434
Unique predictions: (array([0, 1, 2, 3]), array([1386,  408,  326,  352]))
Epoch 04 | Time: 50.7s
  Train Loss: 1.2607 | Acc: 0.4060 | F1: 0.3141
  Val   Loss: 1.1109 | Acc: 0.4680 | F1: 0.3566
Unique predictions: (array([0, 1, 2, 3]), array([1814,  238,  155,  265]))
Epoch 05 | Time: 51.4s
  Train Loss: 1.1216 | Acc: 0.4249 | F1: 0.3562
  Val   Loss: 1.0889 | Acc: 0.5534 | F1: 0.3830
Unique predictions: (array([0, 1, 2, 3]), array([1557,  4