## GENERATION (VAE notebook explained in plain language)
# Train a generative model that learns how trials of a given class look, then generate synthetic new ones.

STEP 1 — Load data for one subject and one class
• Go into Original/A01/train/0/
• Load every CSV trial
• Remove the first column (channel names)
• Keep first 1000 timepoints
• Stack them

STEP 2 — Split into train/validation
Randomly split:
80% → model training
20% → validation

STEP 3 — Build the VAE conceptually
A VAE has two parts:
- Encoder: Takes 22×1000 trial → compresses to a small vector (e.g., 2 numbers)
- Decoder: Takes that small vector → reconstructs 22×1000 trial
Force the small vector space to behave like a smooth Gaussian distribution.
That is what the KL term in the code is doing. (AKA: Loss = reconstruction_error)

STEP 4 — Train until convergence
Train for many epochs.
Stop early if validation loss stops improving.

STEP 5 — Generate synthetic trials
This is the important generation step:
Instead of giving the encoder real data,
you sample random vectors from a normal distribution:
z ~ N(0,1)
Then: synthetic_trial = decoder(z)
This gives you a new 22×1000 matrix that resembles training data.

STEP 6 — Save synthetic data
For each synthetic trial:
• Save as CSV (numeric only, no channel-name column) + place int Generated/A01/train/0/ then Repeat for: classes 1, 2, 3 subjects A02..A09

In [None]:

# ============================================================
# PART 2: GENERATION (Original -> train VAE -> write Generated/<subject>/train/<class>/*.csv)
# ============================================================

def generate_all_subjects(num_synthetic_per_class: int) -> None:
    """
    For each subject and each class:
      - Load Original/<subject>/train/<class>/*.csv
      - Train a VAE on that class' trials
      - Sample latent vectors z and decode to synthetic trials
      - Save numeric-only CSVs into Generated/<subject>/train/<class>/
    """
    for subj in SUBJECTS:
        for cls in [0, 1, 2, 3]:
            # 1) Load real training trials for this subject/class
            X_real = load_trials_from_csv(
                in_dir=ORIGINAL_ROOT / subj / "train" / str(cls),
                drop_first_column=True,         # because Original has channel-name column
                expected_shape=(22, 1000),
            )

            # 2) Fit preprocessing scaler *on training data only*
            scaler = fit_minmax_scaler(X_real)            # learn scaling params
            X_scaled = apply_scaler(X_real, scaler)       # scale to [0,1] or similar

            # 3) Train VAE (conceptual)
            vae = train_vae(
                X_train=X_scaled,
                latent_dim=2,                 # or other
                early_stopping=True,
            )

            # 4) Create destination folders
            (GENERATED_ROOT / subj / "train" / str(cls)).mkdir(parents=True, exist_ok=True)

            # 5) Generate synthetic trials
            for i in range(num_synthetic_per_class):
                z = sample_standard_normal(latent_dim=vae.latent_dim)  # z ~ N(0,1)
                X_syn_scaled = vae_decode(vae, z)                      # shape (22,1000)
                X_syn = inverse_scaler(X_syn_scaled, scaler)           # back to original-ish scale (optional)

                # 6) Save numeric-only CSV (repo expects generated to be numeric only)
                out_file = GENERATED_ROOT / subj / "train" / str(cls) / f"trial_{i:05d}.csv"
                write_csv_matrix(out_file, X_syn, include_channel_name_column=False)


