
# EEG Preprocessing — Single EDF (TUEP v2.0.1)

This notebook preprocesses **one** EDF file from the **TUH EEG Epilepsy Corpus (TUEP v2.0.1)** and saves clean epochs plus helpful plots for QA.

It follows the high-level steps inspired by:
- Typical clinical EEG preprocessing pipelines (10–20 montage, re-referencing, filtering, notch).
- Artifact reduction via ICA.
- Fixed-length epoching and amplitude-based artifact rejection.
- Z-score normalization per recording.
- Saving arrays and metadata for downstream (e.g., graph-based SSL / GNN).

> Tip: Run cells top-to-bottom. Adjust the `CONFIG` section to point to your EDF and local output folders.



## Requirements

- `mne`
- `numpy`
- `matplotlib`
- `scipy` (installed with `mne`)
- Optional: `autoreject` (not used here, but you can explore later)

```bash
pip install mne numpy matplotlib
```


In [None]:

# =========================
# CONFIG (EDIT THESE PATHS)
# =========================
import os

# Example Windows paths — change to your own
EDF_PATH = r"C:\Users\georg\Desktop\DATA\01_no_epilepsy\aaaaafiy\s002_2010\01_tcp_ar\aaaaafiy_s002_t001.edf"
SAVE_DIR = r"C:\Users\georg\Desktop\Preprocessed"
PSD_DIR  = r"C:\Users\georg\Desktop\PSD_Plots"

os.makedirs(SAVE_DIR, exist_ok=True)
os.makedirs(PSD_DIR, exist_ok=True)

# Core 10–20 + T1/T2 list we try to keep consistently across files
CORE_CHS = [
    'Fp1','Fp2','F7','F3','Fz','F4','F8',
    'T1','T3','C3','Cz','C4','T4','T2',
    'T5','P3','Pz','P4','T6','O1','Oz','O2'
]

# Processing hyperparams
FS_TARGET = 250         # resample frequency (Hz)
HP, LP = 0.5, 100.0     # bandpass
NOTCH = 60              # mains frequency
EPOCH_DUR = 2.0         # seconds
REJECT_PERCENTILE = 95  # amplitude-based epoch rejection
SEED = 42               # reproducibility for ICA


In [None]:

# ==============
# IMPORTS
# ==============
import re
import mne
import numpy as np
import pickle
import matplotlib.pyplot as plt
from mne.preprocessing import ICA
mne.set_log_level("ERROR")
print(mne.__version__)



## 1) Load EDF and keep EEG

We also **clean channel names** to strip common suffixes (`EEG `, `-LE`, `-REF`), then keep only a consistent subset (`CORE_CHS`) if present.


In [None]:

raw_before = mne.io.read_raw_edf(EDF_PATH, preload=True, verbose='ERROR')
raw_before.pick_types(eeg=True)
print("Original channel names:", raw_before.ch_names)

# Clean channel names
mapping = {
    orig: re.sub(r'^(?:EEG\s*)', '', orig).replace('-LE','').replace('-REF','').strip()
    for orig in raw_before.ch_names
}
raw_before.rename_channels(mapping)
print("Cleaned names:", raw_before.ch_names)

# Work on a copy
raw = raw_before.copy()

# Keep only the core channels that exist in this file
present = [ch for ch in CORE_CHS if ch in raw.ch_names]
raw.pick_channels(present)
raw_before.pick_channels(present)
print(f"Picked {len(present)}/{len(CORE_CHS)} channels")



## 2) Apply 10–20 montage (+ approximate T1/T2)

We attach a standard montage for spatial consistency. T1/T2 positions are approximate and only used if present.


In [None]:

std_montage = mne.channels.make_standard_montage('standard_1020')
pos = std_montage.get_positions()['ch_pos']

# Approximate locations for T1/T2
extra_pos = {
    'T1': np.array([-0.060, -0.090, 0.120]),
    'T2': np.array([ 0.060, -0.090, 0.120]),
}

all_pos = dict(pos)
all_pos.update(extra_pos)

custom_montage = mne.channels.make_dig_montage(ch_pos=all_pos, coord_frame='head')
raw.set_montage(custom_montage, match_case=False)



## 3) Re-reference, Notch, and Bandpass Filter


In [None]:

raw.set_eeg_reference('average', projection=False)
raw.notch_filter(freqs=NOTCH)
raw.filter(l_freq=HP, h_freq=LP, fir_design='firwin', filter_length='auto')



## 4) ICA for Artifact Reduction

We compute ICA (FastICA) and **apply it without manual selection** for a reproducible demo.  
For a production pipeline, you may want to visually inspect components or use EOG/ECG templates.


In [None]:

n_components = min(20, len(raw.ch_names))
ica = ICA(n_components=n_components, method='fastica', random_state=SEED)
ica.fit(raw)
# In an interactive session, consider: ica.plot_components(); ica.plot_sources(raw)
ica.apply(raw)
print("ICA applied.")



## 5) PSD Before/After

We visualize and save PSD plots to quickly verify filtering efficacy.


In [None]:

psd_b = raw_before.compute_psd(fmax=LP, average='mean')
fig_b = psd_b.plot(show=False)
fig_b.suptitle("PSD Before Preprocessing")
patient_id = os.path.splitext(os.path.basename(EDF_PATH))[0]
out_b = os.path.join(PSD_DIR, f"{patient_id}_PSD_before.png")
fig_b.savefig(out_b); plt.close(fig_b)

psd_a = raw.compute_psd(fmax=LP, average='mean')
fig_a = psd_a.plot(show=False)
fig_a.suptitle("PSD After Preprocessing + ICA")
out_a = os.path.join(PSD_DIR, f"{patient_id}_PSD_after.png")
fig_a.savefig(out_a); plt.close(fig_a)

print("Saved:", out_b)
print("Saved:", out_a)



## 6) Resample (and optional initial crop)

We resample to **250 Hz** for modeling convenience.  
If the file is **non-epileptic** (path contains `01_no_epilepsy`), we crop the initial 10s (often setup noise).


In [None]:

raw.resample(FS_TARGET)
print("Resampled to", raw.info['sfreq'], "Hz")

if "01_no_epilepsy" in EDF_PATH.replace("/", "\\"):
    print("Cropping out first 10 s (non-epilepsy file)")
    raw.crop(tmin=10.0)



## 7) Fixed-Length Epochs


In [None]:

epochs = mne.make_fixed_length_epochs(raw, duration=EPOCH_DUR, overlap=0.0, preload=True)
print(epochs)



## 8) Amplitude-Based Artifact Rejection (Percentile)

We reject epochs whose **peak-to-peak amplitude** exceeds the **95th percentile** of this recording’s distribution.


In [None]:

data = epochs.get_data()                           # (n_epochs, n_channels, n_times)
ptps = np.ptp(data, axis=2)                        # per-epoch per-channel
max_ptp = ptps.max(axis=1) * 1e6                   # μV

import matplotlib.pyplot as plt
plt.figure(figsize=(6,4))
plt.hist(max_ptp, bins=100)
plt.xlabel("Peak-to-Peak amplitude (µV)")
plt.ylabel("Count of epochs")
plt.title("Distribution of epoch P2P amplitudes")
threshold = np.percentile(max_ptp, REJECT_PERCENTILE)
plt.axvline(threshold, linestyle='--', label=f"{REJECT_PERCENTILE}th = {threshold:.1f} µV")
plt.legend(); plt.tight_layout()
plt.show()

reject_criteria = dict(eeg=threshold * 1e-6)
epochs_clean = epochs.copy().drop_bad(reject=reject_criteria)
print(f"{len(epochs_clean)} / {len(epochs)} epochs kept")



## 9) Z-Score Normalization (per recording)


In [None]:

data_clean = epochs_clean.get_data()
mean_val, std_val = data_clean.mean(), data_clean.std()
epochs_clean._data = (data_clean - mean_val) / (std_val + 1e-8)
print("Z-scoring done.")



## 10) Quick Visual Checks

- Ten random epochs, stacked.
- All channels concatenated across first N epochs.


In [None]:

import numpy as np
n_epochs_to_plot = min(10, len(epochs_clean))
idxs = np.arange(n_epochs_to_plot)

ch_names = epochs_clean.ch_names
data = epochs_clean.get_data()

plt.figure(figsize=(14, 2.5 * ((n_epochs_to_plot+1)//2)))
n_cols = 2
n_rows = (n_epochs_to_plot + n_cols - 1) // n_cols
for i in range(n_epochs_to_plot):
    plt.subplot(n_rows, n_cols, i + 1)
    for ch_idx, ch_name in enumerate(ch_names):
        plt.plot(data[i, ch_idx], alpha=0.8)
    plt.title(f"Epoch {i} (z-score)")
    plt.xlabel("Timepoint"); plt.ylabel("z-score")
plt.tight_layout(); plt.show()

# Concatenate first N epochs for each channel
n_concat = min(10, len(epochs_clean))
concat = np.hstack([data[i] for i in range(n_concat)])  # (channels, n_concat * n_times)
plt.figure(figsize=(18, 8))
offset = 10
for ch in range(len(ch_names)):
    plt.plot(concat[ch] + ch * offset)
plt.yticks([ch * offset for ch in range(len(ch_names))], ch_names)
plt.xlabel("Timepoint (all epochs concatenated)"); plt.title("All channels, concatenated epochs (z)")
plt.show()



## 11) Save Arrays & Metadata


In [None]:

patient_id = os.path.splitext(os.path.basename(EDF_PATH))[0]
label = 1 if "00_epilepsy" in EDF_PATH.replace("/", "\\") else 0
labels = np.full(len(epochs_clean), label, dtype=int)

prefix = os.path.join(SAVE_DIR, patient_id)
np.save(f"{prefix}_raw.npy", raw.get_data())
np.save(f"{prefix}_epochs.npy", epochs_clean.get_data())
np.save(f"{prefix}_labels.npy", labels)
with open(f"{prefix}_info.pkl", "wb") as f:
    pickle.dump(raw.info, f)

print("Saved:")
print(f"  {prefix}_raw.npy")
print(f"  {prefix}_epochs.npy")
print(f"  {prefix}_labels.npy")
print(f"  {prefix}_info.pkl")
