# CleanEEG: Automated Resting-State EEG Preprocessing Tutorial
This tutorial demonstrates the complete CleanEEG preprocessing pipeline using MNE-Python and complementary libraries. Based on the DISCOVER-EEG framework, it covers all preprocessing steps with quality assessment metrics including Signal-to-Noise Ratio (SNR) and Power Spectral Density (PSD) visualization after each step.

## Table of Contents

1. Installation and Setup
2. Quality Assessment Functions
3. Loading EEG Data
4. Channel Montage Setup
5. Preprocessing Pipeline
   - Line Noise Removal (DSS)
   - Bandpass Filter
   - Downsample Data
   - Bad Channel Rejection (PREP)
   - Independent Component Analysis (ICA)
   - Bad Channel Interpolation
   - Bad Time Segments Removal (ASR)
6. Final Quality Assessment
7. Saving Results

## 1. Installation and Setup

### Option 1: Using Conda Environment (Recommended)

Create a conda environment with all dependencies:

```bash
# Create environment from the provided environment.yml
conda env create -f environment.yml
conda activate cleaneeg
```

### Option 2: Using Pip Environment

Create a virtual environment and install dependencies:

```bash
# Create virtual environment
python -m venv cleaneeg_env
source cleaneeg_env/bin/activate  # On Windows: cleaneeg_env\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

### Option 3: Install in Current Environment (Jupyter/Colab)

If you're running this in Jupyter or Google Colab, you can install packages directly:

In [None]:
# Install required packages for EEG processing and visualization

# Core EEG/MEG analysis packages
!pip install mne==1.5.0             # Core package for EEG/MEG data analysis
!pip install mne-icalabel==0.5.0    # For automatic classification of ICA components

# Automatic bad-channel detection and denoising
!pip install pyprep>=0.4.0          # For automatic bad channel detection
!pip install meegkit>=0.1.9         # For advanced denoising methods (DSS, ASR)

# File‐format support
!pip install pybv>=0.7.0            # For BrainVision file support
!pip install eeglabio>=0.0.2        # For EEGLAB file support
!pip install edfio>=0.1.0           # For EDF file support
!pip install EDFlib-Python>=1.0.8   # For EDF+ file support
!pip install h5py>=3.7.0            # For HDF5 file support

# Visualization and UI
!pip install matplotlib>=3.8,<4.0   # For visualization

# Numerical and data-handling libraries
!pip install numpy>=2.0.0           # For numerical operations
!pip install scipy>=1.10,<2.0       # For scientific computing
!pip install pandas>=1.5,<3.0       # For data handling

# Utility
!pip install tqdm                   # For progress bars (sample data download)

### Verify Installation
Let's check that all required packages are installed correctly:

In [None]:
# Check if required packages are installed
import sys

required_packages = [
    'numpy', 'scipy', 'matplotlib', 'PyQt5', 'mne', 'mne_icalabel', 
    'pyprep', 'pyriemann', 'sklearn', 'meegkit', 'pybv', 'eeglabio', 
    'edfio', 'EDFlib', 'h5py', 'pandas'
]

print("Checking packages...")
missing_packages = []

for package in required_packages:
    try:
        # Handle special package names
        import_name = package
        if package == 'sklearn':
            import_name = 'sklearn'
        elif package == 'mne_icalabel':
            import_name = 'mne_icalabel'
        elif package == 'EDFlib':
            import_name = 'EDFlib'
        elif package == 'PyQt5':
            import_name = 'PyQt5'
            
        __import__(import_name)
        print(f"✅ {package}")
    except ImportError:
        print(f"❌ {package} - missing")
        missing_packages.append(package)

if missing_packages:
    print(f"\n⚠️  Missing {len(missing_packages)} packages:")
    for pkg in missing_packages:
        print(f"   - {pkg}")
    print("\nInstall all packages with:")
    print("pip install -r requirements.txt")
else:
    print("\n🎉 All packages ready!")

In [None]:
# Import all necessary libraries
import mne
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import pandas as pd
from scipy import signal
import ftplib
import random
from tqdm.notebook import tqdm

# Set MNE logging level
mne.set_log_level('WARNING')

# Configure matplotlib for better plots
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

### Sample Data Download Function
If you don't have your own EEG data, we'll implement a function to download sample resting-state EEG data:

In [None]:
# Sample data download function implementation
import ftplib
import random
from pathlib import Path
from tqdm.notebook import tqdm

def is_dir(ftp: ftplib.FTP, path: str) -> bool:
    """Check if a path is a directory on the FTP server."""
    cwd = ftp.pwd()
    try:
        ftp.cwd(path)
        ftp.cwd(cwd)
        return True
    except ftplib.error_perm:
        return False

def download_remote(ftp: ftplib.FTP, remote_dir: str, local_dir: Path):
    """Recursively download files from FTP server."""
    local_dir.mkdir(parents=True, exist_ok=True)
    try:
        ftp.cwd(remote_dir)
    except ftplib.error_perm:
        return
    
    try:
        entries = list(ftp.mlsd())
    except (ftplib.error_perm, AttributeError):
        names = ftp.nlst()
        entries = [(name, {'type': 'dir' if is_dir(ftp, f"{remote_dir}/{name}") else 'file'})
                   for name in names]
    
    for name, info in tqdm(entries, desc=f"Scanning {Path(remote_dir).name}", leave=False):
        rpath = f"{remote_dir}/{name}"
        lpath = local_dir / name
        
        if info.get('type') == 'dir':
            download_remote(ftp, rpath, lpath)
        else:
            if not lpath.exists():  # Skip if file already exists
                try:
                    with open(lpath, 'wb') as f:
                        ftp.retrbinary(f"RETR {rpath}", f.write)
                except Exception as e:
                    print(f"⚠️ Failed to download {rpath}: {e}")
            else:
                print(f"📁 File already exists: {lpath.name}")

def download_sample_data(ftp_host: str,
                         ftp_base: str,
                         local_base: Path,
                         num_subjects: int = 1):
    """
    Download sample EEG data from FTP server.
    
    Parameters:
    -----------
    ftp_host : str
        FTP server hostname
    ftp_base : str
        Base directory on FTP server
    local_base : Path
        Local directory to save data
    num_subjects : int
        Number of subjects to download
    """
    print(f"🌐 Connecting to {ftp_host}...")
    
    try:
        ftp = ftplib.FTP(ftp_host)
        ftp.login()
        ftp.cwd(ftp_base)
        
        subjects = ftp.nlst()
        if len(subjects) < num_subjects:
            raise ValueError(f"Found only {len(subjects)} subjects, asked for {num_subjects}")
        
        chosen = random.sample(subjects, num_subjects)
        print(f"📥 Downloading {num_subjects} random subject(s): {chosen}\n")
        
        for subj in tqdm(chosen, desc="Subjects"):
            download_remote(ftp, f"{ftp_base}/{subj}", local_base / subj)
        
        ftp.quit()
        print(f"\n✅ Download complete. Data is in: {local_base.resolve()}")
        
    except Exception as e:
        print(f"❌ Download failed: {e}")
        print("   Will use MNE sample data instead...")
        return False
    
    return True

print("✅ Sample data download functions loaded!")

### Download Sample Data (Optional)
If you don't have your own resting-state EEG data, you can download sample data from public repositories:

In [None]:
# Download sample EEG data
sample_data_downloaded = False
sample_data_path = Path('sample_data')

print("🔍 Checking for existing sample data...")
if sample_data_path.exists() and any(sample_data_path.iterdir()):
    print(f"✅ Found existing data in {sample_data_path}")
    sample_data_downloaded = True
else:
    print("📥 No existing data found. Attempting to download sample data...")
    
    # Try to download from MPI-Leipzig LEMON dataset
    # This dataset contains high-quality resting-state EEG recordings
    try:
        sample_data_downloaded = download_sample_data(
            ftp_host='ftp.gwdg.de',
            ftp_base='/pub/misc/MPI-Leipzig_Mind-Brain-Body-LEMON/EEG_MPILMBB_LEMON/EEG_Raw_BIDS_ID',
            local_base=sample_data_path,
            num_subjects=1  # Download just one subject for this tutorial
        )
    except Exception as e:
        print(f"⚠️ FTP download failed: {e}")
        sample_data_downloaded = False

# Fallback to MNE sample data if download fails
if not sample_data_downloaded:
    print("\n🔄 Falling back to MNE sample data...")
    try:
        # Use MNE's built-in sample dataset
        import mne
        sample_data_folder = mne.datasets.sample.data_path()
        sample_data_path = sample_data_folder / 'MEG' / 'sample'
        print(f"✅ Will use MNE sample data from: {sample_data_path}")
        sample_data_downloaded = True
    except Exception as e:
        print(f"❌ MNE sample data also failed: {e}")
        print("   Please provide your own EEG data file path in the next section.")

print(f"\n📊 Sample data status: {'Available' if sample_data_downloaded else 'Not available'}")

## 2. Quality Assessment Functions

**Purpose**: Monitor and quantify data quality improvements throughout the preprocessing pipeline.

**Why needed**: Preprocessing should improve signal quality, but it's important to verify this objectively. Signal-to-Noise Ratio (SNR) and Power Spectral Density (PSD) provide quantitative metrics to ensure each step is helping rather than hurting data quality.

**Methods**: 
- **SNR calculation**: Compares signal power in neural frequency bands to noise estimates
- **PSD visualization**: Shows how preprocessing affects the frequency content of signals
- **Progress tracking**: Documents quality changes after each preprocessing step

These functions will help us track data quality throughout the preprocessing pipeline:

In [None]:
def compute_snr(raw, freq_bands=None, method='rms'):
    """
    Compute Signal-to-Noise Ratio for EEG data.
    
    Parameters:
    -----------
    raw : mne.io.Raw
        The EEG data
    freq_bands : dict
        Dictionary of frequency bands to analyze
    method : str
        Method for SNR calculation ('rms' or 'spectral')
    
    Returns:
    --------
    snr_results : dict
        SNR values for different frequency bands
    """
    if freq_bands is None:
        freq_bands = {
            'delta': (1, 4),
            'theta': (4, 8), 
            'alpha': (8, 13),
            'beta': (13, 30),
            'gamma': (30, 100)
        }
    
    # Get data and sampling frequency
    data = raw.get_data()
    sfreq = raw.info['sfreq']
    
    snr_results = {}
    
    if method == 'spectral':
        # Compute PSD
        freqs, psd = signal.welch(data, sfreq, nperseg=int(2*sfreq))
        
        for band_name, (low_freq, high_freq) in freq_bands.items():
            # Find frequency indices
            freq_mask = (freqs >= low_freq) & (freqs <= high_freq)
            
            # Signal power in the band
            signal_power = np.mean(psd[:, freq_mask], axis=1)
            
            # Noise estimation (neighboring frequencies)
            noise_low = max(0, low_freq - 2)
            noise_high = min(freqs[-1], high_freq + 2)
            noise_mask = ((freqs >= noise_low) & (freqs < low_freq)) | \
                        ((freqs > high_freq) & (freqs <= noise_high))
            
            if np.any(noise_mask):
                noise_power = np.mean(psd[:, noise_mask], axis=1)
                snr = 10 * np.log10(signal_power / (noise_power + 1e-10))
            else:
                snr = np.full(len(raw.ch_names), np.nan)
            
            snr_results[band_name] = {
                'mean_snr': np.nanmean(snr),
                'std_snr': np.nanstd(snr),
                'channel_snr': snr
            }
    
    else:  # RMS method
        for band_name, (low_freq, high_freq) in freq_bands.items():
            # Filter data to frequency band
            raw_filtered = raw.copy().filter(low_freq, high_freq, verbose=False)
            filtered_data = raw_filtered.get_data()
            
            # RMS of signal
            signal_rms = np.sqrt(np.mean(filtered_data**2, axis=1))
            
            # Estimate noise from high frequencies (above 80 Hz)
            if raw.info['sfreq'] > 160:  # Ensure we can filter above 80 Hz
                raw_noise = raw.copy().filter(80, None, verbose=False)
                noise_data = raw_noise.get_data()
                noise_rms = np.sqrt(np.mean(noise_data**2, axis=1))
                snr = 20 * np.log10(signal_rms / (noise_rms + 1e-10))
            else:
                # Use standard deviation as noise estimate
                noise_std = np.std(filtered_data, axis=1)
                snr = 20 * np.log10(signal_rms / (noise_std + 1e-10))
            
            snr_results[band_name] = {
                'mean_snr': np.mean(snr),
                'std_snr': np.std(snr),
                'channel_snr': snr
            }
    
    return snr_results

def plot_psd_comparison(raw_list, labels, title="Power Spectral Density Comparison", fmax=80):
    """
    Plot PSD comparison for multiple raw objects.
    
    Parameters:
    -----------
    raw_list : list
        List of mne.io.Raw objects
    labels : list
        Labels for each raw object
    title : str
        Plot title
    fmax : float
        Maximum frequency to plot
    """
    fig, ax = plt.subplots(1, 1, figsize=(12, 6))
    
    # Red for before, Blue for after
    colors = ['red', 'blue'][:len(raw_list)]
    
    for i, (raw, label, color) in enumerate(zip(raw_list, labels, colors)):
        # Compute PSD
        psd = raw.compute_psd(fmax=fmax, verbose=False)
        
        # Plot average across channels
        freqs = psd.freqs
        psd_data = psd.get_data()
        mean_psd = np.mean(psd_data, axis=0)
        
        ax.semilogy(freqs, mean_psd, label=label, color=color, linewidth=2)
    
    ax.set_xlabel('Frequency (Hz)')
    ax.set_ylabel('Power Spectral Density (V²/Hz)')
    ax.set_title(title)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

def print_snr_summary(snr_results, step_name):
    """
    Print a formatted summary of SNR results.
    """
    print(f"\n📊 SNR Summary - {step_name}:")
    print("=" * 50)
    for band, results in snr_results.items():
        print(f"{band.capitalize():>8}: {results['mean_snr']:6.2f} ± {results['std_snr']:5.2f} dB")
    print("=" * 50)

def plot_processing_summary(processing_log):
    """
    Plot a summary of SNR changes throughout processing.
    """
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    bands = ['delta', 'theta', 'alpha', 'beta', 'gamma']
    steps = list(processing_log.keys())
    
    for i, band in enumerate(bands):
        if i < len(axes):
            snr_values = [processing_log[step][band]['mean_snr'] for step in steps]
            axes[i].plot(range(len(steps)), snr_values, 'o-', linewidth=2, markersize=8)
            axes[i].set_title(f'{band.capitalize()} Band SNR')
            axes[i].set_ylabel('SNR (dB)')
            axes[i].set_xticks(range(len(steps)))
            axes[i].set_xticklabels(steps, rotation=45, ha='right')
            axes[i].grid(True, alpha=0.3)
    
    # Remove the last subplot if we have 6 subplots but only 5 bands
    if len(bands) < len(axes):
        fig.delaxes(axes[-1])
    
    plt.tight_layout()
    plt.show()

print("✅ Quality assessment functions loaded successfully!")

## 3. Loading EEG Data
Load your EEG data from various formats supported by MNE-Python (.edf, .vhdr, .bdf, .set, .fif). This cell automatically finds EEG files in your downloaded dataset, checks for common mislabeling issues (like EOG or reference channels incorrectly marked as EEG), and filters to keep only true EEG channels for analysis.

In [None]:
## 📁 Load EEG Data

# Load EEG data (using already downloaded dataset)
print("📁 Loading EEG data...")

if sample_data_downloaded:
    # Find EEG files in downloaded data
    eeg_extensions = [".vhdr", ".edf", ".bdf", ".set", ".fif"]
    eeg_files = []
    
    for ext in eeg_extensions:
        eeg_files.extend(list(sample_data_path.rglob(f"*{ext}")))
    
    if eeg_files:
        # Use the first EEG file found
        eeg_file = eeg_files[0]
        print(f"📂 Loading: {eeg_file.name}")
        raw = mne.io.read_raw(eeg_file, preload=True)
        data_source = f"Downloaded: {eeg_file.name}"
    else:
        # No EEG files found
        print("❌ No EEG files found in downloaded data!")
        raise FileNotFoundError("Please check data download above")
else:
    print("❌ No data available!")
    raise FileNotFoundError("Please check data download above")

# Check for mislabeled channels before filtering
print("\n🔍 Checking for mislabeled channels...")

# Common patterns for non-EEG channels that might be labeled as EEG
eog_patterns = ['EOG', 'VEOG', 'HEOG', 'EOGH', 'EOGV', 'EOG1', 'EOG2', 'LHEOG', 'RHEOG']
ref_patterns = ['REF', 'GND', 'A1', 'A2', 'M1', 'M2', 'TP9', 'TP10']
other_patterns = ['ECG', 'EMG', 'RESP', 'TRIG', 'STI']

# Check each channel name
mislabeled_channels = []
for ch_name in raw.ch_names:
    ch_upper = ch_name.upper()
    
    # Check for EOG patterns
    for pattern in eog_patterns:
        if pattern in ch_upper:
            mislabeled_channels.append((ch_name, 'eog'))
            break
    else:
        # Check for reference patterns
        for pattern in ref_patterns:
            if pattern in ch_upper:
                mislabeled_channels.append((ch_name, 'misc'))
                break
        else:
            # Check for other non-EEG patterns
            for pattern in other_patterns:
                if pattern in ch_upper:
                    mislabeled_channels.append((ch_name, 'misc'))
                    break

# Report and fix mislabeled channels
if mislabeled_channels:
    print(f"⚠️ Found {len(mislabeled_channels)} potentially mislabeled channels:")
    for ch_name, suggested_type in mislabeled_channels:
        print(f"   {ch_name} → {suggested_type}")
        raw.set_channel_types({ch_name: suggested_type})
    print("✅ Channel types corrected!")
else:
    print("✅ All channels appear correctly labeled")

# Check channel types and filter to EEG only
channel_types = set(raw.get_channel_types())
print(f"📊 Channel types found: {list(channel_types)}")

if 'eeg' in channel_types:
    n_before = len(raw.ch_names)
    raw.pick('eeg')
    n_after = len(raw.ch_names)
    print(f"📌 Keeping EEG channels: {n_before} → {n_after} channels")
else:
    print("⚠️ No EEG channels detected!")

# Show data summary
print(f"\n📋 Data Summary:")
print(f"├── Source: {data_source}")
print(f"├── Channels: {len(raw.ch_names)} EEG")
print(f"├── Duration: {raw.times[-1]:.1f} seconds")
print(f"└── Sample rate: {raw.info['sfreq']} Hz")

# Initial quality assessment
print("\n🔍 Initial quality check...")
initial_snr = compute_snr(raw)
print_snr_summary(initial_snr, "Original Data")

# Set up for processing pipeline
processing_log = {'Original': initial_snr}
raw_versions = {'Original': raw.copy()}

## 4. Channel Montage Setup

**Purpose**: Apply a standard electrode montage to provide spatial information about channel locations.

**Why needed**: Many preprocessing steps (bad channel detection, interpolation) and analyses (source localization, connectivity) require knowing where each electrode is positioned on the scalp. Without spatial information, we can't determine which channels are neighbors or create topographic maps.

**Method**: Match electrode names to standard montage templates (10-20, 10-05, etc.) that define precise coordinates for each electrode position.

In [None]:
## 🗺️ Set Electrode Montage

# Create a working copy for processing
print("🗺️ Setting up electrode montage...")
raw_processed = raw.copy()

# Try common montages to find best match
montages = ['standard_1020', 'standard_1005', 'easycap-M1', 'biosemi64']

montage_applied = False
for montage_name in montages:
    try:
        montage = mne.channels.make_standard_montage(montage_name)
        raw_processed.set_montage(montage, match_case=False, on_missing='ignore')
        
        # Check how many channels got positions
        n_positioned = sum(1 for ch in raw_processed.info['chs'] if ch['loc'][0] != 0)
        match_pct = (n_positioned / len(raw_processed.ch_names)) * 100
        
        if match_pct > 50:  # At least 50% matched
            print(f"✅ Applied {montage_name}: {n_positioned}/{len(raw_processed.ch_names)} channels positioned ({match_pct:.0f}%)")
            montage_applied = True
            break
        else:
            print(f"   {montage_name}: {match_pct:.0f}% match - trying next...")
            
    except Exception as e:
        print(f"   {montage_name}: failed - {e}")
        continue

if not montage_applied:
    print("⚠️ No suitable montage found - continuing without spatial info")

# Show electrode positions if montage was applied
if montage_applied:
    try:
        raw_processed.plot_sensors(show_names=True, sphere='auto')
    except:
        print("📍 Electrode positions set (visualization unavailable)")

print(f"\n📊 Setup Summary:")
print(f"├── EEG channels: {len(raw_processed.ch_names)}")
print(f"├── Montage: {'Applied' if montage_applied else 'None'}")
print(f"└── Ready for preprocessing pipeline")

# Update main processing variable
raw = raw_processed

## 5. Preprocessing Pipeline

**Overview**: This section applies the complete CleanEEG preprocessing workflow following the DISCOVER-EEG framework. Each step targets specific types of artifacts while preserving neural signals.

**Pipeline Logic**: Steps are ordered to handle the largest artifacts first (line noise, drifts) before more sophisticated analyses (ICA, ASR) that work better on cleaner data. Quality metrics after each step confirm improvements.

**Key Principle**: Every preprocessing step should improve signal quality. We'll monitor this with quantitative metrics throughout.

Now we'll apply the complete CleanEEG preprocessing pipeline, monitoring quality at each step:

### 5.1 Line Noise Removal (Denoising Source Separation)

**Purpose**: Remove electrical interference from power lines (50/60 Hz) that contaminates EEG signals.

**Why needed**: Power line noise creates strong, narrow-band artifacts that can overwhelm neural signals and distort frequency analysis. This interference comes from electrical equipment and building wiring.

**Method**: Denoising Source Separation (DSS) is superior to simple notch filtering because it removes line noise while preserving neural activity at the same frequencies.

In [None]:
from meegkit import dss
from scipy import signal

print("⚡ Removing line noise using Denoising Source Separation (DSS)...")

# Line frequency depends on geographical location:
# - 50 Hz: Europe, Asia, Africa, Australia (most of the world)
# - 60 Hz: North America, parts of South America, some Pacific islands
# Since this dataset was collected in Germany (LEMON dataset), we use 50 Hz
line_freq = 50  # Change to 60 if your data is from North America
print(f"⚡ Removing {line_freq} Hz line noise...")

# Check current line noise level
data = raw.get_data()
sfreq = raw.info['sfreq']
freqs, psd = signal.welch(data, sfreq, nperseg=int(2*sfreq))
freq_idx = np.argmin(np.abs(freqs - line_freq))
power_before = np.mean(psd[:, freq_idx])

# Apply DSS line noise removal
try:
    print("🔧 Applying Denoising Source Separation...")
    processed_data, artifacts = dss.dss_line(
        data.T,              # DSS expects (time x channels)
        fline=line_freq,     # Target frequency
        sfreq=sfreq,         # Sampling rate
        show=False           # No plots during processing
    )
    
    # Update the raw data
    raw._data = processed_data.T  # Convert back to (channels x time)
    method_used = "DSS"
    print("✅ DSS successfully applied")
    
except Exception as e:
    print(f"⚠️ DSS failed: {e}")
    print("🔄 Using notch filter instead...")
    raw.notch_filter(freqs=[line_freq], verbose=False)
    method_used = "Notch Filter"
    print("✅ Notch filter applied")

# Verify noise reduction
data_after = raw.get_data()
freqs_after, psd_after = signal.welch(data_after, sfreq, nperseg=int(2*sfreq))
power_after = np.mean(psd_after[:, freq_idx])
reduction_db = 10 * np.log10(power_before / (power_after + 1e-12))

print(f"📊 Line noise reduction: {reduction_db:.1f} dB using {method_used}")

# Store results for comparison
raw_versions['After Line Noise'] = raw.copy()

# Quality assessment
snr_after_line = compute_snr(raw)
processing_log['After Line Noise'] = snr_after_line
print_snr_summary(snr_after_line, "After Line Noise Removal")

# Plot before/after comparison
plot_psd_comparison(
    [raw_versions['Original'], raw_versions['After Line Noise']], 
    ['Original', 'After Line Noise'],
    "Power Spectral Density: Line Noise Removal"
)

### 5.2 Bandpass Filter

**Purpose**: Remove slow drifts, baseline shifts, and high-frequency noise from EEG recordings.

**Why needed**: EEG amplifiers can introduce very low-frequency drifts (<1 Hz) due to electrode movement, skin conductance changes, and amplifier instabilities. Additionally, high-frequency noise (>100 Hz) from electrical interference, muscle artifacts, and amplifier noise can contaminate the signal. These artifacts can distort analyses and make data appear non-stationary.

**Method**: A 1-100 Hz bandpass filter combines:
- **Highpass component (1 Hz)**: Removes slow drifts while preserving all neural frequencies of interest (delta waves start at ~1-4 Hz)
- **Lowpass component (100 Hz)**: Removes high-frequency noise while retaining all relevant neural oscillations including gamma activity (30-100 Hz)

In [None]:
## 🔽 Apply Bandpass Filter

# Set filter parameters
hp_freq = 1.0   # Remove slow drifts below 1 Hz
lp_freq = 100.0 # Remove noise above 100 Hz

print(f"🔽 Applying {hp_freq}-{lp_freq} Hz bandpass filter...")

# Apply bandpass filter
raw.filter(
    l_freq=hp_freq,   # High-pass cutoff  
    h_freq=lp_freq,   # Low-pass cutoff
    method='fir',     # Finite Impulse Response
    verbose=False
)

print(f"✅ Bandpass filter applied: {hp_freq}-{lp_freq} Hz")

# Store results
raw_versions['After Bandpass'] = raw.copy()

# Quality assessment
snr_after_bandpass = compute_snr(raw)
processing_log['After Bandpass'] = snr_after_bandpass
print_snr_summary(snr_after_bandpass, "After Bandpass Filter")

# Plot before/after comparison
plot_psd_comparison(
    [raw_versions['After Line Noise'], raw_versions['After Bandpass']], 
    ['After Line Noise', 'After Bandpass'],
    "Power Spectral Density: Bandpass Filter Effect"
)

### 5.3 Downsample Data

**Purpose**: Reduce data size and computational load while preserving all relevant neural information.

**Why needed**: Many EEG systems record at very high sampling rates (>1000 Hz) to avoid aliasing, but most EEG analysis only requires frequencies up to 100-200 Hz. High sampling rates create unnecessarily large files and slow processing.

**Method**: Downsample to 500 Hz (adequate for frequencies up to 250 Hz) after applying anti-aliasing filters to prevent frequency distortion.

In [None]:
## 📉 Downsample Data

# Set target sampling rate
target_sfreq = 500  # Hz - adequate for EEG analysis
original_sfreq = raw.info['sfreq']

print(f"📉 Checking sampling rate: {original_sfreq} Hz")

# Downsample if necessary
if original_sfreq > target_sfreq:
    raw.resample(target_sfreq, verbose=False)
    reduction = (1 - target_sfreq/original_sfreq) * 100
    print(f"✅ Downsampled to {target_sfreq} Hz (reduced data by {reduction:.1f}%)")
else:
    print(f"✅ No downsampling needed (already {original_sfreq} Hz)")

# Store results
raw_versions['After Downsample'] = raw.copy()

# Quality assessment
snr_after_downsample = compute_snr(raw)
processing_log['After Downsample'] = snr_after_downsample
print_snr_summary(snr_after_downsample, "After Downsampling")

# Plot before/after comparison (up to new Nyquist frequency)
plot_psd_comparison(
    [raw_versions['After Bandpass'], raw_versions['After Downsample']], 
    ['Before Downsample', 'After Downsample'],
    "Power Spectral Density: Downsampling Effect"
)

### 5.4 Bad Channel Rejection (PREP Pipeline)

**Purpose**: Automatically identify and mark electrodes that are not recording valid neural signals.

**Why needed**: EEG electrodes can malfunction due to poor skin contact, broken wires, high impedance, or movement artifacts. Bad channels introduce noise and can distort spatial analyses, source localization, and connectivity measures.

**Method**: The PREP pipeline uses multiple statistical criteria: flat channels, channels with extreme amplitudes, poor correlation with neighbors, and channels that deviate from robust signal statistics.

In [None]:
## 🔍 Detect Bad Channels

print("🔍 Detecting bad channels...")

try:
    # Use PREP pipeline for comprehensive bad channel detection
    from pyprep.find_noisy_channels import NoisyChannels
    
    nd = NoisyChannels(raw, random_state=42)
    nd.find_all_bads(ransac=True, channel_wise=True)
    bad_channels_raw = nd.get_bads()
    
    # Clean up bad channels list (convert numpy strings to regular strings)
    bad_channels = [str(ch) for ch in bad_channels_raw] if bad_channels_raw else []
    
    if bad_channels:
        print(f"🚫 PREP detected {len(bad_channels)} bad channels: {bad_channels}")
        raw.info['bads'] = bad_channels
    else:
        print("✅ PREP found no bad channels")
        
except Exception as e:
    print(f"⚠️ PREP failed, using statistical method...")
    
    # Fallback: detect channels with extreme variance
    data = raw.get_data()
    channel_vars = np.var(data, axis=1)
    
    # Find outliers (top/bottom 5% by variance)
    high_thresh = np.percentile(channel_vars, 95)
    low_thresh = np.percentile(channel_vars, 5)
    
    bad_channels = []
    for i, ch_name in enumerate(raw.ch_names):
        if channel_vars[i] > high_thresh or channel_vars[i] < low_thresh:
            bad_channels.append(ch_name)
    
    if bad_channels:
        print(f"🚫 Statistical method found {len(bad_channels)} bad channels: {bad_channels}")
        raw.info['bads'] = bad_channels
    else:
        print("✅ Statistical method found no bad channels")

# Summary
total_channels = len(raw.ch_names)
n_bad = len(raw.info['bads'])
n_good = total_channels - n_bad

print(f"📊 Channel Summary: {n_good}/{total_channels} good channels ({n_bad} marked as bad)")

# Visualize montage with bad channels highlighted
if n_bad > 0:
    try:
        print("🗺️ Showing electrode positions (bad channels in red)...")
        fig = raw.plot_sensors(
            kind='topomap', 
            show_names=True, 
            sphere='auto',
            title=f'Electrode Positions ({n_bad} bad channels in red)'
        )
        
        # Get the axes and mark bad channels in red
        import matplotlib.pyplot as plt
        ax = fig.get_axes()[0]
        
        # Find positions of bad channels and mark them red
        pos = raw._get_channel_positions()
        if pos is not None:
            for i, ch_name in enumerate(raw.ch_names):
                if ch_name in raw.info['bads']:
                    # Find the channel position and recolor it
                    for child in ax.get_children():
                        if hasattr(child, 'get_text') and child.get_text() == ch_name:
                            child.set_color('red')
                            child.set_fontweight('bold')
        
        plt.show()
    except Exception as e:
        print(f"⚠️ Could not plot montage: {e}")

# Store bad channels for later use
bad_channels_for_interpolation = raw.info['bads'].copy()
print(f"💾 Stored {len(bad_channels_for_interpolation)} bad channels for later interpolation")

# Store results
raw_versions['After Bad Channels'] = raw.copy()

# Quality assessment
snr_after_bad = compute_snr(raw)
processing_log['After Bad Channels'] = snr_after_bad
print_snr_summary(snr_after_bad, "After Bad Channel Detection")

# Plot PSD comparison to show impact of bad channel detection
plot_psd_comparison(
    [raw_versions['After Downsample'], raw_versions['After Bad Channels']], 
    ['Before Bad Channel Detection', 'After Bad Channel Detection'],
    "Power Spectral Density: Bad Channel Detection Effect"
)

### 5.5 Independent Component Analysis (ICA)

**Purpose**: Separate mixed EEG signals into independent components and remove non-neural artifacts.

**Why needed**: EEG signals are mixtures of neural activity, muscle artifacts, heart beats, eye movements, and other noise sources. These artifacts can't always be removed by simple filtering and often overlap with neural frequencies.

**Method**: ICA decomposes the signal into statistically independent components. ICLabel automatically classifies components as brain activity, muscle, eye blinks, heart beats, or noise, allowing selective removal of artifacts while preserving neural signals.

In [None]:
## 🧠 Independent Component Analysis (ICA)

from mne.preprocessing import ICA
from mne_icalabel import label_components

print("🧠 Running ICA for artifact removal...")

try:
    # Prepare data for ICA (exclude bad channels, use average reference)
    raw_for_ica = raw.copy().pick('eeg', exclude='bads')
    raw_for_ica.set_eeg_reference('average', projection=False, verbose=False)
    
    # Fit ICA
    print("🔧 Fitting ICA components...")
    # Note: ICLabel was trained on extended infomax, but we use fastica for speed
    # This may produce a warning, but still gives good results for most datasets
    ica = ICA(n_components=None, method='fastica', random_state=42, max_iter='auto')
    ica.fit(raw_for_ica, verbose=False)
    
    # Classify components with ICLabel
    print("🏷️ Classifying components with ICLabel...")
    ic_labels = label_components(raw_for_ica, ica, method='iclabel')
    
    # Find artifact components to exclude
    labels = ic_labels['labels']
    probabilities = ic_labels['y_pred_proba']
    artifact_types = ['muscle artifact', 'eye blink', 'heart beat', 'line noise', 'channel noise']
    
    exclude_idx = []
    for i, (label, probs) in enumerate(zip(labels, probabilities)):
        if label in artifact_types and np.max(probs) > 0.7:  # High confidence artifacts
            exclude_idx.append(i)
    
    # Show only artifact components being excluded
    if exclude_idx:
        print(f"📊 Artifact components to exclude ({len(exclude_idx)} total):")
        for i in exclude_idx[:12]:  # Show first 12 excluded components
            label = labels[i]
            confidence = np.max(probabilities[i])
            print(f"   IC{i:02d}: {label} (confidence: {confidence:.2f}) [EXCLUDED]")
        
        if len(exclude_idx) > 12:
            print(f"   ... and {len(exclude_idx)-12} more artifact components")
    else:
        print("📊 No high-confidence artifact components found")
    
    # Apply ICA artifact removal
    ica.exclude = exclude_idx
    raw = ica.apply(raw, verbose=False)
    
    print(f"✅ ICA applied - removed {len(exclude_idx)} artifact components")
    
    # Plot excluded component topographies only
    if exclude_idx:
        try:
            fig = ica.plot_components(picks=exclude_idx[:12],  # Show first 12 excluded
                                     title=f'Excluded ICA Components ({len(exclude_idx)} total)', 
                                     show=False)
            plt.show()
        except:
            print("📊 Component visualization unavailable")
    else:
        print("📊 No artifact components to visualize")
        
except Exception as e:
    print(f"⚠️ ICA failed: {e}")
    print("Continuing without ICA artifact removal")

# Store results
raw_versions['After ICA'] = raw.copy()

# Quality assessment
snr_after_ica = compute_snr(raw)
processing_log['After ICA'] = snr_after_ica
print_snr_summary(snr_after_ica, "After ICA")

# Plot before/after comparison
plot_psd_comparison(
    [raw_versions['After Bad Channels'], raw_versions['After ICA']], 
    ['Before ICA', 'After ICA'],
    "Power Spectral Density: ICA Artifact Removal"
)

### 5.6 Bad Channel Interpolation

**Purpose**: Restore the full electrode array by estimating signals at previously identified bad channel locations.

**Why needed**: Many analyses (especially connectivity and source localization) require a complete, uniform electrode montage. Missing channels create gaps in spatial coverage and can bias results toward areas with higher electrode density.

**Method**: Spherical spline interpolation uses signals from neighboring good electrodes to estimate what the signal would have been at bad electrode locations, based on the spatial smoothness of scalp potentials.

In [None]:
## 🔧 Interpolate Bad Channels

print("🔧 Interpolating bad channels...")

# Use the bad channels identified earlier
if 'bad_channels_for_interpolation' in locals() and bad_channels_for_interpolation:
    # Set the bad channels in the current raw data
    raw.info['bads'] = bad_channels_for_interpolation
    n_bad = len(bad_channels_for_interpolation)
    
    print(f"📍 Interpolating {n_bad} bad channels: {bad_channels_for_interpolation}")
    
    try:
        # Interpolate bad channels using spherical splines
        raw.interpolate_bads(reset_bads=True, verbose=False)
        print(f"✅ Successfully interpolated {n_bad} channels")
        
    except Exception as e:
        print(f"⚠️ Interpolation failed: {e}")
        print("This usually means channel positions are missing from montage")
        # Clear the bad channels list if interpolation failed
        raw.info['bads'] = []
        
else:
    print("ℹ️ No bad channels to interpolate")

# Summary
n_remaining_bad = len(raw.info['bads'])
print(f"📊 Channel status: {len(raw.ch_names)} total, {n_remaining_bad} still marked as bad")

# Store results
raw_versions['After Interpolation'] = raw.copy()

# Quality assessment
snr_after_interp = compute_snr(raw)
processing_log['After Interpolation'] = snr_after_interp
print_snr_summary(snr_after_interp, "After Channel Interpolation")

# Plot before/after comparison
plot_psd_comparison(
    [raw_versions['After ICA'], raw_versions['After Interpolation']], 
    ['Before Interpolation', 'After Interpolation'],
    "Power Spectral Density: Channel Interpolation Effect"
)

### 5.7 Bad Time Segments Removal (Artifact Subspace Reconstruction)

**Purpose**: Automatically detect and correct brief periods of extreme artifacts that affect multiple channels simultaneously.

**Why needed**: Even after other cleaning steps, occasional periods of extreme artifacts can remain (sudden movements, cable bumps, amplifier saturation). These brief but severe artifacts can distort statistical analyses and connectivity measures.

**Method**: ASR learns the 'normal' signal patterns from clean calibration data, then identifies and reconstructs time periods where the signal deviates beyond a statistical threshold, effectively removing transient artifacts while preserving normal neural activity.

In [None]:
## ⚡ Remove Bad Time Segments (ASR)

print("⚡ Applying Artifact Subspace Reconstruction...")

try:
    from meegkit.asr import ASR
    
    # Get good EEG channels only (exclude bad channels)
    picks_eeg_good = mne.pick_types(raw.info, eeg=True, eog=False, exclude='bads')
    
    if len(picks_eeg_good) == 0:
        raise ValueError("No good EEG channels available for ASR")
    
    eeg_data = raw.get_data(picks=picks_eeg_good)
    sfreq = raw.info['sfreq']
    
    # ASR parameters
    asr_cutoff = 5  # Standard deviation cutoff
    train_duration = min(30, raw.times[-1])  # Use first 30s or less for training
    train_samples = int(train_duration * sfreq)
    
    print(f"🔧 ASR cutoff: {asr_cutoff}σ, training: {train_duration:.1f}s on {len(picks_eeg_good)} channels")
    
    # Fit ASR on clean training data
    asr = ASR(sfreq=sfreq, cutoff=asr_cutoff)
    train_data = eeg_data[:, :train_samples]
    asr.fit(train_data)
    
    # Transform the entire EEG dataset
    cleaned_data = asr.transform(eeg_data)
    
    # Calculate reconstruction percentage
    reconstruction_pct = np.mean(np.var(eeg_data - cleaned_data, axis=1) / np.var(eeg_data, axis=1)) * 100
    
    # Update only the good EEG channels in raw data
    raw._data[picks_eeg_good] = cleaned_data
    
    print(f"✅ ASR applied - {reconstruction_pct:.1f}% of signal reconstructed")
    
except Exception as e:
    print(f"⚠️ ASR failed: {e}")
    print("Continuing without ASR...")

# Store results
raw_versions['After ASR'] = raw.copy()

# Quality assessment
snr_after_asr = compute_snr(raw)
processing_log['After ASR'] = snr_after_asr
print_snr_summary(snr_after_asr, "After ASR")

# Plot before/after comparison
plot_psd_comparison(
    [raw_versions['After Interpolation'], raw_versions['After ASR']], 
    ['Before ASR', 'After ASR'],
    "Power Spectral Density: ASR Effect"
)

## 6. Final Quality Assessment

**Purpose**: Evaluate the overall effectiveness of the preprocessing pipeline and document improvements.

**Why needed**: It's crucial to verify that preprocessing actually improved data quality rather than inadvertently removing important neural signals. Quantitative metrics provide objective evidence of improvement and help optimize preprocessing parameters.

**Methods**: Compare SNR across frequency bands before and after processing, visualize PSD changes, and generate comprehensive quality reports.

Let's examine the overall improvement in data quality:

In [None]:
## 📈 Final Quality Assessment

print("📈 Final Quality Assessment")
print("=" * 60)

# Get the final processing step that was completed
final_step_names = list(raw_versions.keys())
final_step = final_step_names[-1]  # Last completed step
final_raw = raw_versions[final_step]

print(f"Final processing step: {final_step}")

# Plot processing summary showing SNR progression
plot_processing_summary(processing_log)

# Final comparison: Original vs Final processed data
plot_psd_comparison(
    [raw_versions['Original'], final_raw], 
    ['Original Data', 'Fully Processed'],
    f"Final Comparison: Original vs {final_step}"
)

# SNR improvement summary
print("\n🎯 SNR Improvement Summary:")
print("=" * 50)

original_snr = processing_log['Original']
final_snr = processing_log[final_step]

for band in ['delta', 'theta', 'alpha', 'beta', 'gamma']:
    original_val = original_snr[band]['mean_snr']
    final_val = final_snr[band]['mean_snr']
    improvement = final_val - original_val
    
    status = "↗️" if improvement > 0 else "↘️" if improvement < 0 else "→"
    print(f"{band.capitalize():>8}: {original_val:6.2f} → {final_val:6.2f} dB ({improvement:+5.2f} dB) {status}")

print("=" * 50)

# Final data summary
print("\n📊 Final Data Summary:")
print(f"├── Duration: {raw.times[-1]:.1f} seconds")
print(f"├── Sampling rate: {raw.info['sfreq']:.0f} Hz")
print(f"├── Channels: {len(raw.ch_names)} EEG")
print(f"├── Bad channels processed: {len(bad_channels_for_interpolation) if 'bad_channels_for_interpolation' in locals() else 0}")
print(f"├── Final step completed: {final_step}")
print(f"└── Pipeline completed successfully! ✅")

# Processing steps completed
print(f"\n🔄 Processing Steps Completed:")
for i, step in enumerate(final_step_names, 1):
    print(f"   {i}. {step}")

print(f"\n🎉 EEG preprocessing pipeline completed!")
print(f"Your cleaned EEG data is ready for analysis.")

## 7. Saving Results

**Purpose**: Export cleaned data in multiple formats and generate comprehensive documentation of the preprocessing workflow.

**Why needed**: Different analysis software requires different file formats. Documentation ensures reproducibility and helps track what preprocessing steps were applied. Quality metrics provide evidence of data improvement for publications.

**Methods**: Save in common EEG formats (BrainVision, EEGLAB, EDF), generate HTML reports with MNE, and export quantitative quality metrics as CSV files.

Save the cleaned data and generate a processing report:

In [None]:
## 💾 Save Cleaned Data

from pathlib import Path
import matplotlib.pyplot as plt

print("💾 Saving cleaned EEG data...")

# Create output directory
output_dir = Path('cleaneeg_output')
output_dir.mkdir(exist_ok=True)

# Get input file information (must exist from data loading)
if 'eeg_file' not in locals():
    print("❌ No input file information found!")
    raise ValueError("Cannot determine input file - please check data loading step")

input_file = eeg_file
input_format = input_file.suffix
print(f"📂 Input file: {input_file.name} (format: {input_format})")

# Create output filename with same base name as input
output_file = output_dir / f"{input_file.stem}_clean{input_file.suffix}"

# Save cleaned data
try:
    if input_format == '.fif':
        raw.save(output_file, overwrite=True, verbose=False)
    else:
        mne.export.export_raw(output_file, raw, fmt='auto', overwrite=True, verbose=False)
    
    print(f"✅ Cleaned data saved: {output_file.name}")
    
except Exception as e:
    print(f"❌ Failed to save data: {e}")
    output_file = None

# Create simple before/after comparison report
print("📊 Creating comparison report...")

try:
    # Get original and final data
    original_raw = raw_versions['Original']
    final_step = list(raw_versions.keys())[-1]
    final_raw = raw_versions[final_step]
    
    # Create comparison plot
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('EEG Processing Comparison: Original vs Cleaned', fontsize=16, fontweight='bold')
    
    # Original PSD
    original_psd = original_raw.compute_psd(fmax=50, verbose=False)
    original_psd.plot(axes=ax1, show=False, spatial_colors=False, average=True)
    ax1.set_title('Original Data - Power Spectral Density')
    ax1.set_xlabel('Frequency (Hz)')
    ax1.set_ylabel('Power (dB)')
    
    # Cleaned PSD  
    final_psd = final_raw.compute_psd(fmax=50, verbose=False)
    final_psd.plot(axes=ax2, show=False, spatial_colors=False, average=True)
    ax2.set_title('Cleaned Data - Power Spectral Density')
    ax2.set_xlabel('Frequency (Hz)')
    ax2.set_ylabel('Power (dB)')
    
    # SNR comparison by frequency band
    original_snr = processing_log['Original']
    final_snr = processing_log[final_step]
    
    bands = ['delta', 'theta', 'alpha', 'beta', 'gamma']
    original_values = [original_snr[band]['mean_snr'] for band in bands]
    final_values = [final_snr[band]['mean_snr'] for band in bands]
    
    x = range(len(bands))
    width = 0.35
    
    ax3.bar([i - width/2 for i in x], original_values, width, label='Original', alpha=0.7)
    ax3.bar([i + width/2 for i in x], final_values, width, label='Cleaned', alpha=0.7)
    ax3.set_xlabel('Frequency Band')
    ax3.set_ylabel('SNR (dB)')
    ax3.set_title('Signal-to-Noise Ratio by Frequency Band')
    ax3.set_xticks(x)
    ax3.set_xticklabels([b.capitalize() for b in bands])
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    # SNR improvement
    improvements = [final_values[i] - original_values[i] for i in range(len(bands))]
    colors = ['green' if imp > 0 else 'red' if imp < 0 else 'gray' for imp in improvements]
    
    ax4.bar(x, improvements, color=colors, alpha=0.7)
    ax4.set_xlabel('Frequency Band')
    ax4.set_ylabel('SNR Improvement (dB)')
    ax4.set_title('SNR Improvement After Processing')
    ax4.set_xticks(x)
    ax4.set_xticklabels([b.capitalize() for b in bands])
    ax4.axhline(y=0, color='black', linestyle='-', alpha=0.3)
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    
    # Save comparison plot with same base name as input file
    report_file = output_dir / f"{input_file.stem}_comparison.png"
    plt.savefig(report_file, dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"✅ Comparison report saved: {report_file.name}")
    
except Exception as e:
    print(f"❌ Failed to create report: {e}")

# Summary
print(f"\n🎉 Processing completed!")
print(f"📁 Output folder: {output_dir.absolute()}")
if output_file:
    print(f"📄 Cleaned data: {output_file.name}")
    print(f"📊 Comparison: {input_file.stem}_comparison.png")

print(f"\n✅ Your cleaned EEG data is ready for analysis!")

## Conclusion

🎉 **Congratulations!** You have successfully completed the CleanEEG preprocessing pipeline.

### What we accomplished:

1. **Loaded and inspected** your EEG data
2. **Applied comprehensive preprocessing** following the DISCOVER-EEG framework:
   - Line noise removal using Denoising Source Separation
   - Bandpass filtering to remove slow drifts and high-frequency noise
   - Downsampling for computational efficiency
   - Automatic bad channel detection using PREP pipeline
   - Independent Component Analysis with automatic classification
   - Bad channel interpolation
   - Bad time segment removal using Artifact Subspace Reconstruction

3. **Monitored data quality** throughout the pipeline using SNR metrics
4. **Saved cleaned data** in multiple formats for further analysis
5. **Generated comprehensive reports** documenting the preprocessing steps

### Next steps:

Your cleaned EEG data is now ready for:
- **Spectral analysis** (power spectral density, frequency band analysis)
- **Connectivity analysis** (coherence, phase-amplitude coupling)
- **Event-related potential analysis** (if you have event markers)
- **Machine learning applications** (classification, regression)
- **Source localization** (if you have a forward model)

### Tips for further analysis:

- The cleaned data maintains the original channel structure and timing
- All preprocessing steps are documented in the generated reports
- SNR improvements indicate the effectiveness of each preprocessing step
- Consider the specific requirements of your analysis when choosing output formats

**Happy analyzing!** 🧠✨