# LLM-EEG Framework - Phase 2: Data Loading & Processing

This notebook demonstrates the complete Phase 2 implementation for the LLM-EEG framework,
specifically designed for the BCI Competition IV-2a dataset.

## Overview

**Phase 2 Components:**
- **Data Loaders**: `BCICIV2aLoader` for loading MAT files
- **Preprocessing**: Bandpass filter (8-30Hz), Notch filter (50/60Hz), Normalization
- **Validation**: Data structure validation and signal quality assessment
- **PyTorch Integration**: `EEGDataset` with train/val/test splitting

**Dataset Information:**
- 9 subjects (A01-A09)
- 4 motor imagery classes (left hand, right hand, feet, tongue)
- 22 EEG + 3 EOG channels
- 250 Hz sampling rate
- 288 trials per session (48 per class × 6 runs)

**Google Drive Dataset:**
- URL: https://drive.google.com/drive/folders/14tFFsegwr6oYF4wUuf_mjNOAgfuQ_Bwk

---

## Step 1: Environment Setup

In [None]:
# Step 1.1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Step 1.2: Clone the repository
!git clone https://github.com/erlika/llm-eeg.git
%cd llm-eeg

In [None]:
# Step 1.3: Install dependencies
!pip install -q numpy scipy mne torch scikit-learn matplotlib

In [None]:
# Step 1.4: Add src to Python path
import sys
sys.path.insert(0, '/content/llm-eeg')

# Verify import
from src.core.data_types import EEGData, TrialData
print("✅ LLM-EEG framework imported successfully!")

## Step 2: Dataset Configuration

In [None]:
# Step 2.1: Configure data paths
# Update this path to match your Google Drive folder structure
DATA_DIR = '/content/drive/MyDrive/BCI_Data/dataset_2a'  # Adjust as needed

# Alternative common paths:
# DATA_DIR = '/content/drive/MyDrive/BCI_Competition_IV_2a'
# DATA_DIR = '/content/drive/MyDrive/EEG/BCI_IV_2a'

import os
if os.path.exists(DATA_DIR):
    files = os.listdir(DATA_DIR)
    mat_files = [f for f in files if f.endswith('.mat')]
    print(f"✅ Found {len(mat_files)} MAT files in {DATA_DIR}")
    print(f"Files: {sorted(mat_files)}")
else:
    print(f"❌ Directory not found: {DATA_DIR}")
    print("Please update DATA_DIR to your dataset location")

In [None]:
# Step 2.2: Dataset constants
from src.data import (
    BCI_IV_2A_EEG_CHANNELS,
    BCI_IV_2A_EOG_CHANNELS, 
    BCI_IV_2A_SAMPLING_RATE,
    BCI_IV_2A_CLASS_MAPPING,
    BCI_IV_2A_EVENT_CODES,
    BCI_IV_2A_TRIALS_PER_SESSION
)

print("=== BCI Competition IV-2a Dataset Constants ===")
print(f"EEG Channels: {BCI_IV_2A_EEG_CHANNELS}")
print(f"EOG Channels: {BCI_IV_2A_EOG_CHANNELS}")
print(f"Sampling Rate: {BCI_IV_2A_SAMPLING_RATE} Hz")
print(f"Trials per Session: {BCI_IV_2A_TRIALS_PER_SESSION}")
print(f"\nClass Mapping: {BCI_IV_2A_CLASS_MAPPING}")
print(f"Event Codes: {BCI_IV_2A_EVENT_CODES}")

## Step 3: Custom Data Loader for BCI IV-2a

The BCI Competition IV-2a MAT files have a specific nested structure:
- `data` array with 9 elements (runs per session)
- Each run contains: `X` (signals), `y` (labels), `trial` (markers), `fs` (sampling rate), etc.
- Runs 0-2 have no MI trials; Runs 3-8 contain 48 trials each

In [None]:
# Step 3.1: BCICIV2aLoader - Custom loader for BCI Competition IV-2a
import numpy as np
from scipy.io import loadmat
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple, Any
from src.core.data_types import EEGData, EventMarker

class BCICIV2aLoader:
    """
    Data loader for BCI Competition IV-2a dataset.
    
    Handles the specific MAT file structure with nested data arrays.
    Supports loading single subjects, extracting trials, and event parsing.
    
    Attributes:
        sampling_rate: Signal sampling rate (250 Hz)
        n_eeg_channels: Number of EEG channels (22)
        n_eog_channels: Number of EOG channels (3)
        include_eog: Whether to include EOG channels
        class_mapping: Mapping from class labels to names
    """
    
    def __init__(
        self,
        sampling_rate: int = 250,
        include_eog: bool = False,
        trial_duration: float = 4.0,
        trial_offset: float = 0.0
    ):
        self.sampling_rate = sampling_rate
        self.n_eeg_channels = 22
        self.n_eog_channels = 3
        self.include_eog = include_eog
        self.trial_duration = trial_duration
        self.trial_offset = trial_offset
        
        # BCI IV-2a class mapping
        self.class_mapping = {
            1: 'left_hand',
            2: 'right_hand', 
            3: 'feet',
            4: 'tongue'
        }
        
        # EEG channel names (10-20 system)
        self.eeg_channel_names = [
            'Fz', 'FC3', 'FC1', 'FCz', 'FC2', 'FC4',
            'C5', 'C3', 'C1', 'Cz', 'C2', 'C4', 'C6',
            'CP3', 'CP1', 'CPz', 'CP2', 'CP4',
            'P1', 'Pz', 'P2', 'POz'
        ]
        self.eog_channel_names = ['EOG1', 'EOG2', 'EOG3']
        
    def load(self, file_path: str) -> EEGData:
        """
        Load EEG data from a BCI IV-2a MAT file.
        
        Args:
            file_path: Path to the MAT file (e.g., A01T.mat)
            
        Returns:
            EEGData object containing signals, events, and metadata
        """
        print(f"Loading: {file_path}")
        
        # Load MAT file
        mat_data = loadmat(file_path, struct_as_record=False, squeeze_me=True)
        
        # Extract data array (9 runs)
        data_array = mat_data['data']
        
        # Concatenate signals and collect events from MI runs (3-8)
        all_signals = []
        all_events = []
        sample_offset = 0
        
        for run_idx in range(len(data_array)):
            run = data_array[run_idx]
            
            # Get signals (samples x channels)
            signals = run.X
            n_samples = signals.shape[0]
            all_signals.append(signals)
            
            # Get labels and trial markers (only in runs 3-8)
            if hasattr(run, 'y') and hasattr(run.y, '__len__') and len(run.y) > 0:
                labels = run.y
                trial_starts = run.trial
                
                for i, (start, label) in enumerate(zip(trial_starts, labels)):
                    event = EventMarker(
                        sample=int(start) + sample_offset,
                        code=768 + int(label),  # Map to BCI IV-2a event codes
                        label=self.class_mapping.get(int(label), f'class_{label}')
                    )
                    all_events.append(event)
            
            sample_offset += n_samples
        
        # Concatenate all signals
        signals = np.vstack(all_signals)  # (total_samples, channels)
        
        # Select channels
        if self.include_eog:
            n_channels = self.n_eeg_channels + self.n_eog_channels
            channel_names = self.eeg_channel_names + self.eog_channel_names
        else:
            n_channels = self.n_eeg_channels
            channel_names = self.eeg_channel_names
            signals = signals[:, :self.n_eeg_channels]
        
        # Transpose to (channels, samples)
        signals = signals.T
        
        # Extract subject info from filename
        import os
        filename = os.path.basename(file_path)
        subject_id = filename[:3] if len(filename) >= 3 else 'unknown'
        session_type = filename[3] if len(filename) > 3 else 'T'
        
        # Create EEGData object
        eeg_data = EEGData(
            signals=signals,
            sampling_rate=self.sampling_rate,
            channel_names=channel_names,
            events=all_events,
            metadata={
                'subject_id': subject_id,
                'session_type': session_type,
                'n_trials': len(all_events),
                'n_runs': len(data_array),
                'file_path': file_path
            }
        )
        
        print(f"  ✅ Loaded: {signals.shape[0]} channels, {signals.shape[1]} samples")
        print(f"  ✅ Events: {len(all_events)} trials")
        
        return eeg_data
    
    def extract_trials(
        self,
        eeg_data: EEGData,
        duration: Optional[float] = None,
        offset: Optional[float] = None
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        Extract fixed-length trials from continuous EEG data.
        
        Args:
            eeg_data: EEGData object with events
            duration: Trial duration in seconds (default: 4.0)
            offset: Offset from event onset in seconds (default: 0.0)
            
        Returns:
            Tuple of (trials, labels):
                - trials: ndarray of shape (n_trials, n_channels, n_samples)
                - labels: ndarray of shape (n_trials,)
        """
        duration = duration or self.trial_duration
        offset = offset or self.trial_offset
        
        samples_per_trial = int(duration * self.sampling_rate)
        offset_samples = int(offset * self.sampling_rate)
        
        trials = []
        labels = []
        
        for event in eeg_data.events:
            # Get trial start/end
            start = event.sample + offset_samples
            end = start + samples_per_trial
            
            # Skip if out of bounds
            if start < 0 or end > eeg_data.signals.shape[1]:
                continue
            
            # Extract trial segment
            trial = eeg_data.signals[:, start:end]
            trials.append(trial)
            
            # Map event code to class label (0-3)
            if event.code >= 769:
                label = event.code - 769  # 769->0, 770->1, 771->2, 772->3
            else:
                label = event.code - 1  # Fallback
            labels.append(label)
        
        trials = np.array(trials)  # (n_trials, n_channels, n_samples)
        labels = np.array(labels)  # (n_trials,)
        
        print(f"Extracted {len(trials)} trials, shape: {trials.shape}")
        return trials, labels

print("✅ BCICIV2aLoader defined")

In [None]:
# Step 3.2: Load a single subject
import os

# Initialize loader
loader = BCICIV2aLoader(include_eog=False)

# Load subject A01 training data
subject_file = os.path.join(DATA_DIR, 'A01T.mat')
eeg_data = loader.load(subject_file)

print(f"\n=== EEG Data Summary ===")
print(f"Shape: {eeg_data.signals.shape}")
print(f"Sampling rate: {eeg_data.sampling_rate} Hz")
print(f"Channels: {len(eeg_data.channel_names)}")
print(f"Events: {len(eeg_data.events)}")
print(f"Duration: {eeg_data.signals.shape[1] / eeg_data.sampling_rate:.1f} seconds")
print(f"Metadata: {eeg_data.metadata}")

In [None]:
# Step 3.3: Extract trials
trials, labels = loader.extract_trials(eeg_data)

print(f"\n=== Trial Data ===")
print(f"Trials shape: {trials.shape}")
print(f"Labels shape: {labels.shape}")

# Class distribution
unique, counts = np.unique(labels, return_counts=True)
print(f"\nClass distribution:")
for u, c in zip(unique, counts):
    class_name = loader.class_mapping.get(u + 1, f'class_{u}')
    print(f"  Class {u} ({class_name}): {c} trials")

## Step 4: Data Validation

In [None]:
# Step 4.1: Validate EEG data structure
from src.data.validators import DataValidator, ValidationResult

validator = DataValidator()

# Validate the loaded data
result = validator.validate(eeg_data)

print("=== Data Validation ===")
print(f"Valid: {result.is_valid}")

if result.errors:
    print(f"\n❌ Errors:")
    for error in result.errors:
        print(f"  - {error}")

if result.warnings:
    print(f"\n⚠️  Warnings:")
    for warning in result.warnings:
        print(f"  - {warning}")

In [None]:
# Step 4.2: Signal quality assessment
from src.data.validators import QualityChecker

quality_checker = QualityChecker(sampling_rate=250)

# Assess quality of trial data
quality_report = quality_checker.assess_quality(trials)

print("=== Signal Quality Report ===")
print(f"Overall Score: {quality_report.get('overall_score', 'N/A'):.3f}")
print(f"\nMetrics:")
for key, value in quality_report.items():
    if key != 'overall_score' and not isinstance(value, (list, np.ndarray)):
        print(f"  {key}: {value}")

## Step 5: Preprocessing Pipeline

In [None]:
# Step 5.1: Create standard preprocessing pipeline
from src.preprocessing import (
    PreprocessingPipeline,
    create_standard_pipeline,
    BandpassFilter,
    NotchFilter,
    Normalization
)

# Standard pipeline: Notch -> Bandpass -> Normalization
pipeline = create_standard_pipeline(
    sampling_rate=250,
    notch_freq=50.0,        # Power line frequency (50 Hz EU, 60 Hz US)
    low_freq=8.0,           # Lower cutoff (mu rhythm)
    high_freq=30.0,         # Upper cutoff (beta rhythm)
    normalize_method='zscore',
    normalize_axis='channel'
)

print("=== Preprocessing Pipeline ===")
print(f"Steps: {[s.name for s in pipeline.get_steps()]}")
for step in pipeline.get_steps():
    print(f"  - {step.name}: {step.get_parameters()}")

In [None]:
# Step 5.2: Apply preprocessing to trials
print(f"Before preprocessing: {trials.shape}, range: [{trials.min():.2f}, {trials.max():.2f}]")

# Initialize and process
pipeline.initialize()
trials_processed = pipeline.process(trials)

print(f"After preprocessing: {trials_processed.shape}, range: [{trials_processed.min():.2f}, {trials_processed.max():.2f}]")

In [None]:
# Step 5.3: Visualize preprocessing effects
import matplotlib.pyplot as plt

# Select a sample trial and channel
trial_idx = 0
channel_idx = 7  # C3 - important for motor imagery

fig, axes = plt.subplots(2, 2, figsize=(14, 8))

# Time axis
time = np.arange(trials.shape[2]) / 250.0

# Raw signal
ax1 = axes[0, 0]
ax1.plot(time, trials[trial_idx, channel_idx, :], 'b-', linewidth=0.5)
ax1.set_title(f'Raw Signal - Trial {trial_idx}, Channel C3')
ax1.set_xlabel('Time (s)')
ax1.set_ylabel('Amplitude (μV)')
ax1.grid(True, alpha=0.3)

# Processed signal
ax2 = axes[0, 1]
ax2.plot(time, trials_processed[trial_idx, channel_idx, :], 'g-', linewidth=0.5)
ax2.set_title('Preprocessed Signal (Notch + Bandpass + Normalization)')
ax2.set_xlabel('Time (s)')
ax2.set_ylabel('Normalized Amplitude')
ax2.grid(True, alpha=0.3)

# Power spectrum - Raw
ax3 = axes[1, 0]
from scipy.signal import welch
freqs, psd_raw = welch(trials[trial_idx, channel_idx, :], fs=250, nperseg=256)
ax3.semilogy(freqs, psd_raw, 'b-')
ax3.axvline(x=8, color='r', linestyle='--', alpha=0.5, label='8 Hz')
ax3.axvline(x=30, color='r', linestyle='--', alpha=0.5, label='30 Hz')
ax3.axvline(x=50, color='orange', linestyle='--', alpha=0.5, label='50 Hz (line noise)')
ax3.set_title('Power Spectrum - Raw')
ax3.set_xlabel('Frequency (Hz)')
ax3.set_ylabel('PSD')
ax3.set_xlim([0, 60])
ax3.legend()
ax3.grid(True, alpha=0.3)

# Power spectrum - Processed
ax4 = axes[1, 1]
freqs, psd_proc = welch(trials_processed[trial_idx, channel_idx, :], fs=250, nperseg=256)
ax4.semilogy(freqs, psd_proc, 'g-')
ax4.axvline(x=8, color='r', linestyle='--', alpha=0.5, label='8 Hz')
ax4.axvline(x=30, color='r', linestyle='--', alpha=0.5, label='30 Hz')
ax4.set_title('Power Spectrum - Preprocessed (8-30 Hz bandpass)')
ax4.set_xlabel('Frequency (Hz)')
ax4.set_ylabel('PSD')
ax4.set_xlim([0, 60])
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 6: PyTorch Dataset Integration

In [None]:
# Step 6.1: Create PyTorch Dataset
import torch
from torch.utils.data import DataLoader
from src.datasets import EEGDataset, train_val_test_split

# Create dataset from preprocessed trials
dataset = EEGDataset(
    trials=trials_processed,
    labels=labels,
    transform=None  # Already preprocessed
)

print(f"=== PyTorch Dataset ===")
print(f"Total samples: {len(dataset)}")
print(f"Sample shape: {dataset[0][0].shape}")
print(f"Label type: {type(dataset[0][1])}")

In [None]:
# Step 6.2: Split into train/validation/test sets
train_data, val_data, test_data = train_val_test_split(
    dataset,
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15,
    random_seed=42,
    stratify=True
)

print(f"\n=== Data Split ===")
print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")
print(f"Test samples: {len(test_data)}")

In [None]:
# Step 6.3: Create DataLoaders
batch_size = 32

train_loader = DataLoader(
    train_data,
    batch_size=batch_size,
    shuffle=True,
    num_workers=0
)

val_loader = DataLoader(
    val_data,
    batch_size=batch_size,
    shuffle=False,
    num_workers=0
)

test_loader = DataLoader(
    test_data,
    batch_size=batch_size,
    shuffle=False,
    num_workers=0
)

print(f"\n=== DataLoaders ===")
print(f"Train batches: {len(train_loader)}")
print(f"Val batches: {len(val_loader)}")
print(f"Test batches: {len(test_loader)}")

# Verify batch
sample_batch, sample_labels = next(iter(train_loader))
print(f"\nSample batch shape: {sample_batch.shape}")
print(f"Sample labels shape: {sample_labels.shape}")

## Step 7: Multi-Subject Loading

In [None]:
# Step 7.1: Load multiple subjects
def load_multiple_subjects(
    data_dir: str,
    subject_ids: List[str],
    session_type: str = 'T',
    include_eog: bool = False
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Load and combine data from multiple subjects.
    
    Args:
        data_dir: Path to dataset directory
        subject_ids: List of subject IDs (e.g., ['A01', 'A02', 'A03'])
        session_type: 'T' for training, 'E' for evaluation
        include_eog: Whether to include EOG channels
        
    Returns:
        Tuple of (trials, labels, subject_ids) arrays
    """
    loader = BCICIV2aLoader(include_eog=include_eog)
    
    all_trials = []
    all_labels = []
    all_subject_ids = []
    
    for subject_id in subject_ids:
        file_path = os.path.join(data_dir, f"{subject_id}{session_type}.mat")
        
        if not os.path.exists(file_path):
            print(f"⚠️  File not found: {file_path}")
            continue
            
        eeg_data = loader.load(file_path)
        trials, labels = loader.extract_trials(eeg_data)
        
        all_trials.append(trials)
        all_labels.append(labels)
        all_subject_ids.extend([subject_id] * len(labels))
    
    # Concatenate
    trials = np.concatenate(all_trials, axis=0)
    labels = np.concatenate(all_labels, axis=0)
    subject_ids_array = np.array(all_subject_ids)
    
    return trials, labels, subject_ids_array

print("✅ load_multiple_subjects function defined")

In [None]:
# Step 7.2: Load all subjects
all_subject_ids = ['A01', 'A02', 'A03', 'A04', 'A05', 'A06', 'A07', 'A08', 'A09']

# Load all subjects (may take a few minutes)
all_trials, all_labels, subject_array = load_multiple_subjects(
    DATA_DIR,
    all_subject_ids,
    session_type='T'
)

print(f"\n=== All Subjects Combined ===")
print(f"Total trials: {all_trials.shape}")
print(f"Total labels: {all_labels.shape}")
print(f"Unique subjects: {np.unique(subject_array)}")

In [None]:
# Step 7.3: Preprocess all subjects
print("Preprocessing all trials...")
all_trials_processed = pipeline.process(all_trials)
print(f"✅ Preprocessed shape: {all_trials_processed.shape}")

## Step 8: Cross-Validation Setup

In [None]:
# Step 8.1: K-Fold Cross-Validation
from src.datasets import create_cv_folds

# Create dataset from all subjects
full_dataset = EEGDataset(
    trials=all_trials_processed,
    labels=all_labels
)

# Create 5-fold CV splits
cv_folds = create_cv_folds(
    full_dataset,
    n_folds=5,
    random_seed=42,
    stratify=True
)

print(f"=== 5-Fold Cross-Validation ===")
for i, (train_data, val_data) in enumerate(cv_folds):
    print(f"Fold {i+1}: Train={len(train_data)}, Val={len(val_data)}")

In [None]:
# Step 8.2: Leave-One-Subject-Out (LOSO) Cross-Validation
def create_loso_splits(
    trials: np.ndarray,
    labels: np.ndarray,
    subject_ids: np.ndarray
) -> List[Tuple[EEGDataset, EEGDataset, str]]:
    """
    Create Leave-One-Subject-Out cross-validation splits.
    
    Args:
        trials: All trial data
        labels: All labels
        subject_ids: Subject ID for each trial
        
    Returns:
        List of (train_dataset, test_dataset, test_subject_id) tuples
    """
    unique_subjects = np.unique(subject_ids)
    splits = []
    
    for test_subject in unique_subjects:
        # Test: single subject
        test_mask = subject_ids == test_subject
        # Train: all other subjects
        train_mask = ~test_mask
        
        train_dataset = EEGDataset(
            trials=trials[train_mask],
            labels=labels[train_mask]
        )
        test_dataset = EEGDataset(
            trials=trials[test_mask],
            labels=labels[test_mask]
        )
        
        splits.append((train_dataset, test_dataset, test_subject))
    
    return splits

# Create LOSO splits
loso_splits = create_loso_splits(all_trials_processed, all_labels, subject_array)

print(f"=== Leave-One-Subject-Out CV ===")
for train_data, test_data, subject in loso_splits:
    print(f"Test Subject {subject}: Train={len(train_data)}, Test={len(test_data)}")

## Step 9: Simple Training Example

In [None]:
# Step 9.1: Define a simple CNN classifier
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleEEGNet(nn.Module):
    """
    Simple CNN for EEG classification.
    Input: (batch, channels, samples) = (batch, 22, 1000)
    Output: (batch, n_classes) = (batch, 4)
    """
    
    def __init__(self, n_channels=22, n_samples=1000, n_classes=4):
        super().__init__()
        
        # Temporal convolution
        self.conv1 = nn.Conv1d(n_channels, 32, kernel_size=25, padding=12)
        self.bn1 = nn.BatchNorm1d(32)
        self.pool1 = nn.MaxPool1d(4)
        
        # Second conv layer
        self.conv2 = nn.Conv1d(32, 64, kernel_size=10, padding=5)
        self.bn2 = nn.BatchNorm1d(64)
        self.pool2 = nn.MaxPool1d(4)
        
        # Calculate flattened size
        self._flat_size = 64 * (n_samples // 16)
        
        # Fully connected layers
        self.fc1 = nn.Linear(self._flat_size, 128)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(128, n_classes)
        
    def forward(self, x):
        # x: (batch, channels, samples)
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.pool1(x)
        
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.pool2(x)
        
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        
        return x

# Initialize model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleEEGNet(n_channels=22, n_samples=1000, n_classes=4).to(device)

print(f"=== Model ===")
print(f"Device: {device}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(model)

In [None]:
# Step 9.2: Training loop
from torch.optim import Adam
from sklearn.metrics import accuracy_score, classification_report

# Training configuration
n_epochs = 20
learning_rate = 0.001

criterion = nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=learning_rate)

# Training history
history = {
    'train_loss': [],
    'train_acc': [],
    'val_loss': [],
    'val_acc': []
}

print(f"=== Training ===")
for epoch in range(n_epochs):
    # Training phase
    model.train()
    train_loss = 0.0
    train_preds = []
    train_labels_list = []
    
    for batch_data, batch_labels in train_loader:
        batch_data = batch_data.float().to(device)
        batch_labels = batch_labels.long().to(device)
        
        optimizer.zero_grad()
        outputs = model(batch_data)
        loss = criterion(outputs, batch_labels)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()
        _, preds = torch.max(outputs, 1)
        train_preds.extend(preds.cpu().numpy())
        train_labels_list.extend(batch_labels.cpu().numpy())
    
    train_loss /= len(train_loader)
    train_acc = accuracy_score(train_labels_list, train_preds)
    
    # Validation phase
    model.eval()
    val_loss = 0.0
    val_preds = []
    val_labels_list = []
    
    with torch.no_grad():
        for batch_data, batch_labels in val_loader:
            batch_data = batch_data.float().to(device)
            batch_labels = batch_labels.long().to(device)
            
            outputs = model(batch_data)
            loss = criterion(outputs, batch_labels)
            
            val_loss += loss.item()
            _, preds = torch.max(outputs, 1)
            val_preds.extend(preds.cpu().numpy())
            val_labels_list.extend(batch_labels.cpu().numpy())
    
    val_loss /= len(val_loader)
    val_acc = accuracy_score(val_labels_list, val_preds)
    
    # Record history
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['val_loss'].append(val_loss)
    history['val_acc'].append(val_acc)
    
    if (epoch + 1) % 5 == 0 or epoch == 0:
        print(f"Epoch {epoch+1:2d}/{n_epochs}: "
              f"Train Loss={train_loss:.4f}, Acc={train_acc:.4f} | "
              f"Val Loss={val_loss:.4f}, Acc={val_acc:.4f}")

print(f"\n✅ Training complete!")

In [None]:
# Step 9.3: Evaluate on test set
model.eval()
test_preds = []
test_labels_list = []

with torch.no_grad():
    for batch_data, batch_labels in test_loader:
        batch_data = batch_data.float().to(device)
        outputs = model(batch_data)
        _, preds = torch.max(outputs, 1)
        test_preds.extend(preds.cpu().numpy())
        test_labels_list.extend(batch_labels.numpy())

test_acc = accuracy_score(test_labels_list, test_preds)

print(f"=== Test Results ===")
print(f"Test Accuracy: {test_acc:.4f}")
print(f"\nClassification Report:")
print(classification_report(
    test_labels_list, 
    test_preds,
    target_names=['Left Hand', 'Right Hand', 'Feet', 'Tongue']
))

In [None]:
# Step 9.4: Plot training history
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Loss plot
ax1 = axes[0]
ax1.plot(history['train_loss'], label='Train Loss', marker='o')
ax1.plot(history['val_loss'], label='Val Loss', marker='s')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training & Validation Loss')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Accuracy plot
ax2 = axes[1]
ax2.plot(history['train_acc'], label='Train Accuracy', marker='o')
ax2.plot(history['val_acc'], label='Val Accuracy', marker='s')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Training & Validation Accuracy')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Step 9.5: Confusion Matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(test_labels_list, test_preds)
class_names = ['Left Hand', 'Right Hand', 'Feet', 'Tongue']

plt.figure(figsize=(8, 6))
sns.heatmap(
    cm, 
    annot=True, 
    fmt='d', 
    cmap='Blues',
    xticklabels=class_names,
    yticklabels=class_names
)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title(f'Confusion Matrix (Test Accuracy: {test_acc:.2%})')
plt.tight_layout()
plt.show()

## Phase 2 Complete!

### Summary

You have successfully completed Phase 2: Data Loading & Processing:

1. **Data Loader**: `BCICIV2aLoader` for BCI Competition IV-2a MAT files
2. **Trial Extraction**: Fixed-length trial segmentation from continuous data
3. **Validation**: Data structure and signal quality validation
4. **Preprocessing**: Standard pipeline (Notch + Bandpass + Normalization)
5. **PyTorch Integration**: `EEGDataset` with train/val/test splitting
6. **Multi-Subject Loading**: Load and combine multiple subjects
7. **Cross-Validation**: K-Fold and Leave-One-Subject-Out (LOSO)
8. **Training Example**: Simple CNN classifier with training loop

### Next Steps: Phase 3 - Feature Extraction & Classification

- Common Spatial Pattern (CSP) feature extraction
- Filter Bank CSP (FBCSP)
- Deep learning classifiers (EEGNet, DeepConvNet)
- Advanced hyperparameter tuning

---

### Quick Reference

```python
# Load data
loader = BCICIV2aLoader()
eeg_data = loader.load('A01T.mat')
trials, labels = loader.extract_trials(eeg_data)

# Preprocess
pipeline = create_standard_pipeline(sampling_rate=250)
trials_processed = pipeline.process(trials)

# Create PyTorch dataset
dataset = EEGDataset(trials_processed, labels)
train_data, val_data, test_data = train_val_test_split(dataset)

# DataLoaders
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
```