## Supplementary Material
Deep Learning in EEG-Based BCIs: A Comprehensive Review of Transformer Models, Advantages, Challenges, and Applications


### EEGTransformer Class

The `EEGTransformer` class is designed to leverage a transformer-based architecture tailored specifically for Electroencephalogram (EEG) data processing.

#### Parameters:
- `num_channels` (int): Specifies the number of channels in the EEG dataset.
- `num_timepoints` (int): Indicates the number of time points or the sequence length in the EEG data.
- `output_dim` (int): Defines the output dimensionality for the classifier layer.
- `hidden_dim` (int): Specifies the hidden layer dimensionality.
- `num_heads` (int): Determines the number of attention heads to be used in the multi-head self-attention mechanism.
- `key_query_dim` (int): Denotes the dimensionality for the key/query pairs in the self-attention mechanism.
- `hidden_ffn_dim` (int): Indicates the hidden layer dimensionality for the feed-forward network.
- `intermediate_dim` (int): Refers to the dimensionality of the intermediate layer in the feed-forward network.
- `ffn_output_dim` (int): Specifies the output size of the feed-forward network.

#### Attributes:
- `positional_encoding` (torch.Tensor): A tensor of shape `(num_channels, num_timepoints)` that imparts the sequence position information.
- `multihead_attn` (nn.MultiheadAttention): Implements the multi-head self-attention mechanism.
- `ffn` (nn.Sequential): Constructs a feed-forward network composed of a linear transformation followed by ReLU activation and another linear transformation.
- `norm1` and `norm2` (nn.LayerNorm): Execute layer normalization.
- `classifier` (nn.Linear): Deploys a final linear transformation layer to categorize the input into designated classes.

#### Methods:
- `forward(X)`: Outlines the forward propagation for the model.
  - `X` (torch.Tensor): The input tensor for EEG data, which should have a shape of `(batch_size, num_channels, num_timepoints)`.

  - **Steps**:
    1. Standardize the input tensor.
    2. Apply positional encoding.
    3. Implement multi-head self-attention.
    4. Reshape the attention output and apply layer normalization.
    5. Forward the data through the feed-forward network.
    6. Flatten the resultant tensor and direct it through a classifier layer.
    7. Yield the final output.
  
### Notes:

- The model applies layer normalization after the multi-head self-attention and feed-forward network stages.
- Positional encoding is utilized to impart sequence position information to the model, which can either be relative or absolute.
- The classifier layer flattens the model output and categorizes it into `output_dim` classes.

### Usage:

To employ the `EEGTransformer` model, instantiate the class using the desired parameters. Then, similar to any other PyTorch model, forward the input data to the model and utilize the returned output for either training or inference.

```python
# Sample Usage
model = EEGTransformer(num_channels=32, num_timepoints=200, output_dim=2,
                       hidden_dim=512, num_heads=8, key_query_dim=512,
                       hidden_ffn_dim=512, intermediate_dim=2048,
                       ffn_output_dim=32)
                       
input_data = torch.randn(64, 32, 200)
output = model(input_data)
```

Ensure that the model is paired with a compatible loss function and optimizer for effective training. Depending on the specifics of the EEG dataset or application requirements, the model can be further refined.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader

class EEGTransformer(nn.Module):
    def __init__(self, num_channels, num_timepoints, output_dim,
                 hidden_dim, num_heads, key_query_dim,
                 hidden_ffn_dim, intermediate_dim, ffn_output_dim):
        super(EEGTransformer, self).__init__()

        # Positional Encoding
        self.positional_encoding = torch.zeros(num_channels, num_timepoints)
        for j in range(num_channels):
            for k in range(num_timepoints):
                if j % 2 == 0:
                    self.positional_encoding[j][k] =\
                        torch.sin(torch.tensor(k) / (10000 ** (torch.tensor(j) / num_channels)))
                else:
                    self.positional_encoding[j][k] =\
                        torch.cos(torch.tensor(k) / (10000 ** ((torch.tensor(j) - 1) / num_channels)))

        # Multi-Head Self Attention
        self.multihead_attn = nn.MultiheadAttention(embed_dim=num_channels,
                                                    num_heads=num_heads)

        # Feed-Forward Network
        self.ffn = nn.Sequential(
            nn.Linear(num_channels, intermediate_dim),
            nn.ReLU(),
            nn.Linear(intermediate_dim, ffn_output_dim)
        )

        # Layer Normalization
        self.norm1 = nn.LayerNorm(num_channels)
        self.norm2 = nn.LayerNorm(num_channels)

        # Classifier
        self.classifier = nn.Linear(num_channels * num_timepoints, output_dim)

    def forward(self, X):
        # Input Standardization
        mean = X.mean(dim=2, keepdim=True)
        std  = X.std(dim=2, keepdim=True)
        X_hat = (X - mean) / (std + 1e-5)  # epsilon to avoid division by zero

        # Add Positional Encoding
        X_tilde = X_hat + self.positional_encoding.to(X.device)

        # Reshape for multi-head self attention: (seq_len, batch_size, embed_dim)
        X_tilde = X_tilde.permute(2, 0, 1)

        # Multi-Head Self Attention
        attn_output, _ = self.multihead_attn(X_tilde, X_tilde, X_tilde)

        # Reshape back and Apply Layer Norm
        # attn_output = attn_output.permute(1, 2, 0)  # Reshape: (batch_size, embed_dim, seq_len)
        X_ring = torch.stack([self.norm1(a) for a in attn_output], dim=1)

        # Position-wise Feed-Forward Networks
        ff_output = self.ffn(X_ring)
        O = self.norm2(ff_output + X_ring)

        # Classifier
        # Flatten and classify
        O_flat = O.view(O.size(0), -1)  # Flatten the tensor
        output = self.classifier(O_flat)

        return output

In [None]:
# Sample generated data (replace with real EEG (segmented) data)
num_channels=32
num_timepoints=200
batch_size = 64

X = torch.randn(batch_size, num_channels, num_timepoints)
y = torch.randint(0, 2, (batch_size,))  # L=2 for binary classification

# Model, Loss and Optimizer
model = EEGTransformer(num_channels, num_timepoints, output_dim=2,
                       hidden_dim=512, num_heads=8, key_query_dim=512,
                       hidden_ffn_dim=512, intermediate_dim=2048,
                       ffn_output_dim=num_channels)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
epochs = 10

for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(X)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()
    print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")


# once the model is trained, it can be tested on unseen EEG test examples
# also, different model selection techniques (e.g. cross-validation methods) can be implemented within the training loop
