# **Hands-On TMS-EEG Preprocessing**

### [Matteo De Matola](https://github.com/matteo-d-m) 

This notebook contains code and explanations for the hands-on TMS-EEG preprocessing activity, [Brain Stimulation & Multimodal Electrophysiological Recording](https://unitn.coursecatalogue.cineca.it/insegnamenti/2025/50512_653501_96292/2011/50513/10168?annoOrdinamento=2011&coorte=2024) course, [Master's degree in Cognitive Science](https://corsi.unitn.it/en/cognitive-science), University of Trento.

We will use data from one condition and one subject of [Zazio et al. (2021)](https://www.sciencedirect.com/science/article/pii/S1388245721006714). The data are publicly available and freely reusable through [G-Node Gin](https://gin.g-node.org/) at [this](https://gin.g-node.org/AgneseZazio/ZazioMiniussiBortoletto2021) link under a [Creative Commons CC0 1.0 Universal license](https://creativecommons.org/publicdomain/zero/1.0/deed.en).

# **Introduction**

This activity will guide you through a typical TMS-EEG preprocessing pipeline. 

Given a signal generated by some system, _preprocessing_ is the act of separating the actual signal from the noise.

In the context of TMS-EEG:

- The _signal_ is the fraction of scalp voltage generated by the cerebral cortex in response to TMS pulses
- The _noise_ is the fraction of scalp voltage generated by any non-cerebral source, including (but not limited to) muscles, eyes, and electronic instrumentation (chiefly the TMS coil itself)

Preprocessing is usually a complex process that includes multiple sequential transformations of the TMS-EEG data.

Such sequence of transformations is a _**preprocessing pipeline**_.

There is no standard preprocessing pipeline for TMS-EEG data (nor for EEG data in general). 

The problem is significant enough to write papers about it: see [Bertazzoli et al. (2021)](https://www.sciencedirect.com/science/article/pii/S1053811921005486), [Rogasch et al., 2022](https://www.sciencedirect.com/science/article/pii/S0165027022000218?via%3Dihub), [Brancaccio et al., 2024](https://www.sciencedirect.com/science/article/pii/S1053811924003719) 


Broadly speaking, one could classify preprocessing strategies in two categories: _conservationist approaches_ and _interventionist approaches_.

- _Conservationist approaches_ tend to conserve the signal as it is acquired, keeping transformations to a minimum at risk of not eliminating noise
    - Assumption: any transformation might delete noise, but it alters the signal in potentially undesirable ways
    - See [Delorme (2023)](https://www.nature.com/articles/s41598-023-27528-0)
- _Interventionist approaches_ tend to act heavily on the signal, under the assumption that experimental manipulations of the EEG signal will not be visible until artefacts have been removed

In practice, people tend to design pipelines that suit their experimental manipulations and the characteristics of their data. 

For example, if TMS is delivered on an area where there are large cranial muscles, data will probably be contaminated by large muscle artefacts. Therefore, it will be reasonable to design a pipeline that acts aggressively on muscle artefacts. 

In [None]:
from pathlib import Path                            # to build a bridge between Python and the filesystem 

import numpy as np                                  # to perform array operations
import matplotlib                                   # this and the following to draw plots
import matplotlib.pyplot as plt
matplotlib.use('Qt5Agg')                            # to make plots interactive

import mne                                          # to read and manipulate EEG data

from scripts import utils                           # custom functions to shorten the code in this notebook

# **1. Basic Preprocessing**

Basic preprocessing is the process of loading data, inspecting them to understand their shape and content, and applying some basic transformations that set the stage for subsequent, more complex operations.

In a typical TMS-EEG preprocessing pipeline, this would include the following steps:

1. Loading data

2. Inspecting data
    - Shape
        - Number of channels and timelength
    - Acquisition parameters
        - Sampling rate
    - Appearance
        - Are the raw data visibly dirty?
3. Adjusting data 
    - If present, drop non-EEG channels (EMG, EOG, ECG)
    - Set channel locations
    
4. Interpolating the pulse artifact

5. High-pass filtering

## **1.1 Loading Data**

This step amounts merely to reading the data files. 

EEG data files come in a variety of formats, depending on the recording system and the choices made by researchers. 

The four most common formats are (in no particular order):

1. BrainVision format: `.eeg` + `.vhdr` + `.vmrk` = one single recording
    - Typical of data acquired with Brain Products hardware, which usually ships with the BrainVision Recorder software

2. EEGLAB format: `.set` + `.fdt` = one single recording 
    - Typical of data that were previously analysed or otherwise treated with EEGLAB, the main Matlab-based EEG toolbox
    - Some recording systems (e.g., Bittium NeurOne) have a built-in option to save data in this format to facilitate EEGLAB users

3. Matlab format: `.mat`
    - Typycal of recording systems that run Matlab-based software (e.g., g.tec systems)

4. European Data Format: `.edf` 

MNE-Python has functions to read data in all the formats above, except `.mat` which is probably the least common. 


In this activity we will use data acquired with Brain Products hardware and stored in BrainVision format.

As mentioned above, the BrainVision format distributes information across three different files: 

1. The `.eeg` file, which contains actual EEG time series 
2. The `.vhdr` file (also known as _header file_), which contains important metadata about _how_ the data were acquired &mdash; chiefly, the map between channel labels (e.g., `Fp1`) and their place in the recording hardware (e.g., _pin number one_)
3. The `.vmrk` file (also known as _marker file_), which contains important information about event markers &mdash; that is, _when_ an event like a TMS pulse happened and _how long_ did it last

To read data in BrainVision format, the go-to function is `read_raw_brainvision()`, contained in module `io` (that is, input/output) of the `mne` package.

The code below performes the following operations:

- Define a `Path` object (basically, make the data directory accessible to Python)
- Identify the file of interest
- Read it using `read_raw_brainvision()`
- Assign the result to a Python variable, where EEG data will be represented as a `channels x time` matrix with accompanying metadata

In [None]:
data_dir = Path("data")
subject_and_condition = "S02C1_M1"

file_of_interest = list(data_dir.glob(pattern=f"{subject_and_condition}.vhdr"))
eeg_data = mne.io.read_raw_brainvision(vhdr_fname=file_of_interest[0])

Note that the file to read is not the  EEG file itself (`.eeg`), but the header file (`.vhdr`). 

This is because the header contains metadata that are needed to correctly interpret the information from the EEG file, which would otherwise be a meaningless array of numbers. 
- The same goes for the EEGLAB format, where the file to read is the `.fdt` that contains metadata 

After reading the header, `read_raw_brainvision()` proceeds to read the actual data from the corresponding EEG file. This implies that the two filenames **must** be identical (except for the extension), otherwise Python will throw an error.

After reading the data, it is possible to start inspecting them by printing the name of the corresponding Python variable. 

This gives access to a first set of useful information concerning the acquisition set-up, the channels and the filters that were applied to the signal during the recording. 

In [None]:
eeg_data

As you can see, information is classified as `General`, `Acquisition`, `Channels`, and `Filters`. 

The `General` section yields no particular insights. 

Much more interesting is the `Acquisition` section, which describes the temporal characteristics of the data: the duration of the recording, the sampling rate, and the number of timepoints. 

We shall now pause for a minute and ponder the relationship between these three quantities ðŸ¤”

We can see that:

- The recording length is expressed in minutes (which are a multiple of seconds)
- The sampling rate is expressed in Hz (that is, samples per second)
- The number of timepoints is expressed in array units &mdash; in other words, it is: 
    - The number of points on the horizontal axis of the array that stores the EEG data
    - The number of columns in the EEG data matrix

<p align="center">
<img src="./files/eeg.png" width="500"/>
</p>


The three quantities (minutes, Hz, array units) have _time_ in common &mdash; therefore, it is possible to convert from one to another with simple operations. This is useful to acquire a deeper undestanding of things and to recover one quantity from the other two (should that become necessary).

The following relationships are particularly useful when you have to [wrangle](https://en.wikipedia.org/wiki/Data_wrangling) with ill-defined data:

$$ \text{Number \ of \ Timepoints} = \text{Recording  \ Length in Minutes} \cdot \text{Sampling \ Rate} $$

$$ \text{Recording  \ Length in Minutes} = \Bigg(\dfrac{\text{Number \ of \ Timepoints}}{\text{Sampling \ Rate}}\Bigg) \cdot \dfrac{1}{60}$$

where $60$ is the number of seconds in a minute. You can check for yourself that the following results are coherent with what you find in the dataset's info:

In [None]:
NUMBER_OF_TIMEPOINTS = len(eeg_data.times)
SAMPLING_RATE = eeg_data.info["sfreq"]
SECONDS_IN_A_MINUTE = 60

RECORDING_LENGTH_IN_MINUTES = NUMBER_OF_TIMEPOINTS / SAMPLING_RATE / SECONDS_IN_A_MINUTE
print(f"The recording length is: {RECORDING_LENGTH_IN_MINUTES}")

NUMBER_OF_TIMEPOINTS = RECORDING_LENGTH_IN_MINUTES*SAMPLING_RATE*SECONDS_IN_A_MINUTE
print(f"The number of timepoints is: {NUMBER_OF_TIMEPOINTS}")

The `Channels` section is also interesting, as it contains information about the number of channels and whether their position was [digitised](https://www.nature.com/articles/s41598-023-30223-9) during the recording. As you can see from the table above, no digitisation was performed. However, the list of channels is available and can be accessed as:

In [None]:
eeg_data.ch_names

## **1.2 Drop Unwanted Channels**

As you could see from the list above, the dataset contains EEG channels (`Fp1`,`Fpz`,`Fp2`... etc.) as well as non-EEG channels (`EOG` and `FDI`).

Assuming that we are not interested in analysing non-EEG channels, we can drop them. To this end, we need to identify them. This can either be done manually (which would be boring and time-consuming) or automatically. In the latter case, one can exploit the fact that the last character in EEG channel names tends to be an integer number (`Fp1`, `Fp2`, `AF7`...) or a lowercase _z_ (`Fpz`, `Cz`, `Oz`...). The following code does just that: for each channel name, it checks if its last character is an integer. If it is not (that is, if Python throws a `ValueError`), it checks if such character is a _z_. If it is not, the channel must be non-EEG and is thus appended to a list of channels that need be dropped. 

You can check for yourself that the resulting list contains only channels with a non-EEG name.

In [None]:
channels_to_drop = []

for channel_name in eeg_data.ch_names:
    try:
        int(channel_name[-1])
    except ValueError:
        if channel_name[-1] != "z": 
            channels_to_drop.append(channel_name)
print(f"Channels to drop: {channels_to_drop}")

After identifying the channels to drop, one can actually drop them as follows:

In [None]:
eeg_data = eeg_data.drop_channels(ch_names=channels_to_drop)

Re-printing dataset info shows that the data now have 70 channels, which is two less than before (as expected):

In [None]:
eeg_data

## **1.3 Set Channel Locations**

As revealed by its info, the dataset contains no _Head & sensor digitization_ &mdash; in other words, we have no way to know the shape of the head nor the position of channels relative to it. 

However, this is fundamental information for some analyses (_source reconstruction_) and visualizations (_topoplots_).

While we will not perform source reconstruction, we will visualize topoplots &mdash; that is, colour-coded scalp voltage maps. 

Therefore, we need to associate each channel to a set of coordinates (called _montage_) that locate them on the head. To this end, MNE has a list of built-in coordinate sets that can be printed as follows:

In [None]:
mne.channels.get_builtin_montages()

We know from the publication that the authors used a 74-channels Brain Products cap, which should correspond to MNE's `easycap-M1` montage. 

Therefore, we can create a corresponding _montage_ object and print the resulting $x,y,z$ coordinates for each channel and the three fiducial points: nasion, left pre-auricular (LPA) and right pre-auricular (RPA): 

In [None]:
easycap_m1_montage = mne.channels.make_standard_montage(kind="easycap-M1")
easycap_m1_montage.get_positions()

Having 3D coordinates, we can project them onto a plane and build the following plot:

In [None]:
easycap_m1_montage.plot();

After having visualized the montage and ascertained that it makes sense, we can apply it to our data as follows: 

In [None]:
eeg_data.set_montage(montage=easycap_m1_montage,
                     on_missing="ignore")

We are finally ready to visualize the raw data:

In [None]:
eeg_data.plot();

The next step would be to interpolate the TMS pulse artefact. To this end, we need to locate TMS pulses on the EEG trace. Therefore, we need information about _when_ TMS pulses occurred during the experiment.

This information is given by **event markers**: timestamps that locate events of experimental interest on the EEG trace. These timestamps have three defining characteristics:

1. A name: for example, a numeric code like `54` or a short word like `tms`
2. A duration 
3. A time of occurrence 

In the BrainVision data format, marker information are stored in the `.vmrk` file and are enriched by some supplementary information. In fact, they are defined by:

1. An ordinal number
2. A type
3. A name
4. A time of occurrence (in units of data points)
5. A duration (in units of data points)
6. Information about which channels are involved

For example, the following row of the marker file tells us that marker number 200 (`Mk200`) is of type `Stimulus`, its name is `S 54`, it occurred at time point `3183619`, it lasted one instant (`1`), and it affected all channels (in the BrainVision convention, `0` means "all channels"):

`Mk200=Stimulus,S 54,3183619,1,0`

MNE has a built-in function to extract marker information from EEG data structures. 

The function is called `events_from_annotations()` and its goal is to translate events data into a Python-friendly format:

1. One array with as many rows as there are events (in our case, 200) and three columns: \
one for the time of occurrence, one for the affected channels, and one for the event's name
2. One dictionary that associates each event name to a number-only format. \
This one is of secondary importance for the sake of understanding the event, as the name is arbitrary

In [None]:
events_from_annotations, events_dict = mne.events_from_annotations(raw=eeg_data)
events_from_annotations = events_from_annotations[2:]

## **1.4. Interpolate the Pulse Artefact: Rationale & Execution**

The TMS pulse leaves a large artifact on EEG traces ([Veniero et al., 2009](https://www.sciencedirect.com/science/article/pii/S1388245709003629), [Freche et al., 2018](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006177)). 

This artifact is due to the interaction between the EEG recording system and the electric field induced by the pulse, which is orders of magnitude larger than physiological fields. 

The following code creates epochs around TMS pulses, averages them, and plots the resulting TEP. 

In [None]:
pre_interpolation_epochs = mne.Epochs(raw=eeg_data,
                                      events=events_from_annotations,
                                      tmin=-1.1,
                                      tmax=0.5,
                                      baseline=None)
pre_interpolation_epochs_tep = pre_interpolation_epochs.average()
pre_interpolation_epochs_tep.plot();
del pre_interpolation_epochs

As you can see, the EEG traces are dominated by a large spike at the time of the TMS pulse &mdash; so large that everything else appears flat. That is the pulse artifact. 

There is currently no way to recover physiological signals from a pulse artifact. One can only delete the artifact and interpolate the empty window underneath. 

Interpolation is the act of estimating a function in some interval, given values from a preceding and a following interval. 

In other words, interpolation answers the question: "**Given the pre- and post-pulse EEG, what should the EEG around the pulse look like?**"

<p align="center">
<img src="./files/interpolation.png" width="1000"/>
</p>

There exist a wealth of interpolation methods. Historically, the TMS-EEG community has adopted the following: 

- Replacing the artifact window with zeroes
    - Gives up on guessing what happened in the brain during the pulse
    - Can affect downstream computations due to the massive presence of zeroes 
- Linear interpolation 
    - Fits a linear function in the artifact window
    - Implausible: the EEG is seldom linear
- Moving average interpolation 
    - Replaces the artifact with a moving average that starts before the pulse
    - Reasonable approach but suboptimal results
- Cubic spline interpolation 
    - Fits a cubic function between each pair of contiguous `(time, voltage)` points in the interval of interest
    - Reasonable approach and satisfying results &mdash; current state of the art  

The following code loads data that were were previously interpolated with a cubic spline between 2 ms pre-pulse and 5 ms post-pulse, then segments the signal into epochs and computes the corresponding TEP for visualization.

In [None]:
post_interpolation_eeg = mne.io.read_raw(fname="data/post_2_5_interpolation_eeg.fif",
                                         preload=True)
post_interpolation_epochs = mne.Epochs(raw=post_interpolation_eeg,
                                       events=events_from_annotations,
                                       tmin=-1.1,
                                       tmax=0.5,
                                       baseline=None)
post_interpolation_tep = post_interpolation_epochs.average()
post_interpolation_tep.plot();

## **1.5. High-Pass Filtering: Rationale & Execution**

Filtering is the convolution* of a signal with a kernel whose characteristics can highlight a pattern of interest while suppressing other patterns.

For example, high-pass filtering with a cut frequency (or _threshold_) $\tau$ is the convolution of the signal with a kernel that highlights  high-frequency components, where "high" means "larger than $\tau$". 

We are interested in high-pass filtering the data to discard all frequencies below 0.1 Hz, preserving others. Oscillations at very low frequencies (so-called _slow drifts_) are usually due to slow-cycling local currents that are generated at the scalp by processes like sweating or the exchange of ions between the electrode and the electrode gel, so they are not interested for scientific purposes.

---

*_Convolution is a mathematical operation that cannot be fully understood without some basics of linear algebra and calculus. 
However, people without a quantitative background can get a working understanding of convolution from resources like [this article](https://betterexplained.com/articles/intuitive-convolution/) at BetterExplained or the appropriate chapters from the book "Analyzing Neural Time Series Data", by Mike X Cohen (MIT Press) (available from [BUR &mdash; Rovereto University Library](https://www.biblioteca.unitn.it/en/bur-rovereto-university-library)). 
Finally, [de CheveignÃ© & Nelken (2019)](https://www.sciencedirect.com/science/article/pii/S0896627319301746) provide a good introduction to filters and their use in EEG._

In [None]:
post_filtering_eeg = mne.filter.filter_data(data=post_interpolation_eeg.get_data(),
                                            sfreq=post_interpolation_eeg.info["sfreq"],
                                            l_freq=0.1,
                                            h_freq=None,
                                            method="iir",
                                            iir_params=None,
                                            copy=True,
                                            phase="zero")
post_filtering_eeg = mne.io.RawArray(data=post_filtering_eeg,
                                     info=post_interpolation_eeg.info)
post_filtering_epochs = mne.Epochs(raw=post_filtering_eeg,
                                   events=events_from_annotations,
                                   tmin=-1.1,
                                   tmax=0.5,
                                   baseline=None)
post_filtering_tep = post_filtering_epochs.average()
post_filtering_tep.plot();