<a href="https://colab.research.google.com/github/IvaroEkel/Probabilistic-Machine-Learning_lecture-PROJECTS/blob/main/TEMPLATE_Probabilistic_Machine_Learning_Project_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probabilistic Machine Learning - Project Report - fMRI Vector Embeddings

**Course:** Probabilistic Machine Learning (SoSe 2025)
**Lecturer:** [Lecturer Name]  
**Student(s) Name(s):**  Felix Filius
**GitHub Username(s):**  thehappyson
**Date:**  20.08.2025
**PROJECT-ID:** [Assigned Project ID]  

---


## 0. Preface
This project is part of ongoing research at the Max Planck Institute for Cognitive and Brain Sciences in Leipzig with the research group under Dr. Nico Scherf. The work outlined in this document and is my own as port of that research engagement or specifically for this course project. When I relied on or used work of the other researchers I highlight this in the text and make sure to mention their contributions. Proper citations are sadly not possible as none of this work is published yet. I highly appreciate the collaboration with and guidance from the team.

All code used and referenced here is located in the notebooks inside the repository.
Some path references might be broken in the notebooks as the code was run on a HPC-Cluster, for this I acknowledge the support of the Max Planck Computing and Data Facility where the computations were performed on the HPC systems Raven and Robin.
I advise against running the code in the notebooks again as the memory and compute requriements are immense due to the size of the data and the models, however I also stored fitted models and some visuals as separate files in the repository to enable more efficient computations.

## 1. Introduction

- Brief description of the dataset and problem
- Motivation for your project
- Hypothesis or research question

The presented work is based on the research and data by [Finn, E.S., Corlett, P.R., Chen, G. et al. Trait paranoia shapes inter-subject synchrony in brain activity during an ambiguous social narrative](https://rdcu.be/eBHsQ). The paper investigated a potential correlation between neural activations of patient reacting to a stimulus and their inherent character traits, in this case a continuous paranoia score.
The data used in this project is the dataset produced by the above credited paper and can be accessed here: https://openneuro.org/datasets/ds001338/versions/1.0.0
The easiest way to download the data is with the following code



In [None]:
# can be run in shell
# copies data in current working directory in a new folder
%%sh
# install AWS CLI first, if not already (code below for Mac might need to adjust for Linux or windows)
# Install instructions can be found under https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#getting-started-install-instructions
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /

# download data
aws s3 sync --no-sign-request s3://openneuro.org/ds001338 ds001338-download/

### Dataset overview

The dataset consists of the following components
- anatomical functional MRI scan for 22 patients as .nii.gz
- functional scan split into three, due to experiment setup
- parameter file for the used scanner
- The used stimuli
    - The story as .txt and .doc file
    - The recording of the story as played to the subjects

The technical details of the scan in the original experiment is not relevant for the scope of this project as a thorough neurological evaluation is not feasible without extensive domain knowledge.
The dataset consists of preprocessed neuroimaging data stored in NIfTI format (.nii.gz files), representing BOLD signal measurements across voxels in the brain during experimental tasks. A voxel represents a single unit of the smallest part of the brain the scanner can capture. It roughly translates to one pixel in the 3D image. Each of these voxels has a measured activation level for every timepoint during the scan, here the timesteps are 1 second each. An activation in this context is every activity in the brain. The primary challenge lies in extracting meaningful low-dimensional representations from this high-dimensional, noisy neural data while preserving the underlying neural dynamics and behavioral correlates.


### Motivation

Data produced from neural experiments naturally results in a temporal structure being available as subjects are routinely exposed to stimuli where their reaction to the stimulus is what is actually observed and studied. Conventional dimensionality reduction methods struggle with properly incorporating the temporal aspect. For functional data, like we explore here which measures the brain functions for an activity, the temporal structure is inherently important as the original experiment investigated the changes in neural activation to different aspects of the story being presented. The CEBRA model also provides the ability to incorporate auxiliary variables directly into the model during training. Auxiliary data, or behavioral data in this context, is data describing the action or stimulus during the functional recording. Incorporating this directly to the model could theoretically provide a better insight into the relation of stimulus and neural activation.

Investigating the functional brain activity offers promising insight into our current understanding of the brain and how we perceive things. Can certain traits in a person dictate their perception of reality, and we are able to construct something like and objective reality? This research is also closely tied to prior works in the area of brain waves and neurostimulation, as identifying activation patterns in the latent space could then in return to experiment with using these patterns to force certain neurological reactions.
While the current research is in an early stage and purely exploration without a concrete hypothesis it is easy to see the potential applications and implications making this an interesting and worthwhile project.


## 2. Data Loading and Exploration


### Basic Data Exploration
The raw data can not be explored meaningfully for different reasons.
The exploration will therefore begin after some well-known required preprocessing so actual sense can be made out of the data with regard to our problem.

Loading the data requires specialized libraries, packages or tools, which in return also require the user to have extensive domain knowledge in radiology and/or neurosciences to undertake the required and meaningful processing steps.

#### Text
The story that was read to the subject during the scan is characteristically unstructured data. The story hold limited information by itself to our research.
For this reason the story will not be reviewed or explored in its unprocessed form.

#### Neural Data
The neural data is loaded as a nifti file, as mentioned before, which means the data is not fully raw out of the scanner but was converted from the DICOM format into the aforementioned .nii.gz format to make it usable with libraries for analysis.



## 3. Data Preprocessing

### Textual Data Preprocessing
To work with the story in the context of the latent space analysis the story was converted into a csv file and the text was split up into small chunks correlating to individual time points of the recording. Timepoints were used based on the timeslices of the fMRI scanning procedure to ensure an alignment between the text the subject heard and the functional response we observed. The file with these 'tokens' can be found in the project repository as 'tokens.csv'. This chunking and timepoint alignment was already done in manual effort.

Afterward the story was labelled with the help of an LLm, namely the Claude Sonnet model[^3] via API calls. Every token was sent to the API together with a system prompt explaining the task and a prompt outlining the instructions and a sliding window for story context. The labels were created in four categories as well as certainty scores were produced. This approach will be outlined in more detail in Section 4.
The labelled and unlabelled texts were embedded using the 'BAAI/bge-large-en-v1.5' model[^4].

The prompt that was used:
```
You are a semantic token labeling expert. Analyze the given token within its context and provide labels for these four categories:

1. **location** - Physical places, geographical locations, spatial references
2. **characters** - People, character names, pronouns referring to people, roles
3. **emotions** - Emotional states, feelings, mood descriptors
4. **time** - Temporal references, time periods, time-related words

**Target Token:** "{context_info['target_token']}"

**Context Before:** {' '.join(context_info['context_before'][-5:]) if context_info['context_before'] else '[none]'}
**Context After:** {' '.join(context_info['context_after'][:5]) if context_info['context_after'] else '[none]'}

For each category, provide:
- **value**: The specific semantic label (e.g., "office", "Dr. Carmen Reed", "tired", "afternoon") or "null" if not applicable
- **confidence**: Float between 0.0-1.0 indicating your confidence in the label

Make sure the character you assign is one of the following and no other:
- Dr. Carmen Reed
- Dr. John Torreson
- Antonio
- Juan Torres
- Alba
- Maria
- Linda
- Ramiro
- Boat Driver
- Alba's Mother

Remember that you are not strictly constrained to one value per label. If you assign multiple values to one category for any given token make sure it is complying with the JSON formating restrictions:
Consider the context when labeling. If the token doesn't directly contain a category but the context suggests it should be labeled (e.g., pronouns referring to characters, implicit time/location), include those labels.
Null values should be avoided and only used in scenarios where you a are completely unsure about the label. A value with a low confidence is better than a null value in most cases.

Respond in this exact JSON format:
{{
  "location": {{"value": "label_or_null", "confidence": 0.0}},
  "characters": {{"value": "label_or_null", "confidence": 0.0}},
  "emotions": {{"value": "label_or_null", "confidence": 0.0}},
  "time": {{"value": "label_or_null", "confidence": 0.0}}
}}"""
```
The strict requirement to format the response in JSON enabled easier extraction of the values and handling in downstream tasks.

### Neural Data Processing
The neural data preprocessing for the neural data was handled by subject-matter experts within the research group and already done when I joined the project. However, I still attempted it to identify problems and understand the process at least roughly.

The entire workflow can be seen in the "MRI_Data_Processing.ipynb" and "cebra_demo.ipynb" notebooks. The demo notebook only contains the numerical adjustments and some raw data inspection.

The first step was the so-called alignment of the functional scan to the anatomical scan of the brain. This step was already were my personal preprocessing pipeline failed, because the process of mapping the functional scan to a reference anatomical model requires some domain knowledge. Mapping to a reference model is done to ensure the gathered data is comparable among different subjects and the results are seen in the functional space and not the anatomical space. This process is similar to genetics were a human reference genome is considered during analysis. Acknowledging that my preprocessing was failing I still conducted the temporal smoothing spatial smoothing.
!(./raw_vs_smoothed_signal.png "Preprocessing effect")
The smoothing and normalization worked and reduced the noise considerably, but considering the initial alignment and mapping was not working as intended I could not use my results and had to resort to the aforementioned already prepared data by another researcher.

The resulting data for a single patient is a 2-dimensional matrix with the shape (1311, 139501). The first dimension (1311) consists of the time. So we have 1311 timepoints in total per patient.
The second dimension represents the voxels of the brain. The fMRI takes a scan of the entire brain once per time unit, this is once per second in this experiment, and this scan records BOLD signals together with the anatomy resulting in a 4D datastructures with the 3 spatial dimensions of the anatomy of the brain and the measured activation per voxel. A voxel is a single region in the brain in regard to the resolution of the scanner, one voxel equals one pixel in the representation so to say.





## 4. Probabilistic Modeling Approach

In this section the different probabalistic models and their application in the scope of this project will be outlined starting with the LLM used for labelling the story followed by the conrastive learning model which is the main focus of the project. The dimensionality reduction methods addressed afterwards contribute less to the project outcome.
- Description of the models chosen
    - CEBRA (contrastive learning
    - GMM
    - UMAP
    - t-SNE

- Why they are suitable for your problem
- Mathematical formulations (if applicable)

### 4.1 Claude Sonnet

### 4.2 CEBRA: "Consistent EmBeddings of high-dimensional Recordings using Auxiliary variables"


### 4.3 GMM:

### 4.4 UMAP
UMAP was used once to represent the latents learned by CEBRA differnetly and another time to construct a differnet latent space directly from the patient data.
### 4.5 t-SNE
t-SNE was used once in the project to create a low dimensional representation of the subject functional data to compare with the results of the CEBRA model. It was also applied to the embeddings learned by CEBRA to create a different visual representation of the same space.
Considering the size of the Matrix applying t-SNE to the entire dataset was not feasible as the training would have likely taken days if memory allowed it at all, similar to UMAP.  For this reason t-SNE used the same reduced dimensionality as UMAP (50) to enable computation. This constraint also equally limits the quality of the results, however they still illustrate the difference to CEBRA well enough.

## 5. Model Training and Evaluation

- Training process
- Model evaluation (metrics, plots, performance)
- Cross-validation or uncertainty quantification



## 6. Results

- Present key findings
- Comparison of models if multiple approaches were used



## 7. Discussion

- Interpretation of results
- Limitations of the approach
- Possible improvements or extensions



## 8. Conclusion

The interpretation and preparation of neural data requires extensive domain knowledge, hence I focused on the exploration of the latent space trying to identify patterns in neural activation.
After initial challenges of training the models, and I was able to show the expected strong temporal consistency of the neural activations throughout the observed time. The emerging complex activation pattern was captured similarly by different model architectures suggesting robust findings with expected differences due to the different model architectures.
Inconsistencies in the embeddings structure when labels were used indicated that this approach was flawed in the used setup and would warrant further investigation.
Going forward the training of the CEBRA model with auxiliary variables will be the main focus to investigate the possibility of extracting patterns from the resulting latent space. It might also be worth it to identify emotional triggers within the story and investigate them against expected activation in certain brain regions.


## 9. References

- Cite any papers, datasets, or tools used

- [^1]: Finn Paranoia
- [^2]: [CEBRA](https.//cebra.ai)
- [^2]: Dataset
- [^3]: Anthropic Claude
- [^4]: BGE Embedder Model