A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience
TRIBE v2 is a deep multimodal brain encoding model that predicts fMRI brain responses to naturalistic stimuli (video, audio, text). It combines state-of-the-art text, audio and video models into a unified Transformer architecture that maps multimodal representations onto the cortical surface.
Load a pretrained model from HuggingFace and predict brain responses to a video:
from tribev2 import TribeModel
model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")
df = model.get_events_dataframe(video_path="path/to/video.mp4")
preds, segments = model.predict(events=df)
print(preds.shape) # (n_timesteps, n_vertices)Predictions are for the "average" subject (see paper for details) and live on the fsaverage5 cortical mesh (~20k vertices). They are offset by 5 seconds in the past, in order to compensate for the hemodynamic lag.
You can also pass text_path or audio_path to model.get_events_dataframe — text is automatically converted to speech and transcribed to obtain word-level timings.
For a full walkthrough with brain visualizations, see the Colab demo notebook.
Basic (inference only):
pip install -e .With brain visualization:
pip install -e ".[plotting]"With training dependencies (PyTorch Lightning, W&B, etc.):
pip install -e ".[training]"Configure data/output paths and Slurm partition (or edit tribev2/grids/defaults.py directly):
export DATAPATH="/path/to/studies"
export SAVEPATH="/path/to/output"Local test run:
python -m tribev2.grids.test_runGrid search on Slurm:
python -m tribev2.grids.run_cortical
python -m tribev2.grids.run_subcorticaltribev2/
├── main.py # Experiment pipeline: Data, TribeExperiment
├── model.py # FmriEncoder: Transformer-based multimodal→fMRI model
├── pl_module.py # PyTorch Lightning training module
├── demo_utils.py # TribeModel and helpers for inference from text/audio/video
├── eventstransforms.py # Custom event transforms (word extraction, chunking, …)
├── utils.py # Multi-study loading, splitting, subject weighting
├── utils_fmri.py # Surface projection (MNI / fsaverage) and ROI analysis
├── grids/
│ ├── defaults.py # Full default experiment configuration
│ └── test_run.py # Quick local test entry point
├── plotting/ # Brain visualization (PyVista & Nilearn backends)
└── studies/ # Dataset definitions (Algonauts2025, Lahner2024, …)
If you use this software, please share your results with the broader research community using the following citation:
@article{dAscoli2026TribeV2,
title={A foundation model of vision, audition, and language for in-silico neuroscience},
author={d'Ascoli, St{\'e}phane and Rapin, J{\'e}r{\'e}my and Benchetrit, Yohann and Brookes, Teon and Begany, Katelyn and Raugel, Jos{\'e}phine and Banville, Hubert and King, Jean-R{\'e}mi},
year={2026}
}This project is licensed under CC-BY-NC-4.0. See LICENSE for details.
See CONTRIBUTING.md for how to get involved.