This repository contains the code and experiments for the paper "How to sample the world for understanding the visual system", published as a full paper in the Computational and Cognitive Neuroscience (CCN) 2025 proceedings.
Understanding vision requires capturing the vast diversity of the visual world we experience. How can we sample this diversity in a manner that supports robust, generalizable inferences? While widely-used, massive neuroimaging datasets have strongly contributed to our understanding of brain function, their ability to comprehensively capture the diversity of visual and semantic experiences has remained largely untested.
Here we used a subset of 120 million natural photographs filtered from LAION-2B - LAION-natural - as a proxy of the visual world in assessing visual-semantic coverage. Our analysis showed significant representational gaps in existing datasets (NSD, THINGS), demonstrating that each covers only a subset of the space spanned by LAION-natural. Our results suggest that diverse stimulus sets are the key for strong out-of-distribution generalization, and that this diversity is more important than the specific sampling strategy employed.
- Coverage Gaps: Existing large-scale neuroimaging datasets (NSD, THINGS) cover only ~60% of the visual-semantic (CLIP) space defined by LAION-natural
- Diversity Matters: Dataset diversity is more critical for out-of-distribution generalization than the specific sampling strategy used
- Sampling Strategy: While Core-Set sampling shows slight advantages, most strategies perform similarly
- Create new conda environment
conda create -n how_to_sample python=3.10
conda activate how_to_sample- Install package and dependencies
pip install -e .
pip install -r requirements.txtTo reproduce the results, you will need to download the following datasets:
- LAION-2B: Download from LAION (~2B image-text pairs)
- Natural Scenes Dataset (NSD): Download from NSD (~70k images + fMRI responses)
- THINGS Database: Download from THINGS (~26k object images)
The experiment scripts expect all datasets to be in the /data directory (data/laion2b, data/nsd, data/things).
LAION-2B is required for some of the experiments - these are marked with a output_format='files'. This will create a large number of tar files in the data/laion2b directory - the scripts directly use them, so they don't need to be extracted. After creating CLIP image features, you can then use the classifier data/laion_natural_img_clf.pkl for filtering images to obtain LAION-natural.
As the experiments can be computationally intensive (and the datasets are quite large), we recommend running them on a compute cluster. However, if this is not possible, you can still easily replicate the GMM-based simulations, as well as all NSD-based experiments (with single subject fMRI data).
Update on LAION-2B: As LAION-2B is no longer publicly available, we decided to map our natural scores to relaion2B-en-safe, a version of LAION-2B that is safe to use for research. We provide "naturalness" scores for this dataset on HuggingFace: https://huggingface.co/datasets/andropar/relaion2b-natural. With this, you can directly download only the images that are natural according to our classifier (use natural_score > 0.75 to replicate the dataset from our paper).
New: LAION-natural Embeddings: We also provide CLIP embeddings for the filtered natural subset as a ready-to-use HuggingFace dataset: https://huggingface.co/datasets/andropar/relaion2b-natural-embeddings. This allows you to access image-level CLIP embeddings (ViT-H/14) for all images with high naturalness scores, facilitating fast experimentation without recomputing features.
First, extract CLIP features from all datasets:
cd experiments/coverage/extract_features
# Extract NSD features
python extract_nsd_features.py --hdf5_path /path/to/nsd_stimuli.hdf5 --output_fp outputs/nsd_clip_features.npz
# Extract THINGS features
python extract_things_features.py --image_glob "/path/to/things/images/*/*.jpg" --output_fp outputs/things_clip_features.npz
# Extract LAION features ⚠️
python extract_laion_features.py --tar_glob "/path/to/laion/tar_files/*.tar"Generate k-means cluster centers from LAION-natural samples for cluster-based analysis:
cd experiments/coverage
python cluster_laion.pyCalculate coverage metrics:
# Calculate NSD and THINGS coverage of LAION-natural
python calculate_coverage.py --task datasets_vs_laion_natural
# Calculate LAION-natural coverage of LAION-2B
python calculate_coverage.py --task laion_natural_vs_laion_2b Visualize results:
jupyter notebook notebooks/plot_coverage_metrics.ipynb
jupyter notebook notebooks/cluster_comparison.ipynb cd experiments/ood_accuracy
jupyter notebook OOD_accuracy_GMM_LAION.ipynbpython nsd_OOD_accuracy.py
python plot_nsd_OOD_results.py Run sampling experiments:
cd experiments/sampling
bash scripts/sampling_gmm.sh
bash scripts/sampling_laion.sh ⚠️
bash scripts/sampling_nsd.shVisualize results:
jupyter notebook plot_sampling_results.ipynbGenerate image keywords using Gemini API:
cd experiments/concept_distribution
python get_LAION_image_keywords.py --api-key YOUR_GEMINI_API_KEY Apply sampling strategies to keyword-labeled images:
python get_image_keyword_subsets.py Compare concept distributions:
jupyter notebook compare_concept_distributions.ipynbIf you use this code or our photography classifier in your research, please cite:
@inproceedings{
roth2025how,
title={How to sample the world for understanding the visual system},
author={Johannes Roth and Martin N Hebart},
booktitle={8th Annual Conference on Cognitive Computational Neuroscience},
year={2025},
url={https://openreview.net/forum?id=T9k6KkZoca}
}
Johannes Roth
Max Planck Institute for Human Cognitive and Brain Sciences
Email: jroth@cbs.mpg.de