How to sample the world for understanding the visual system

This repository contains the code and experiments for the paper "How to sample the world for understanding the visual system", published as a full paper in the Computational and Cognitive Neuroscience (CCN) 2025 proceedings.

Abstract

Understanding vision requires capturing the vast diversity of the visual world we experience. How can we sample this diversity in a manner that supports robust, generalizable inferences? While widely-used, massive neuroimaging datasets have strongly contributed to our understanding of brain function, their ability to comprehensively capture the diversity of visual and semantic experiences has remained largely untested.

Here we used a subset of 120 million natural photographs filtered from LAION-2B - LAION-natural - as a proxy of the visual world in assessing visual-semantic coverage. Our analysis showed significant representational gaps in existing datasets (NSD, THINGS), demonstrating that each covers only a subset of the space spanned by LAION-natural. Our results suggest that diverse stimulus sets are the key for strong out-of-distribution generalization, and that this diversity is more important than the specific sampling strategy employed.

Key findings

Coverage Gaps: Existing large-scale neuroimaging datasets (NSD, THINGS) cover only ~60% of the visual-semantic (CLIP) space defined by LAION-natural
Diversity Matters: Dataset diversity is more critical for out-of-distribution generalization than the specific sampling strategy used
Sampling Strategy: While Core-Set sampling shows slight advantages, most strategies perform similarly

Installation

Create new conda environment

conda create -n how_to_sample python=3.10
conda activate how_to_sample

Install package and dependencies

pip install -e .
pip install -r requirements.txt

Data and compute requirements

To reproduce the results, you will need to download the following datasets:

LAION-2B: Download from LAION (~2B image-text pairs)
Natural Scenes Dataset (NSD): Download from NSD (~70k images + fMRI responses)
THINGS Database: Download from THINGS (~26k object images)

The experiment scripts expect all datasets to be in the /data directory (data/laion2b, data/nsd, data/things).

LAION-2B is required for some of the experiments - these are marked with a ⚠️. You can download LAION-2B using the img2dataset tool (see here for instructions) with output_format='files'. This will create a large number of tar files in the data/laion2b directory - the scripts directly use them, so they don't need to be extracted. After creating CLIP image features, you can then use the classifier data/laion_natural_img_clf.pkl for filtering images to obtain LAION-natural.

As the experiments can be computationally intensive (and the datasets are quite large), we recommend running them on a compute cluster. However, if this is not possible, you can still easily replicate the GMM-based simulations, as well as all NSD-based experiments (with single subject fMRI data).

Update on LAION-2B: As LAION-2B is no longer publicly available, we decided to map our natural scores to relaion2B-en-safe, a version of LAION-2B that is safe to use for research. We provide "naturalness" scores for this dataset on HuggingFace: https://huggingface.co/datasets/andropar/relaion2b-natural. With this, you can directly download only the images that are natural according to our classifier (use natural_score > 0.75 to replicate the dataset from our paper).

New: LAION-natural Embeddings: We also provide CLIP embeddings for the filtered natural subset as a ready-to-use HuggingFace dataset: https://huggingface.co/datasets/andropar/relaion2b-natural-embeddings. This allows you to access image-level CLIP embeddings (ViT-H/14) for all images with high naturalness scores, facilitating fast experimentation without recomputing features.

Reproducing paper results

First, extract CLIP features from all datasets:

cd experiments/coverage/extract_features

# Extract NSD features
python extract_nsd_features.py --hdf5_path /path/to/nsd_stimuli.hdf5 --output_fp outputs/nsd_clip_features.npz

# Extract THINGS features  
python extract_things_features.py --image_glob "/path/to/things/images/*/*.jpg" --output_fp outputs/things_clip_features.npz

# Extract LAION features ⚠️
python extract_laion_features.py --tar_glob "/path/to/laion/tar_files/*.tar"

Visual-semantic coverage of LAION-natural ⚠️

Generate k-means cluster centers from LAION-natural samples for cluster-based analysis:

cd experiments/coverage
python cluster_laion.py

Calculate coverage metrics:

# Calculate NSD and THINGS coverage of LAION-natural
python calculate_coverage.py --task datasets_vs_laion_natural 

# Calculate LAION-natural coverage of LAION-2B  
python calculate_coverage.py --task laion_natural_vs_laion_2b

Visualize results:

jupyter notebook notebooks/plot_coverage_metrics.ipynb
jupyter notebook notebooks/cluster_comparison.ipynb

Effect of dataset diversity on out-of-distribution generalization

Simulated data (GMM + LAION-2B ⚠️)

cd experiments/ood_accuracy
jupyter notebook OOD_accuracy_GMM_LAION.ipynb

NSD fMRI validation

python nsd_OOD_accuracy.py 
python plot_nsd_OOD_results.py

Effect of sampling strategy on generalization

Run sampling experiments:

cd experiments/sampling
bash scripts/sampling_gmm.sh
bash scripts/sampling_laion.sh ⚠️
bash scripts/sampling_nsd.sh

Visualize results:

jupyter notebook plot_sampling_results.ipynb

Effect of sampling strategy on concept distribution ⚠️

Generate image keywords using Gemini API:

cd experiments/concept_distribution
python get_LAION_image_keywords.py --api-key YOUR_GEMINI_API_KEY

Apply sampling strategies to keyword-labeled images:

python get_image_keyword_subsets.py

Compare concept distributions:

jupyter notebook compare_concept_distributions.ipynb

Citation

If you use this code or our photography classifier in your research, please cite:

@inproceedings{
  roth2025how,
  title={How to sample the world for understanding the visual system},
  author={Johannes Roth and Martin N Hebart},
  booktitle={8th Annual Conference on Cognitive Computational Neuroscience},
  year={2025},
  url={https://openreview.net/forum?id=T9k6KkZoca}
}

Contact

Johannes Roth
Max Planck Institute for Human Cognitive and Brain Sciences
Email: jroth@cbs.mpg.de

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
experiments		experiments
src/how_to_sample		src/how_to_sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to sample the world for understanding the visual system

Abstract

Key findings

Installation

Data and compute requirements

Reproducing paper results

Visual-semantic coverage of LAION-natural ⚠️

Effect of dataset diversity on out-of-distribution generalization

Simulated data (GMM + LAION-2B ⚠️)

NSD fMRI validation

Effect of sampling strategy on generalization

Effect of sampling strategy on concept distribution ⚠️

Citation

Contact

About

Uh oh!

Releases

Packages

Languages

License

andropar/how-to-sample

Folders and files

Latest commit

History

Repository files navigation

How to sample the world for understanding the visual system

Abstract

Key findings

Installation

Data and compute requirements

Reproducing paper results

Visual-semantic coverage of LAION-natural ⚠️

Effect of dataset diversity on out-of-distribution generalization

Simulated data (GMM + LAION-2B ⚠️)

NSD fMRI validation

Effect of sampling strategy on generalization

Effect of sampling strategy on concept distribution ⚠️

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages