Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Who's Waldo? Linking People Across Text and Images

Download links and PyTorch implementation of "Who's Waldo? Linking People Across Text and Images", ICCV 2021.

Who's Waldo? Linking People Across Text and Images

Claire Yuqing Cui*, Apoorv Khandelwal*, Yoav Artzi, Noah Snavely, Hadar Averbuch-Elor ICCV 2021

Project Page | Paper

drawing drawing

Quick Start

1. Request access to the Who's Waldo dataset.

2. Create a new conda environment

conda create --name whos-waldo
conda activate whos-waldo
pip install -r requirements.txt

3. Data preprocessing

Run the following preprocessing scripts in the environment created above. First generate annotations:

python preprocess/ --output {annotation-output-dir}

Process textual information for each split:

python preprocess/ --ann {annotation-output-dir} --output {txtdb-name} --split {split}

Process visual information for each split:

python preprocess/ --output {imgdb-name} --split {split}

Note that you will need to extract features for the images before creating the imgdb. We used this repo for feature extraction. But you may find this PyTorch re-implementation easier to use instead.

4. Set up Docker container

run with the appropriate paths for each argument.

5. Training

Create a training config file as config/train-whos-waldo.json Inside the container, run

python --config {path to training config}

6. Inference (evaluation and visualizations)

Inside the container, run


with the appropriate arguments which can be found in


We provide a datasheet for our dataset here.


The images in our dataset are provided by Wikimedia Commons under various free licenses. These licenses permit the use, study, derivation, and redistribution of these images—sometimes with restrictions, e.g. requiring attribution and with copyleft. We provide source links, full license text, and attribution (when available) for all images, make no modifications to any image, and release these images under their original licenses. The associated captions are provided as a part of unstructured text in Wikimedia Commons, with rights to the original writers under the CC BY-SA 3.0 license. We modify these (as specified in our paper) and release such derivatives under the same license. We provide the rest of our dataset (i.e. detections, coreferences, and ground truth correspondences) under a CC BY-NC-SA 4.0 license. We provide our code under an MIT license.


    author    = {Cui, Yuqing and Khandelwal, Apoorv and Artzi, Yoav and Snavely, Noah and Averbuch-Elor, Hadar},
    title     = {Who's Waldo? Linking People Across Text and Images},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {1374-1384}


Our code is based on the implementation of UNITER.


No releases published


No packages published