Densely Captioned Images

This repo contains the code required to use the Densely Captioned Images dataset, as well as the complete reproduction for the A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions Paper.

For use of the dataset, which includes both for training and evaluation, see the Dataset section. For reproduction, which includes data collection, evaluation against other benchmarks, model code, and more, see the Reproduction section.

Note: Running any of our scripts will initially prompt you for standard data saving locations. Using the defaults should mean that our download scripts work as expected, but if you already have data saved or specific target locations this is how you can override. Config files are then placed in dataset/config.yaml and reproduction/config.yaml.

If you use this dataset in your work, please credit the initial work here:

@misc{urbanek2023picture,
      title={A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions}, 
      author={Jack Urbanek and Florian Bordes and Pietro Astolfi and Mary Williamson and Vasu Sharma and Adriana Romero-Soriano},
      year={2023},
      eprint={2312.08578},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Dataset

Details

The Densely Captioned Images dataset, or DCI, consists of 7805 images from SA-1B, each with a complete description aiming to capture the full visual detail of what is present in the image. Much of the description is directly aligned to submasks of the image.

An example is shown above. In the top left we see the full image of a water pump, with an associated description. The italicized section is collected as a ‘standard caption’, aiming to summarize the full image in about a sentence. The remainder of that first description contains details about the relationship between visible entities in the image, as well as in-depth descriptions of regions that are not described as part of the submasks. All other text describing the image is all associated with submasks of the image. Each submask has its own label (not pictured) and description, and may also contain further submasks. Here for instance we see submasks for windows and balconies as being contained in the submask capturing three buildings in the background.

Setup

We suggest setting up in an conda environment.

conda create -n densecaps python=3.10

Then navigate to the dataset directory and run install

cd dataset
pip install -e .

You can then download the dataset and our reported best models with our download script.

python dataset/densely_captioned_images/dataset/scripts/download.py

Or download them manually at the following urls.

DCI/dataset/densely_captioned_images/dataset/scripts/download.py

Lines 16 to 29 in 3d95af6

    
           RESOURCES = { 
        
               'densely_captioned_images': { 
        
                   'url': 'https://dl.fbaipublicfiles.com/densely_captioned_images/dci.tar.gz', 
        
                   'check': '9caff10cb6324c801d9020638f49925f04de87d897ca8614c599f3c43bef3aeb' 
        
               }, 
        
               'dci_pick1': { 
        
                   'url': 'https://dl.fbaipublicfiles.com/densely_captioned_images/dci_pick1.tar.gz', 
        
                   'check': '225b9e1c88a00a2dd62cd5b0e23d4133efbbffe8d28b6de796ba894db1a2aa6a' 
        
               }, 
        
               'dci_pick1_nl0': { 
        
                   'url': 'https://dl.fbaipublicfiles.com/densely_captioned_images/dci_pick1_nl0.tar.gz', 
        
                   'check': '2943daaba2489492c274284631223e02933cc9436db201c23abe5939ed01d446' 
        
               }, 
        
           }

Once you extracted the dci.tar.gz dataset which contain the annotations, you should download the images after accepting the license agreement from the SA-1B dataset page. You do not need to download the entire SA-1B dataset, you only need to download the archive sa_000138.tar and extract the images in the photos folder.

Data

CLIP-ready

We provide easy data-loader utilities for the CLIP-ready version of the densely captioned images dataset, wherein all Captions have LLaMA2-generated summaries and negatives that fit inside of the CLIP context limit.

from densely_captioned_images.dataset.impl import get_clip_ready_ds, DenseCaptionedDataset
train_ds: DenseCaptionedDataset = get_clip_ready_ds('train')
valid_ds: DenseCaptionedDataset = get_clip_ready_ds('valid')

Full

You can preview data from the DCI dataset by running the explorer script:

pip install flask
python explorer/run_server.py <port>

This shows the complete data available from within the DenseCaptionedImage class at http://localhost:port:

In this view, you can navigate to image by index or by specific ID. The [prev] and [next] buttons can be used to directly parse through. On screen is the complete description of the image, as well as all submasks. Hovering over words in the text highlights the corresponding mask in the image, as is done here for "Top most green tree leaves".

The DenseCaptionedImage class acts as a wrapper around the stored data json, which has the following format:

{
    "image": "relative-image-path.jpg",
    "short_caption": "A standard short-form caption for the image",
    "mask_data": {
      "[mask_key]": {
        "idx": "[mask_key]", # Self-reference into mapping
        "outer_mask": "iVBORw0KGgoAAAANSUhE.....", # base64 encoding of the binary mask for this segment
        "mask_quality": 0, # one of 0, 1, or 2 for "ok", "low-quality/uninteresting", or "bad" respectively
        "label": "A short label for the given mask", # omitted if "bad" quality
        "caption": "A long descriptive caption for this given mask", # only for "ok" masks
        "parent": "other_mask_key", # either the parent mask id in the tree, or -1 if parent is the base image
        "requirements": ["list", "of", "children", "masks"] # mask IDs for children masks
        "bounds": [[0, 0], [500, 500]] # TopLeft & BottomRight coords of mask bounds
        "area": 123, # mask size in pixels 
      },
      # ...
    },
    "mask_keys": ["list", "of", "mask_keys", "into", "mask_data"],
    "extra_caption": "Additional long form caption that may catch additional information about layout or from from missing masks",
    "summaries": {
        "base": ["list", "of", "generated", "summaries"],
        # ...
        "[mask_key]": ["list", "of", "generated", "summaries"],
        # ...
    },
    "negatives": {
        # ...
        "[mask_key]": {
            # ...
            "[negative_type]": ["list", "of", "negatives", "generated", "of", "type"],
            # ...
        },
        # ...
    }
}

(CLIP-ready) Densely Captioned Images Test set

The Densely Captioned Images test set comes in a few variations:

All Submasks: Pulls images and all subimages from the test set, and uses their first captions. Key: all_swaps
All Submasks Pick 5: Pulls images and all subimages from the test set, and uses their first 5 captions. Key: all_swaps_pick5
Base: Only pulls the 112 base images from the test set, alongside their first captions. Key: base_swaps
Hardest: Use the same imageset as all_swaps, but hardest negative of all generated based on CLIP score. Key: all_hardest

All tests report both the CLIP-correct (correct caption prediction compared to rest of batch) and Negative-correct (correct caption prediction compared to a generated negative).

Usage

You can also directly reproduce the DCI results for CLIP with the following:

python dataset/densely_captioned_images/dataset/scripts/run_clip_dense_cap_eval.py

You can also reproduce our results on provided DCI-trained models by running the following in a python shell from the project root.

from densely_captioned_images.dataset.scripts.run_clip_dense_cap_eval import run_dense_cap_on_lora
run_dense_cap_on_lora('models/dci_pick1')
run_dense_cap_on_lora('models/dci_pick1_nl0')

So long as you can wrap a model in CLIPModel, you can use the run_dense_cap_on_model function instead to test your own models.

Reproduction

This section contains information about reproducing the full set of results from the paper, including training and eval sweeps across all of our datasets.

Setup

Be sure to follow the dataset setup first, which is required as a prerequisite. Then from the root of this repository:

cd reproduction
pip install -e .
bash ./clone_dependents.sh

This script clones the dependent repos that we run evaluations from, and also applies the patches that we required to get them running.

Dataset Download

We provide dataset downloading scripts that install the expected datasets to the default locations. If you change from the default locations, these may not work. They also may not work if resources change. Still, it may work! Run from the root directory.

bash reproduction/densely_captioned_images/repro/setup_data/dlownload_vl_checklist.sh
python reproduction/densely_captioned_images/repro/setup_data/get_aro.py

Collection

The complete collection flow can be found in reproduction/crowdsourcing, with more complete documentation.

Training

The bulk of the training implementation is contained in the ClipAndNegTrainer class in densely_captioned_images.repro.train.trainer. This includes computing loss in the single caption and multi-caption cases. Usage can be observed in densely_captioned_images.repro.train.train_clip.

Example sweep scripts can be found in reproduction/densely_captioned_images/repro/train/sweeps, and wrappers for the COCO and Localized Narratives datasets are present here too.

Evaluation

The main entry point for evaluations is reproduction/densely_captioned_images/repro/eval/run_full_evals.py, and examples can be seen at reproduction/densely_captioned_images/repro/eval/sweeps

License

DCI code and data are CC-BY-NC licensed, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
dataset		dataset
docs/images		docs/images
explorer		explorer
models		models
reproduction		reproduction
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

	RESOURCES = {
	'densely_captioned_images': {
	'url': 'https://dl.fbaipublicfiles.com/densely_captioned_images/dci.tar.gz',
	'check': '9caff10cb6324c801d9020638f49925f04de87d897ca8614c599f3c43bef3aeb'
	},
	'dci_pick1': {
	'url': 'https://dl.fbaipublicfiles.com/densely_captioned_images/dci_pick1.tar.gz',
	'check': '225b9e1c88a00a2dd62cd5b0e23d4133efbbffe8d28b6de796ba894db1a2aa6a'
	},
	'dci_pick1_nl0': {
	'url': 'https://dl.fbaipublicfiles.com/densely_captioned_images/dci_pick1_nl0.tar.gz',
	'check': '2943daaba2489492c274284631223e02933cc9436db201c23abe5939ed01d446'
	},
	}

License

facebookresearch/DCI

Folders and files

Latest commit

History

Repository files navigation

Densely Captioned Images

Dataset

Details

Setup

Data

CLIP-ready

Full

(CLIP-ready) Densely Captioned Images Test set

Usage

Reproduction

Setup

Dataset Download

Collection

Training

Evaluation

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages