Skip to content
Go to file


Failed to load latest commit information.
Latest commit message
Commit time

ActivityNet Entities Dataset and Challenge

ActivityNet Entities Object Localization (Grounding) Challenge joins the official ActivityNet Challenge anet challenge logo this year! Looking forward to seeing you at CVPR'20!

ActivityNet Entities Object Localization Challenge 2020

ActivityNet-Entities Object Localization Task aims to evaluate how grounded or faithful a description (could be generated or ground-truth) is to the video they describe.

An object word is first identified in the description and then localized in the video in the form of a spatial bounding box. The prediction is compared against the human annotation to determine the correctness and overall localization accuracy.

Winners will be announced at ActivityNet Challenge at CVPR'20 (June 14th). Prizes 💲💲💲 include $3500 cash award ($1500 for Sub-task I and $2000 for Sub-task II), sponsored by P&G and UMich RI, and subject to changes.* Each team can have one or more members. An individual associated with multiple teams or a team with multiple accounts will be disqualified.

*A winner is valid and the monetary award will be distributed only if the winner solution is at least 5% relatively better than the baseline method GVD (under user activitynet-entities).

Important dates

Note that these dates are tentative and subject to changes if necessary.

  • March 2: Public Train/Val/Test sets are available; participants can start training.
  • April 15: Release Hidden Test set with ground truth withheld and open evaluation server.
  • May 22: Public the leaderboard.
  • May 29 June 3 (extended): Close evaluation server.
  • June 2 June 5 (extended): Deadline for submitting the report.
  • June 14: A full-day workshop at CVPR 2020.

Dataset overview

ActivityNet-Entities is based on the video description dataset ActivityNet Captions and augments it with 158k bounding box annotations, each grounding a noun phrase (NP). In this challenge, we will use pre-processed object-based annotations that link individual object words to their corresponding regions in the video. This gives 432 unique object categories.

The original dataset consists of 10k/2.5k/2.5k videos for training/validation/testing. There are 35k/8.6k/8.5k event segments & sentence descriptions and 105k/26.5k/26.1k bounding box annotations on each split. We have further collected a new hidden test set where the video descriptions are not public. This enables us to reliably evaluate both video description quality and object localization quality.

The annotation on the training and validation set is in data/anet_entities_cleaned_class_thresh50_trainval.json. data/anet_entities_skeleton.txt specifies the JSON file structure on both reference files and submission files.

dataset teaser

Challenge overview

Depending on the availability of the video description during inference, we divide the challenge into two sub-tasks:

Sub-task I: Grounding on GT Sentences (public test set). The same data as in the ANet-Entities test set, which comes from ActivityNet Captions val set. The skeleton of the file is in data/anet_entities_cleaned_class_thresh50_test_skeleton.json, where we intentionally leave out the bounding box annotation for official evaluation purposes. GT sentences are provided.

Sub-task II: Grounding on Generated Sentences (hidden test set). GT sentences are NOT provided and hence both user sentence prediction and grounding prediction are required for evaluation. The skeleton of the file is in data/anet_entities_cleaned_class_thresh50_hidden_test_skeleton.json, where no video description nor bounding box is provided.

Regarding the format of the bounding box annotation, we first uniformly sample 10 frames from each event segment and sparsely locate objects from the description in only one of the frames where the object can be clearly observed.

Pre-extracted features

Here are download links to region features on the public Train/Val/Test splits and the hidden Test split. The region coordinates for all splits are here.

Note that feature files are saved as individual *.npy files for legacy reasons. Consider merging them into batched *.h5 files (say 10-100) to speed up the data loading.

Evaluation metrics

Due to the sparsity of the annotation, we request all participants to submit the localization results on all object categories appearing in the target sentence on all 10 sampled frames. Only the prediction at the same frame as the GT annotation will be assessed and compared against the human annotation to determine the correctness (>50% IoU indicates correct and otherwise incorrect). Localization accuracy is computed per object category and then averaged by the number of unique object categories.

The evaluation metric used in Sub-task I is Localization Accuracy (with 50% IoU threshold). We have benchmarked the baseline methods below (averaged over three runs).

Method Localization Accuracy
GVD 43.45%

The evaluation metrics used in Sub-task II include F1_all, F1_loc, F1_all_per_sent, and F1_loc_per_sent (all with 50% IoU threshold). We have benchmarked the baseline methods below (averaged over three runs).

Method F1_all_per_sent F1_loc_per_sent F1_all F1_loc
GVD 17.26% 58.57% 7.46% 22.88%

(Important!) F1_all, and F1_loc are proposed as the official metrics in Zhou et al. CVPR 2019. In F1_all, a region prediction is considered correct if the object word is correctly predicted and also correctly localized. In F1_loc, we only consider correctly-predicted object words, i.e., language generation error (e.g., hallucinated objects) is ignored. However, as both metrics average accuracies over object categories, they emphasize unproportionally on description diversity, as object classes never predicted will have zero accuracy and reduce overall metric numbers. In fact, in the baseline method GVD, only about half of the object categories are predicted (that’s why we see a low F1_all score). Hacks such as increasing accuracies on the long-tail classes will significantly improve the metric scores which is not exactly the purpose of the challenge. We therefore adopt two additional metrics for this sub-task, F1_all_per_sent and F1_loc_per_sent, where the accuracies are averaged over all the sentences. Note that none of the four metrics are perfect, but at least give us a holistic view of the system performance. More details on evaluation metrics are in Sec. 5.1 and A.2 in Zhou et al. CVPR 2019.

To determine the winner, we adopt the highest score on Localization Accuracy on Sub-task I and the highest score on F1_all_per_sent on Sub-task II.

A visual demonstration of F1_all and F1_loc is shown below. The evaluation script used in our evaluation server will be identical to scripts/ Read the following section for more details.

f1 scores

Evaluation servers

For Sub-task I - GT Captions:

For Sub-task II - Generated Captions:

Please follow the example in data/anet_entities_skeleton.txt to format your submission file.

General Dataset Info

This repo hosts the dataset and evaluation scripts used in our paper Grounded Video Description (GVD). We also released the source code of GVD in this repo.

ActivityNet-Entities, is based on the video description dataset ActivityNet Captions and augments it with 158k bounding box annotations, each grounding a noun phrase (NP). Here we release the complete set of NP-based annotations as well as the pre-processed object-based annotations.


We have the following dataset files under the data directory.

  • anet_entities_skeleton.txt: Specify the expected structure of the JSON annotation files.
  • split_ids_anet_entities.json: Video IDs included in the training/validation/public testing/hidden testing splits.
  • anet_entities_cleaned_class_thresh50_trainval.json: Pre-processed dataset file with object class and bounding box annotations. For training and validation splits only.
  • anet_entities_cleaned_class_thresh50_test_skeleton.json: Object class annotation for the public testing split. This file is for evaluation server purpose and no bounding box annotation is given.
  • anet_entities_cleaned_class_thresh50_hidden_test_skeleton.json: Object class annotation for the hidden testing split. This file is for evaluation server purpose and no description nor bounding box annotation is given.
  • anet_entities_trainval.json: The raw dataset file with noun phrase and bounding box annotations. We only release the training and the validation splits for now.

Note: Both the raw dataset file and the pre-processed dataset file contain all the 12469 videos in our training and validation split (training + one half of the validation split as in ActivityNet Captions, which is based on ActivityNet 1.3). This includes 626 videos without box annotations.


Under the scripts directory, we include:

  • The evaluation script for object grounding on GT/generated captions. PyTorch (tested on 1.1 and 1.3), Stanford CoreNLP 3.9.1 and the Python wrapper are required.
  • The preprocessing scripts to obtain the NP/object annotation files.
  •, The scripts that print the dataset stats.

To evaluate attention/grounding output based upon GT sentences (metrics in paper: Attn., Grd.), run:

python scripts/ -s YOUR_SUBMISSION_FILE.JSON --eval_mode GT

To evaluate attention (same for grounding) output based upon generated sentences (metrics in paper: F1all, F1loc), similarly run:

python scripts/ -s YOUR_SUBMISSION_FILE.JSON --eval_mode gen --loc_mode $loc_mode

where setting loc_mode=all to perform evaluation on all object words while setting loc_mode=loc to perform evaluation only on correctly-predicted object words.


  1. How are the 10 frames sampled from each video clip (event)?

    We divide each clip evenly into 10 segments and sample the middle frame of each segment. We have clarified this in the skeleton file.

  2. How can I sample the frames by myself and extract feature?

    First, you may want to check if the object region feature and RGB/motion frame-wise feature we provided meet your requirement. If not, you can first download the ActivityNet videos using this web crawler or contact the dataset owners for help. An incorrect video encoding format would result in a wrong frame resolution and aspect ratio, and therefore a mismatch in the annotation. Hence, make sure you download the videos in the best mp4 format. Note that the web crawler is not appliable to the hidden test set as its video IDs are not YouTube IDs. Contact Luowei for details. Once you have the videos, you can use ffmpeg to extract the frames. We provide an example command here.

Additional resources

Ma et al. "Learning to Generate Grounded Visual Captions without Localization Supervision." arXiv 2019. paper code


Please contact if you have any trouble running the code. Please cite the following paper if you use the dataset.

  title={Grounded Video Description},
  author={Zhou, Luowei and Kalantidis, Yannis and Chen, Xinlei and Corso, Jason J and Rohrbach, Marcus},


We thank Chih-Yao Ma for his helpful discussions and contribution to the code on evaluation metrics. We also thank Elliot Klein for his contribution on the annotation collection interface.


This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

The noun phrases in these annotations are based on ActivityNet Captions, which are linked to videos in ActivityNet 1.3


A Dataset for Grounded Video Description




No releases published


No packages published

Contributors 4



You can’t perform that action at this time.