Repository for the ManyNames dataset (version 2.3) for English and Mandarin Chinese. The English version of ManyNames provides ca. 31 name annotations for each of 25K objects in images selected from VisualGenome, whereas the Chinese version provides ca. 20 name annotations for 1319 objects in images selected from ManyNames. For an illustration, see the images below.
For details about the data collection process and the information encoded in the dataset, see Silberer, Zarrieß, & Boleda, 2020 (version 1.0) and Silberer, Zarrieß, Westera, & Boleda, 2020 (version 2.0), as well as He, Liao, Liang, & Boleda, 2023 (Mandarin Chinese version). For changes in the present version see the release notes. Previous versions of the dataset can be accessed as older releases in this repository.
The data are provided in two formats:
- TSV: tab-separated text files, first row contains the column labels, nested data is stored as python dictionaries (i.e., "{key: value}"). Available in this folder.
- JSON: the same data set in .json format to facilitate access (to the nested data) outside of python. Available in subfolder other_data.
The columns that are included for both the English and Mandarin Chinese datasets are labelled as follows.
| Column | Type | Description |
|---|---|---|
| vg_object_id | int | The VisualGenome ID of the object (functions as the unique ID for the datapoints in ManyNames) |
| topname | str | The most frequent name produced for the object in the ManyNames data collection |
| responses | dict | Correct responses and their counts |
| domain | str | The ManyNames domain of the object, categorizing objects into people, animals_plants, vehicles, food, home, buildings, and clothing |
| N | int | The number of correct name types in the ManyNames responses (each name counts once) |
| total_responses | int | Sum count of correct responses (tokens; each subject production of a name counts once) |
| perc_top | float | The relative frequency of the topname (among correct responses), in percentage |
| H | float | The H agreement measure from Snodgrass and Vanderwart (1980), which is the entropy over subject responses |
| link_mn | str | The url to the image with the object framed (the original VG image contains no frame) |
| vg_bbox_xywh | list | The coordinates of the object: "[left x, bottom y, width, height]"; y=0 is at the top of the image |
| vg_image_id | int | The VisualGenome ID of the image (is also unique) |
| vg_obj_name | str | The VG name of the object |
| vg_synset | str | The WordNet synset of the object, as provided by VisualGenome |
| vg_domain | str | The ManyNames domain of the VG name, which may be a superset of its WordNet category (encoded in column vg_cat of file other-data/additional-info-en.tsv). Example: The ManyNames domain food subsumes the WordNet categories food, solid food, and food, nutrient |
The English dataset also includes the columns listed below.
| Column | Type | Description |
|---|---|---|
| incorrect | dict | Incorrect responses and their counts |
| typicality | dict | Correct responses and their image-text similarity score, calculated using BLIP2 (Li, Li, Savarese & Hoi, 2023) |
| synsets | dict | Correct responses and their WordNet synset, assigned automatically |
| informativeness | dict | Synsets of correct responses and their WordNet information content ratings |
| most_informative_synset_image | str | Synset with the highest WordNet information content rating |
| split | str | Use of the image in training vs. test vs. validation in Silberer, Zarrieß, Westera, & Boleda, 2020 |
The Mandarin Chinese dataset also includes the columns listed below.
| Column | Type | Description |
|---|---|---|
| list | str | Lists of images assigned to participants |
| familiarity | float | Familiarity, approximated as the weighted average of the frequency of the responses in a textual corpus |
Note: A subset of the ManyNames data has also been annotated for Catalan within the AINA project. It is available here.
Contains the ManyNames datasets in JSON format, some files with additional information about the English ManyNames dataset (including per-subject responses), and also files with lexical information (concreteness, familiarity, imageability, age of acquisition, corpus frequency, context diversity) for each name in ManyNames, in both languages. See the README inside the folder for more information.
Contains scripts to facilitate processing ManyNames and to reproduce tables and figures from publications about ManyNames. See the README inside the folder for more information.
- For any use of ManyNames:
Silberer, C., S. Zarrieß, G. Boleda. 2020. Object Naming in Language and Vision: A Survey and a New Dataset. Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), 5792-5801.
@inproceedings{silberer2020manynames,
title = {{Object Naming in Language and Vision: A Survey and a New Dataset}},
author = {Silberer, Carina and Zarie{\ss}, Sina and Boleda, Gemma},
booktitle = {Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020)},
year = {2020},
url = {https://aclanthology.org/2020.lrec-1.710/},
pages = "5792--5801"
}
- In addition, if you refer to anything specific to version 1:
Silberer, C., S. Zarrieß, M. Westera, G. Boleda. 2020. Humans meet models on object naming: A new dataset and analysis. Proceedings of the 28th International Conference on Computational Linguistics, 1893-1905.
@inproceedings{silberer-etal-2020-humans,
title = "Humans Meet Models on Object Naming: A New Dataset and Analysis",
author = "Silberer, Carina and Zarrie{\ss}, Sina and Westera, Matthijs and Boleda, Gemma",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
year = "2020",
url = "https://aclanthology.org/2020.coling-main.172",
doi = "10.18653/v1/2020.coling-main.172",
pages = "1893--1905"
}
- In addition, if you use the data for Mandarin Chinese:
He, Y., Liao, X., Liang, J., Boleda, G. 2023. The Impact of Familiarity on Naming Variation: A Study on Object Naming in Mandarin Chinese. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), 456-475.
@inproceedings{he-etal-2023-impact,
title = "The Impact of Familiarity on Naming Variation: A Study on Object Naming in {M}andarin {C}hinese",
author = "He, Yunke and Liao, Xixian and Liang, Jialing and Boleda, Gemma",
booktitle = "Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)",
year = "2023",
url = "https://aclanthology.org/2023.conll-1.30",
doi = "10.18653/v1/2023.conll-1.30",
pages = "456--475"
}
(For more information about versions 2.1 onward, see the release notes.)
-
version 2.3: Revised English singletons after identifying issues in v2.2; added WordNet synset-related information and BLIP-2 typicality scores for English names; added anonymized subject IDs for English data; enhanced lexical data for English and Mandarin Chinese.
-
version 2.2: Added all singletons responses for English (= responses given only once) following a manual correction procedure; added Mandarin Chinese names for a subset of the data; added lexical information for names.
-
version 2.1.1: Added bounding box coordinates for ManyNames image versions. Updated image links to new domain: manynames.upf.edu.
-
version 2.1: Corrections to topname and domain definitions, inclusion of some singleton responses.
-
version 2.0: Integration of name verification data (for details, see Silberer, Zarrieß, Westera, & Boleda, 2020).
-
version 1.0: Initial release (for details, see Silberer, Zarrieß, & Boleda, 2020).
ManyNames is licensed under a Creative Commons Attribution 4.0 International License, and based on VisualGenome at visualgenome.org.
This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No 715154).
