This repository contains code and data for the following paper:
@InProceedings{W18-6550,
author = "van Miltenburg, Emiel
and Elliott, Desmond
and Vossen, Piek",
title = "Talking about other people: an endless range of possibilities",
booktitle = "Proceedings of the 11th International Conference on Natural Language Generation",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "415--420",
location = "Tilburg University, The Netherlands",
url = "http://aclweb.org/anthology/W18-6550"
}
The exact code and data for this paper (this commit) is captured as a release.
There are three folders:
Flickr30K
contains all code and data for the categorization of person labels in Flickr30k-Entities.VisualGenome
contains code and data for the categorization of attributes from Visual Genome.Other
contains some additional functions to compute relevant statistics.
The code has been tested with the following software. Results shouldn't be different for other versions of Python or the NLTK, but this is untested.
- Python 3.6.3
- nltk 3.2.2
We'll take the Flickr30K data as an example. The general logic is as follows:
- The
resources
folder contains all files with categories, stopwords, etc. - The grammar is generated by using
python update_grammar.py
. This script takes the resources and compiles a grammar to match the labels with the categories. - You can check the labels by using
python check_labels.py
. This script checks which labels are covered by the grammar. Labels that are covered are written togrammatical.txt
. Ungrammatical labels are written toungrammatical.txt
. By reading the latter, we can identify (parts of) labels that should be categorized. - After adding (parts of) labels to the category files in the
resources
folder, runpython update_grammar.py
again.
Then there are two non-essential script files.
- If you want to parse any labels, just import the
analyze_label
function fromlabel_parser.py
. - Run
flickr_stats.py
to get some statistics about the original data. Specifically: total number of unique labels classified as PEOPLE; size of the subset of those labels that end in boy, girl, male, female, woman, or man.