Textual supervision for visually grounded spoken language understanding

This repository contains the instructions and code to reproduce the results from the paper:

Higy, B., Elliott, D. & Chrupała, G. Textual supervision for visually grounded spoken language understanding. In Findings of Empirical Methods in Natural Language Processing. Punta Cana, Dominican Republic, 2020

Prerequisites

Download the Flickr8K dataset as well as the Flickr Audio Caption Corpus. The audio captions should be extracted under the same folder as the rest of the dataset.
Update the parameter flickr8k_root in the file config.yml to point to the location of the dataset. Make ~/.platalea/config.yml point to this file by running:
```
  mkdir ~/.platalea
  ln -s config.yml ~/.platalea/config.yml
```
Copy the files dataset.json and dataset_multilingual_human.json to the root folder of the Flickr8K dataset.
Install the requirements:

pip install -r requirements.txt

Training the models

The different models presented in the paper can be trained by running the script run.sh:

./run.sh

Reproducing the figures and tables from the paper

Figure 3 can be reproduced by running:

python -c "import results; results.plot_figure_3()"

Results presented in Tables 1-3 and 6-10 can be extracted by running:

python -c "import results; results.print_table_1()"
python -c "import results; results.print_table_2()"
python -c "import results; results.print_table_3()"
python -c "import results; results.print_table_6()"
python -c "import results; results.print_table_7()"
python -c "import results; results.print_table_8()"
python -c "import results; results.print_table_9()"
python -c "import results; results.print_table_10()"

Credits

The files dataset.json and dataset_multilingual_human.json are based on pre-processed information extracted from Flickr8K, Flickr8K Audio Caption Corpus and F30kEnt-JP. While they are provided here for conveniency, credits go to the authors of the original datasets.

Flickr8K:

Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. Journal of Artificial Intelligence Research, 47, 853–899. https://doi.org/10.1613/jair.3994

Flickr8K Audio Caption Corpus:

Harwath, D., & Glass, J. (2015). Deep multimodal semantic embeddings for speech and images. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 237–244. https://doi.org/10.1109/ASRU.2015.7404800

F30kEnt-JP:

Nakayama, H., Tamura, A., & Ninomiya, T. (2020). A Visually-Grounded Parallel Corpus with Phrase-to-Region Linking. Proceedings of The 12th Language Resources and Evaluation Conference, 4197–4203. https://www.aclweb.org/anthology/2020.lrec-1.518

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
conf		conf
LICENSE		LICENSE
README.md		README.md
asr.py		asr.py
basic_default.py		basic_default.py
config.yml		config.yml
dataset.json		dataset.json
dataset_multilingual_human.json		dataset_multilingual_human.json
eval_asr_beam.sh		eval_asr_beam.sh
eval_test.sh		eval_test.sh
evaluate_net.py		evaluate_net.py
mtl_asr.py		mtl_asr.py
mtl_st.py		mtl_st.py
pip_ind.py		pip_ind.py
pip_seq.py		pip_seq.py
requirements.txt		requirements.txt
results.py		results.py
run.sh		run.sh
text_image.py		text_image.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Textual supervision for visually grounded spoken language understanding

Prerequisites

Training the models

Reproducing the figures and tables from the paper

Credits

About

Contributors 2

Languages

License

bhigy/textual-supervision

Folders and files

Latest commit

History

Repository files navigation

Textual supervision for visually grounded spoken language understanding

Prerequisites

Training the models

Reproducing the figures and tables from the paper

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages