Distribution-based compositionality assessment of natural language corpora

Experiments utilising the distribution-based compositionality assessment (DBCA) framework to split natural language corpora into training and test sets in such a way that the test sets require systematic compositional generalisation capacity.

This repository contains experiments described in the two papers:

[1] Moisio, Creutz, and Kurimo, Evaluating Morphological Generalisation in Machine Translation by Distribution-Based Compositionality Assessment, in Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 738–751, 2023.
[2] Moisio, Creutz, and Kurimo, On Using Distribution-Based Compositionality Assessment to Evaluate Compositional Generalisation in Machine Translation, To appear at the GenBench workshop at EMNLP, 2023.

which use the DBCA framework, introduced in the paper:

Keysers, Schärli, Scales, Buisman, Furrer, Kashubin, Momchev, Sinopalnikov, Stafiniak, Tihon, Tsarkov, Wang, van Zee, Bousquet, Measuring compositional generalization: A comprehensive method on realistic data in International Conference on Learning Representations, 2020.

Instructions

The experiments consist of the following steps:

tag a corpus of sentences
- [1] uses a morphological tagger
- [2] uses a dependency parser
define the atoms and compounds
- atoms can be, for example, lemmas and tags
- compounds are combinations of atoms
create matrices that encode the number of atoms and compounds in each sentence
divide the corpus into training and test sets using the greedy algorithm
evaluate NLP models on splits with different compound divergence values

Dependencies

Data in [1] is from the Tatoeba Challenge data release (eng-fin set)
Data in [2] is from the Europarl parallel corpus
Data filtering is done using OpusFilter
Morphological parsing in [1] is done using TNPP, CoNLL-U format parsed using this parser
Dependency parsing in [2] is done using LAL-Parser
Data split algorithm uses PyTorch
Tokenisers are trained using sentencepiece
Translation systems are trained with OpenNMT-py
Evaluating translations is done with sacreBLEU

Experiments in [1]: generalising to novel morphological forms

run-nodalida2023.sh includes the commands to run the experiments in [1]
exp/subset-d-1m/data contains the 1M sentence pair dataset
exp/subset-d-1m/splits/*/*/*/ids_{train,test_full}.txt.gz contain the data splits with different compound divergences and different random initialisations

Experiments in [2]: generalising to novel dependency relations

run-genbench2023.sh includes the commands to run the experiments in [2]
data splits are available at https://huggingface.co/datasets/Anssi/europarl_dbca_splits
related PR in the GenBench CBT repository: GenBench/genbench_cbt#33

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
exp		exp
figures		figures
tests		tests
utils		utils
.gitignore		.gitignore
01-select-corpora.sh		01-select-corpora.sh
02-opusfilter.sh		02-opusfilter.sh
03-insert-ids.sh		03-insert-ids.sh
04-tnpp-parse.sh		04-tnpp-parse.sh
05-prepare-divide-data.sh		05-prepare-divide-data.sh
06-divide.sh		06-divide.sh
07-prep-nmt-data.sh		07-prep-nmt-data.sh
08-train-spm.sh		08-train-spm.sh
09-build-vocab.sh		09-build-vocab.sh
10-train-nmt-model.sh		10-train-nmt-model.sh
11-translate.sh		11-translate.sh
12-translation-eval.sh		12-translation-eval.sh
13-significance-tests-simple.sh		13-significance-tests-simple.sh
13-significance-tests.sh		13-significance-tests.sh
README.md		README.md
__init__.py		__init__.py
divide.py		divide.py
freq_mats.py		freq_mats.py
prep_onmt_data.py		prep_onmt_data.py
run-genbench2023.sh		run-genbench2023.sh
run-nodalida2023.sh		run-nodalida2023.sh

aalto-speech/dbca

Folders and files

Latest commit

History

Repository files navigation

Distribution-based compositionality assessment of natural language corpora

Instructions

Dependencies

Experiments in [1]: generalising to novel morphological forms

Experiments in [2]: generalising to novel dependency relations

About

Resources

Stars

Watchers

Forks

Languages