Skip to content

aalto-speech/dbca

 
 

Repository files navigation

Distribution-based compositionality assessment of natural language corpora

Experiments utilising the distribution-based compositionality assessment (DBCA) framework to split natural language corpora into training and test sets in such a way that the test sets require systematic compositional generalisation capacity.

This repository contains experiments described in the two papers:

which use the DBCA framework, introduced in the paper:

Instructions

The experiments consist of the following steps:

  1. tag a corpus of sentences
    • [1] uses a morphological tagger
    • [2] uses a dependency parser
  2. define the atoms and compounds
    • atoms can be, for example, lemmas and tags
    • compounds are combinations of atoms
  3. create matrices that encode the number of atoms and compounds in each sentence
  4. divide the corpus into training and test sets using the greedy algorithm
  5. evaluate NLP models on splits with different compound divergence values

Dependencies

Experiments in [1]: generalising to novel morphological forms

  • run-nodalida2023.sh includes the commands to run the experiments in [1]
  • exp/subset-d-1m/data contains the 1M sentence pair dataset
  • exp/subset-d-1m/splits/*/*/*/ids_{train,test_full}.txt.gz contain the data splits with different compound divergences and different random initialisations

Experiments in [2]: generalising to novel dependency relations

About

Distribution-based compositionality assessment of natural language corpora

Resources

Stars

Watchers

Forks

Languages

  • Python 75.0%
  • Shell 25.0%