Latin data from the Index Thomisticus Treebank.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
CONTRIBUTING.md
LICENSE.txt
README.md
eval.log
la_ittb-ud-dev.conllu
la_ittb-ud-test.conllu
la_ittb-ud-train.conllu
stats.xml

README.md

Summary

Latin data from the Index Thomisticus Treebank. Data are taken from the Index Thomisticus corpus by Roberto Busa SJ, which contains the complete work by Thomas Aquinas (1225–1274; Medieval Latin) and by 61 other authors related to Thomas.

History of the Releases

The UD_Latin-ITTB dataset results from the automated conversion of the Index Thomisticus Treebank from the Prague dependency treebank (PDT) style into the Universal Dependencies (UD) style.

Its first version was part of HamleDT and as such made use of the PDT style, which was later automatically converted to the UD style as part of HamleDT 3.0 in 2015. That same year in November UD v1.2 was released, including for the first time the IT-TB, with almost identical dependency relations and morphological features as those in HamleDT 3.0, all the while improving its part-of-speech tagging.

The original part-of-speech classification of the Index Thomisticus is tripartite, in that solely the pure inflectional behaviour of words is taken into account, thus distinguishing only between nominal inflection (adjectives, nouns, pronouns, numerals, with a subclass for verbal nominal inflection, such as participles), verbal inflection and absence of inflection (adverbs, prepositions, conjunctions,...). In HamleDT 3.0, all nominally inflecting words had been tagged NOUN. In UD v1.2 a first differentiation was implemented: separated lexicons for adjectives (corresponding to PoS ADJ or NUM), nouns (NOUN) and pronouns (PRON and DET) were obtained by means of the Latin lemmatiser LEMLAT, and unrecognized words were manually disambiguated by Berta González Saavedra and Marco Passarotti. This way, tagging nominally inflecting words became possible, also for later versions, and at the same time invariable words, previously generically treated as PART, were reanalyzed as ADV, ADP, CONJ or INTJ.

The release of UD v2.3 sees a major update and revision of the conversion scripts for the Index Thomisticus Treebank into the UD style, significantly improving the overall conversion quality, both in terms of deprel's and subtree structures, as of part-of-speech tagging and lemmatisation. Guidelines for a common annotation style of the three current Latin UD treebanks have also been put into effect.

Acknowledgments

@article{lait-ud,
  author    = {Cecchini, Flavio Massimiliano and Passarotti, Marco and Marongiu, Paola and Zeman, Daniel},
  title     = {{Challenges in Converting the \emph{Index Thomisticus} treebank into Universal Dependencies}},
  journal   = {Proceedings of the Universal Dependencies Workshop 2018 (UDW 2018)},
  year      = {2018},
  address = {Brussels, Belgium}
}

@article{lait,
  author    = {Passarotti, Marco and Dell’Orletta, Felice},
  title     = {Improvements in parsing the index thomisticus treebank. Revision, combination and a feature model for medieval Latin},
  journal   = {Training},
  volume    = {2},
  pages     = {61--024},
  year      = {2010}
}

Changelog

2018-11-01 v2.3

  • Book three of Summa contra gentiles now completely annotated with more than 3500 new sentences and 60000 additional tokens
  • Generic major update of the conversion script:
    • ellipsis and ExD afun's are now addressed; where heuristics might fail, warnings are issued;
    • PDT-style apposition subtrees are completely restructured for UD conversion; new relation subtype appos and composite deprel advmod:cc for appositive adverbial modifiers (scilicet);
    • tagging of proper nouns with PROPN by means of a hard-coded lexicon;
    • correct treatment of xcomp and ccomp;
    • INTJ PoS removed;
    • introduction of relation subtype advmod for verbal attributes (often corresponding to PDT afun AtvV);
    • many other minor corrections, additions and improvements.
  • Applied common guidelines for Latin treebanks. The major points:
    • adverbs always get their positive degree as lemma;
    • the part-of-speech DET has been removed and retained only for the proto-article ly;
    • pronouns in an attributive function receive deprel det;
    • possessive pronouns in an attributive function receive PoS ADJ and deprel amod;
    • some lemmas were harmonised to a common standard.
  • Manual corrections of annotation errors in the original data (mostly regarding co-ordinations and appositions)
  • New split of dev/test/train data: dev and test contain the first 2101+2101 sentences in the Summa contra gentiles, while train all the remaining ones, including the concordances of forma.

2017-03-01 v2.0

  • Converted to UD v2 guidelines.
  • Reconsidered PRON vs. DET distinction.
  • Improved advmod vs. obl distinction.

2016-05-15 v1.3

  • Fixed adverbs that were attached as nmod; correct: advmod.
  • Improved conversion of AuxY.
  • PROPN are now distinguished from NOUN.
  • Larger data: almost 2000 newly annotated sentences.
  • Manual fixes of annotation errors in the old data.
  • An exceptional one-time change of the train/dev/test data split was necessary to overcome past bad design and to reflect the evolution of the treebank. Beware: UD 1.2 dev/test data have become UD 1.3 train data and vice versa, the data split is thus not backwards compatible!
=== Machine-readable metadata =================================================
Data available since: UD v1.2
License: CC BY-NC-SA 3.0
Includes text: yes
Genre: nonfiction
Lemmas: converted from manual
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: converted from manual
Contributors: Passarotti, Marco; Zeman, Daniel; González Saavedra, Berta; Cecchini, Flavio Massimiliano
Contributing: elsewhere
Contact: zeman@ufal.mff.cuni.cz
===============================================================================