No description, website, or topics provided.
Clone or download

README.md

Summary

Gold standard Universal Dependencies corpus for Ukrainian, developed for UD originally, by Institute for Ukrainian, NGO.
[українською]

Introduction

UD Ukrainian comprises 115K tokens in 6800 sentences of fiction, news, opinion articles, Wikipedia, legal documents, letters, posts, and comments — from the last 15 years, as well as from the first half of the 20th century.

Consider using the latest version at ‘dev’ branch on GitHub. It contains the latest stable improvements while the official releases are up to 6 month old [discussion].

Acknowledgments

Major contributors: Natalia Kotsyba, Bohdan Moskalevskyi, Mykhailo Romanenko.

Large portion of annotation was made by Halyna Samoridna, Ivanka Kosovska, Olha Lytvyn, Oksana Orlenko and by students of Kyiv-Mohyla Academy department of Ukrainian language (headed by Liudmyla Dyka): Hanna Brovko, Bohdana Matushko, Natalia Onyshchuk, Valeriia Pareviazko, Yaroslava Rychyk, Anastasiia Stetsenko, Snizhana Umanets.

We thank Prof. Larysa Masenko for guidance.

Documentation

Project homepage (in Ukrainian)

Search

You can also browse the entire treebank in Brat.

Stats

set sentences ~tokens
train 5290 88K
dev 647 12K
test 864 16K
TOTAL 6801 116K

See stats.xml for detail.

Annotation procedure

Morphology is annotated using 2+1 schema. The syntax is single-pass plus supervisor’s check. Consistency is further enforced by ~300 validation and autofix rules (see warnings page) and by investigating errors made by a trained parser.

Data split

Data is split between train/dev/test linearly by hand at 75%/10%/15% to balance in genre and complexity. Some large documents are divided across datasets.

Format

UD Ukrainian data conforms to CoNLL-U format with the following specifics:

  • Sentence-level comments:
    • Document boundaries are present as # newdoc id = xxxx.
    • Sentence-level paragraph boundaries are present as # newpar id = xxxx.
    • Document titles are present as # doc_title = Назва.
    • Czech-like translit is present as # translit = ….
    • Gaps in the text are marked on the sentences following the gap as:
      • # annotation_gap for sentences not exported to CoNLL-U because annotator was unable to parse it with confidence (e.g. new guidelines need to be created);
      • # gap for intentional gaps in texts (selected fragments).
  • XPOSTAG column contains MTE tag with U for punctuation. UPOS+FEATS contain all the information in XPOSTAG and more. XPOSTAG is intended for legacy applications.
  • DEPS column contains Enhanced Dependencies.
  • MISC column:
    • Token-level paragraph boundaries are present as NewPar=Yes.
    • Token ids are present as Id=xxxx.
    • SpaceAfter=No markers are present.
    • Form (Translit) and lemma (LTranslit) transliterations are present, except for token Id=1mnf, see issue #569.
  • Document, paragraph, sentence, and token ids are 4-character base-32 numbers. They survive treebank updates.

Enhanced Dependencies

  1. Ellipsis. Elided predicates are manually reconstructed with word forms and full morphological info. The TB currently contains ~200 of them.
  2. Propagation of conjuncts. Conjoined modifiers are propagated automatically. For heterogeneous conjuncts, a relation guesser is employed. Dependents of first conjuncts are propagated only if they are manually marked as shared (40% of such annotation is done).
  3. Controlled/raised subjects. All xcomp subjects are annotated manually as nsubj:x/csubj:x. Subjects of xcomp:sp (secondary predication) are nsubj:sp/csubj:sp. The latter are also used for the subjects of advcl:sp (see #476).
  4. Relative clauses. All relative clauses are manually annotated with enhanced dependencies. This includes all types mentioned in the universal docs plus Ukrainian clauses that use personal pronouns as relativizers: вузол, що його не переріжеш “the-knot, that it.Acc not you-can-cut”.
  5. Case information. We don’t case-mark relation names because this doesn’t bring any new information [discussion].

Development

Data files are built from sources at mova-institute/zoloto, where the actual development happens.

Licensing

The data is licensed under CC BY-NC-SA 4.0 and is free for non-commercial use. For a commercial license, please contact us at org@mova.institute.

Contact

org@mova.institute

Changelog

  • 2018-11-15 v2.3

    • Added all types of enhanced dependencies except for case-marking, see Enhanced Dependencies section.
    • Closed many annotation gaps and added new texts: 100→115K.
    • Fixed ~450 annotation errors including його/її/їх PRON vs DET ambiguity.
    • Improved consistency by extending annotation guidelines to many rarer phenomena.
    • Introduced multitokens for ні́кого, ні́де etc.
    • Split words with fused пів- numerals (e.g. півкласу) to multitokens.
    • Introduced flat:abs, flat:sibl, flat:range, advmod:det, acl:adv, parataxis:rel, vocative:cl.
    • Specified acl:relcl.
    • Removed :pass subtype from relations as it currently can be inferred from the morphology.
    • Added transliteration.
    • Fixed missing # annotation_gaps.
    • Updated readme with more description, links.
  • 2018-04-15 v2.2

    • Renamed the repository from UD_Ukrainian to UD_Ukrainian-IU to match the new UD naming convention.
    • Fixed some validation errors.
    • Added a couple of new sentences.
    • Orth=Khark feature renamed to Orth=Alt.
  • 2017-11-15 v2.1

    • Quadrupled the amount of data up to 100K, mostly with nonfiction; improved consistency.
    • Resplitted train/dev/test.
  • 2017-02-15 v2.0

    • Replaced v1.4 data with 25K tokens of misc genres, mostly fiction.
  • 2016-11-01 v1.4

    • An initial experimental release containing 1.6K tokens of grammar examples and fiction.

=== Machine-readable metadata =================================================
Data available since: UD v1.4
License: CC BY-NC-SA 4.0
Includes text: yes
Genre: blog email fiction grammar-examples legal news reviews social web wiki
Lemmas: manual native
UPOS: manual native
XPOS: manual native
Features: manual native
Relations: manual native
Contributors: Kotsyba, Natalia; Moskalevskyi, Bohdan; Romanenko, Mykhailo
Contributing: elsewhere
Contact: org@mova.institute
===============================================================================