Skip to content

Latest commit

 

History

History
177 lines (123 loc) · 9.68 KB

README.md

File metadata and controls

177 lines (123 loc) · 9.68 KB

Summary

Gold standard Universal Dependencies corpus for Ukrainian, developed for UD originally, by Institute for Ukrainian, NGO. [українською]

Introduction

UD Ukrainian comprises 122K tokens in 7000 sentences of fiction, news, opinion articles, Wikipedia, legal documents, letters, posts, and comments — from the last 15 years, as well as from the first half of the 20th century.

Consider using the latest version at ‘dev’ branch on GitHub. It contains the latest stable improvements while the official releases are up to 6 month old [discussion].

Acknowledgments

Major contributors: Natalia Kotsyba, Bohdan Moskalevskyi, Mykhailo Romanenko.

Large portion of annotation was made by Halyna Samoridna, Ivanka Kosovska, Olha Lytvyn, Oksana Orlenko and by students of Kyiv-Mohyla Academy department of Ukrainian language (headed by Liudmyla Dyka): Hanna Brovko, Bohdana Matushko, Natalia Onyshchuk, Valeriia Pareviazko, Yaroslava Rychyk, Anastasiia Stetsenko, Snizhana Umanets.

We thank Prof. Larysa Masenko for guidance.

Documentation

Project homepage (in Ukrainian)

Search

You can also browse the entire treebank in Brat.

Stats

set sentences ~tokens
train 5496 92K
dev 672 13K
test 892 17K
TOTAL 7060 122K

See stats.xml for detail.

Annotation procedure

Morphology is annotated using 2+1 schema. The syntax is single-pass plus supervisor’s check. Consistency is further enforced by ~300 validation and autofix rules (see warnings page) and by investigating errors made by a trained parser.

Data split

Data is split between train/dev/test linearly by hand at 75%/10%/15% to balance in genre and complexity. Some large documents are divided across datasets.

Format

UD Ukrainian data conforms to CoNLL-U format with the following specifics:

  • Sentence-level comments:
    • Document boundaries as # newdoc id = ....
    • Sentence-level paragraph boundaries as # newpar id = ....
    • Document titles as # doc_title = ....
    • Document authors as # author = ....
    • Document sources as # source = ....
    • Czech-like translit is present as # translit = ....
    • Gaps in the text are marked on the sentences following the gap as:
      • # annotation_gap for sentences not exported to CoNLL-U because annotator was unable to parse it with confidence (e.g. new guidelines need to be created);
      • # gap for intentional gaps in texts (selected fragments).
  • XPOSTAG column contains MTE tag with U for punctuation. UPOS+FEATS contain all the information in XPOSTAG and more. XPOSTAG is intended for legacy applications.
  • DEPS column contains Enhanced Dependencies.
  • MISC column:
    • Token-level paragraph boundaries as NewPar=Yes.
    • Token ids as Id=xxxx.
    • SpaceAfter=No markers are present.
    • Form (Translit) and lemma (LTranslit) transliterations are present
    • The pipe (|) character is escaped with \p. Backslash is \\. See issue #569.
  • Document, paragraph, sentence, and token ids are 4-character base-32 numbers. They survive treebank updates.
  1. Empty (null) nodes for elided predicates. Elided predicates are manually reconstructed with word forms and full morphological info. Coverage: only ~200 instances done.
  2. Propagation of incoming dependencies to conjuncts. Propagated automatically. For heterogeneous conjuncts, a relation guesser is employed. Coverage: full.
  3. Propagation of outgoing dependencies from conjuncts. Dependents of first conjuncts are propagated only if they are manually marked as shared. Coverage: ~75% of the sentences.
  4. Additional subject relations for control and raising constructions. All xcomp subjects are annotated manually as nsubj:xsubj/csubj:x. Subjects of xcomp:pred (secondary predication) are nsubj:pred/csubj:pred. The latter are also used for the subjects of advcl:pred (see #476). Coverage: full.
  5. Coreference in relative clause constructions. All relative clauses are manually annotated with enhanced dependencies. This includes all types mentioned in the universal docs plus Ukrainian clauses that use personal pronouns as relativizers: вузол, що його не переріжеш “the-knot, that it.Acc not you-can-cut”. Coverage: full.
  6. Case information. We don’t case-mark relation names because this doesn’t bring any new information [discussion].

Development

Data files are built from sources at mova-institute/zoloto, where the actual development happens.

Licensing

The data is licensed under CC BY-NC-SA 4.0 and is free for non-commercial use. For a commercial license, please contact us at org@mova.institute.

Contact

org@mova.institute

Changelog

  • 2022-11-15 v2.9 (upcoming)

    • Reanalyze large numerals like thousand, million, and above. See the discussion.
    • Brought back Hyph and Bull PunctTypes.
    • Renamed :sp relation subtypes to :pred.
    • Fixed errors.
    • Added sentenses.
  • 2021-05-15 v2.8

    • Undocumented PunctType Ndash, Hyph, Bull converted to Dash.
  • 2019-05-15 v2.4

    • Closed many annotaion gaps: 116K→122K.
    • Fixed annotation errors.
    • Shared more dependents of a first conjunct.
    • Improved consistency by extending annotation guidelines to rarer phenomena.
    • Switched from ccomp to xcomp where nsubj:xsubj is a phantom object.
    • Made clauses with ADV relativizers :relcl.
    • Added Polarity=Neg for conjunctions.
    • Escaped the pipe (|) character in MISC as \p. \\ is now a backslash.
  • 2018-11-15 v2.3

    • Added all types of enhanced dependencies except for case-marking, see Enhanced Dependencies section.
    • Closed many annotation gaps and added new texts: 100→115K.
    • Fixed ~450 annotation errors including його/її/їх PRON vs DET ambiguity.
    • Improved consistency by extending annotation guidelines to many rarer phenomena.
    • Introduced multitokens for ні́кого, ні́де etc.
    • Split words with fused пів- numerals (e.g. півкласу) to multitokens.
    • Introduced flat:abs, flat:sibl, flat:range, advmod:det, acl:adv, parataxis:rel, vocative:cl.
    • Specified acl:relcl.
    • Removed :pass subtype from relations as it currently can be inferred from the morphology.
    • Added transliteration.
    • Fixed missing # annotation_gaps.
    • Updated readme with more description, links.
  • 2018-04-15 v2.2

    • Renamed the repository from UD_Ukrainian to UD_Ukrainian-IU to match the new UD naming convention.
    • Fixed some validation errors.
    • Added a couple of new sentences.
    • Orth=Khark feature renamed to Orth=Alt.
  • 2017-11-15 v2.1

    • Quadrupled the amount of data up to 100K, mostly with nonfiction; improved consistency.
    • Resplitted train/dev/test.
  • 2017-02-15 v2.0

    • Replaced v1.4 data with 25K tokens of misc genres, mostly fiction.
  • 2016-11-01 v1.4

    • An initial experimental release containing 1.6K tokens of grammar examples and fiction.

=== Machine-readable metadata =================================================
Data available since: UD v1.4
License: CC BY-NC-SA 4.0
Includes text: yes
Genre: blog email fiction grammar-examples legal news reviews social web wiki
Lemmas: manual native
UPOS: manual native
XPOS: manual native
Features: manual native
Relations: manual native
Contributors: Kotsyba, Natalia; Moskalevskyi, Bohdan; Romanenko, Mykhailo
Contributing: elsewhere
Contact: org@mova.institute
===============================================================================