Gold standard Universal Dependencies corpus for Ukrainian, developed for UD originally, by Institute for Ukrainian, NGO. [українською]
UD Ukrainian comprises 122K tokens in 7000 sentences of fiction, news, opinion articles, Wikipedia, legal documents, letters, posts, and comments — from the last 15 years, as well as from the first half of the 20th century.
Consider using the latest version at ‘dev’ branch on GitHub. It contains the latest stable improvements while the official releases are up to 6 month old [discussion].
Major contributors: Natalia Kotsyba, Bohdan Moskalevskyi, Mykhailo Romanenko.
Large portion of annotation was made by Halyna Samoridna, Ivanka Kosovska, Olha Lytvyn, Oksana Orlenko and by students of Kyiv-Mohyla Academy department of Ukrainian language (headed by Liudmyla Dyka): Hanna Brovko, Bohdana Matushko, Natalia Onyshchuk, Valeriia Pareviazko, Yaroslava Rychyk, Anastasiia Stetsenko, Snizhana Umanets.
We thank Prof. Larysa Masenko for guidance.
Project homepage (in Ukrainian)
You can also browse the entire treebank in Brat.
See stats.xml for detail.
Morphology is annotated using 2+1 schema. The syntax is single-pass plus supervisor’s check. Consistency is further enforced by ~300 validation and autofix rules (see warnings page) and by investigating errors made by a trained parser.
Data is split between train/dev/test linearly by hand at 75%/10%/15% to balance in genre and complexity. Some large documents are divided across datasets.
UD Ukrainian data conforms to CoNLL-U format with the following specifics:
- Sentence-level comments:
- Document boundaries as
# newdoc id = ....
- Sentence-level paragraph boundaries as
# newpar id = ....
- Document titles as
# doc_title = ....
- Document authors as
# author = ....
- Document sources as
# source = ....
- Czech-like translit is present as
# translit = ....
- Gaps in the text are marked on the sentences following the gap as:
# annotation_gapfor sentences not exported to CoNLL-U because annotator was unable to parse it with confidence (e.g. new guidelines need to be created);
# gapfor intentional gaps in texts (selected fragments).
- Document boundaries as
- XPOSTAG column contains MTE tag with
Ufor punctuation. UPOS+FEATS contain all the information in XPOSTAG and more. XPOSTAG is intended for legacy applications.
- DEPS column contains Enhanced Dependencies.
- MISC column:
- Token-level paragraph boundaries as
- Token ids as
SpaceAfter=Nomarkers are present.
- Form (
Translit) and lemma (
LTranslit) transliterations are present
- The pipe (
|) character is escaped with
\p. Backslash is
\\. See issue #569.
- Token-level paragraph boundaries as
- Document, paragraph, sentence, and token ids are 4-character base-32 numbers. They survive treebank updates.
- Empty (null) nodes for elided predicates. Elided predicates are manually reconstructed with word forms and full morphological info. Coverage: only ~200 instances done.
- Propagation of incoming dependencies to conjuncts. Propagated automatically. For heterogeneous conjuncts, a relation guesser is employed. Coverage: full.
- Propagation of outgoing dependencies from conjuncts. Dependents of first conjuncts are propagated only if they are manually marked as shared. Coverage: ~75% of the sentences.
- Additional subject relations for control and raising constructions. All
xcompsubjects are annotated manually as
csubj:x. Subjects of
xcomp:pred(secondary predication) are
csubj:pred. The latter are also used for the subjects of
advcl:pred(see #476). Coverage: full.
- Coreference in relative clause constructions. All relative clauses are manually annotated with enhanced dependencies. This includes all types mentioned in the universal docs plus Ukrainian clauses that use personal pronouns as relativizers: вузол, що його не переріжеш “the-knot, that it.Acc not you-can-cut”. Coverage: full.
- Case information. We don’t case-mark relation names because this doesn’t bring any new information [discussion].
Data files are built from sources at mova-institute/zoloto, where the actual development happens.
The data is licensed under CC BY-NC-SA 4.0 and is free for non-commercial use. For a commercial license, please contact us at email@example.com.
2022-11-15 v2.9 (upcoming)
- Reanalyze large numerals like thousand, million, and above. See the discussion.
- Brought back
:sprelation subtypes to
- Fixed errors.
- Added sentenses.
- Closed many annotaion gaps: 116K→122K.
- Fixed annotation errors.
- Shared more dependents of a first conjunct.
- Improved consistency by extending annotation guidelines to rarer phenomena.
- Switched from
nsubj:xsubjis a phantom object.
- Made clauses with
- Escaped the pipe (
|) character in
\\is now a backslash.
- Added all types of enhanced dependencies except for case-marking, see Enhanced Dependencies section.
- Closed many annotation gaps and added new texts: 100→115K.
- Fixed ~450 annotation errors including його/її/їх
- Improved consistency by extending annotation guidelines to many rarer phenomena.
- Introduced multitokens for ні́кого, ні́де etc.
- Split words with fused пів- numerals (e.g. півкласу) to multitokens.
:passsubtype from relations as it currently can be inferred from the morphology.
- Added transliteration.
- Fixed missing
- Updated readme with more description, links.
- Renamed the repository from UD_Ukrainian to UD_Ukrainian-IU to match the new UD naming convention.
- Fixed some validation errors.
- Added a couple of new sentences.
Orth=Kharkfeature renamed to
- Quadrupled the amount of data up to 100K, mostly with nonfiction; improved consistency.
- Resplitted train/dev/test.
- Replaced v1.4 data with 25K tokens of misc genres, mostly fiction.
- An initial experimental release containing 1.6K tokens of grammar examples and fiction.
=== Machine-readable metadata ================================================= Data available since: UD v1.4 License: CC BY-NC-SA 4.0 Includes text: yes Genre: blog email fiction grammar-examples legal news reviews social web wiki Lemmas: manual native UPOS: manual native XPOS: manual native Features: manual native Relations: manual native Contributors: Kotsyba, Natalia; Moskalevskyi, Bohdan; Romanenko, Mykhailo Contributing: elsewhere Contact: firstname.lastname@example.org ===============================================================================