Skip to content

Latest commit

 

History

History
122 lines (90 loc) · 7.1 KB

README.md

File metadata and controls

122 lines (90 loc) · 7.1 KB

Summary

The Belarusian UD treebank is based on a sample of the news texts included in the Belarusian-Russian parallel subcorpus of the Russian National Corpus, online search available at: http://ruscorpora.ru/search-para-be.html.

Tokenization

The low-level tokenization of the Belarusian UD treebank generally adopts the RNC standard.

  • In general, tokens are delimited by whitespace. The regexp [А-zА-яЁёУўі-]+ usually corresponds to one token.
  • Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization.
  • Each punctuation mark is treated as a single token, e.g. the following sequence: )", - becomes four tokens, ) , ", ,, and -". Exceptions are conventional multi-character punctuation marks: -- , ... , ?! , etc., and emojis and smileys: :) , ^_^, etc.
  • Conventional non-cyrillic multi-character terms are tokenized as single tokens: °С, км2.

Some special cases worth mentioning:

  • Numerical expressions including decimal numbers, such as 245, 3,14, are treated as single tokens.
  • Time expressions like 20:55 are splitted into separate tokens (in this case, three { 20 , : , 55 }).
  • Dates like 20.04.2012 are splitted into separate tokens (in this case, five { 20 , . , 04 , . , 2012 }).
  • Special symbols before and after numerical expressions, as in $500 , 2,67% , +27°С , are tokenised separately (so, the tokens are { $ , 500 } , { 2,67 , % } , { + , 27 , °С }).
  • Numerical expressions with hyphen and cyrillic endings (e.g. 1-ый “1st”, 3-м “3rd.Ins”) as well as adjectives and other non-numerals which contain digits (e.g. 79-гадовы “79 year old”, 500-годдзе “500th anniversary”) are treated as single tokens.
  • Other words with hyphen are treated as single tokens, except for the cases then the first part is inflected. Examples: { з-за } “because of”, { зялёна-шэрых } “green-gray”, { Санкт-Пецярбург } “St. Petersburg”, but { Ростове , - , на , - , Дону} “(in) Rostov on Don”.
  • Abbreviations are treated as single tokens, whitespaces split the abbreviations.
  • Abbreviations marked by a period, as in стр. “p. (page)”, П. “P. (for Peter)”, are treated as single tokens. If the period overlaps with the end of sentence period then it is written once as a separate token (denoting end-of-sentence), e.g. { 1914 , г , . } “year 1914”.
  • Abbreviations can not contain a period inside, i.e. the patterns like і т.д. “and so on”, да т.п. “and so forth” are splitted into three tokens: { i , т. , д. }, { да , т. , п. }.
  • Email addresses, URLs, and tweet-style names are treated as single tokens: {no@mail.ru}, {https://github.com}, {@anna_li}

The Belarusian UD treebank does not contain multiword tokens.

Morphology

The morphological annotation is adopted from the Russian-Syntagrus UD guidelines and mostly compliant with the RNC morphological standard (exept for "second" cases, comp2, imper2, which were converted to the "primary" tags, and transitivity tags, which were removed). Lemmas and features were annotated manually.

Syntax

The data were labeled semi-automatically using the annotation projection from Russian. For that purpose, Russian data were annotated using UDpipe, converted into UD 2.0, and then checked manually. In 2020, a UDpipe model for Belarusian trained on 1 mln corpus were also used. Belarusian dependency relations and labels were checked manually.

Texts

The source texts are the following:

  1. short news articles originally written in Belarusian (and/)or Russian and published by telegraf.by online agency.
    Document list: http://search2.ruscorpora.ru/search.xml?env=alpha&text=meta&sort=gr_tagging&lang=ru&doc_g_number_lang=&doc_te_author=&mode=para&doc_te_header=*&author=&doc_g_birthday=&doc_l_birthday=&doc_g_created=&doc_l_created=&doc_te_translator=&doc_lang=bel&doc_lang_trans=rus&doc_g_date_date_trans=&doc_l_date_date_trans=&doc_sphere=%EF%F3%E1%EB%E8%F6%E8%F1%F2%E8%EA%E0
  2. short news articles published by http://zviazda.by/.
    Document list:
  1. fiction: short stories and poetry from Belaruskaja palichka (knihi.com)
    Authors:
  • Francishak Bahushevich
  • Janka Kupala
  • Maksim Harecki
  • Vasil' Bykov
  • Ivan Mielez
  1. a short excerpt from The Lord of the Rings by J. R. R. Tolkien, translated in Belarusian
  2. social media: messages from the Telegram channels:
  1. Belarusian wikipedia
  2. nonfiction

Acknowledgments

We thank Uladzimir Koshchanka (Уладзімір Кошчанка, koshul@gmail.com) for providing a part of source texts, Anna Sherbakova (aniezka.sherbakova@gmail.com) for checking the pos and feature labels in two texts, Alyaxey Yaskevich and Katya Niamkovich for comments ans suggestions, Boris Orekhov for helpfull scripts.

References

  • Shishkina, Yana & Olga Lyashevskaya. Sculpting enhanced dependencies for Belarusian. In: Analysis of Images, Social Networks and Texts: 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers. PDF

Changelog

  • 2023-01-15 v2.12
    • XPOSes corrected, minor updates
  • 2021-01-05 v2.8
    • New texts added (genre: wiki, nonfiction)
    • lemma, upos, feat, head, deprel manually corrected
  • 2020-11-01 v2.6
    • New texts added (genre: news, social media, fiction, poetry).
  • 2019-01-05 v2.4
    • Constructions with parataxis, appos, ccomp, xcomp, ccomp, advcl, acl, nmod, passive and depictive constructions manually fixed.
    • UPOS, FEAT manually fixed.
    • Lemmas of PROPN uppercased.
    • New texts (genre: legal nonfiction fiction) added.
  • 2018-04-15 v2.2
    • Repository renamed from UD_Belarusian to UD_Belarusian-HSE.
  • 2017-11-15 v2.1
    • Flat / appos fixed.
    • New texts added.
  • 2017-03-01 v2.0
    • Initial UD release.
=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v2.0
License: CC BY-SA 4.0
Includes text: yes
Genre: fiction legal news nonfiction social poetry wiki
Lemmas: manual native
UPOS: manual native
XPOS: manual native
Features: manual native
Relations: manual native
Contributors: Lyashevskaya, Olga; Peljak-Łapińska, Angelika; Petrova, Daria; Shishkina, Yana
Contributing: elsewhere
Contact: olesar@yandex.ru
===============================================================================