Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
70 lines (44 sloc) 3.58 KB

Summary

Universal Dependencies treebank based on data samples extracted from Taiga Corpus and MorphoRuEval-2017 text collections.

Introduction

UD Russian Taiga has been developed at the School of Linguistics, National Research University Higher School of Economics in Moscow (HSE/Vyshka). The selection of texts is meant to represent those registers that have not been covered by UD Russian SynTagRus and UD Russian Google Stanford Dependencies, mainly e-communication (blogs and social media). The sentences are extracted from two open data collections. Taiga Corpus (https://tatianashavrina.github.io/taiga_site/) is an open-source corpus for machine learning collected by students as part of the curriculum of the MA Program in Computational Linguistics at HSE. MorphoRuEval 2017 text collections (https://github.com/dialogue-evaluation/morphoRuEval-2017) is an output of the RuEval shared task 'Evaluation of Russian NLP: Morphological analysis, http://www.dialog-21.ru/en/evaluation/2017/morphology/).

The plain text data were tokenized, lemmatized and parsed using UDpipe (http://ufal.mff.cuni.cz/udpipe) and checked manually. Corrections were made at all levels: tokenization, lemmata, pos, features, dependency relations.

Registers

  • blogs and social media: 87% main source: vk.com

  • naïve poetry: 11%  main source: stihi.ru

  • news: 1% The tiny news collection was used to train annotators and check annotation consistency with other UD treeabnks.

Data split

v.2.2:

  • training: 50% (10K tokens, 880 sentences)

  • test: 50% (10K tokens, 884 sentences)

Acknowledgments

We are grateful to all the contributors to the original open Russian data collections and especially to Tatiana Shavrina (Taiga) and Alena Fenogenova (MorphoRuEval-2017 data set).

References

  • Lyashevskaya, Olga, Kira Droganova, Daniel Zeman, Maria Alexeeva, Tatiana Gavrilova, Nina Mustafina, and Elena Shakurova. (2016). Universal Dependencies for Russian: a New Syntactic Dependencies Tagset. In: Series: Linguistics, WP BRP 44/LNG/2016. Moscow.

  • Sorokin, Andrey, Tatiana Shavrina, Olga Lyashevskaya, Victor Bocharov, Svetlana Alexeeva, Kira Droganova, Alena Fenogenova, and Dmitry Granovsky. (2017). MorphoRuEval-2017: an Evaluation Track for the Automatic Morphological Analysis Methods for Russian. In Computational Linguistics and Intellectual Technologies, Proceedings of Dialog 2017, Moscow. No 16 (23). Vol. 1, 297-313.

  • Lyashevskaya, Olga, Victor Bocharov, Alexey Sorokin, Tatiana Shavrina, Dmitry Granovsky, and Svetlana Alexeeva. (2017). Text collections for evaluation of Russian morphological taggers. Jazykovedny Casopis, 68 (2), 2017: 258-267.

  • Shavrina, Tatiana, Olga Shapovalova. (2017) To the methodology of corpus construction for machine learning: «Taiga» syntax tree corpus and parser. In Proceedings of the International Conference "CORPORA 2017", Saint-Petersbourg, Russia.

Changelog

  • 2018-07-01 v2.2
    • First official release.
=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v2.2
License: CC BY-SA 4.0
Includes text: yes
Genre: blog news poetry social
Lemmas: manual native
UPOS: manual native
XPOS: manual native
Features: manual native
Relations: manual native
Contributors: Lyashevskaya, Olga; Rudina, Olga
Contributing: elsewhere
Contact: olesar@yandex.ru
===============================================================================