Skip to content

Latest commit

 

History

History
184 lines (122 loc) · 7.85 KB

data.md

File metadata and controls

184 lines (122 loc) · 7.85 KB

GramEval Data

All data available is stored in CoNLL-U format:

# newdoc
# newpar
# sent_id = 1
1	Кто	кто	PRON	_	Case=Nom	3	nsubj	_	_
2	нить	нить	NOUN	_	Animacy=Inan|Case=Nom|Gender=Fem|Number=Sing	1	appos	_	_
3	настраивал	настраивать	VERB	_	Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act	0	root	_	_
4	связку	связка	NOUN	_	Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing	3	obj	_	_
5	VisualSVN	Visualsvn	PROPN	_	Foreign=Yes	4	flat:foreign	_	_
6	Server	Server	PROPN	_	Foreign=Yes	5	flat:foreign	_	_
7	+	плюс	PUNCT	_	_	6	punct	_	_
8	Trac	Trac	ADP	_	_	11	case	_	_
9	0.11	0.11	NUM	_	_	11	nummod	_	_
10	на	на	ADP	_	_	11	case	_	_
11	Windows?	Windows?	SYM	_	_	12	obl	_	_
12	Можете	мочь	VERB	_	Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act	4	acl:relcl	_	_
13	подсказать,	подсказать	ADV	_	Degree=Pos	12	advmod	_	_
14	а	а	CCONJ	_	_	19	cc	_	_
15	то	то	PRON	_	Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing	19	mark	_	_
16	заводится	заводиться	VERB	_	Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin|Voice=Mid	15	fixed	_	_
17	никак	никак	ADV	_	Degree=Pos	19	advmod	_	_
18	не	не	PART	_	_	19	advmod	_	_
19	хочет	хотеть	VERB	_	Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	12	conj	_	_
20	(((	(((	PUNCT	_	_	19	punct	_	_

Enhanced dependencies and empty/secondary nodes should not be part of the output.

Development Data (open test sets)

Training Data

SynTagRus-UD

UD_SynTagRus

UD_SynTagRus-v02 - a harmonized version with semi-manual corrections

Russian data from the SynTagRus corpus (1.1M tokens, fiction, news, wiki, nonficion). Source: UD Russian SynTagRus repository

Annotation:

  • automatic (ETAP3), human correction in native SynTagRus, then re-tokenized and converted automatically to UD 2.x
  • enhanced dependencies removed, minor fixes of lemmas, UPOS, features, and relations

UD Russian GSD

Russian Universal Dependencies Treebank annotated and converted by Google (96K tokens, wiki). Source: UD_Russian GSD repository

Annotation:

  • automatic (GSD), human correction

UD Russian Taiga

Samples extracted from Taiga Corpus and MorphoRuEval-2017 text collections (38K tokens, blog, social, poetry, news). UD_Russian Taiga repository

Annotation:

  • manual

MorphoRuEval 2017

Russian Corpus Data with manual verification, including SynTagRus, OpenCorpora, GICR, RNC.

Annotation:

  • unified automatic morphology (AOT, Mystem, ABBYY Compreno...)
  • UDPipe

RNC 17th century

A subcorpus of the Middle Russian corpus, texts of the 17th century (4M tokens, business&law, letters, church slavic, hybrid)

Annotation:

  • upos and features: hybrid automatic with partial manual post-correction
  • lemmas: TBA
  • dependencies: automatic (UDpipe)

Additional Data

The organizers provide additional data with fully automatic annotation:

MorphoRuEval 2017

link

Russian Corpus Data with manual verification, including SynTagRus, OpenCorpora, GICR, RNC.

Annotation:

  • unified automatic morphology (AOT, Mystem, ABBYY Compreno...)
  • UDPipe

Twitter

Updated link

Corpus of Russian tweets with sentiment annotation from http://study.mokoron.com/

Annotation:

  • UDPipe pipeline (tokenization, morphology, syntax)

Wikipedia

link

Actual dump of Russian Wikipedia, first 100000 articles (will be supplemented)

Annotation:

  • UDPipe pipeline (tokenization, morphology, syntax)

Youtube comments

link

Comments from Russian Youtube Trends, april 2019

Annotation:

  • UDPipe pipeline (tokenization, morphology, syntax)

Lenta Ru news

link

Lenta Ru news, up to 2018

Annotation:

  • symbol unification
  • UDPipe pipeline (tokenization, morphology, syntax)

Stihi ru (Taiga)

link

Stihi ru poetry, part from from Taiga Corpus

Annotation:

  • symbol unification
  • UDPipe pipeline (tokenization, morphology, syntax)

Proza ru (Taiga)

link

Proza ru fiction, part from from Taiga Corpus

Annotation:

  • symbol unification
  • UDPipe pipeline (tokenization, morphology, syntax)

Fiction Magazines (Taiga)

link

Materials from https://magazines.gorky.media/, Tiga Corpus

Annotation:

  • symbol unification
  • UDPipe pipeline (tokenization, morphology, syntax)