Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
102 lines (73 sloc) 4.67 KB

Summary

The Croatian UD treebank is based on the SETimes-HR corpus.

Introduction

The sentences are partially parallel with the smaller Serbian UD treebank, which comes from the Serbian edition of SETimes. For the CoNLL 2018 shared task in parsing (and for UD release 2.2), the Croatian corpus was re-split so that corresponding sentences are in the same section (train/dev/test) in Croatian and Serbian. The re-split had to be done on the Croatian side because the Serbian corpus is smaller and most of it correspond to what used to be training data in Croatian.

For the time being, sentence ids have not been changed although they contain references to train/dev/test. Therefore it is now possible that e.g. sentence id "train-s2852" occurs in the development data, not in training data. This may be changed in future releases.

Also note that the following description of data split and sources refers to the old data split. Thus, sentences 0001-3557 of the "training set" have ids "train-s1" to "train-s3557" but some of them are now in the dev file and some in the test file.

Training set.

Contains 7,689 sentences (169,283 tokens) from three sources:

  1. Sentences 0001-3557: Newspaper text from the Southeast European Times news website, obtained from the SETimes parallel corpus. This part of the treebank is built on top of the SETimes.HR dependency treebank of Croatian;
  2. Sentences 3558-5792: Text from various Croatian web sources.
  3. Sentences 5793-7689: Croatian news web sources.

Development set.

Contains 600 sentences (14,533 tokens) from two sources:

  1. 001-200: newspaper text from the Croatian SETimes,
  2. 201-600: Croatian news web sources.

Test set.

Contains 600 sentences (13,228 tokens) from three sources:

  1. sentences 001-100: newspaper text,
  2. sentences 101-200: Wikipedia,
  3. sentences 201-297: web sources, and
  4. sentences 298-600: Croatian news web sources.

Details

Sentence and word segmentation was manually checked. The treebank does not include multiword tokens. No language-specific features and relations were used. The POS tags and features were converted from Multext East v4 and manually checked. The syntactic annotation was done manually.

Acknowledgments

When using the Croatian UD treebank, please cite the following paper:

See file LICENSE.txt for further licensing information.

Changelog

  • 2018-04-15 v2.2
    • Repository renamed from UD_Croatian to UD_Croatian-SET.
    • Data split made compatible with the parallel data in UD_Serbian-SET.
  • 2017-02-15
    • converted to UD v2 standard
      • nmod vs. obl under non-verbal predicates should be checked manually (see the ToDo attribute in the MISC column)
      • by UD guidelines, reflexive pronouns with inherently reflexive verbs are now attached as expl:pv, not compound
      • adverbial participles (converbs) are marked by VerbForm=Conv
    • a number of enhancements and bug fixes
      • all pronouns, determiners and pronominal adverbs have PronType
      • all verbs have VerbForm; all finite verbs have Mood
      • ordinal numerals are ADJ like elsewhere in UD, not NUM (but they keep NumType=Ord)
      • relative pronouns, determiners and adverbs are not attached as mark (subordinating conjunctions keep the mark relation)
      • possessive adjectives and determiners are amod and det respectively; not nmod
      • coordinating conjunctions at the beginning of sentence are attached as cc, not discourse
  • 2017-02-09
    • added new ud v1 sentences from news-hr to dev, test, and train set: 2600 sentences, out of which the last 703 went to dev (400) and test (303), and the remainder to the train set
  • 2016-10-31
    • added 2235 new sentences to the training set, and 97 new sentences to the test set, from various Croatian web sources
=== Machine-readable metadata =================================================
Data available since: UD v1.1
License: CC BY-SA 4.0
Includes text: yes
Genre: news web wiki
Lemmas: converted from manual
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: manual native
Contributors: Agić, Željko; Ljubešić, Nikola; Zeman, Daniel
Contributing: elsewhere
Contact: zeljko.agic@gmail.com
===============================================================================