Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

The SETimes.HR+ Croatian dependency treebank

The treebank is a result of an effort in providing free-culture language resources for Croatian by the NLP group at FF Zagreb.

The SETimes.HR dataset ( is available under the CC-BY-SA 4.0 license. Please cite Agić and Ljubešić (2014) (bib) when using this resource.

The remaining datasets ( and are available under the CC-BY-NC-SA 4.0 license. For the time being, please cite Agić and Ljubešić (2014) (bib) when using these resources as well.

There are currently 8,579 sentences (190,735 tokens) in the training collections. All are manually split and tokenized, and then manually annotated for:

  1. parts of speech and morphological features in the Multext East v4 (MTE4) style, and for
  2. syntactic dependencies following a simplified PDT-motivated scheme (Agić & Merkler 2013), referred to as the SETimes.HR scheme.

On top of that, we also provide an Universal Dependencies (UD) annotation layer for a large part of the treebank. This layer contains:

  1. the UD POS tags, including the universal morphological features, and
  2. the UD syntactic dependencies.

The UD annotation layer for Croatian is also available through the official UD repositories.

We encode the treebank using the CONLL-U format from the UD project. Note that columns 4 and 5 contain the coarse- and fine-grained MTE4 tags, while column 6 splits the fine-grained MTE4 tags into the corresponding attribute-value pairs. Columns 9 and 10 capture the Universal Dependencies layer of the treebank, i.e, the head:label syntactic pairs (column 9), and the universal POS tags and morphological features as attribute-value pairs (column 10).

The treebank is split into training and test sets. The training sets are in Croatian, while the test sets are further split by language (Croatian, Serbian) and domain (newswire, Wikipedia), following Agić et al. (2013a, 2013b).

The training sets are packaged in two files:

  • contains 3,757 training sentences (83,637 tokens) with both annotation layers available (SETimes.HR & UD). Note that this dataset is split in the Croatian UD dataset into 3,557 training and 200 development sentences.
  • contains 2,223 sentences (49,077 tokens) from the Croatian web-based corpus described by Klubička & Ljubešić (2014). Currently, this dataset does not include the UD annotation layer.
  • contains 2,599 sentences (58,021 tokens) from various news portals

The test sets are split into five 100-sentence (~2,000-word) files:

  • {set|wiki}.{hr|sr}.test.conll which represent the Croatian and Serbian newswire and Wikipedia test sets, and
  • which contains the Croatian web-based test set.

For the Croatian and Serbian newswire and Wikipedia test sets, both annotation layers are available, while for the Croatian web-based test set, only the SETimes.HR annotations are currently included.

On top of these resources, we also provide the Apertium morphological lexicons of Croatian and Serbian mapped to the tagset used in the above datasets (available from the ReLDI project GitHub), and the tag and morphological feature mappings between MTE4r and UD (file mte4r-upos.mapping).


The SETimes.HR+ Croatian dependency treebank







No releases published


No packages published