GitHub - ffnlp/sethr: The SETimes.HR+ Croatian dependency treebank

The SETimes.HR+ Croatian dependency treebank

The treebank is a result of an effort in providing free-culture language resources for Croatian by the NLP group at FF Zagreb.

The SETimes.HR dataset (set.hr.conll) is available under the CC-BY-SA 4.0 license. Please cite Agić and Ljubešić (2014) (bib) when using this resource.

The remaining datasets (web.hr.conll and news.hr.conll) are available under the CC-BY-NC-SA 4.0 license. For the time being, please cite Agić and Ljubešić (2014) (bib) when using these resources as well.

There are currently 8,579 sentences (190,735 tokens) in the training collections. All are manually split and tokenized, and then manually annotated for:

parts of speech and morphological features in the Multext East v4 (MTE4) style, and for
syntactic dependencies following a simplified PDT-motivated scheme (Agić & Merkler 2013), referred to as the SETimes.HR scheme.

On top of that, we also provide an Universal Dependencies (UD) annotation layer for a large part of the treebank. This layer contains:

the UD POS tags, including the universal morphological features, and
the UD syntactic dependencies.

The UD annotation layer for Croatian is also available through the official UD repositories.

We encode the treebank using the CONLL-U format from the UD project. Note that columns 4 and 5 contain the coarse- and fine-grained MTE4 tags, while column 6 splits the fine-grained MTE4 tags into the corresponding attribute-value pairs. Columns 9 and 10 capture the Universal Dependencies layer of the treebank, i.e, the head:label syntactic pairs (column 9), and the universal POS tags and morphological features as attribute-value pairs (column 10).

The treebank is split into training and test sets. The training sets are in Croatian, while the test sets are further split by language (Croatian, Serbian) and domain (newswire, Wikipedia), following Agić et al. (2013a, 2013b).

The training sets are packaged in two files:

set.hr.conll contains 3,757 training sentences (83,637 tokens) with both annotation layers available (SETimes.HR & UD). Note that this dataset is split in the Croatian UD dataset into 3,557 training and 200 development sentences.
web.hr.conll contains 2,223 sentences (49,077 tokens) from the Croatian web-based corpus described by Klubička & Ljubešić (2014). Currently, this dataset does not include the UD annotation layer.
news.hr.conll contains 2,599 sentences (58,021 tokens) from various news portals

The test sets are split into five 100-sentence (~2,000-word) files:

{set|wiki}.{hr|sr}.test.conll which represent the Croatian and Serbian newswire and Wikipedia test sets, and
web.hr.test.conll which contains the Croatian web-based test set.

For the Croatian and Serbian newswire and Wikipedia test sets, both annotation layers are available, while for the Croatian web-based test set, only the SETimes.HR annotations are currently included.

On top of these resources, we also provide the Apertium morphological lexicons of Croatian and Serbian mapped to the tagset used in the above datasets (available from the ReLDI project GitHub), and the tag and morphological feature mappings between MTE4r and UD (file mte4r-upos.mapping).

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
scripts		scripts
LICENSE.md		LICENSE.md
README.md		README.md
apertium-hbs.hbs_HR_purist.mte.gz		apertium-hbs.hbs_HR_purist.mte.gz
mte4r-upos.mapping		mte4r-upos.mapping
news.hr.conll		news.hr.conll
set.hr.conll		set.hr.conll
set.hr.test.conll		set.hr.test.conll
set.sr.test.conll		set.sr.test.conll
web.hr.conll		web.hr.conll
web.hr.test.conll		web.hr.test.conll
wiki.hr.test.conll		wiki.hr.test.conll
wiki.sr.test.conll		wiki.sr.test.conll

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The SETimes.HR+ Croatian dependency treebank

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

ffnlp/sethr

Folders and files

Latest commit

History

Repository files navigation

The SETimes.HR+ Croatian dependency treebank

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages