Skip to content
Polish data.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
CONTRIBUTING.md
LICENSE.txt
README.md
eval.log
pl_pdb-ud-dev.conllu
pl_pdb-ud-test.conllu
pl_pdb-ud-train.conllu
stats.xml

README.md

Summary

The Polish PDB-UD treebank is based on the Polish Dependency Bank 2.0 (PDB 2.0), created at the Institute of Computer Science, Polish Academy of Sciences in Warsaw. The PDB-UD treebank is an extended and corrected version of the Polish SZ-UD treebank (the release 2.3).

Introduction

The PDB-UD treebank consists of 22,208 sentences (351K tokens). It contains all 8K sentences of the Polish UD-SZ treebank and further 14K unique sentences. The additional sentences enclose linguistic phenomena that did not occur or were not annotated (e.g. relative clauses, reported speech) in the UD-SZ trees. The PDB-UD treebank contains enhanced graphs, i.e. trees with the enhanced edges encoding the shared dependents and the shared governors of coordinated conjuncts (9167 PDB-UD trees contain enhanced edges).

The morphological, syntactic and semantic annotation of the PDB-UD treebank is created through a conversion of PDB 2.0 data. The conversion procedure has been designed and implemented by Alina Wróblewska partly based on the conversion of the UD-SZ trees.

Data Split

The PDB-UD treebank contains sentences from Polish National Corpus, CDSCorpus, parallel corpora (e.g. Europarl, OPUS, and Pelcra Prallel Corpus) and other sources (e.g. literature). The PDB-UD treebank is divided into three parts:

  • training: 17,766 trees (284,209 tokens),
  • test: 2221 trees (31,488 tokens),
  • development: 2221 trees (35,709).

The procedure of assigning dependency trees to particular data sets is generally random while maintaining the proportion of data from individual sources. There is one constraint on the dividing procedure – if a sentence occurs in the test subcorpus of the Polish LFG-UD treebank, this sentence is assigned to the PDB-UD test subset. Since sentences underlying the Polish UD-LFG trees are generally short, the average number of tokens per sentence is significantly lower in the PDB-UD test set than in two other sets.

License/Copyright

Universal Polish Dependency Bank © 2019 by Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences.

The Polish PDB-UD treebank is licensed under CC BY-NC-SA 4.0.

History Note

The earlier release of the Polish UD treebank, i.e. the UD-SZ treebank, was first converted to the Prague dependency style as a part of HamleDT; then it was automatically converted to Universal Dependencies (HamleDT 3.0, 2015). The first release of Universal Dependencies that includes this treebank is UD v1.2 in November 2015. It is essentially the HamleDT conversion but the data is not identical to HamleDT 3.0 because the conversion procedure has been further improved.

References

If you use the Polish PDB-UD treebank, you are encouraged to cite this paper:

@inproceedings{pl,
  author    = {Wr{\'o}blewska, Alina},
  title     = {Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format},
  booktitle = {Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)},
  editor    = {de Marneffe, Marie-Catherine and Lynn, Teresa and Schuster, Sebastian},
  pages     = {173--182},
  publisher = {Association for Computational Linguistics},
  year      = {2018}
}

Acknowledgments

We would like to thank all of the contributors of the original Polish Dependency Bank 2.0. The development of the PDB-UD treebank was founded by the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure.

Changelog

  • 2019-05-15 v2.4
    • Repository renamed from UD_Polish-SZ to UD_Polish-PDB.
    • Completely new conversion of the data.
    • In addition to the texts from “Składnica zależnościowa” 0.5, new data from PDB was added.
    • License changed from GNU GPL 3.0 to CC BY-NC-SA 4.0.
  • 2018-04-15 v2.2
    • Repository renamed from UD_Polish to UD_Polish-SZ.
  • 2017-11-15 v2.1
    • Prepositional objects are now “obl:arg” instead of “obj”.
    • Instrumental phrases for demoted agents in passives are now “obl:agent”.
  • 2017-03-01 v2.0
    • Converted to UD v2 guidelines.
    • Reconsidered PRON vs. DET.
    • Improved advmod vs. obl distinction.
    • Participles moved from verbs to adjectives.
  • 2016-05-15 v1.3
    • Fixed adverbs that were attached as nmod; correct: advmod.
    • Conditional auxiliary "by" changed from PART to AUX (and features added).
    • Demonstrative, interrogative, indefinite, total and negative PRON/DET separated from NOUN/ADJ.
    • Proper nouns (PROPN) separated from NOUN.
    • Past tense / l-participles changed from VerbForm=Fin to VerbForm=Part.
    • Verbal nouns (VerbForm=Ger) retagged and relemmatized from VERB to NOUN.
=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v1.2
License: CC BY-NC-SA 4.0
Includes text: yes
Genre: fiction nonfiction news
Lemmas: converted from manual
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: converted from manual
Contributors: Wróblewska, Alina; Zeman, Daniel; Mašek, Jan; Rosa, Rudolf
Contributing: elsewhere
Contact: zeman@ufal.mff.cuni.cz
===============================================================================
You can’t perform that action at this time.