Corpus for Universal Conceptual Cognitive Annotation
Pull request Compare This branch is 40 commits ahead, 1 commit behind huji-nlp:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
xml
.travis.yml Test with TUPA Oct 18, 2018
README.md Version 1.2.3 Sep 23, 2018
guidelines.pdf Add guidelines used for annotation May 30, 2018
metadata
short_defs.deprecated.pdf

README.md

The UCCA Wikipedia Corpus

Version 1.2.3 (September 23, 2018)

This bundle contains 367 passages annotated according to the foundational layer of UCCA. The passages are given as xmls in a format which is described below. The total number of tokens in this corpus is 158573. It also contains the annotation guidelines that were given to the annotators, and a metadata file.

The dataset is a part of the UCCA project developed in the NLP lab of the Hebrew University by Omri Abend and Ari Rappoport. The users of this dataset are kindly requested to cite the following publication:

Universal Conceptual Cognitive Annotation (UCCA). Omri Abend and Ari Rappoport, ACL 2013

Example passages can be graphically viewed through our web application. Please refer to our website or email (oabend@cs.huji.ac.il) for regular updates on the UCCA project and available resources.

Files included

  1. The passages files in an XML format. file names are of the form ucca_passageXXX.xml where XXX is the passage ID. Please see the UCCA resource webpage for a software package for reading and using these files.
  2. metadata: a file that contains some metadata for the passages. Specifically it contains the source of the text used (i.e., the Wikipedia article it was taken from), and the index of the annotator that did the final proof-reading (it can be 2,3 or 6).
  3. guidelines.pdf: the annotation guidelines that were given to the annotators are summarized in this file named "UCCA in a nutshell". Concise definitions are available through the UCCA website as well.
  4. short_defs.deprecated.pdf: a brief summary of the categories used by UCCA's foundational layer (used for the original annotation).

XML format

See FORMAT.md.

Licensing

The texts are taken from the English Wikipedia (http://en.wikipedia.org). The specific articles they were taken from are listed in the metadata file. The Wikipedia texts, as well as the UCCA annotation is distributed under the "Attribution-ShareAlike 3.0 Unported" license (http://creativecommons.org/licenses/by-sa/3.0/). Please follow the link for exact details.

ACKNOWLEDGEMENTS:

We would like to thank Tomer Eshet for his partnering in developing the UCCA web application, and Amit Beka for his help with UCCA's development set and software tools. We would also like to thank our four annotators for hard and thorough work.