Universal Dependencies Syntactic Graphs

The syntactic graphs that form the first layer of annotation in the dataset come from gold UD dependency parses provided in the UD-EWT treebank, which contains sentences from the Linguistic Data Consortium's constituency parsed EWT. UD-EWT has predefined training (train), development (dev), and test (test) data in corresponding files in CoNLL-U format: en_ewt-ud-train.conllu, en_ewt-ud-dev.conllu, and en_ewt-ud-test.conllu. Henceforth, SPLIT ranges over train, dev, and test.

In UDS, each dependency parsed sentence in UD-EWT is represented as a rooted directed graph (digraph). Each graph's identifier takes the form ewt-SPLIT-SENTNUM, where SENTNUM is the ordinal position (1-indexed) of the sentence within en_ewt-ud-SPLIT.conllu.

Each token in a sentence is associated with a node with identifier ewt-SPLIT-SENTNUM-syntax-TOKNUM, where TOKNUM is the token's ordinal position within the sentence (1-indexed, following the convention in UD-EWT). At minimum, each node has the following attributes.

  • position (int): the ordinal position (TOKNUM) of that node as an integer (again, 1-indexed)
  • domain (str): the subgraph this node is part of (always syntax)
  • type (str): the type of the object in the particular domain (always token)
  • form (str): the actual token
  • lemma (str): the lemma corresponding to the actual token
  • upos (str): the UD part-of-speech tag
  • xpos (str): the Penn TreeBank part-of-speech tag
  • any attribute found in the features column of the CoNLL-U

For information about the values upos, xpos, and the attributes contained in the features column can take on, see the UD Guidelines.

Each graph also has a special root node with identifier ewt-SPLIT-SENTNUM-root-0. This node always has a position attribute set to 0 and domain and type attributes set to root.

Edges within the graph represent the grammatical relations (dependencies) annotated in UD-EWT. These dependencies are always represented as directed edges pointing from the head to the dependent. At minimum, each edge has the following attributes.

  • domain (str): the subgraph this node is part of (always syntax)
  • type (str): the type of the object in the particular domain (always dependency)
  • deprel (str): the UD dependency relation tag

For information about the values deprel can take on, see the UD Guidelines.