The following examples suggest extensions to CoNLL-U which allow the publication of merged UD + UA data and are already being used in several valid UD corpora. Scenarios range from basic token data (no POS tags, sentence splits etc.) to fully treebanked data, which is compatible with UD validation scripts. Some advantages of this proposal:
- Uses familiar round bracket CoNLL scorer notation, which covers multiple markables starting/ending at once (e.g.
Entity=(1-organization(2-person)
), and is backwards compatible with older scorer versions - Very compact: most basic coreference data (with just identity coreference and perhaps entity types) can fit inside a single MISC annotation, preventing clutter and confusion when other annotation types appear in MISC. Keys for annotation values are declared only once at the top.
- Human readable: easy to see where markables start and end (no need to look at token index numbers)
- No need for co-indexing multiple annotations for markables: since almost all information can fit into a basic
Entity=
annotation, multiple markables each get a single string of annotations, without multiple keys - Trivially supports spans that cross sentence boundaries (just close the span in the relevant position)
- Supports entity linking/wikification without disrupting markable IDs or scorers (since the scorer accepts any arbitrary string as an identifier)
- Supports metadata
- Compatible with ANNIS CoNLL importer, meaning corpora in this format can be directly imported to ANNIS without any conversions or merging
- Proposes representations of direct edges for bridging/split antecedent and discontinuous markables (though these are no longer compatible with the CoNLL scorer by nature)
- There are no sentence breaks so numbers are sequential for entire document
- There are no annotations, so we just use 10 columns to conform with the conll format
- Coreference is indicate by co-index opening and closing brackets (on one line for single-token entities)
- Since CoNLL-U demands capitalized key-value annotations with
=
, we useEntity=
, which can be used for NER, coref, or both (see below) - The contents of the Entity annotation are compatible with the existing conll coref scorer
# newdoc id = GUM_voyage_tulsa
1 Tulsa _ _ _ _ _ _ _ Entity=(1)
2 Tulsa _ _ _ _ _ _ _ Entity=(1)
3 is _ _ _ _ _ _ _ _
4 in _ _ _ _ _ _ _ _
5 the _ _ _ _ _ _ _ Entity=(2
6 Green _ _ _ _ _ _ _ _
7 Country _ _ _ _ _ _ _ _
8 region _ _ _ _ _ _ _ _
9 of _ _ _ _ _ _ _ _
10 Oklahoma _ _ _ _ _ _ _ Entity=(3)2)
11 . _ _ _ _ _ _ _ _
12 It _ _ _ _ _ _ _ Entity=(1)
13 is _ _ _ _ _ _ _ _
14 also _ _ _ _ _ _ _ _
15 called _ _ _ _ _ _ _ _
16 “ _ _ _ _ _ _ _ _
17 T-town _ _ _ _ _ _ _ Entity=(1)
18 ” _ _ _ _ _ _ _ _
- This format is usable for NER type annotation (or non-named, and also nested entities)
- All entities of the same type have the same label, coref is not indicated
- Note that if we have spans with conflicting hierarchy, indices must be used as above (but will never repeat, since we have no coref)
# newdoc id = GUM_voyage_tulsa
1 Tulsa _ _ _ _ _ _ _ Entity=(place)
2 Tulsa _ _ _ _ _ _ _ Entity=(place)
3 is _ _ _ _ _ _ _ _
4 in _ _ _ _ _ _ _ _
5 the _ _ _ _ _ _ _ Entity=(place
6 Green _ _ _ _ _ _ _ _
7 Country _ _ _ _ _ _ _ _
8 region _ _ _ _ _ _ _ _
9 of _ _ _ _ _ _ _ _
10 Oklahoma _ _ _ _ _ _ _ Entity=(place)place)
11 . _ _ _ _ _ _ _ _
12 It _ _ _ _ _ _ _ Entity=(place)
13 is _ _ _ _ _ _ _ _
14 also _ _ _ _ _ _ _ _
15 called _ _ _ _ _ _ _ _
16 “ _ _ _ _ _ _ _ _
17 T-town _ _ _ _ _ _ _ Entity=(place)
18 ” _ _ _ _ _ _ _ _
- Both types of information are combined
- Singletons can be expressed by an unrepeated index
- After discussion with the Coref UD team, we decided to use only the group identifier in closer elements
- Names for other key-value annotations can be specified at the top of the document following the CoNLL-U-plus
# global.
format- Here
# global.Entity
specifies the annotation names insideEntity
annotations, indicating that "place" is the value of an annotation calledentity
- The special value
GRP
indicates that this position in the string is reserved for the coreference group ID
- Here
# newdoc id = GUM_voyage_tulsa
# global.Entity = entity-GRP
1 Tulsa _ _ _ _ _ _ _ Entity=(place-1)
2 Tulsa _ _ _ _ _ _ _ Entity=(place-1)
3 is _ _ _ _ _ _ _ _
4 in _ _ _ _ _ _ _ _
5 the _ _ _ _ _ _ _ Entity=(place-2
6 Green _ _ _ _ _ _ _ _
7 Country _ _ _ _ _ _ _ _
8 region _ _ _ _ _ _ _ _
9 of _ _ _ _ _ _ _ _
10 Oklahoma _ _ _ _ _ _ _ Entity=(place-3)2)
11 . _ _ _ _ _ _ _ _
12 It _ _ _ _ _ _ _ Entity=(place-1)
13 is _ _ _ _ _ _ _ _
14 also _ _ _ _ _ _ _ _
15 called _ _ _ _ _ _ _ _
16 “ _ _ _ _ _ _ _ _
17 T-town _ _ _ _ _ _ _ Entity=(place-1)
18 ” _ _ _ _ _ _ _ _
- Entity linking or 'Wikification' can be added using the same notation, as the last portion after the separator
- Literal hyphens in the entity link identifier should be escaped as
-
if needed - The following example uses GUM's convention of Wikipedia page identifiers as entity identifiers
- The annotation names
entity
andidentity
are indicated using the# global.Entity
header
# newdoc id = GUM_voyage_tulsa
# global.Entity = entity-GRP-identity
1 Tulsa _ _ _ _ _ _ _ Entity=(place-1-Tulsa)
2 Tulsa _ _ _ _ _ _ _ Entity=(place-1-Tulsa)
3 is _ _ _ _ _ _ _ _
4 in _ _ _ _ _ _ _ _
5 the _ _ _ _ _ _ _ Entity=(place-2-Green_Country
6 Green _ _ _ _ _ _ _ _
7 Country _ _ _ _ _ _ _ _
8 region _ _ _ _ _ _ _ _
9 of _ _ _ _ _ _ _ _
10 Oklahoma _ _ _ _ _ _ _ Entity=(place-3-Oklahoma)2)
11 . _ _ _ _ _ _ _ _
12 It _ _ _ _ _ _ _ Entity=(place-1-Tulsa)
13 is _ _ _ _ _ _ _ _
14 also _ _ _ _ _ _ _ _
15 called _ _ _ _ _ _ _ _
16 “ _ _ _ _ _ _ _ _
17 T-town _ _ _ _ _ _ _ Entity=(place-1-Tulsa)
18 ” _ _ _ _ _ _ _ _
- Since conllu is built for treebanking, we can just reuse the existing conllu spec
- Line space separates sentences
- IDs restart each sentence, but unique sentence IDs can be specified (optional in conllu)
- Note that multiple annotations in the MISC column (col 10) are pipe-separated, based on the conll spec (see
SpaceAfter
below for example) - For more information about UD tagsets, morphological categories, etc. see https://universaldependencies.org/format.html
# newdoc id = GUM_voyage_tulsa
# global.Entity = entity-GRP
# sent_id = GUM_voyage_tulsa-1
# text = Tulsa
1 Tulsa Tulsa PROPN NNP Number=Sing 0 root _ Entity=(place-1)
# sent_id = GUM_voyage_tulsa-2
# text = Tulsa is in the Green Country region of Oklahoma.
1 Tulsa Tulsa PROPN NNP Number=Sing 7 nsubj _ Entity=(place-1)
2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 7 cop _ _
3 in in ADP IN _ 7 case _ _
4 the the DET DT Definite=Def|PronType=Art 7 det _ Entity=(place-2
5 Green Green PROPN NNP Number=Sing 6 amod _ _
6 Country Country PROPN NNP Number=Sing 7 compound _ _
7 region region NOUN NN Number=Sing 0 root _ _
8 of of ADP IN _ 9 case _ _
9 Oklahoma Oklahoma PROPN NNP Number=Sing 7 nmod _ Entity=(place-3)2)|SpaceAfter=No
10 . . PUNCT . _ 1 punct _ _
- For fuzzy matching and other purposes, it can be useful to have a minimum or core span which systems must identify for each entity.
- In many contexts, this will be the syntactic head of the phrase, but we need some facilities in case the min span is multiple, possible discontinuous tokens
- Proposal: add min spec to entity brackets, with token indices refering to position within entity (not token IDs in sentence, in case the entity spans multiple sentences). Token indices can be comma separated, e.g.
(place-2-4,5
means a place entity, unique entity ID 2, begins on this line, and the min span is tokens 4 and 5 of this entity span.
# newdoc id = GUM_voyage_tulsa
# global.Entity = entity-GRP-MIN
# sent_id = GUM_voyage_tulsa-1
# text = Tulsa
1 Tulsa Tulsa PROPN NNP Number=Sing 0 root _ Entity=(place-1-1)
# sent_id = GUM_voyage_tulsa-2
# text = Tulsa is in the Green Country region of Oklahoma.
1 Tulsa Tulsa PROPN NNP Number=Sing 7 nsubj _ Entity=(place-1-1)
2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 7 cop _ _
3 in in ADP IN _ 7 case _ _
4 the the DET DT Definite=Def|PronType=Art 7 det _ Entity=(place-2-4
5 Green Green PROPN NNP Number=Sing 6 amod _ _
6 Country Country PROPN NNP Number=Sing 7 compound _ _
7 region region NOUN NN Number=Sing 0 root _ _
8 of of ADP IN _ 9 case _ _
9 Oklahoma Oklahoma PROPN NNP Number=Sing 7 nmod _ Entity=(place-3-1)2)|SpaceAfter=No
10 . . PUNCT . _ 1 punct _ _
- Split antecedent coref is not transitive: in "Kim met Alex ... they", the plural only refers back to both entities together, and not to each mention, or other mentions of the individual entities
- However, each mentioned entity, including the aggregate one, should be represented as usual in the
Entity
field - Since more mentions can appear before and after the split antecedent, we need a way to mark which mention is the anaphor
- Proposal: Use an additional
Split
annotation referencing each antecedent link from the anaphor to its antecedent, which will appear at the first token of the anaphor, e.g.Split=1<3,2<3
means that person-3, which begins at this line, refers back toperson-1
andperson-2
based on their GRP cluster.
# sent_id = GUM_bio_bernoulli-14
# text = He is said to have had a bad relationship with his father.
# s_type = decl
1 He he PRON PRP Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs 3 nsubj:pass _ Entity=(person-1)
2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 aux:pass _ _
3 said say VERB VBN Tense=Past|VerbForm=Part 0 root _ _
4 to to PART TO _ 6 mark _ _
5 have have AUX VB VerbForm=Inf 6 aux _ _
6 had have VERB VBN Tense=Past|VerbForm=Part 3 xcomp _ _
7 a a DET DT Definite=Ind|PronType=Art 9 det _ Entity=(abstract-54
8 bad bad ADJ JJ Degree=Pos 9 amod _ _
9 relationship relationship NOUN NN Number=Sing 6 obj _ _
10 with with ADP IN _ 12 case _ _
11 his his PRON PRP$ Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs 12 nmod:poss _ Entity=(person-45(person-1)
12 father father NOUN NN Number=Sing 9 nmod _ Entity=54)45)|SpaceAfter=No
13 . . PUNCT . _ 3 punct _ _
# sent_id = GUM_bio_bernoulli-15
# text = Upon both of them entering and tying for first place in a scientific contest at the University of Paris, Johann, unable to bear the "shame" of being compared Daniel's equal, banned Daniel from his house.
# s_type = decl
1 Upon upon SCONJ IN _ 5 mark _ _
2 both both PRON DT _ 5 nsubj _ Entity=(person-55|Split=1<55,45<55
3 of of ADP IN _ 4 case _ _
4 them they PRON PRP Case=Acc|Number=Plur|Person=3|PronType=Prs 2 nmod _ Entity=55)
- Since bridging is also not transitive, we again need a mechanism to indicate what is pointing to what, while assuming that entity IDs already exist in
Entity=
- For example, we need to distinguish
The car <- the door <- the handle
(a chain) fromThe car <- the door, (the car) <- the wheel
(two bridging anaphora to the same antecedent) - Proposal: use the same logic as for Split, using an annotation called
Bridge
(possibly we can add subtypes of bridging as well, similar to dependency subtypes)
# sent_id = GUM_bio_gordon-32
1 An a DET DT Definite=Ind|PronType=Art 6 det _ Discourse=joint:75->74|Entity=(abstract-142
2 incomplete incomplete ADJ JJ Degree=Pos|Polarity=Neg 6 amod _ _
3 and and CCONJ CC _ 4 cc _ _
4 faulty faulty ADJ JJ Degree=Pos 2 conj _ _
5 German German ADJ JJ Degree=Pos 6 amod _ _
6 translation translation NOUN NN Number=Sing 21 nsubj:pass _ SpaceAfter=No
7 , , PUNCT , _ 8 punct _ _
8 edited edit VERB VBN Tense=Past|VerbForm=Part 6 acl _ Discourse=elaboration:76->75
9 by by ADP IN _ 10 case _ _
10 Dr Dr PROPN NNP Number=Sing 8 obl _ Entity=(person-143
11 Moritz Moritz PROPN NNP Number=Sing 10 flat _ _
12 Posselt Posselt PROPN NNP Number=Sing 10 flat _ Entity=142)143)
13 ( ( PUNCT -LRB- _ 18 punct _ Discourse=elaboration:77->75|SpaceAfter=No
14 Tagebuch Tagebuch X FW _ 18 compound _ Entity=(abstract-142
15 des des X FW _ 18 compound _ _
16 Generals Generals X FW _ 18 compound _ _
17 Patrick Patrick PROPN NNP Number=Sing 18 compound _ _
18 Gordon Gordon PROPN NNP Number=Sing 6 appos _ Entity=142)|SpaceAfter=No
19 ) ) PUNCT -RRB- _ 18 punct _ _
20 was be AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 21 aux:pass _ Discourse=same-unit:78->75
21 published publish VERB VBN Tense=Past|VerbForm=Part 0 root _ SpaceAfter=No
22 , , PUNCT , _ 25 punct _ _
23 the the DET DT Definite=Def|PronType=Art 25 det _ Entity=(abstract-144|Bridge=142<144
24 first first ADJ JJ Degree=Pos|NumType=Ord 25 amod _ _
25 volume volume NOUN NN Number=Sing 21 parataxis _ Entity=144)
26 at at ADP IN _ 27 case _ _
27 Moscow Moscow PROPN NNP Number=Sing 25 orphan _ Entity=(place-95)
28 in in ADP IN _ 29 case _ _
29 1849 1849 NUM CD NumType=Card 25 orphan _ Entity=(time-145)|SpaceAfter=No
30 , , PUNCT , _ 32 punct _ _
31 the the DET DT Definite=Def|PronType=Art 32 det _ Entity=(abstract-146|Bridge=142<146
32 second second ADJ JJ Degree=Pos|NumType=Ord 25 conj _ Entity=146)
33 at at ADP IN _ 34 case _ _
34 St St PROPN NNP Number=Sing 32 orphan _ Entity=(place-147
35 Petersburg Petersburg PROPN NNP Number=Sing 34 flat _ Entity=147)
Discontinuities can be expressed by repeating the same markable ID multiple times (e.g. M times in total for a span with M parts), with a suffix denoting for the Nth occurrence that it is part N/M. The suggested notation is (15-abstract[1/2]
for the opening position of a two-part abstract entity span with the group identifier 15, as in the following example:
# speaker = DrCellini
1 So so INTJ UH _ 9 discourse 9:discourse Discourse=ROOT:17
2 so so ADV RB _ 3 advmod 3:advmod _
3 far far ADV RB Degree=Pos 9 advmod 9:advmod _
4 this this DET DT Number=Sing|PronType=Dem 5 det 5:det Entity=(time-14
5 morning morning NOUN NN Number=Sing 9 obl:tmod 9:obl:tmod Entity=14)
6 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 9 nsubj 9:nsubj Entity=(person-4)|SpaceAfter=No
7 've have AUX VBP Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 9 aux 9:aux _
8 just just ADV RB _ 9 advmod 9:advmod _
9 read read VERB VBN Tense=Past|VerbForm=Part 0 root 0:root _
10 a a DET DT Definite=Ind|PronType=Art 12 det 12:det Entity=(abstract-15[1/2]
11 few few ADJ JJ Degree=Pos 12 amod 12:amod _
12 studies study NOUN NNS Number=Plur 9 obj 9:obj Entity=15[1/2])
13 ( ( PUNCT -LRB- _ 18 punct 18:punct Discourse=evaluation:18->17|SpaceAfter=No
14 it it PRON PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs 18 nsubj 18:nsubj Entity=(abstract-16)|SpaceAfter=No
15 's be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 18 cop 18:cop _
16 a a DET DT Definite=Ind|PronType=Art 17 det 17:det _
17 little little ADV RB Degree=Pos 18 obl:npmod 18:obl:npmod _
18 slow slow ADJ JJ Degree=Pos 9 parataxis 9:parataxis _
19 in in ADP IN _ 21 case 21:case _
20 the the DET DT Definite=Def|PronType=Art 21 det 21:det Entity=(time-17
21 morning morning NOUN NN Number=Sing 18 obl 18:obl:in Entity=17)|SpaceAfter=No
22 ) ) PUNCT -RRB- _ 18 punct 18:punct _
23 of of ADP IN _ 26 case 26:case Discourse=same-unit:19->17|Entity=(abstract-15[2/2]
24 general general ADJ JJ Degree=Pos 26 amod 26:amod Entity=(abstract-8
25 nuclear nuclear ADJ JJ Degree=Pos 26 amod 26:amod _
26 medicine medicine NOUN NN Number=Sing 12 nmod 12:nmod:of Entity=15[2/2])8)
Metadata can be specified using a comment line following the # newdoc
declaration using a special prefix meta::
. For example:
# newdoc id = maz-00001
# global.Entity = GRP-phrase_type-complex_np-np_form-referentiality-grammatical_role-ambiguity
# meta::id = 00001
# meta::date = 30.10.2002
# meta::section = UNKNOWN
# meta::author = STEPHAN BREIDING
# meta::header = Auf Eis gelegt
# sent_id = maz-00001-1
1 Auf auf ADP APPR _ 2 case _ _
2 Eis Eis NOUN NN _ 3 obl _ _
3 gelegt legen VERB VVPP _ 0 root _ _
# sent_id = maz-00001-2
1 Dagmar Dagmar PROPN NE _ 3 nsubj _ Entity=(1-np-simple-ne-new-sbj-unambig
2 Ziegler Ziegler PROPN NE _ 1 flat _ Entity=1)
3 sitzt sitzen VERB VVFIN _ 0 root _ _
4 in in ADP APPR _ 6 case _ Entity=(2-pp-simple-defnp-discourse_cataphor-other-unambig
5 der der DET ART _ 6 det _ _
6 Schuldenfalle Schuldenfalle NOUN NN _ 3 obl _ Entity=2)
7 . -- PUNCT $. _ 3 punct _ _