Adding Named Entity annotations #562

fredrijo · 2018-08-01T07:56:34Z

The Norwegian Bokmål part of UD is based on Norwegian Dependency Treebank (NDT). NDT has now been extended with named entity annotations, and is going to be redistributed with these annotations in addition to the linguistic (syntactic and morphological) annotations.

We are interested in distributing these names as part of the UD as well, if this is desirable.

Questions:

Do you want to add named entity annotations to UD?
If you are ok with adding NE annotations (1.), should they be included in the MISC field?
If they should be added to the MISC field (2.), is there a preferred attribute name? In the paragraph about "Other Miscellaneous Attributes". If there are any other treebanks in UD that already has NE annotations, I assume it is preferred to use the same attribute name. If not, then I'll let you decide on a name.
Do you prefer any particular convention for annotation entity boundaries/scope? Currently, the annotations are on the IOB2 format, but this can be changed to e.g. using token indexes or something else if desired.

jnivre · 2018-08-01T08:05:33Z

Adding more annotation layers to UD treebank is of course always welcome, since it makes the treebanks more useful. However, it is still not clear to me what is the best way to do this in practice. If the information is going to be added to the CoNLL-U files, then the MISC field is the only possibility, and it would of course be preferable if this could be done in a uniform way across treebanks, but currently there is no proposal for NE (as far as I know). The other option is to do a kind of standoff annotation, where the annotation is stored in a separate file that references the CoNLL-U file. There is an ongoing effort to develop such a format for multiword expressions and to generalise this to a standard that can be used also for other types of annotations. This work has been stalled for a while, but the plan is to resume it in September. Perhaps it would make sense to include NE as another test case here.

fredrijo · 2018-08-01T08:23:11Z

Thank you for the answer. Treating multiword expressions and named entities in a uniform manner makes a lot of sense! I will await your decision on the standoff format then.

amir-zeldes · 2018-08-01T17:52:28Z

One way of including this information would be to use a similar convention to WebAnno TSV:

#Text=Water Off a Black Dog’s Back
2-1	25-30	Water	substance	new	_	_	
2-2	31-34	Off	_	_	_	_	
2-3	35-36	a	object[3]	new[3]	_	_	
2-4	37-42	Black	object[3]|animal[4]	new[3]|new[4]	_	_	
2-5	43-46	Dog	object[3]|animal[4]	new[3]|new[4]	_	_	
2-6	47-49	’s	object[3]	new[3]	_	_	
2-7	50-54	Back	object[3]	new[3]	_	_

Only multitoken spans get a span ID in WebAnno TSV, but one could assign it to all entity annotationns for consistency. Using the MISC field one could write:

1	Water	Water	PROPN	NNP	Number=Sing	0	root	_	Entity[1]=substance
2	Off	off	ADP	IN	_	7	case	_	_
3	a	a	DET	DT	Definite=Ind|PronType=Art	5	det	_	Entity[2]=animal|Entity[3]=object
4	Black	Black	PROPN	NNP	Number=Sing	5	amod	_	Entity[2]=animal|Entity[3]=object
5	Dog	Dog	PROPN	NNP	Number=Sing	7	nmod:poss	_	SpaceAfter=No|Entity[2]=animal|Entity[3]=object
6	’s	's	PART	POS	_	5	case	_	Entity[3]=object
7	Back	Back	PROPN	NNP	Number=Sing	1	nmod	_	Entity[3]=object

We have complete entity type information (including named, non-named and pronoun mentions) for UD_English-GUM as well, which we would be happy to include in the UD release. For more on WebAnno TSV see:

https://webanno.github.io/webanno/releases/3.2.2/docs/user-guide.html#sect_webannotsv

savary · 2018-08-01T20:27:35Z

The definition of a beta version of the standoff format, mentioned by Joakim, can be found here. Note that this definition includes:

a meta format, described in the "Extended CoNLL-U format" section; it allows any initiative (like PARSEME) to add extra columns to conllu
an instantiation (called .cupt) of this meta-format for multiword expression annotations, as defined in PARSEME guidelines

.cupt was used by the PARSEME corpus of verbal MWE in a recent shared task. We are now gathering feedback on this format before we discuss it further with the UD core group.

In the future, PARSEME aims at extending .cupt so as to include all kinds of MWEs, not only verbal, and multiword named entities will also be considered. Collaborations are welcome.

arademaker · 2018-08-04T01:00:15Z

Hi @amir-zeldes, regarding your suggestion #562 (comment), why do we need to type all tokens? Why not only the head of the entity?

foxik · 2018-08-04T06:56:20Z

In a project here in Prague, where we annotated NEs and annotated NE linking, and we ended up using sentence-level comments to embedd information in JSON format. We also used document-level comments which linked all NEs in the document. Our goal was compatibility with CoNLL-U format (so that the documents themselves can be processed by any CoNLL-U compatible tool), while allowing very general extensibility.

An example follows (with added line breaks for readibility; all comments should be single-line)

# newdoc id = ukázková analýza
# doc_json_entities = [
    {"id":"e1","type":"Osoba","mentions":["s1#m1","s2#m1","s2#m3"],
     "labels":["Jan Novák, nar. 01.01.1901"]},
    {"id":"e2","type":"Firma","mentions":["s1#m2"],
     "labels":["Novák a synové"]},
    {"id":"e3","type":"Adresa","mentions":["s1#m3"],
     "labels":["Václavské náměstí 1, Praha 1, 110 00"]},
    {"id":"e4","type":"Auto","mentions":["s2#m2"],
     "labels":["Volvo"]},
    {"id":"e5","type":"Zbraň","mentions":["s2#m4"],
     "labels":["Glock 42"]}]
# doc_json_relations = [
    {"id":"r1","from":"e1","to":"e2","type":"StejnaVeta"},
    {"id":"r2","from":"e2","to":"e3","type":"StejnaVeta"},
    {"id":"r3","from":"e1","to":"e4","type":"StejnaVeta"},
    {"id":"r4","from":"e1","to":"e5","type":"StejnaVeta"}]
# doc_json_summaries = ["Aktiva Jana Nováka"]
# sent_id = s1
# text = Jan Novák, nar. 01.01.1901, tel. 720123456, vlastní firmu
    Novák a synové, která sídlí na adrese Václavské náměstí 1,
    Praha 1, 110 00.
# json_mentions = [
    {"id":"s1#m1","type":"Osoba","span":[1,2,3,4,5,6,7,8]},
    {"id":"s1#m2","type":"Firma","span":[16,17,18]},
    {"id":"s1#m3","type":"Adresa","span":[24,25,26,27,28,29,30,31,32]}]
1  	Jan       	Jan       	_ 	NNMS1-----A---- 	_ 	_ 	_ 	_ 	_
2  	Novák     	Novák     	_ 	NNMS1-----A---- 	_ 	_ 	_ 	_ 	SpaceAfter=No
3  	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
4  	nar       	narozený  	_ 	AAXXX----1A---8 	_ 	_ 	_ 	_ 	SpaceAfter=No
5  	.         	.         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
6  	01.01     	01.01     	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
7  	.         	.         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
8  	1901      	1901      	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
9  	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
10 	tel       	telefon   	_ 	NNIXX-----A---8 	_ 	_ 	_ 	_ 	SpaceAfter=No
11 	.         	.         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
12 	720123456 	720123456 	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
13 	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
14 	vlastní   	vlastní   	_ 	AAFS4----1A---- 	_ 	_ 	_ 	_ 	_
15 	firmu     	firma     	_ 	NNFS4-----A---- 	_ 	_ 	_ 	_ 	_
16 	Novák     	Novák     	_ 	NNMS1-----A---- 	_ 	_ 	_ 	_ 	_
17 	a         	a         	_ 	J^------------- 	_ 	_ 	_ 	_ 	_
18 	synové    	syn       	_ 	NNMP1-----A---- 	_ 	_ 	_ 	_ 	SpaceAfter=No
19 	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
20 	která     	který     	_ 	P4FS1---------- 	_ 	_ 	_ 	_ 	_
21 	sídlí     	sídlit    	_ 	VB-S---3P-AA--- 	_ 	_ 	_ 	_ 	_
22 	na        	na        	_ 	RR--6---------- 	_ 	_ 	_ 	_ 	_
23 	adrese    	adresa    	_ 	NNFS6-----A---- 	_ 	_ 	_ 	_ 	_
24 	Václavské 	václavský 	_ 	AANS1----1A---- 	_ 	_ 	_ 	_ 	_
25 	náměstí   	náměstí   	_ 	NNNS1-----A---- 	_ 	_ 	_ 	_ 	_
26 	1         	1         	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
27 	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
28 	Praha     	Praha     	_ 	NNFS1-----A---- 	_ 	_ 	_ 	_ 	_
29 	1         	1         	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
30 	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
31 	110       	110       	_ 	C=------------- 	_ 	_ 	_ 	_ 	_
32 	00        	00        	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
33 	.         	.         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_

# sent_id = s2
# text = Další majetek pana Nováka čítá vozidlo značky Volvo a Novákovu
    legálně drženou zbraň Glock 42.
# json_mentions = [
    {"id":"s2#m1","type":"Osoba","span":[4]},
    {"id":"s2#m2","type":"Auto","span":[8]},
    {"id":"s2#m3","type":"Osoba","span":[10]},
    {"id":"s2#m4","type":"Zbraň","span":[14,15]}]
1  	Další     	další     	_ 	AAIS1----1A---- 	_ 	_ 	_ 	_ 	_
2  	majetek   	majetek   	_ 	NNIS1-----A---- 	_ 	_ 	_ 	_ 	_
3  	pana      	pan       	_ 	NNMS2-----A---- 	_ 	_ 	_ 	_ 	_
4  	Nováka    	Novák     	_ 	NNMS2-----A---- 	_ 	_ 	_ 	_ 	_
5  	čítá      	čítat     	_ 	VB-S---3P-AA--- 	_ 	_ 	_ 	_ 	_
6  	vozidlo   	vozidlo   	_ 	NNNS4-----A---- 	_ 	_ 	_ 	_ 	_
7  	značky    	značka    	_ 	NNFS2-----A---- 	_ 	_ 	_ 	_ 	_
8  	Volvo     	Volvo     	_ 	NNNS1-----A---- 	_ 	_ 	_ 	_ 	_
9  	a         	a         	_ 	J^------------- 	_ 	_ 	_ 	_ 	_
10 	Novákovu  	Novákův   	_ 	AUFS4M--------- 	_ 	_ 	_ 	_ 	_
11 	legálně   	legálně   	_ 	Dg-------1A---- 	_ 	_ 	_ 	_ 	_
12 	drženou   	držený    	_ 	AAFS4----1A---- 	_ 	_ 	_ 	_ 	_
13 	zbraň     	zbraň     	_ 	NNFS4-----A---- 	_ 	_ 	_ 	_ 	_
14 	Glock     	Glock     	_ 	NNMS1-----A---- 	_ 	_ 	_ 	_ 	_
15 	42        	42        	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
16 	.         	.         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_

I am not proposing standardizing this of anything, just showing what we use to extend CoNLL-U :-)

amir-zeldes · 2018-08-04T06:58:36Z

Hi all - @savary I don't mean to propose a new format if one is already established, I'm also happy to use whatever else is agreed on. Just suggesting that there are already some representations of spans in a CoNLL-U-like format that could be reused for this.

@arademaker - marking the head is not sufficient for at least two main reasons:

The exact extent of the entity annotation is not necessarily predictable from the head, since some relations typically continue an entity (e.g. an amod is included in an entity span) and some aren't (e.g. nsubj to a nominal predicate is not part of the predicate entity span). In my experience it is not 100% possible to formulate rules determining entity span from relation types, since some are ambiguous (e.g. dep, which may or may not be part of the entity) and some are treebank/language specific.
Some tokens can be the head of two conflicting entities, especially for coordinations: we might have two coordinated entities with different types (e.g. person + organization), and we might have a guideline that gives the entire NP a different type (e.g. mixed or abstract). In this case, the first noun will be the head of two distinct entity annotations with different spans.

amir-zeldes · 2018-08-04T07:04:26Z

Just seeing @foxik 's post - for completeness I should say that we also have a different working solution at the moment. We have parallel WebAnno files next to the CoNLL-U syntax files, and both have the same tokenization so it's easy to merge. We didn't consider putting entities directly into the CoNLL files since we also have coreference for these entities, and we're already annotating them in WebAnno.

As for standoff, we also have a standoff XML representation of all annotations in the corpus (entities, coref, discourse, TEI tags...), which is expressed using PAULA XML but is automatically generated from the CoNLL-U, WebAnno and other formats, so it's never manually edited.

dan-zeman · 2018-09-17T09:59:42Z

See http://universaldependencies.org/ext-format.html for the specification of the CoNLL-U Plus file format.

gcelano · 2018-09-17T15:05:26Z

I think that the approach of keeping the original text separate from any other kind of annotation and reference to the original text via offsets even for sentence split and tokenization is the most/only scalable approach. Sentence split and tokenization are of course kinds of annotation, and one should consider that different segmentations may be needed.

One can think of different serializations for this, but PAULA XML is an effective, already available solution. PAULA is also a very relevant solution because an increasing number of texts are natively encoded in TEI-XML.

arademaker · 2018-09-18T17:12:31Z

This thread started asking for named entities annotation but in the http://universaldependencies.org/ext-format.html link, I found only references to PARSEME schema for VMWE. The IOB2 mentioned by @fredrijo has some variations, so what people. believe should be the best schema for named entities in the CoNLL-U Plus file?

amir-zeldes · 2018-09-18T18:33:56Z

I've already pointed out WebAnno format as a candidate for a stand-alone format above, but if this is still an open discussion, another candidate to throw into the mix is to just use the MISC column with entity type brackets within the present CoNLL-U format.

This can also be complemented by coref IDs as used by the CoNLL-coref scorer, like so:

1	My	_	PRON	PRP$	_	2	nmod	_	Entity=(1-person)(2-person
2	sister	_	NOUN	NN	_	3	nsubj	_	Entity=2-person)
3	saw	_	VERB	VBD	_	0	root	_	_
4	herself	_	PRON	PRP	_	3	obj	_	Entity=(2-person)
5	in	_	ADP	IN	_	7	case	_	_
6	the	_	DET	DT	_	7	det	_	Entity=(3-object
7	mirror	_	NOUN	NN	_	3	obl	_	Entity=3-object)
8	.	_	PUNCT	.	_	3	punct	_	_

fredrijo changed the title ~~Adding Named Entity annotation~~ Adding Named Entity annotations Aug 1, 2018

dan-zeman added this to the later milestone Nov 13, 2018

arademaker mentioned this issue Jan 6, 2020

Sentence-level features? #681

Closed

nschneid mentioned this issue Jul 23, 2020

PROPN or NOUN UniversalDependencies/UD_English-EWT#91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Named Entity annotations #562

Adding Named Entity annotations #562

fredrijo commented Aug 1, 2018

jnivre commented Aug 1, 2018

fredrijo commented Aug 1, 2018

amir-zeldes commented Aug 1, 2018

savary commented Aug 1, 2018 •

edited

Loading

arademaker commented Aug 4, 2018

foxik commented Aug 4, 2018

amir-zeldes commented Aug 4, 2018

amir-zeldes commented Aug 4, 2018

dan-zeman commented Sep 17, 2018

gcelano commented Sep 17, 2018

arademaker commented Sep 18, 2018

amir-zeldes commented Sep 18, 2018 •

edited

Loading

Adding Named Entity annotations #562

Adding Named Entity annotations #562

Comments

fredrijo commented Aug 1, 2018

jnivre commented Aug 1, 2018

fredrijo commented Aug 1, 2018

amir-zeldes commented Aug 1, 2018

savary commented Aug 1, 2018 • edited Loading

arademaker commented Aug 4, 2018

foxik commented Aug 4, 2018

amir-zeldes commented Aug 4, 2018

amir-zeldes commented Aug 4, 2018

dan-zeman commented Sep 17, 2018

gcelano commented Sep 17, 2018

arademaker commented Sep 18, 2018

amir-zeldes commented Sep 18, 2018 • edited Loading

savary commented Aug 1, 2018 •

edited

Loading

amir-zeldes commented Sep 18, 2018 •

edited

Loading