New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Named Entity annotations #562

Open
fredrijo opened this Issue Aug 1, 2018 · 12 comments

Comments

Projects
None yet
8 participants
@fredrijo
Copy link

fredrijo commented Aug 1, 2018

The Norwegian Bokmål part of UD is based on Norwegian Dependency Treebank (NDT). NDT has now been extended with named entity annotations, and is going to be redistributed with these annotations in addition to the linguistic (syntactic and morphological) annotations.

We are interested in distributing these names as part of the UD as well, if this is desirable.

Questions:

  1. Do you want to add named entity annotations to UD?
  2. If you are ok with adding NE annotations (1.), should they be included in the MISC field?
  3. If they should be added to the MISC field (2.), is there a preferred attribute name? In the paragraph about "Other Miscellaneous Attributes". If there are any other treebanks in UD that already has NE annotations, I assume it is preferred to use the same attribute name. If not, then I'll let you decide on a name.
  4. Do you prefer any particular convention for annotation entity boundaries/scope? Currently, the annotations are on the IOB2 format, but this can be changed to e.g. using token indexes or something else if desired.

@fredrijo fredrijo changed the title Adding Named Entity annotation Adding Named Entity annotations Aug 1, 2018

@jnivre

This comment has been minimized.

Copy link
Contributor

jnivre commented Aug 1, 2018

Adding more annotation layers to UD treebank is of course always welcome, since it makes the treebanks more useful. However, it is still not clear to me what is the best way to do this in practice. If the information is going to be added to the CoNLL-U files, then the MISC field is the only possibility, and it would of course be preferable if this could be done in a uniform way across treebanks, but currently there is no proposal for NE (as far as I know). The other option is to do a kind of standoff annotation, where the annotation is stored in a separate file that references the CoNLL-U file. There is an ongoing effort to develop such a format for multiword expressions and to generalise this to a standard that can be used also for other types of annotations. This work has been stalled for a while, but the plan is to resume it in September. Perhaps it would make sense to include NE as another test case here.

@fredrijo

This comment has been minimized.

Copy link
Author

fredrijo commented Aug 1, 2018

Thank you for the answer. Treating multiword expressions and named entities in a uniform manner makes a lot of sense! I will await your decision on the standoff format then.

@amir-zeldes

This comment has been minimized.

Copy link
Contributor

amir-zeldes commented Aug 1, 2018

One way of including this information would be to use a similar convention to WebAnno TSV:

#Text=Water Off a Black Dog’s Back
2-1	25-30	Water	substance	new	_	_	
2-2	31-34	Off	_	_	_	_	
2-3	35-36	a	object[3]	new[3]	_	_	
2-4	37-42	Black	object[3]|animal[4]	new[3]|new[4]	_	_	
2-5	43-46	Dog	object[3]|animal[4]	new[3]|new[4]	_	_	
2-6	47-49	’s	object[3]	new[3]	_	_	
2-7	50-54	Back	object[3]	new[3]	_	_	

Only multitoken spans get a span ID in WebAnno TSV, but one could assign it to all entity annotationns for consistency. Using the MISC field one could write:

1	Water	Water	PROPN	NNP	Number=Sing	0	root	_	Entity[1]=substance
2	Off	off	ADP	IN	_	7	case	_	_
3	a	a	DET	DT	Definite=Ind|PronType=Art	5	det	_	Entity[2]=animal|Entity[3]=object
4	Black	Black	PROPN	NNP	Number=Sing	5	amod	_	Entity[2]=animal|Entity[3]=object
5	Dog	Dog	PROPN	NNP	Number=Sing	7	nmod:poss	_	SpaceAfter=No|Entity[2]=animal|Entity[3]=object
6	’s	's	PART	POS	_	5	case	_	Entity[3]=object
7	Back	Back	PROPN	NNP	Number=Sing	1	nmod	_	Entity[3]=object

We have complete entity type information (including named, non-named and pronoun mentions) for UD_English-GUM as well, which we would be happy to include in the UD release. For more on WebAnno TSV see:

https://webanno.github.io/webanno/releases/3.2.2/docs/user-guide.html#sect_webannotsv

@savary

This comment has been minimized.

Copy link
Contributor

savary commented Aug 1, 2018

The definition of a beta version of the standoff format, mentioned by Joakim, can be found here. Note that this definition includes:

  • a meta format, described in the "Extended CoNLL-U format" section; it allows any initiative (like PARSEME) to add extra columns to conllu
  • an instantiation (called .cupt) of this meta-format for multiword expression annotations, as defined in PARSEME guidelines

.cupt was used by the PARSEME corpus of verbal MWE in a recent shared task. We are now gathering feedback on this format before we discuss it further with the UD core group.

In the future, PARSEME aims at extending .cupt so as to include all kinds of MWEs, not only verbal, and multiword named entities will also be considered. Collaborations are welcome.

@arademaker

This comment has been minimized.

Copy link
Contributor

arademaker commented Aug 4, 2018

Hi @amir-zeldes, regarding your suggestion #562 (comment), why do we need to type all tokens? Why not only the head of the entity?

@foxik

This comment has been minimized.

Copy link
Member

foxik commented Aug 4, 2018

In a project here in Prague, where we annotated NEs and annotated NE linking, and we ended up using sentence-level comments to embedd information in JSON format. We also used document-level comments which linked all NEs in the document. Our goal was compatibility with CoNLL-U format (so that the documents themselves can be processed by any CoNLL-U compatible tool), while allowing very general extensibility.

An example follows (with added line breaks for readibility; all comments should be single-line)

# newdoc id = ukázková analýza
# doc_json_entities = [
    {"id":"e1","type":"Osoba","mentions":["s1#m1","s2#m1","s2#m3"],
     "labels":["Jan Novák, nar. 01.01.1901"]},
    {"id":"e2","type":"Firma","mentions":["s1#m2"],
     "labels":["Novák a synové"]},
    {"id":"e3","type":"Adresa","mentions":["s1#m3"],
     "labels":["Václavské náměstí 1, Praha 1, 110 00"]},
    {"id":"e4","type":"Auto","mentions":["s2#m2"],
     "labels":["Volvo"]},
    {"id":"e5","type":"Zbraň","mentions":["s2#m4"],
     "labels":["Glock 42"]}]
# doc_json_relations = [
    {"id":"r1","from":"e1","to":"e2","type":"StejnaVeta"},
    {"id":"r2","from":"e2","to":"e3","type":"StejnaVeta"},
    {"id":"r3","from":"e1","to":"e4","type":"StejnaVeta"},
    {"id":"r4","from":"e1","to":"e5","type":"StejnaVeta"}]
# doc_json_summaries = ["Aktiva Jana Nováka"]
# sent_id = s1
# text = Jan Novák, nar. 01.01.1901, tel. 720123456, vlastní firmu
    Novák a synové, která sídlí na adrese Václavské náměstí 1,
    Praha 1, 110 00.
# json_mentions = [
    {"id":"s1#m1","type":"Osoba","span":[1,2,3,4,5,6,7,8]},
    {"id":"s1#m2","type":"Firma","span":[16,17,18]},
    {"id":"s1#m3","type":"Adresa","span":[24,25,26,27,28,29,30,31,32]}]
1  	Jan       	Jan       	_ 	NNMS1-----A---- 	_ 	_ 	_ 	_ 	_
2  	Novák     	Novák     	_ 	NNMS1-----A---- 	_ 	_ 	_ 	_ 	SpaceAfter=No
3  	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
4  	nar       	narozený  	_ 	AAXXX----1A---8 	_ 	_ 	_ 	_ 	SpaceAfter=No
5  	.         	.         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
6  	01.01     	01.01     	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
7  	.         	.         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
8  	1901      	1901      	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
9  	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
10 	tel       	telefon   	_ 	NNIXX-----A---8 	_ 	_ 	_ 	_ 	SpaceAfter=No
11 	.         	.         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
12 	720123456 	720123456 	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
13 	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
14 	vlastní   	vlastní   	_ 	AAFS4----1A---- 	_ 	_ 	_ 	_ 	_
15 	firmu     	firma     	_ 	NNFS4-----A---- 	_ 	_ 	_ 	_ 	_
16 	Novák     	Novák     	_ 	NNMS1-----A---- 	_ 	_ 	_ 	_ 	_
17 	a         	a         	_ 	J^------------- 	_ 	_ 	_ 	_ 	_
18 	synové    	syn       	_ 	NNMP1-----A---- 	_ 	_ 	_ 	_ 	SpaceAfter=No
19 	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
20 	která     	který     	_ 	P4FS1---------- 	_ 	_ 	_ 	_ 	_
21 	sídlí     	sídlit    	_ 	VB-S---3P-AA--- 	_ 	_ 	_ 	_ 	_
22 	na        	na        	_ 	RR--6---------- 	_ 	_ 	_ 	_ 	_
23 	adrese    	adresa    	_ 	NNFS6-----A---- 	_ 	_ 	_ 	_ 	_
24 	Václavské 	václavský 	_ 	AANS1----1A---- 	_ 	_ 	_ 	_ 	_
25 	náměstí   	náměstí   	_ 	NNNS1-----A---- 	_ 	_ 	_ 	_ 	_
26 	1         	1         	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
27 	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
28 	Praha     	Praha     	_ 	NNFS1-----A---- 	_ 	_ 	_ 	_ 	_
29 	1         	1         	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
30 	,         	,         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_
31 	110       	110       	_ 	C=------------- 	_ 	_ 	_ 	_ 	_
32 	00        	00        	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
33 	.         	.         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_

# sent_id = s2
# text = Další majetek pana Nováka čítá vozidlo značky Volvo a Novákovu
    legálně drženou zbraň Glock 42.
# json_mentions = [
    {"id":"s2#m1","type":"Osoba","span":[4]},
    {"id":"s2#m2","type":"Auto","span":[8]},
    {"id":"s2#m3","type":"Osoba","span":[10]},
    {"id":"s2#m4","type":"Zbraň","span":[14,15]}]
1  	Další     	další     	_ 	AAIS1----1A---- 	_ 	_ 	_ 	_ 	_
2  	majetek   	majetek   	_ 	NNIS1-----A---- 	_ 	_ 	_ 	_ 	_
3  	pana      	pan       	_ 	NNMS2-----A---- 	_ 	_ 	_ 	_ 	_
4  	Nováka    	Novák     	_ 	NNMS2-----A---- 	_ 	_ 	_ 	_ 	_
5  	čítá      	čítat     	_ 	VB-S---3P-AA--- 	_ 	_ 	_ 	_ 	_
6  	vozidlo   	vozidlo   	_ 	NNNS4-----A---- 	_ 	_ 	_ 	_ 	_
7  	značky    	značka    	_ 	NNFS2-----A---- 	_ 	_ 	_ 	_ 	_
8  	Volvo     	Volvo     	_ 	NNNS1-----A---- 	_ 	_ 	_ 	_ 	_
9  	a         	a         	_ 	J^------------- 	_ 	_ 	_ 	_ 	_
10 	Novákovu  	Novákův   	_ 	AUFS4M--------- 	_ 	_ 	_ 	_ 	_
11 	legálně   	legálně   	_ 	Dg-------1A---- 	_ 	_ 	_ 	_ 	_
12 	drženou   	držený    	_ 	AAFS4----1A---- 	_ 	_ 	_ 	_ 	_
13 	zbraň     	zbraň     	_ 	NNFS4-----A---- 	_ 	_ 	_ 	_ 	_
14 	Glock     	Glock     	_ 	NNMS1-----A---- 	_ 	_ 	_ 	_ 	_
15 	42        	42        	_ 	C=------------- 	_ 	_ 	_ 	_ 	SpaceAfter=No
16 	.         	.         	_ 	Z:------------- 	_ 	_ 	_ 	_ 	_

I am not proposing standardizing this of anything, just showing what we use to extend CoNLL-U :-)

@amir-zeldes

This comment has been minimized.

Copy link
Contributor

amir-zeldes commented Aug 4, 2018

Hi all - @savary I don't mean to propose a new format if one is already established, I'm also happy to use whatever else is agreed on. Just suggesting that there are already some representations of spans in a CoNLL-U-like format that could be reused for this.

@arademaker - marking the head is not sufficient for at least two main reasons:

  1. The exact extent of the entity annotation is not necessarily predictable from the head, since some relations typically continue an entity (e.g. an amod is included in an entity span) and some aren't (e.g. nsubj to a nominal predicate is not part of the predicate entity span). In my experience it is not 100% possible to formulate rules determining entity span from relation types, since some are ambiguous (e.g. dep, which may or may not be part of the entity) and some are treebank/language specific.

  2. Some tokens can be the head of two conflicting entities, especially for coordinations: we might have two coordinated entities with different types (e.g. person + organization), and we might have a guideline that gives the entire NP a different type (e.g. mixed or abstract). In this case, the first noun will be the head of two distinct entity annotations with different spans.

@amir-zeldes

This comment has been minimized.

Copy link
Contributor

amir-zeldes commented Aug 4, 2018

Just seeing @foxik 's post - for completeness I should say that we also have a different working solution at the moment. We have parallel WebAnno files next to the CoNLL-U syntax files, and both have the same tokenization so it's easy to merge. We didn't consider putting entities directly into the CoNLL files since we also have coreference for these entities, and we're already annotating them in WebAnno.

As for standoff, we also have a standoff XML representation of all annotations in the corpus (entities, coref, discourse, TEI tags...), which is expressed using PAULA XML but is automatically generated from the CoNLL-U, WebAnno and other formats, so it's never manually edited.

@dan-zeman

This comment has been minimized.

Copy link
Member

dan-zeman commented Sep 17, 2018

See http://universaldependencies.org/ext-format.html for the specification of the CoNLL-U Plus file format.

@gcelano

This comment has been minimized.

Copy link
Contributor

gcelano commented Sep 17, 2018

I think that the approach of keeping the original text separate from any other kind of annotation and reference to the original text via offsets even for sentence split and tokenization is the most/only scalable approach. Sentence split and tokenization are of course kinds of annotation, and one should consider that different segmentations may be needed.

One can think of different serializations for this, but PAULA XML is an effective, already available solution. PAULA is also a very relevant solution because an increasing number of texts are natively encoded in TEI-XML.

@arademaker

This comment has been minimized.

Copy link
Contributor

arademaker commented Sep 18, 2018

This thread started asking for named entities annotation but in the http://universaldependencies.org/ext-format.html link, I found only references to PARSEME schema for VMWE. The IOB2 mentioned by @fredrijo has some variations, so what people. believe should be the best schema for named entities in the CoNLL-U Plus file?

@amir-zeldes

This comment has been minimized.

Copy link
Contributor

amir-zeldes commented Sep 18, 2018

I've already pointed out WebAnno format as a candidate for a stand-alone format above, but if this is still an open discussion, another candidate to throw into the mix is to just use the MISC column with entity type brackets within the present CoNLL-U format.

This can also be complemented by coref IDs as used by the CoNLL-coref scorer, like so:

1	My	_	PRON	PRP$	_	2	nmod	_	Entity=(1-person)(2-person
2	sister	_	NOUN	NN	_	3	nsubj	_	Entity=2-person)
3	saw	_	VERB	VBD	_	0	root	_	_
4	herself	_	PRON	PRP	_	3	obj	_	Entity=(2-person)
5	in	_	ADP	IN	_	7	case	_	_
6	the	_	DET	DT	_	7	det	_	Entity=(3-object
7	mirror	_	NOUN	NN	_	3	obl	_	Entity=3-object)
8	.	_	PUNCT	.	_	3	punct	_	_

@dan-zeman dan-zeman added this to the later milestone Nov 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment