-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Named Entity annotations #562
Comments
Adding more annotation layers to UD treebank is of course always welcome, since it makes the treebanks more useful. However, it is still not clear to me what is the best way to do this in practice. If the information is going to be added to the CoNLL-U files, then the MISC field is the only possibility, and it would of course be preferable if this could be done in a uniform way across treebanks, but currently there is no proposal for NE (as far as I know). The other option is to do a kind of standoff annotation, where the annotation is stored in a separate file that references the CoNLL-U file. There is an ongoing effort to develop such a format for multiword expressions and to generalise this to a standard that can be used also for other types of annotations. This work has been stalled for a while, but the plan is to resume it in September. Perhaps it would make sense to include NE as another test case here. |
Thank you for the answer. Treating multiword expressions and named entities in a uniform manner makes a lot of sense! I will await your decision on the standoff format then. |
One way of including this information would be to use a similar convention to WebAnno TSV:
Only multitoken spans get a span ID in WebAnno TSV, but one could assign it to all entity annotationns for consistency. Using the MISC field one could write:
We have complete entity type information (including named, non-named and pronoun mentions) for UD_English-GUM as well, which we would be happy to include in the UD release. For more on WebAnno TSV see: https://webanno.github.io/webanno/releases/3.2.2/docs/user-guide.html#sect_webannotsv |
The definition of a beta version of the standoff format, mentioned by Joakim, can be found here. Note that this definition includes:
.cupt was used by the PARSEME corpus of verbal MWE in a recent shared task. We are now gathering feedback on this format before we discuss it further with the UD core group. In the future, PARSEME aims at extending .cupt so as to include all kinds of MWEs, not only verbal, and multiword named entities will also be considered. Collaborations are welcome. |
Hi @amir-zeldes, regarding your suggestion #562 (comment), why do we need to type all tokens? Why not only the head of the entity? |
In a project here in Prague, where we annotated NEs and annotated NE linking, and we ended up using sentence-level comments to embedd information in JSON format. We also used document-level comments which linked all NEs in the document. Our goal was compatibility with CoNLL-U format (so that the documents themselves can be processed by any CoNLL-U compatible tool), while allowing very general extensibility. An example follows (with added line breaks for readibility; all comments should be single-line)
I am not proposing standardizing this of anything, just showing what we use to extend CoNLL-U :-) |
Hi all - @savary I don't mean to propose a new format if one is already established, I'm also happy to use whatever else is agreed on. Just suggesting that there are already some representations of spans in a CoNLL-U-like format that could be reused for this. @arademaker - marking the head is not sufficient for at least two main reasons:
|
Just seeing @foxik 's post - for completeness I should say that we also have a different working solution at the moment. We have parallel WebAnno files next to the CoNLL-U syntax files, and both have the same tokenization so it's easy to merge. We didn't consider putting entities directly into the CoNLL files since we also have coreference for these entities, and we're already annotating them in WebAnno. As for standoff, we also have a standoff XML representation of all annotations in the corpus (entities, coref, discourse, TEI tags...), which is expressed using PAULA XML but is automatically generated from the CoNLL-U, WebAnno and other formats, so it's never manually edited. |
See http://universaldependencies.org/ext-format.html for the specification of the CoNLL-U Plus file format. |
I think that the approach of keeping the original text separate from any other kind of annotation and reference to the original text via offsets even for sentence split and tokenization is the most/only scalable approach. Sentence split and tokenization are of course kinds of annotation, and one should consider that different segmentations may be needed. One can think of different serializations for this, but PAULA XML is an effective, already available solution. PAULA is also a very relevant solution because an increasing number of texts are natively encoded in TEI-XML. |
This thread started asking for named entities annotation but in the http://universaldependencies.org/ext-format.html link, I found only references to PARSEME schema for VMWE. The IOB2 mentioned by @fredrijo has some variations, so what people. believe should be the best schema for named entities in the CoNLL-U Plus file? |
I've already pointed out WebAnno format as a candidate for a stand-alone format above, but if this is still an open discussion, another candidate to throw into the mix is to just use the MISC column with entity type brackets within the present CoNLL-U format. This can also be complemented by coref IDs as used by the CoNLL-coref scorer, like so:
|
The Norwegian Bokmål part of UD is based on Norwegian Dependency Treebank (NDT). NDT has now been extended with named entity annotations, and is going to be redistributed with these annotations in addition to the linguistic (syntactic and morphological) annotations.
We are interested in distributing these names as part of the UD as well, if this is desirable.
Questions:
The text was updated successfully, but these errors were encountered: