-
Notifications
You must be signed in to change notification settings - Fork 263
Description
Context: together with @AleksandrsBerdicevskis, @Turtilla and @elenavolodina, I have spent a substantial amount of time thinking about how to make use of UD for annotating learner productions in a cross-lingually consistent fashion. We recently condensed these thoughts into a paper. While our suggestions did not meet any substantial objections when they were presented at SyntaxFest, I believe they would benefit from further discussion, not the least because they create some validation issues. So, I thought I'd distill them into a series of issues, of which this is the first one.
In learner corpus research, it is often recommended to follow so-called literal rather than distributional annotation criteria. The former implies that morphosyntactic analysis be guided solely by the observed word forms, whereas a distributional (or, as I would call it, interpretative) approach is one where context is used to figure out the learner's intended meaning and annotate accordingly.
We argue that literal criteria work well for lemmatization, morphological analysis and POS tagging, but less so for dependency annotation, which:
- is intrinsically, to some extent, "distributional", and as such it can be a good complement to literal token-level annotation, maximizing the "informativeness" of the treebank (see example 1 below)
- should, at least for ambiguous sentences, be grounded in what the annotator thinks is the intended meaning.
This is particularly relevant for learner treebanks, where nonstandard language use often causes UPOS-DEPREL mismatches and/or syntactic ambiguities.
In most cases, annotating according to this principle works just fine. Sometimes, it breaks (language-specific) guidelines that are not (yet) enforced by the validator. But in rare cases, the validator complains (examples from UD_Swedish-SweLL; feel free to add more for other languages):
- barnen kan ärva egenskaper från båda mamma och pappa ('the.children can inherit characteristics from both mom and dad'): båda is an existing adjective form, but is used as the conjunction både. We would like to use
ADJ+cc, but this combination is not currently allowed by the validator - du har lång erfarente på området som du här sökt jobb ('you have long expreince in the.area which you here looked.for job'): should be something like du har lång erfarenhet på området som du har sökt jobb inom ('you have long experience in the.area which you have looked.for job in'), which would make job into an
oblof sökt. In the original sentence, we would like to useoblfor both som and jobb, but this results in a "too many objects" error - Till skillnad mot Ur : Samuel August från Sevedstorp och Hanna i Hult av Astrid Lindgren handlar den om en kärleksfull familj ('to difference against Ur: Samuel August från Sevedstorp och Hanna i Hult by Astrid Lindgren talks it about a loving family'): difficult sentence, but we are quite sure handlar has two subjects (Samuel August från Sevedstorp och Hanna i Hult and den), which is also not allowed
Of course, I'm not saying these shouldn't be flagged by the validator!
On the other hand, I argue that there should be a mechanism for the annotator to mark deliberate violations of the validation rules corresponding to nonstandard syntax in the underlying text.
In the paper, we proposed a new subtype, :*. However, @jnivre pointed out that this could be a problem in cases where the DEPREL at hand already has a subtype, as UDv2 does not allow subtype stacking (such as nsubj:pass:*).
I think this could be solved by using a new MISC item instead, although I'm not sure what a good name for it would be. To partly align with the guidelines for typos and other errors in underlying text, we could perhaps introduce CorrectDEPREL=DEPREL-THAT-WOULD-BE-USED-IN-A-CORRECTED-VERSION-OF-THE-SENTENCE :)