Nonstandard DEPRELs for nonstandard syntax

__Context__: together with @AleksandrsBerdicevskis, @Turtilla and @elenavolodina, I have spent a substantial amount of time thinking about how to make use of UD for annotating learner productions in a cross-lingually consistent fashion. We recently condensed these thoughts into a [paper](https://aclanthology.org/2025.udw-1.17/). While our suggestions did not meet any substantial objections when they were presented at SyntaxFest, I believe they would benefit from further discussion, not the least because they create some validation issues. So, I thought I'd distill them into a series of issues, of which this is the first one.

---

In learner corpus research, it is often recommended to follow so-called _literal_ rather than _distributional_ annotation criteria. The former implies that morphosyntactic analysis be guided solely by the observed word forms, whereas a distributional (or, as I would call it, _interpretative_) approach is one where context is used to figure out the learner's intended meaning and annotate accordingly.

We argue that literal criteria work well for lemmatization, morphological analysis and POS tagging, but less so for dependency annotation, which:

- is intrinsically, to some extent, "distributional", and as such it can be a good complement to literal token-level annotation, maximizing the "informativeness" of the treebank (see example 1 below)
- should, at least for ambiguous sentences, be grounded in what the annotator thinks is the intended meaning.

This is particularly relevant for learner treebanks, where nonstandard language use often causes UPOS-DEPREL mismatches and/or syntactic ambiguities. 

In most cases, annotating according to this principle works just fine. Sometimes, it breaks (language-specific) guidelines that are not (yet) enforced by the validator. But in rare cases, the validator complains (examples from UD_Swedish-SweLL; feel free to add more for other languages):

1. _barnen kan ärva egenskaper från **båda** mamma och pappa_ ('the.children can inherit characteristics from both mom and dad'): _båda_ is an existing adjective form, but is used as the conjunction _både_. We would like to use `ADJ` + `cc`, but this combination is not currently allowed by the validator
2. _du har lång erfarente på området **som du här sökt jobb**_ ('you have long expreince in the.area which you here looked.for job'): should be something like _du har lång erfarenhet på området **som du har sökt jobb inom**_ ('you have long experience in the.area which you have looked.for job in'), which would make _job_ into an `obl` of _sökt_. In the original sentence, we would like to use `obl` for both _som_ and _jobb_, but this results in a "too many objects" error
2. _Till skillnad mot Ur : **Samuel August från Sevedstorp och Hanna i Hult av Astrid Lindgren** handlar **den** om en kärleksfull familj_ ('to difference against _Ur_: _Samuel August från Sevedstorp och Hanna i Hult_ by Astrid Lindgren talks it about a loving family'): difficult sentence, but we are quite sure _handlar_ has two subjects (_Samuel August från Sevedstorp och Hanna i Hult_ and _den_), which is also not allowed

Of course, I'm not saying these shouldn't be flagged by the validator! 
On the other hand, I argue that there should be a mechanism for the annotator to mark _deliberate_ violations of the validation rules corresponding to nonstandard syntax in the underlying text.

In the paper, we proposed a new subtype, `:*`. However, @jnivre pointed out that this could be a problem in cases where the `DEPREL` at hand already has a subtype, as UDv2 does not allow subtype stacking (such as `nsubj:pass:*`). 
I think this could be solved by using a new MISC item instead, although I'm not sure what a good name for it would be. To partly align with the [guidelines for typos and other errors in underlying text](https://universaldependencies.org/u/overview/typos.html), we could perhaps introduce `CorrectDEPREL=DEPREL-THAT-WOULD-BE-USED-IN-A-CORRECTED-VERSION-OF-THE-SENTENCE` :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nonstandard DEPRELs for nonstandard syntax #1178

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nonstandard DEPRELs for nonstandard syntax #1178

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions