-
Notifications
You must be signed in to change notification settings - Fork 270
Description
I was wondering whether we should try to do more to standardize/change the tokenization of English used in UD. (At the moment, decisions of tokenization are largely outside the scope of the UD guidelines apart from some general admonitions of UD being built on a lexicalist belief in words).
At present, UD_English-EWT almost entirely follows the tokenization done by newer LDC/Penn Treebanks like OntoNotes (or, here, it just follows the English Web Treebank). Their approach has always been a decompositional one so [don't] ➔ [do] [n't] and [wanna] ➔ [wan] [na]. This facilitated the provision of regular syntactic analyses, though this is maybe more necessary for Penn-style constituency analyses than UD analyses. (The two ways at present in which UD_English-EWT deviates from EWT are: (i) not escape parentheses as -LRB- and -RRB-, and (ii) not converting into ASCII characters in the source documents that were Unicode outside the ASCII range (curly quotes, dashes, etc.).)
In contrast, most (but not all) treebank annotators of tweets and other informal and user-generated content have felt that it is infeasible and maybe even wrong to adopt such a decompositional approach, and so have in general kept contracted tokens and abbreviations together. For example [wanna] stays as is, as does [smh] in [smh at your ignorance] "shaking my head at your ignorance". For example, for the small UD TweetAAE corpus, Blodgett, Wei and @brendano (ACL 2018) "did not attempt to split up these words into their component words (e.g. wanna → want to), since to do this well, it would require a more involved segmentation model over the dozens or even hundreds of alternate spellings each of the above can take". And a similar approach was used in the early Tweebank that @nschneid contributed to. There are strong practical arguments. As well as a multiplicity of forms, you just end up sometimes with a lack of letters (if you avoid null tokens). For example the spelling [ima] is sometimes used, instead of [imma] or [umma] for "I am going to". There just aren't enough letters in [ima] to assign them one each to the words in "I am going to". As well as the practical arguments, an earlier paper suggested a more principled argument: It is perhaps linguistically wrong to project formal English conventions onto the conversational chatter of young people. For instance, many young people now also speak (say the letters of) "wtf". (Nevertheless, in the other direction Liu, Zhu, Che, Qin, Schneider, and Smith (NAACL 2018) did try to split up all tokens in their Tweebank to try to provide a UDv2-consistent version – I think essentially trying to match UD_English-EWT.)
It would be nice to agree on a best, or least bad, convention here. A few thoughts on that:
- Separation is linguistically motivated for the possessive ['s] and verbal and modal auxiliaries like ['ll], which act as phrasal clitic not pieces of morphology of the word that hosts them, for example in [the Queen of England's hat] or [the team that's ahead'll try to slow down play].
- But this phrasal clitic analysis isn't motivated in the same way for contracted negation or to-auxiliary forms like [isn't] and [wanna].
- It seems slightly undesirable if [don't] and [dont] don't have the same analysis, but maybe this isn't that problematic in UD. It's not really more problematic than how UD copes with other null words such as dropped copulas in the choice of [you with me] vs. [you are with me].
So, what are the possible positions? There may be more, but some points on the spectrum are:
- Stick with copying current LDC/Penn English tokenization. This offers compatibility with other LDC resources and it's decompositional nature gives more parallelism in syntax, at the cost of large difficulties in tokenization and kind of ridiculous allomorphs (like [a] for "to").
- Only separate off the clearly phrasal clitics, as defined above. But if you do this consistently, since ['ll] for "will" is a phrasal clitic, you still have the tricky tokenization problem of distinguishing when [ill] is "ill" (sick) vs. should be [i ll] for "I will".
- Decide to stick with dividing tokens for the traditional contractions of formal written English as well as for the phrasal clitics above. This would give substantial compatibility with LDC/Penn Treebanks, but we don't try to extend it to further more informal forms. That is, we'd get tokens like: [do n't], [John 's], and [she s] but [gonna], [imma], and [idc]. (So, the LDC's splitting of cases of [wanna] in EWT would be reversed in UD_English-EWT. Possibly we would also no longer split [cannot] into [can not] as the LDC does, though presumably we might still split [ca n't] for consistency with other "n't" forms?)
- Decide that all this token splitting is a big mistake. We'd still tokenize punctuation marks, etc., from words, but all of these things would be a single word: [don't], [John's], [shes], [gonna], [imma], [idc]. Representing the "wrong" treatment of phrasal clitics would require loosening up the taxonomy of the morphological features, since possessive case can appear on verbs, prepositions, etc., but there is some precedent for that anyway in languages with case marking spreading to different word classes, such as in Australian languages.
Any thoughts? I'm wondering whether option 3 might be better than where we are now with option 1 (at least in the case of UD_English-EWT).