Infixes #384

Open
ermanh opened this Issue Dec 19, 2016 · 11 comments

Projects

None yet

5 participants

@ermanh
ermanh commented Dec 19, 2016

The discussion on mesoclitics in Portuguese in #315 raises a similar question we had about how infixes should be treated in UD. From what I can gather from the Portuguese discussion, the solution for now would be simply to separate them like the following?

1-2	xxxINFIXxxx
1	xxxxxx
2	INFIX

It seems the only unavoidable(?) drawback is that we would always have to display the original word/sentence alongside the tree because the tree itself would have the order jumbled.

In Cantonese we have the word 鬼 gwai2 'ghost, devil' which can be infixed into a bisyllabic, non-compositional word with an emphatic function (analogous to English 'fucking' in 'abso-fucking-lutely'). The interrogative 乜嘢 mat1je5 'what' can also be infixed, with a "What do you mean?" meaning:

論盡		論鬼盡			論乜嘢盡
leon6zeon6	leon6-gwai2-zeon6	leon6-mat1gwai2-zeon6
'clumsy'	'downright clumsy'	'What do you mean clumsy?'

To make things even more interesting we can have nested infixes with 鬼 gwai2 'ghost, devil' infixed inside 乜嘢 mat1ye5 'what':

論乜鬼嘢盡
leon6-mat1-gwai2-je5-zeon6
'What the heck do you mean clumsy? / Clumsy how?!'

We haven't encountered this last case in our own data yet but theoretically should it be ideally treated flat as in the following (rather than recursively, which I imagine would be too messy)?

1-3	論乜鬼嘢盡	leon6-mat1-gwai2-je5-zeon6
1	論盡		leon6zeon6
2	乜嘢		mat1je5
3	鬼		gwai2

Thanks!

@jnivre
Contributor
jnivre commented Dec 19, 2016

This is not different in principle from many other cases where the original form of the token is not obtainable by a simple concatenation of the corresponding words, like French "du" = "de le". However, the crucial question is whether the segments surrounding the infixed segments have to be regarded as one word, or whether they can be analysed with, say, the "compound" relation. In the simple case, we could then get a structure like the following:

xxx1 INFIX xxx2
compound(xxx1, xx2)
advmod(xxx1, INFIX)

This is assuming that the infixed element is a dependent of the host. If this is not the case, we will get a non-projective tree but we will still avoid the complex word segmentation. So, is there evidence that the link between the surrounding elements is stronger than for, say, noun-noun compounds in English? If not, then I think a compound analysis would be preferable.

@martinpopel
Contributor

Another question is if xxx1 and xxx2 appear without the INFIX, whether it will be still analyzed as two words (compound(xxx1, xx2)) or just as one word.
E.g. in English, we probably don't want to analyze abso-lutely as two syntactic words just because you can infix it in some slangs.:-)

@jnivre
Contributor
jnivre commented Dec 19, 2016

Good point. It seems like a good principle would be to posit multiple words only when required by the (exceptional) circumstances.

@ermanh
ermanh commented Dec 19, 2016

The infix can come between both actual compounds and non-compounds. In the non-compound case for Cantonese, they're adjectives that are considered one word (some may be compounds historically, but some just happen to be bisyllabic and cannot be broken apart at all -- the example of 論盡 leon6zeon6 'clumsy' I gave above is one example).

So what it sounds like is if there is no infix, then just treat it as one word (assuming it is not a compound to begin with), but when this word occurs with the infix, then separate it apart, and connect the two with compound?

Should there perhaps be a standardized sub-label? Just compound by itself (or using this label category by definition as an MWE type dependency) seems misleading if the word is non-compositional to begin with. It seems something like goeswith might have been great if it weren't dedicated to indicating tokenization errors.

Also, should the head of the not-really-compound that's split apart by an infix be the first part by default?

@dan-zeman
Member
dan-zeman commented Dec 19, 2016 edited

How large or how open is the set of possible infixes? If it is a closed set then we could also say that this is just an operation of morphology. Maybe unusual and Cantonese-specific, but still word-internal. If that is the solution then the lemma will be 論盡 for both 論盡 and 論鬼盡, and a language-specific morphological feature will be defined to annotate the infixed case. The feature could be something simple, e.g. Infix=Gwai2. I think that Finnish can serve as an example language having features of this kind, right, @fginter?

And in the interrogative case, we could combine Infix=Gwai2 with PronType=Int.

@dan-zeman dan-zeman added this to the lg-specific v2 milestone Dec 19, 2016
@fginter
Member
fginter commented Jan 3, 2017

Not a native speaker, but I don't think Finnish has infixes of this kind. @jmnybl can correct me.

@dan-zeman
Member

I should have used more accurate wording. I did not necessarily mean Finnish has infixes. What I meant was that the Finnish UD data contains language-specific features whose values are actual Finnish morphemes, e.g. Clitic=Han (Ka, Kaan...) or Derivation=Lainen (Llinen, Minen...).

@fginter
Member
fginter commented Jan 4, 2017

Ah, sorry, I misunderstood your question. Yes, we have those features for the various derivations and clitics. The tag is pretty much the derivation suffix / clitic itself. Easy to read and remember.

@ermanh
ermanh commented Jan 6, 2017

Thanks for all the responses! Currently we're not doing features or lemmas but perhaps we might reconsider (@kimgerdes ?). The inventory of infixes is indeed quite small (all with the same emphatic function of 鬼; they're basically all swear words and unlikely to show up in corpora).

Otherwise, the best alternative (?) seems to be to follow the Portuguese mesoclitic example of

1-2	xxxINFIXxxx
1	xxxxxx
2	INFIX

with the caveat that the tree itself would unfortunately not reflect the word form/order correctly.

@jnivre
Contributor
jnivre commented Jan 6, 2017

A dependency tree is in principle unordered and the numerical order of indices does not encode word order

@jnivre jnivre closed this Jan 6, 2017
@jnivre jnivre reopened this Jan 6, 2017
@jnivre
Contributor
jnivre commented Jan 6, 2017

Sorry. Hit the wrong button. What I meant to say was that it is not assumed in general that concatenating the words in numerical order gives the original word order. It is precisely for discrepancies like these (among other things) that we have the multiword tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment