Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "compound" relation for nominals: what, why and when? #761

Closed
Stormur opened this issue Jan 21, 2021 · 10 comments
Closed

The "compound" relation for nominals: what, why and when? #761

Stormur opened this issue Jan 21, 2021 · 10 comments

Comments

@Stormur
Copy link
Contributor

Stormur commented Jan 21, 2021

Some recent discussions about the use of the compound relation in English (#753, #756, #757) have acted as slings for a wider discussion on the definition and use of this relation in the annotation of UD treebanks, which deserves its own space.

My main point being that actually, at least as it stands now in the guidelines, compound seems poorly or unclearly defined and also not so much justified, in that it just appears as a variant of nmod/amod. What I see:

  • Though purportedly a quite common phenomenon, it distribution might be very irregular: it is used massively by some treebanks, but completely ignored by others (e.g. Latin IT-TB & LLCT, or reportedly French);
  • It seems to capture a merely superficial aspect of a modifier: parallel expressions are annotated as compound or nmod based e.g. on the position with respect to the head or presence/absence of morphological traits:
    • dog tail (compound) vs. tail of the dog (nmod) vs the dog's tail (nmod?)
  • It should represent "an absence of internal structure", but very often constructions using compound see a clear hierarchy, a kind of "nesting":
    • criminal defense attorney -> amod(defense,criminal) nmod(attorney,defense) root(attorney)
  • The "lexicalization" of compound expressions seems to happen at a different level and not be totally predictable from syntax
  • It seems to be a case where a traditional denomination is directly carried into a syntactic UD relation, creating some sort of undesirable redundancy (not so unfrequent also for mrphological features, in my experience).

So, in the end, I would be inclined to consider most of such uses as simple *mods, as the most transversal relation which can account for the variability, and some degree of arbitrariness, of strategies in different languages. I doubt the real usefulness of a compound relation, arguing that its possible peculiar traits are better seen as a correlation with other factors (word order, morphological features...) under the umbrella of "modifiers".

Probably some particular cases remain, bordering on the traditionally called "apposition", such as latin Deus pater 'God father', which I don't know if some approach sees as a compound. But here it is to be cleared the status of compound with respect to flat or appos, which already seem capable of handling these expressions (in Latin, for example, we are using flat, as for name + title).

In general, knowing that for Latin we tried to conceive a possible use for compound but ended discarding it, and having discussed the phenomenon at length for English, it would be interesting to gather the experience of other treebanks about the use of this relation, to get a clearer picture! One of the reasons I am interested in this is to understand if compound does find its place in the annotation for Latin.

@Stormur
Copy link
Contributor Author

Stormur commented Jan 21, 2021

I would like to comment here on a point from this post by @amir-zeldes, from #756.

@Stormur I don't think we should get rid of compound for several reasons:


  1. Even for Romance languages, compound analyses are sometimes used for things like Italian "centro trasfusione sangue" or French "vote sanction". Some Romance UD treebanks use the compound relation to describe these rare but interesting constructions, for example: http://match.grew.fr/?corpus=UD_French-Spoken@2.7&custom=6008786c75a16

I really do not agree with this interpretation of such word strings, at least not for Italian. Actually, they are not rare at all, on the contrary; in my personal experience, they are even expanding. They are also the standard for signage and newspaper headlines: it is a matter of telegraphic vs. articulated style.There is absolutely no difference between centro trasfusione sangue (ca. 'blood transfusion center') and centro di/per la trasfusione del sangue, lit. 'center of/for the transfusion of the blood'. The structure is the same, but some connectors become "implicit" and relations underspecified, because in context or from a pragmatic point of view they are clear.
Just to cite another example, In Italian ISDT I found quite bizarre to see Consorzio credito opere pubbliche, which is just telegraphic for Consorzio di credito per le opere pubbliche, ca. 'Consortium for funds for public works', with nmod(consorzio,credito) but compound(credito,opere). The ratio here is not really understandable. If ever, I would have expected compound(consorzio,credito) and nmod(credito,opere).

On the contrary, the French case looks very different, in that the two elements are coreferential: it is a vote and at the same time it is a sanction. Some might see an argument for appos here. I would also be inclined for flat. I find such cases to be elusive.

@dan-zeman
Copy link
Member

Some remarks on the first post by @Stormur:

  • Not all UD relations are supposed to occur in all languages, so uneven distribution is not necessarily a problem, as long as people working on various treebanks of one language can find consensual and clear guidelines for compound in their language. But I agree that compound is probably overused in some treebanks, and that more clarification at the universal level could help avoid some of the confusion we have seen.
  • It is not true that instances of compound always compete with nmod or amod. There are known and documented usages that cannot be either of them: VERB + NOUN = VERB (compound:lvc); VERB + VERB = VERB (compound:svc); VERB + ADP = VERB (compound:prt).
  • It is not true that compound "should represent an absence of internal structure". If there is no internal structure, it should be flat. If there is internal head-dependent structure, in the case of a NOUN-NOUN compound it is by default nmod but some languages may use compound instead, under certain circumstances, if those circumstances can be well defined (see Two Nominals).

@Stormur
Copy link
Contributor Author

Stormur commented Jan 21, 2021

(Thanks @dan-zeman for your patience.)

Right, I have to restrict the issue to nominal compounds (I am changing the title). The field of verbal compounds, light verbs and so on is too vast to fit in here!

As for the last point, I thought I had read it on the compound page... maybe on an earlier version? Or I got confused with other multiword relations. My bad! I am removing that part.

@Stormur Stormur changed the title The "compound" relation: what, why and when? The "compound" relation for nominals: what, why and when? Jan 21, 2021
@amir-zeldes
Copy link
Contributor

@Stormur I agree that if we didn't have compound, then nmod would probably be the closest thing we could call these constructions, but that still doesn't mean we need to get rid of compound in my opinion. Compounds are a very established phenomenon in linguistic research, and in fact, this description:

The structure is the same, but some connectors become "implicit" and relations underspecified

is a very typical expression of what many morphologists assume is a prototypical characteristic of compounds, for example:

" .. two important aspects of compound meaning. The first of these is the idea that there is an underspecified semantic relation between the constituents..." (Bell & Schäfer 2016, "Modelling semantic transparency". Morphology 26, 57–199)

Removing compounds in UD would add a major difference between treebank analyses and common practices in linguistics, and I'm not sure what the advantage would be. If we just want to find all cases of noun modification, isn't it still easy to do so using the current scheme?

For Italian, I'd like to point out that even if these constructions are understood as telegraphic paraphrases, they are still syntactically distinct from full NP+PP in that we can mix and match determiners or add modifiers, just as in other languages discussed in the above threads:

  • Un centro per molti altri tipi di trasfusioni
  • Il centro per gli altri tipi di trasfusioni

But not:

  • x Un centro molti altri tipi trasfusioni
  • x Il centro gli altri tipi transfusioni

etc. From a practical perspective, I think having compounds annotated even for languages like Italian can be helpful, though of course individual language guidelines are developed separately and with all sorts of higher level considerations. For example, you said you have the feeling that this type of telegraphic construction is expanding in Italian -- labeling such cases differently would make it much easier to study whether this is the case, and in what contexts / what kinds of lexical items it tends to appear.

@Stormur
Copy link
Contributor Author

Stormur commented Jan 22, 2021

Compounds are a very established phenomenon in linguistic research, and in fact, this description:

The structure is the same, but some connectors become "implicit" and relations underspecified

is a very typical expression of what many morphologists assume is a prototypical characteristic of compounds, for example:

Haha, you are right, it also occurred to me while writing it! 🙂

Just to make it clear: I am absolutely not contesting the existence of a compound construct, but only its (in my opinion) undue representation/annotation as a separate deprel in UD.

First thing, some more notes about the Italian "telegraphic style", just to put things in perspective (it really is a very interesting phenomenon with many ramifications!).

For Italian, I'd like to point out that even if these constructions are understood as telegraphic paraphrases, they are still syntactically distinct from full NP+PP in that we can mix and match determiners or add modifiers, just as in other languages discussed in the above threads:

* Un centro per molti altri tipi di trasfusioni

* Il centro per gli altri tipi di trasfusioni

But not:

* x Un centro molti altri tipi trasfusioni

* x Il centro gli altri tipi transfusioni

I am not sure of the validity of this argument to distinguish these two kinds of constructions. That is, I think that the point of view should be reversed: the telegraphic paraphrase of a sequence of nominal modifiers is only possible when some requirements are satisfied, like absence of determiners, genericity versus specificity, context, and so on. So I would rather say that it is not that such constructs do not admit those things, but that they arise when the right conditions are present. I mean, conversely it would not make much sense to say something like Consorzio di grandi crediti per alcune opere pubbliche... I think it is semantically driven.

This and the already stated complete parallelism between telegraphic and articulated forms is just to remark again the arbitrariness of the line between "normal" nominal modifiers and compound-like ones: there is a sort of transition.

@Stormur I agree that if we didn't have compound, then nmod would probably be the closest thing we could call these constructions, but that still doesn't mean we need to get rid of compound in my opinion.
...
Removing compounds in UD would add a major difference between treebank analyses and common practices in linguistics, and I'm not sure what the advantage would be. If we just want to find all cases of noun modification, isn't it still easy to do so using the current scheme?

etc. From a practical perspective, I think having compounds annotated even for languages like Italian can be helpful, though of course individual language guidelines are developed separately and with all sorts of higher level considerations. For example, you said you have the feeling that this type of telegraphic construction is expanding in Italian -- labeling such cases differently would make it much easier to study whether this is the case, and in what contexts / what kinds of lexical items it tends to appear.

So, my suggestion would be not to remove a relation for compounds altogether. I agree with and am very sensitive to the practical problem of retrieving particular constructions and always put some thoughts about it during annotation. But here we have a specular issue: were I doing a research about noun modifiers, instead of focusing only on *mod-type relations, I would need to remember a special compound separately exists too, then try to figure out the language-specific criteria for its definition, and in the end for practical reasons I would just lump everything together and make no real difference.

So, since we kind of seem to agree that nmod is the closest thing to compound (for nominals), my suggestion would be to make them the closest possible, i.e. begin annotating today's nominal compound as a subrelation: nmod:compound (which it does not yet seem to exist).

I see a major parallel with the core/oblique vs. complement/adjunct distinction here: how can we consistently and cross-linguistically decide when something is a "required argument" of a predicate? It has way too much semantical variation in it. So we just have an overarching obl and the possibility of the subrelation obl:arg. Nothing is lost, if someone wants to annotate this kind of representation; it can be practically retrieved and distinguished from the data, if one wants to; andit allows to ignore it if someone is interested only in the "big picture". As such, we are not bothered with possible idiosyncracies in its notation between different languages.

So, in my opinion a similar decision for nmod/compound would both allow for more inter-treebank consistency and keep this nuance between different kinds of nominal modifiers. In the end, it is a question of gradation.

@sylvainkahane
Copy link
Contributor

I agree with @Stormur that everything is confusing with compound. Of course we will not suppress the information attached compound, but maybe we can rewrite the definition, change the name of the relation and maybe split it.

The definition is very unclear for me: "It is used for any kind of X0 compounding". No idea what this means and there is no reference. It looks like something coming from X-bar syntax, not from dependency syntax.
What I understand is that compound must be used for particularly cohesive construction. For instance in English we have three constructions for a NOUN modifying another NOUN: the dog tail , the dog's tail, the tail of the dog. The most conhesive construction is the first one and if we want to distinguish it from the others, we can use compound. But I completely agree with @Stormur that nmod:compound would be much more appropriate. This could be changed automatically very easily (with Grew or another tool).

The name: It is very important to recall that all relations in UD are syntactic relations corresponding to particular syntactic constructions. But for compound it is unclear. First, the relation is introduced on the general page for Universal Dependency Relations as a particular case of MWE. It is clearly wrong: compound is not used for MWEs, it is used for particular syntactic constructions, particularly cohesive, but not necessary semantically frozen (even if cohesive constructions tend to freeze). Second, the name is a term inherited from morphology, not from syntax.

The confusion is increased by the fact that compound is used for several constructions, which are not related from the syntactic point of view, except the fact that there are particularly cohesive. What is common between the N N construction, verb phrases in English or serial verb constructions? Nothing. Why are they gather on the same relation? As said before, the N N construction is a particular case of nmod, the verb particle in Germanic languages is a particular case of advmod, SVC is likely to be a particular case of conj in most concerned languages. (Light verb constructions should not be concerned by compound because they are, at least in Indo-european languages, just regular obj or obl relations from the syntactic point of view.)

In other word, I suggest that the compound relation is redistributed on other relations. The distinction encoded with compound can be kept, but something has to be changed if we want that the set of UD relations is used homonegeously and that our relations correspond more or less to universal categories of syntactic constructions.

@nschneid
Copy link
Contributor

Agree that compound is for constructions that are heterogeneous apart from being particularly cohesive. One way to define this, I think, is that these constructions are productive but often lead to word formation/idiosyncratic lexical meanings without a change in form. They are not the only syntactic sources of lexicalization/idiomatic-MWEs, but they are perhaps the main ones.

@amir-zeldes
Copy link
Contributor

I'm not particularly opposed to renaming noun compound to nmod:compound, but I'd like to give what I think is some relevant background, as well as issues from other languages which may make this difficult in practice.

First of all, I don't think it's totally arbitrary and unrelated that these things got named compound. The major commonality is how morphological literature treats compounding, which is as a word formation process involving multiple full lexical stems which normally have independent forms (not functional elements, affixes), which results in something which behave like one syntactic 'word', at least to some extent.

For noun compounds, this depends greatly on the language, but I think in virtually every language it means only one determiner for the whole thing (in many languages also: only one number, gender, case etc.) @sylvainkahane rightly pointed out that compound:prt involves something like an adverb, and SVCs are similar to coordination, but they are also like compounding, in that they create something like a single verb, with a unified lemma and/or the properties of one verb (e.g. tense: "I went and did it" type SVCs usually can't mix tenses "*I went and will do it")

In terms of phrasal verbs being "one word", this is mainly meaning based in English (e.g. "pick" and "pick up" mean rather different things), but in other Germanic languages, the situation is more complicated. In German, phrasal verbs are literally analyzed as single words depending on tense and sentence position. So we have:

  • Die Sonne geht auf "The sun rises" (lit. "goes up")
  • dass die Sonne aufgeht "that the sun rises" (lit. "upgoes")
  • Der Sonnenaufgang "the sunrise" (lit. "the sunupgoing")

As these examples illustrate, it's tempting to treat the first case as the 'unusual' one, where a complex verb is split in context, but really it's a compound verb with a single lemma, meaning "rise". And in fact, the third example suggests that the complex verb is even nominalizable as a single item, retaining the idiosyncratic meaning and behaving like a single stem in a compound (single gender, case, number and definiteness as well).

So does this mean we can't call the nominal case compound:nmod and the verbal ones something else? No, not necessarily. BUT, if some language doesn't distinguish these subtypes, how could we convert something called compound automatically? You could say, well, if it modifies a noun, let's call it compound:nmod. But notice that in some cases, this might not be right, consider:

  • They finger printed her
  • How to Ballet Dance

In the first case, we have a denominal verb derived from an actual noun compound. Should we not label "finger" as nmod:compound just because the head is a verb? This would actually go against the consensus in #753 (phrasally modified compound nouns). In the second case, we have a verbal synthetic compound (one where the modifier is semantically an argument of the head). It's clearly not a phrasal verb IMO, and of course it's also not some kind of SVC.

So while for English a conversion to nmod:compound would be fairly straightforward (since we already distinguish compound:prt), I'm not sure it would be possible to convert automatically for languages that only use an unsubtyped compound. And for more challenging non-nominal cases like the above, I'm not even sure what label they should get if the options are nmod:compound, conj:compound and advmod:compound, as I think @sylvainkahane is suggesting.

@arademaker
Copy link
Contributor

arademaker commented Jan 28, 2021

what about ..said a managing director at one of ... The managing is tagged as VERB in EWT. I found inconsistencies in the annotation of gerunds and past participle as verbs, adjectives, and nouns.

@dan-zeman dan-zeman modified the milestones: v2.8, v2.9 Jun 17, 2021
@dan-zeman dan-zeman modified the milestones: v2.9, v2.11 Jun 13, 2022
@dan-zeman dan-zeman modified the milestones: v2.11, v2.13 May 31, 2023
@nschneid
Copy link
Contributor

The definition of compound is being revised in #989.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants