Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amod vs. compound for terms like "hot dog" #756

Closed
nschneid opened this issue Jan 17, 2021 · 11 comments
Closed

amod vs. compound for terms like "hot dog" #756

nschneid opened this issue Jan 17, 2021 · 11 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Jan 17, 2021

In English, ADJ+NOUN expressions like "hot dog" (in the idiomatic food sense) are traditionally known as compounds. Should "hot" attach as amod or compound?

As @amir-zeldes has pointed out, one distinguishing factor is stress (the Compound Stress Rule): the food is pronounced HOT dog with stress on the first word, whereas the compositional use of the adjective would be pronounced hot DOG. But the written form does not reflect this distinction.

I suspect annotators would be inclined to attach all attributive adjectives modifying a noun as amod. There is, in fact, one token of "hot dogs" in EWT which is currently amod.

A review of uses of compound with ADJ + NOUN in EWT turns up mostly phrasal modifiers like "top notch" and "high quality", which arguably function like multiword adjectives (discussion of phrasal attributive modifiers is happening in #753). Mixed results for "criminal defense lawyer/attorney": amod(defense,criminal) (2 instances), compound(defense,criminal) (1 instance). All matches of this pattern in GUM look like tagging errors.

Note that the term "compound" has caused confusion for end users (#551), but I am mainly concerned with having clear criteria for annotators.

(We can also include names like "White House" in this category though per PTB guidelines for proper names, "White" is tagged as PROPN#678—so on that basis it makes sense to use compound.)

@amir-zeldes
Copy link
Contributor

I think we should avoid talking about compositionality or mwes in the context of compounds: there are compounds which follow the Compound Stress Rule (CSR) which are compositional, like "TRAIN station", and non-compositional cases which do not follow it, like "hot poTAto", which I have no doubt is amod. The underlying construction belongs to a particular syntactic construction, regardless of whether or not it ends up being lexicalized or having idiosyncratic meaning.

As for the CSR as a criterion for compound vs. amod in A+N compounds, I'm more ambivalent here. On the one hand, historically it's clear that old A+N compounds obeying CSR do not represent full adjectives as modifiers, but rather uninflected adjective stems. This is clearest when things are spelled together, like compound "BLACKbird" (a special of bird) vs. phrasal "black BIRD" (any bird that is black). Looking at Germanic languages which inflect full adjectives, the difference is clearer than in English, e.g. German Schwarzbrot "blackbread" (adj stem) vs. Schwarzes Brot (inflected amod), but these are historical/comparative facts, and they don't have to be synchronically relevant.

Even for words that have distinct adjectival forms (e.g. golden, vs. gold, which can be a compound modifier), most English speakers don't seem to recognize a distinction. From my experience teaching UD to English speaking students every year, it's hard enough just to explain that the modifier in N+N compounds is not an adjective... When we have two tokens and the text is written, it's not even always 100% possible to know whether CSR applies, so if something is clearly an ADJ synchronically, as in "hot dog", I don't even know that we should be insisting on compound from a synchronic point of view. If we do, then I think CSR is the only tenable criterion, and it implies that we know and agree about the stress for any potential case (or we decide compound applies to a closed list/when in doubt use amod). Either way, it looks to me like this distinction is now very weak in English, so we may not want to include it in guidelines.

@dan-zeman
Copy link
Member

I would vote for always using amod when attaching an ADJ to a following NOUN in English, and when the result behaves like a NOUN.

@Stormur
Copy link
Contributor

Stormur commented Jan 20, 2021

I would also always choose amod (or nmod) for these "compounds", because the structure is the same as modified noun + modifier. Then crystallization of some co-occurrences into tighter lexical units is something I think we register a posteriori and at a higher, or different level (similarly to proper nouns: #702 ). I'm always seeing a clear hierarchy of modifiers in all these examples:

  • hot dog: amod(dog,hot) - admittedly the more lexicalised of these examples
    • hot dog stand: amod(dog,hot) nmod(stand,dog)
  • (high) quality system: amod(quality,high) nmod(system,quality)
  • (criminal) defense attorney: amod(defense,criminal) nmod(attorney,defense)

... and so on. And this would incidentally be the easiest for the annotators, too!

@nschneid
Copy link
Contributor Author

@Stormur are you arguing that compound should be removed altogether?

@sylvainkahane
Copy link
Contributor

I agree with @Stormur. We never understood how to apply compound for French and decided to remove it.

@nschneid
Copy link
Contributor Author

Interesting. I think @jnivre has expressed that compound could be viewed as a subtype of nmod.

In any case, removing compound altogether would be a bold change for UD. It's not a crazy idea, but I'm not going to propose it be done for English treebanks only, since there are many nonprepositional, nonpossessive nominal modifiers in English for which it is an obvious fit ("defense attorney").

In this issue I just wanted to get some clarity around adjective cases like "hot dog". It seems like nobody objects to calling that amod.

@Stormur
Copy link
Contributor

Stormur commented Jan 20, 2021

@Stormur are you arguing that compound should be removed altogether?

In fact, as it is now (no real definition in the guidelines), the more I think of it, yes. I agree with viewing it as a kind of nmod (I was going to write this in a more detailed way in the other discussion...). I have also not yet found a case where it would be used in Latin: everything seems to be already covered by *mod, appos (also kind of a maybe redundant subtype of nmod) and sometimes flat or conj.

@nschneid
Copy link
Contributor Author

Here's a pull request for clarifying the scope of compound: #759

I do wish we had a better overall definition but I am trying to improve this one for now to say the scope should be limited based on morphosyntactic criteria.

nschneid added a commit that referenced this issue Jan 20, 2021
@amir-zeldes
Copy link
Contributor

@Stormur I don't think we should get rid of compound for several reasons:

  1. English and other European languages are not the only users of this UD relation - we have 200 treebanks covering very diverse languages. We can use compound for Semitic construct states, incorporation with subtokenization analyses, compound verbs...
  2. Even for Romance languages, compound analyses are sometimes used for things like Italian "centro trasfusione sangue" or French "vote sanction". Some Romance UD treebanks use the compound relation to describe these rare but interesting constructions, for example: http://match.grew.fr/?corpus=UD_French-Spoken@2.7&custom=6008786c75a16
  3. Even restricting the discussion to the unmarked English nominal compounds, there are many genuine syntactic differences between compounds and other (usually case marked) modifiers (i.e. nmod / obl):
  • Compounds in English usually resist pluralization of the modifier - nmod does not
  • Compounds resist having a separate determiner on the modifier, but nmod does not
  • Compounds cannot exhibit reflexive binding (John conversed with himself <> * John's himself conversation)
  • More generally, compound modifiers cannot be pronominalized, but nmods can: John's conversation with Mary; the Mary conversation <> John's conversation with her; *the her conversation
  • Compound modifiers cannot normally be bound to anaphors: "I brought bread for ducks/nmod. They were very hungry". But not "* I brought duck bread. They were very hungry"

The common denominator of compounding across languages is that it creates something that at the syntactic level behaves like one word. In some languages this amounts only to limitation to a single determiner/definiteness value (e.g. Arabic or Hebrew), but otherwise the modifier can be modified. In others, the result is much more restricted, often amounting to a single noun with almost no possibility of modifying the modifier (e.g. German). The article criterion is convenient for N-N compounds, since it covers a broad range of languages, including the Romance examples ("le vote sanction", but not "*le vote une sanction" or "*un vote la sanction" etc.)

@nschneid
Copy link
Contributor Author

@amir-zeldes, @Stormur: could one of you start a separate issue to consolidate the discussion of the rationale for compound? Closing this one as it pertains to amod.

@Stormur
Copy link
Contributor

Stormur commented Jan 20, 2021

@nschneid you are right, sorry! The issue is expanding and it deserves its own place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants