Multiword expression nominalizations #807

nschneid · 2021-08-22T15:02:55Z

UniversalDependencies/UD_English-EWT#213 raised the issue of "to-do" acting as a noun:

it was a big to-do to find anyone who knew anything about it.

"To do" is a strange expression. I can't think of another to-infinitive verb that has been lexicalized as a noun (in both the fuss sense and the task item sense); closest I can think of is buying food "to go" (adverbial?). I guess the options are:

to/PART <-mark? compound?- do/NOUN
to/NOUN <-compound- do/NOUN

I think "do" has to be a NOUN because it is modified by a determiner and an adjective. It wouldn't make sense to call it a verb internally and a noun externally, i.e.:

to/PART <-mark- do/VERB/ExtPos=NOUN

See also #648 #753

dan-zeman · 2021-08-22T15:40:42Z

It wouldn't make sense to call it a verb internally and a noun externally

Why not? If I understand it correctly, the whole thing works as a NOUN but “do” does not. (And “to” definitely does not.)

nschneid · 2021-08-22T15:59:12Z

Well if we had a phrase for just "to do" then it would work. But "do" also has "a" and "big" as dependents, so it is a noun with respect to those internal modifiers.

amir-zeldes · 2021-08-22T16:14:18Z

I think this is consistent with our previous discussion of phrasal compound modifiers (see #753 for example):

A/DET big/ADJ to/PART do/NOUN
det(do, a)
amod(do, big)
mark(do, to)

perrier54 · 2021-08-22T16:19:56Z

In the French corpora, we have encountered many expressions, essentially idioms and titles, whose head behaves internally with its usual POS, but externally with another POS. To deal with these double-sided words, we systematically introduced the ExtPos feature, as @nschneid suggests for his example.

In addition, for the idioms, the head is provided with a feature Idiom=Yes and the other elements composing it with the feature InIdiom=Yes .
For the titles, we have in a similar way Title=Yes and InTitle=Yes.
We can thus distinguish the external dependents from the internal dependents of these expressions.
For the example mentioned by @nschneid , we could have:
to [upos=PART, InIdiom=Yes]
do[upos=VERB, ExtPos=NOUN, Idiom=Yes]

nschneid · 2021-08-22T17:13:44Z

To make sure I am understanding the application to the example:

7	it	it	PRON	PRP	Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs	13	expl	13:expl	_
8	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	13	cop	13:cop	_
9	a	a	DET	DT	Definite=Ind|PronType=Art	13	det	13:det	_
10	big	big	ADJ	JJ	Degree=Pos	13	amod	13:amod	_
11	to	to	PART	TO	InIdiom=Yes	13	mark	13:mark	SpaceAfter=No
12	-	-	PUNCT	HYPH	_	13	punct	13:punct	SpaceAfter=No
13	do	do	VERB	VB	ExtPos=NOUN|Idiom=Yes|Number=Sing	4	conj	4:conj:and	_

I see that the distinction between InIdiom=Yes and Idiom=Yes is necessary to distinguish the head in case one idiom is modifying another (e.g., "a no-holds-barred to-do"—we want to encode that there are 2 idioms, not 1).

Sounds like an improvement in terms of expressiveness—the question is whether it is worth it or too much complexity for people to apply in practice. :)

Also:

I'm not sure whether the morphological features on "do" should be for the noun interpretation (singular), the verb interpretation (infinitive), or both.
For those of us who work on lexical semantics it would be convenient to have a canonical form of MWEs stored somewhere; right now one can automatically concatenate lemmas in certain relations (fixed, flat), or with the Idiom/InIdiom annotation, but this is tedious and doesn't allow normalization of noncanonical word orders etc. A feature on "do" like IdiomLemma=to do might be nice. (I know PARSEME has ways to annotate verbal MWEs even if they follow conventional syntax; I am talking for now about the cases that have to be identified in syntactic annotation because they are exceptional.)

amir-zeldes · 2021-08-22T21:27:17Z

not sure whether the morphological features on "do" should be for the noun interpretation

I think in the 'normal' morphological field they should match the noun interpretation, i.e. they should match the POS and deprel in prioritizing the external roles. Additional annotations could be added to reflect internal expression morphology, but just including both in FEATS would be confusing IMO.

maartenpt · 2021-08-24T15:16:25Z

But why would you want to take hyphenated lexicalized items apart into their partly etymological parts? You also find todo items without a hyphen - would you still treat that as separate lexical items? Or out-of-the-box ideas, the French tire-bouchon or even non-hyphenated ones like the Dutch achterbandventieldopjesfabrieksmedewerker - would it not be more accurate to treat to-do as a single token, or at the very least group them into a single NOUN multi-token?

nschneid · 2021-08-24T16:01:07Z

@maartenpt I think there's enough variation in spelling—even in edited genres of a single language due to stylistic variation in hyphenation—that it is impractical to have a perfectly canonical definition of what constitutes a word (and new compounds emerge over time, of course). Thus different languages may need to adopt different heuristics regarding tokenization.

A recent discussion of improving tokenization consistency for English: UniversalDependencies/UD_English-EWT#204 Note that the original Penn Treebank did not tokenize hyphens but LDC later changed the policy, presumably because compounding with hyphens is highly productive in English.

aryamanarora · 2021-09-15T01:09:38Z

I can't think of another to-infinitive verb that has been lexicalized as a noun (in both the fuss sense and the task item sense)

Not as productive but on Goodreads you often see to-read lists, Youtube has to-listen playlists, etc. Probably actually generalised from to-do.

manning · 2021-10-30T18:11:03Z

Very late to this thread, but my comments:

I (of course) agree that in cases of zero-derivation or similar, a word can have an "internal" or original PoS and an "external" or derived PoS. Like "do" in this example that is originally a verb but is being used as a noun. This is very common in many languages including English. UD cannot represent both PoS (except by not-standardized extra features).
For UD, what is put in the UPOS column has to be the external PoS, since that is the basis of the usage in the sentence and the word's syntactic behavior.
So, here, "do" is a noun. I don't think it's mentioned in the thread, but "do" is also used standalone in a number of usages as a noun, including for a party ("We're having a do for my mother on the weekend.") or as short for hairdo ("Suzie got an elaborate bouffant do.")
As such, I think "to - do" should be treated as a compound, and so using compound is appropriate for "to".
I agree with @nschneid that there is enough variation in joining/dividing words that we should just live with it, using compound to join compound words that are written divided.
It seems to me possible within the rules of UD to treat compounds written as one word (German, Dutch, occasionally English) as an MWT and to divide up their internal structure using compound relations, though AFAIK, no one has actually attempted to do this.

nschneid · 2021-10-30T19:09:11Z

Thanks @manning! Re:

I (of course) agree that in cases of zero-derivation or similar, a word can have an "internal" or original PoS and an "external" or derived PoS. Like "do" in this example that is originally a verb but is being used as a noun. This is very common in many languages including English. UD cannot represent both PoS (except by not-standardized extra features).

For UD, what is put in the UPOS column has to be the external PoS, since that is the basis of the usage in the sentence and the word's syntactic behavior.

In many cases it seems like having to choose between internal and external POS would make the annotation very misleading, because one represents the distribution of the word within the phrase that it heads, and the other represents the distribution of the full phrase. (TBC I am talking about phrases with transparent internal syntax, not fixed or flat expressions.)

Given that some treebanks have introduced ExtPos as an additional feature, would it not make sense to standardize that? Or is there a reason to prefer that the UPOS be external and the feature be changed to IntPos? I guess I don't have a strong opinion as long as both tags are encoded somehow.

amir-zeldes · 2021-11-02T20:31:55Z

My gut feeling is also that POS is basically a token-wise category, which often means that we are focused more on the distribution of individual words, even if the phrase they are in behaves differently. This is also what leads to the English dataset work-arounds like obl:npmod (or tmod), which have been adopted in other languages as well: Things like "next week" are basically adverbial modifiers similar to "now", but because they consist of a noun and an adjective, they are tagged ADJ NOUN, and then can no longer be attached as advmod due to the validator's restrictions on that deprel. If everything got its distribution-based POS, I would expect "week" to be tagged ADV in such cases (but to be clear I'm absolutely not pleading for that!)

dan-zeman added dependencies English UPOS Universal part-of-speech tags: definitions and examples question labels Aug 22, 2021

dan-zeman added this to the v2.9 milestone Aug 22, 2021

dan-zeman modified the milestones: v2.9, v2.11 Jun 13, 2022

dan-zeman closed this as completed May 25, 2023

nschneid added this to Needs triage in MWEs, names, adpositions, particles, etc. via automation May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiword expression nominalizations #807

Multiword expression nominalizations #807

nschneid commented Aug 22, 2021

dan-zeman commented Aug 22, 2021

nschneid commented Aug 22, 2021

amir-zeldes commented Aug 22, 2021

perrier54 commented Aug 22, 2021

nschneid commented Aug 22, 2021 •

edited

Loading

amir-zeldes commented Aug 22, 2021

maartenpt commented Aug 24, 2021

nschneid commented Aug 24, 2021

aryamanarora commented Sep 15, 2021

manning commented Oct 30, 2021

nschneid commented Oct 30, 2021

amir-zeldes commented Nov 2, 2021

Multiword expression nominalizations #807

Multiword expression nominalizations #807

Comments

nschneid commented Aug 22, 2021

dan-zeman commented Aug 22, 2021

nschneid commented Aug 22, 2021

amir-zeldes commented Aug 22, 2021

perrier54 commented Aug 22, 2021

nschneid commented Aug 22, 2021 • edited Loading

amir-zeldes commented Aug 22, 2021

maartenpt commented Aug 24, 2021

nschneid commented Aug 24, 2021

aryamanarora commented Sep 15, 2021

manning commented Oct 30, 2021

nschneid commented Oct 30, 2021

amir-zeldes commented Nov 2, 2021

nschneid commented Aug 22, 2021 •

edited

Loading