Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiword expression nominalizations #807

Closed
nschneid opened this issue Aug 22, 2021 · 12 comments
Closed

Multiword expression nominalizations #807

nschneid opened this issue Aug 22, 2021 · 12 comments
Labels
dependencies English question UPOS Universal part-of-speech tags: definitions and examples
Milestone

Comments

@nschneid
Copy link
Contributor

UniversalDependencies/UD_English-EWT#213 raised the issue of "to-do" acting as a noun:

it was a big to-do to find anyone who knew anything about it.

"To do" is a strange expression. I can't think of another to-infinitive verb that has been lexicalized as a noun (in both the fuss sense and the task item sense); closest I can think of is buying food "to go" (adverbial?). I guess the options are:

  • to/PART <-mark? compound?- do/NOUN
  • to/NOUN <-compound- do/NOUN

I think "do" has to be a NOUN because it is modified by a determiner and an adjective. It wouldn't make sense to call it a verb internally and a noun externally, i.e.:

  • to/PART <-mark- do/VERB/ExtPos=NOUN

See also #648 #753

@dan-zeman dan-zeman added dependencies English UPOS Universal part-of-speech tags: definitions and examples question labels Aug 22, 2021
@dan-zeman dan-zeman added this to the v2.9 milestone Aug 22, 2021
@dan-zeman
Copy link
Member

It wouldn't make sense to call it a verb internally and a noun externally

Why not? If I understand it correctly, the whole thing works as a NOUN but “do” does not. (And “to” definitely does not.)

@nschneid
Copy link
Contributor Author

Well if we had a phrase for just "to do" then it would work. But "do" also has "a" and "big" as dependents, so it is a noun with respect to those internal modifiers.

@amir-zeldes
Copy link
Contributor

I think this is consistent with our previous discussion of phrasal compound modifiers (see #753 for example):

A/DET big/ADJ to/PART do/NOUN
det(do, a)
amod(do, big)
mark(do, to)

@perrier54
Copy link
Contributor

In the French corpora, we have encountered many expressions, essentially idioms and titles, whose head behaves internally with its usual POS, but externally with another POS. To deal with these double-sided words, we systematically introduced the ExtPos feature, as @nschneid suggests for his example.

In addition, for the idioms, the head is provided with a feature Idiom=Yes and the other elements composing it with the feature InIdiom=Yes .
For the titles, we have in a similar way Title=Yes and InTitle=Yes.
We can thus distinguish the external dependents from the internal dependents of these expressions.
For the example mentioned by @nschneid , we could have:
to [upos=PART, InIdiom=Yes]
do[upos=VERB, ExtPos=NOUN, Idiom=Yes]

@nschneid
Copy link
Contributor Author

nschneid commented Aug 22, 2021

To make sure I am understanding the application to the example:

7	it	it	PRON	PRP	Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs	13	expl	13:expl	_
8	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	13	cop	13:cop	_
9	a	a	DET	DT	Definite=Ind|PronType=Art	13	det	13:det	_
10	big	big	ADJ	JJ	Degree=Pos	13	amod	13:amod	_
11	to	to	PART	TO	InIdiom=Yes	13	mark	13:mark	SpaceAfter=No
12	-	-	PUNCT	HYPH	_	13	punct	13:punct	SpaceAfter=No
13	do	do	VERB	VB	ExtPos=NOUN|Idiom=Yes|Number=Sing	4	conj	4:conj:and	_

I see that the distinction between InIdiom=Yes and Idiom=Yes is necessary to distinguish the head in case one idiom is modifying another (e.g., "a no-holds-barred to-do"—we want to encode that there are 2 idioms, not 1).

Sounds like an improvement in terms of expressiveness—the question is whether it is worth it or too much complexity for people to apply in practice. :)

Also:

  • I'm not sure whether the morphological features on "do" should be for the noun interpretation (singular), the verb interpretation (infinitive), or both.
  • For those of us who work on lexical semantics it would be convenient to have a canonical form of MWEs stored somewhere; right now one can automatically concatenate lemmas in certain relations (fixed, flat), or with the Idiom/InIdiom annotation, but this is tedious and doesn't allow normalization of noncanonical word orders etc. A feature on "do" like IdiomLemma=to do might be nice. (I know PARSEME has ways to annotate verbal MWEs even if they follow conventional syntax; I am talking for now about the cases that have to be identified in syntactic annotation because they are exceptional.)

@amir-zeldes
Copy link
Contributor

not sure whether the morphological features on "do" should be for the noun interpretation

I think in the 'normal' morphological field they should match the noun interpretation, i.e. they should match the POS and deprel in prioritizing the external roles. Additional annotations could be added to reflect internal expression morphology, but just including both in FEATS would be confusing IMO.

@maartenpt
Copy link

But why would you want to take hyphenated lexicalized items apart into their partly etymological parts? You also find todo items without a hyphen - would you still treat that as separate lexical items? Or out-of-the-box ideas, the French tire-bouchon or even non-hyphenated ones like the Dutch achterbandventieldopjesfabrieksmedewerker - would it not be more accurate to treat to-do as a single token, or at the very least group them into a single NOUN multi-token?

@nschneid
Copy link
Contributor Author

@maartenpt I think there's enough variation in spelling—even in edited genres of a single language due to stylistic variation in hyphenation—that it is impractical to have a perfectly canonical definition of what constitutes a word (and new compounds emerge over time, of course). Thus different languages may need to adopt different heuristics regarding tokenization.

A recent discussion of improving tokenization consistency for English: UniversalDependencies/UD_English-EWT#204 Note that the original Penn Treebank did not tokenize hyphens but LDC later changed the policy, presumably because compounding with hyphens is highly productive in English.

@aryamanarora
Copy link
Member

I can't think of another to-infinitive verb that has been lexicalized as a noun (in both the fuss sense and the task item sense)

Not as productive but on Goodreads you often see to-read lists, Youtube has to-listen playlists, etc. Probably actually generalised from to-do.

@manning
Copy link
Contributor

manning commented Oct 30, 2021

Very late to this thread, but my comments:

  1. I (of course) agree that in cases of zero-derivation or similar, a word can have an "internal" or original PoS and an "external" or derived PoS. Like "do" in this example that is originally a verb but is being used as a noun. This is very common in many languages including English. UD cannot represent both PoS (except by not-standardized extra features).
  2. For UD, what is put in the UPOS column has to be the external PoS, since that is the basis of the usage in the sentence and the word's syntactic behavior.
  3. So, here, "do" is a noun. I don't think it's mentioned in the thread, but "do" is also used standalone in a number of usages as a noun, including for a party ("We're having a do for my mother on the weekend.") or as short for hairdo ("Suzie got an elaborate bouffant do.")
  4. As such, I think "to - do" should be treated as a compound, and so using compound is appropriate for "to".
  5. I agree with @nschneid that there is enough variation in joining/dividing words that we should just live with it, using compound to join compound words that are written divided.
  6. It seems to me possible within the rules of UD to treat compounds written as one word (German, Dutch, occasionally English) as an MWT and to divide up their internal structure using compound relations, though AFAIK, no one has actually attempted to do this.

@nschneid
Copy link
Contributor Author

Thanks @manning! Re:

  1. I (of course) agree that in cases of zero-derivation or similar, a word can have an "internal" or original PoS and an "external" or derived PoS. Like "do" in this example that is originally a verb but is being used as a noun. This is very common in many languages including English. UD cannot represent both PoS (except by not-standardized extra features).

  2. For UD, what is put in the UPOS column has to be the external PoS, since that is the basis of the usage in the sentence and the word's syntactic behavior.

In many cases it seems like having to choose between internal and external POS would make the annotation very misleading, because one represents the distribution of the word within the phrase that it heads, and the other represents the distribution of the full phrase. (TBC I am talking about phrases with transparent internal syntax, not fixed or flat expressions.)

Given that some treebanks have introduced ExtPos as an additional feature, would it not make sense to standardize that? Or is there a reason to prefer that the UPOS be external and the feature be changed to IntPos? I guess I don't have a strong opinion as long as both tags are encoded somehow.

@amir-zeldes
Copy link
Contributor

My gut feeling is also that POS is basically a token-wise category, which often means that we are focused more on the distribution of individual words, even if the phrase they are in behaves differently. This is also what leads to the English dataset work-arounds like obl:npmod (or tmod), which have been adopted in other languages as well: Things like "next week" are basically adverbial modifiers similar to "now", but because they consist of a noun and an adjective, they are tagged ADJ NOUN, and then can no longer be attached as advmod due to the validator's restrictions on that deprel. If everything got its distribution-based POS, I would expect "week" to be tagged ADV in such cases (but to be clear I'm absolutely not pleading for that!)

@dan-zeman dan-zeman modified the milestones: v2.9, v2.11 Jun 13, 2022
@nschneid nschneid added this to Needs triage in MWEs, names, adpositions, particles, etc. via automation May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies English question UPOS Universal part-of-speech tags: definitions and examples
Development

No branches or pull requests

7 participants