-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiword expression nominalizations #807
Comments
Why not? If I understand it correctly, the whole thing works as a |
Well if we had a phrase for just "to do" then it would work. But "do" also has "a" and "big" as dependents, so it is a noun with respect to those internal modifiers. |
I think this is consistent with our previous discussion of phrasal compound modifiers (see #753 for example): A/DET big/ADJ to/PART do/NOUN |
In the French corpora, we have encountered many expressions, essentially idioms and titles, whose head behaves internally with its usual POS, but externally with another POS. To deal with these double-sided words, we systematically introduced the In addition, for the idioms, the head is provided with a feature |
To make sure I am understanding the application to the example:
I see that the distinction between Sounds like an improvement in terms of expressiveness—the question is whether it is worth it or too much complexity for people to apply in practice. :) Also:
|
I think in the 'normal' morphological field they should match the noun interpretation, i.e. they should match the POS and deprel in prioritizing the external roles. Additional annotations could be added to reflect internal expression morphology, but just including both in FEATS would be confusing IMO. |
But why would you want to take hyphenated lexicalized items apart into their partly etymological parts? You also find todo items without a hyphen - would you still treat that as separate lexical items? Or out-of-the-box ideas, the French tire-bouchon or even non-hyphenated ones like the Dutch achterbandventieldopjesfabrieksmedewerker - would it not be more accurate to treat to-do as a single token, or at the very least group them into a single NOUN multi-token? |
@maartenpt I think there's enough variation in spelling—even in edited genres of a single language due to stylistic variation in hyphenation—that it is impractical to have a perfectly canonical definition of what constitutes a word (and new compounds emerge over time, of course). Thus different languages may need to adopt different heuristics regarding tokenization. A recent discussion of improving tokenization consistency for English: UniversalDependencies/UD_English-EWT#204 Note that the original Penn Treebank did not tokenize hyphens but LDC later changed the policy, presumably because compounding with hyphens is highly productive in English. |
Not as productive but on Goodreads you often see to-read lists, Youtube has to-listen playlists, etc. Probably actually generalised from to-do. |
Very late to this thread, but my comments:
|
Thanks @manning! Re:
In many cases it seems like having to choose between internal and external POS would make the annotation very misleading, because one represents the distribution of the word within the phrase that it heads, and the other represents the distribution of the full phrase. (TBC I am talking about phrases with transparent internal syntax, not Given that some treebanks have introduced |
My gut feeling is also that POS is basically a token-wise category, which often means that we are focused more on the distribution of individual words, even if the phrase they are in behaves differently. This is also what leads to the English dataset work-arounds like |
UniversalDependencies/UD_English-EWT#213 raised the issue of "to-do" acting as a noun:
"To do" is a strange expression. I can't think of another to-infinitive verb that has been lexicalized as a noun (in both the fuss sense and the task item sense); closest I can think of is buying food "to go" (adverbial?). I guess the options are:
I think "do" has to be a NOUN because it is modified by a determiner and an adjective. It wouldn't make sense to call it a verb internally and a noun externally, i.e.:
See also #648 #753
The text was updated successfully, but these errors were encountered: