-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long titles of works of art #664
Comments
Bonus question: nmod:tmod(The,1992) Does that look right? (this is maybe more an English specific question) |
It would be very sad to use |
|
Is this addressed by #381 ? |
Thanks, @nschneid , I'd forgotten about that part! OK, then it looks like we will just get some odd structures (e.g. a title that is a PP can get an additional I see in #381 I already suggested the same analysis, and at least @arademaker seemed to like it then :) In that case we will analyze them internally and use |
I was going to open my own question style issue, but now coming here as the closest recent issue. How much do the following two documentation pieces present a consistent guideline? PROPN ― universal guideline I am asking as they may seem to allow more than one way. And also, how do they in your view relate with the discussion above about the use of the Thanks! |
Adding to the former, was any of the vetting here in any way specific to long names of works of art, as opposed to names of places / organizations / etc, and to such names that aren't as excruciatingly long but still multi-word? |
The above discussion was about dependency relations, not POS tags. English-EWT uses PROPN for both words in "United States" but treats "United" as an |
... which is quite odd. If the UPOS tag is |
Fair enough. I wasn't involved in the original UDification of the corpus but my impression is that there are a lot of places where the UPOS tags and dependencies aren't quite consistent. |
I think this is just a continuation of the PTB tag |
In syntax is there a literature on complex proper names, dates, and other values? It's not the sort of thing that is taught in a typical syntax class. @jnivre @manning @dan-zeman |
I think it is a typical case where you would really like to have part-of-speech tags on two levels. One for the component words and one for the whole expression (and only the latter would always be proper name). The same thing applies to many other types of multiword expressions. For example, you would like to be able to say that "in spite of" and "by and large" is a preposition and an adverb, but UD currently only allows you to say that it is ADP NOUN ADP and ADP CCONJ ADJ. @sylvainkahane and @kimgerdes have argued this for a long time and have incorporated it (I think) in their surface-oriented version of UD. |
Right, @sylvainkahane mentioned EXTPOS above. My STREUSLE corpus (which covers the Reviews section of EWT) adds MWE annotations with supplementary tags for MWEs like "in spite of". But it doesn't alter the word-level UPOS. |
Right. I don't think the word-level UPOS should be changed. In a case like "in spite of", they are not very interesting of course, but for the complex names they are. Presumably, the PTB practice comes from the fact that they could only represent one level and then went with the holistic analysis (assigning NNP to all component words regardless of their ordinary tag). However, this seems to be at odds with the UD recommendation, at least for syntax, where a compositional analysis is recommended if possible (rather than "flat:name"). In line with this, I think we should also keep the UPOS tags compositional (and therefore potentially incompatible with the XPOS tags). This is what we do for the Swedish treebank, when there is a discrepancy between the guidelines for UD and the legacy tags for Swedish. |
Regarding the POS for a expression, #664 (comment), we have been using the MISC field with a tag MWEPOS, as described in our paper: |
I'm pretty sure(?) the answer to this will be 'use flat', but I'm running into sometimes long titles of films in the next edition of GUM and wanted to make very sure we want these cases to be flat:
She won the César Award for Best Actress for The Old Lady Who Walked in the Sea (1992)
flat(The,Old)
flat(The,Lady)
flat(The,Who)
....
?
It seems clear to me that the title is acting as an NP, and is some kind of name, so probably
flat
is appropriate, but of course it also has a syntactic structure. The PTB had an interesting device to deal with such cases, which was to place the full title under a category nodeNP-TTL
, and then analyze the internal syntax below that as usual. But we don't have that option in normal dependencies of course...I guess my question is: do we definitely want all such titles, no mater how long or complex, to be all
flat
? I see how it's correct, but also worry a little that this could confuse parsers a lot.The text was updated successfully, but these errors were encountered: