Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long titles of works of art #664

Closed
amir-zeldes opened this issue Oct 27, 2019 · 16 comments
Closed

Long titles of works of art #664

amir-zeldes opened this issue Oct 27, 2019 · 16 comments

Comments

@amir-zeldes
Copy link
Contributor

I'm pretty sure(?) the answer to this will be 'use flat', but I'm running into sometimes long titles of films in the next edition of GUM and wanted to make very sure we want these cases to be flat:

She won the César Award for Best Actress for The Old Lady Who Walked in the Sea (1992)

flat(The,Old)
flat(The,Lady)
flat(The,Who)
....
?

It seems clear to me that the title is acting as an NP, and is some kind of name, so probably flat is appropriate, but of course it also has a syntactic structure. The PTB had an interesting device to deal with such cases, which was to place the full title under a category node NP-TTL, and then analyze the internal syntax below that as usual. But we don't have that option in normal dependencies of course...

I guess my question is: do we definitely want all such titles, no mater how long or complex, to be all flat? I see how it's correct, but also worry a little that this could confuse parsers a lot.

@amir-zeldes
Copy link
Contributor Author

Bonus question:

nmod:tmod(The,1992)

Does that look right? (this is maybe more an English specific question)

@sylvainkahane
Copy link
Contributor

It would be very sad to use flat here. This is regular syntax and a very common construction.
But there is clearly a problem. For the French treebanks, we decided to analyse the titles but to add two features: one feature indicating that it is a title (Type=Title) and one feature indicating that it behaves as a PROPN (EXTPOS=PROPN). This feature is also used for MWEs that we decided to analyze in SUD. This is explained in our last paper on SUD.
You can look at the 141 titles annotated in UD_French-GSD. They are clustered by the POS of their head word. If we had a flat analysis, we couldn't do that.

@nschneid
Copy link
Contributor

flat guidelines: "names that have a regular syntactic structure, like The Lord of the Rings and Captured By Aliens, should be annotated with regular syntactic relations."

@nschneid
Copy link
Contributor

Bonus question:

nmod:tmod(The,1992)

Does that look right? (this is maybe more an English specific question)

Is this addressed by #381 ?

@amir-zeldes
Copy link
Contributor Author

Thanks, @nschneid , I'd forgotten about that part! OK, then it looks like we will just get some odd structures (e.g. a title that is a PP can get an additional case if it it is the head of a PP...) and some differences between native and foreign (=flat) film titles, but that's analogous to personal names.

I see in #381 I already suggested the same analysis, and at least @arademaker seemed to like it then :)

In that case we will analyze them internally and use nmod:tmod like in academic citations. I also like @sylvainkahane 's feature Type=Title, not sure if we have capacity to go back and annotate all of those in GUM though. I'll keep it in mind. Thanks everyone!

@matanox
Copy link

matanox commented Oct 29, 2019

I was going to open my own question style issue, but now coming here as the closest recent issue. How much do the following two documentation pieces present a consistent guideline?

PROPN ― universal guideline
PROPN ― English specific guideline

I am asking as they may seem to allow more than one way. And also, how do they in your view relate with the discussion above about the use of the flat relation?

Thanks!

@matanox
Copy link

matanox commented Oct 30, 2019

Adding to the former, was any of the vetting here in any way specific to long names of works of art, as opposed to names of places / organizations / etc, and to such names that aren't as excruciatingly long but still multi-word?

@nschneid
Copy link
Contributor

The above discussion was about dependency relations, not POS tags. English-EWT uses PROPN for both words in "United States" but treats "United" as an amod dependent of "States".

@dan-zeman
Copy link
Member

English-EWT uses PROPN for both words in "United States" but treats "United" as an amod dependent of "States".

... which is quite odd. If the UPOS tag is PROPN then the relation should be nmod.

@nschneid
Copy link
Contributor

Fair enough. I wasn't involved in the original UDification of the corpus but my impression is that there are a lot of places where the UPOS tags and dependencies aren't quite consistent.

@amir-zeldes
Copy link
Contributor Author

I think this is just a continuation of the PTB tag NNP guideline, which gives priority to the proper name status of the sequence, rather than the syntactic function of the adjective. We reproduced this in GUM as well for consistency, but it would be very easy to make these all ADJ based on the amod function if one wanted to. It's a pretty substantial change in the mapping of POS tags to NER categories though, which is probably one of the main applications of the PROPN/NOUN distinction.

@nschneid
Copy link
Contributor

In syntax is there a literature on complex proper names, dates, and other values? It's not the sort of thing that is taught in a typical syntax class. @jnivre @manning @dan-zeman

@jnivre
Copy link
Contributor

jnivre commented Oct 30, 2019

I think it is a typical case where you would really like to have part-of-speech tags on two levels. One for the component words and one for the whole expression (and only the latter would always be proper name). The same thing applies to many other types of multiword expressions. For example, you would like to be able to say that "in spite of" and "by and large" is a preposition and an adverb, but UD currently only allows you to say that it is ADP NOUN ADP and ADP CCONJ ADJ. @sylvainkahane and @kimgerdes have argued this for a long time and have incorporated it (I think) in their surface-oriented version of UD.

@nschneid
Copy link
Contributor

Right, @sylvainkahane mentioned EXTPOS above.

My STREUSLE corpus (which covers the Reviews section of EWT) adds MWE annotations with supplementary tags for MWEs like "in spite of". But it doesn't alter the word-level UPOS.

@jnivre
Copy link
Contributor

jnivre commented Oct 30, 2019

Right. I don't think the word-level UPOS should be changed. In a case like "in spite of", they are not very interesting of course, but for the complex names they are. Presumably, the PTB practice comes from the fact that they could only represent one level and then went with the holistic analysis (assigning NNP to all component words regardless of their ordinary tag). However, this seems to be at odds with the UD recommendation, at least for syntax, where a compositional analysis is recommended if possible (rather than "flat:name"). In line with this, I think we should also keep the UPOS tags compositional (and therefore potentially incompatible with the XPOS tags). This is what we do for the Swedish treebank, when there is a discrepancy between the guidelines for UD and the legacy tags for Swedish.

@arademaker
Copy link
Contributor

Regarding the POS for a expression, #664 (comment), we have been using the MISC field with a tag MWEPOS, as described in our paper:

http://arademaker.github.io/bibliography/depling-2017.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants