Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PROPN vs. PTB NNP(S), titles in names, and compound vs. nmod vs. appos #678

Closed
nschneid opened this issue Dec 11, 2019 · 41 comments
Closed

Comments

@nschneid
Copy link
Contributor

We are realizing that the UDv2 guideline for PROPN is a pretty radical departure from the previous approach, which for English followed PTB guidelines.

Note that PROPN is only used for the subclass of nouns that are used as names and that often exhibit special syntactic properties (such as occurring without an article in the singular in English). When other phrases or sentences are used as names, the component words retain their original tags. For example, in Cat on a Hot Tin Roof, Cat is NOUN, on is ADP, a is DET, etc.

It makes sense to give function words their usual tags even within names.
But enforcing this policy would require reviewing all multiword names in (at least the English) treebanks to convert PROPN's into NOUN, ADJ, etc., depending on context.

Moreover, it's not entirely clear what the principle is behind the distinction. In the sentence "I'm reading Cat" (short for Cat on a Hot Tin Roof) should it be NOUN or PROPN?

A fine point is that it is not uncommon to regard words that are etymologically adjectives or participles as proper nouns when they appear as part of a multiword name that overall functions like a proper noun, for example in the Yellow Pages, United Airlines or Thrall Manufacturing Company. This is certainly the practice for the English Penn Treebank tag set.

Is this saying that an adjective in some names can be treated as PROPN but in other constructions (such as Cat on a Hot Tin Roof) it is obligatorily ADJ?

(Related: #664)

@sylvainkahane
Copy link
Contributor

For some UD_French treebanks we decided to introduce en external POS for MWEs and titles. In other words, words in a title or a syntactically regular MWE receive their regular POS, and the whole expression receives an EXTPOS value on the head. For titles, EXTPOS=PROPN. We have one example of a one word title where POS and EXTPOS are diffferent. Moreover titles receive a feature Type=Title, while MWEs have a feature Type=MWE.

@amir-zeldes
Copy link
Contributor

From a linguistic point of view 'being a name' is not fundamentally a morphosyntactic category in English IMO, notwithstanding article behavior (which is not 1:1). The PTB guideline (any content word in a name is NNP) is maybe not great, but at least more or less consistently applicable.

If I read the UD guideline literally, it sounds like a person named 'Violet Shoemaker' should not be ADJ+NOUN(?), but a play called 'Violet Shoemaker' should be ADJ+NOUN, which seems strange to me (what if it's about a person called Violet Shoemaker?). I think in practical terms this can't be implemented for the English corpora, since someone would have to go through each PROPN to establish whether it is a 'real' name or part of a 'work of art' like "Cat on a Hot Tin Roof" or similar. And even for Cat on a Hot Tin Roof, I agree with @nschneid that a normal annotator would probably tag "I saw Cat last night" as Cat/PROPN.

@dan-zeman dan-zeman added this to the v2.6 milestone Dec 12, 2019
@dan-zeman
Copy link
Member

This seems to reiterate an old (pre-v1) discussion about whether PROPN is a useful category at all :-) At any rate, the problem seems to be limited to some languages (notably English). Czech adjectives are morphologically distinct, and as the name suggests, a proper noun is supposed to be a noun, so in Kočka na rozpálené plechové střeše “Cat on a Hot Tin Roof”, there is no question about making rozpálené plechové “hot tin” anything else than ADJ. There also does not seem to be a reason to make Kočka “Cat” a PROPN instead of a NOUN; it is written capitalized because it happens to occur at the beginning of a multi-word named entity; but it would also be capitalized if it occurred at the beginning of a sentence. And finally, there is no conflict with a previous practice. As a matter of fact, the PDT Czech tagset does not even distinguish proper and common nouns.

That being said, personal names are special even in Czech. This involves mostly surnames because given names can rarely be confused with common nouns. A surname may be derived from an adjective but it will be treated as noun (i.e., proper noun in UD) because it has a radically different distribution. And if it is (zero-)derived from a common noun, we will still tag it PROPN and not NOUN. Some surnames will behave the same way as their common sources but some will not: e.g., in Václav Kočka, the name is masculine, while the common noun kočka “cat” is feminine.

@amir-zeldes
Copy link
Contributor

As I wrote above, I agree that names are not really a 'part of speech', even if they have slightly different distributions some of the time, because 1. there are lots of other categories with slightly different distributions which we don't distinguish (e.g. mass nouns) and 2. there are names that don't follow those distributions (e.g. can take articles). It's really a semantic distinction, saying whether something is a name. However, if we want to follow common practice in Computational Linguistics, support NER applications, etc. etc., then we need to have decidable guidelines.

For me, "Cat on a Hot Tin Roof" is definitely the name of a work of art, and if I use "Cat" to refer to it, that is PROPN and not NOUN. The same is true of the Mona Lisa (which even takes a 'the'): it is a PROPN because it is the name of a work of art, and for practical NER it should be included (work of art is even an NER category in OntoNotes). As soon as a noun stops pointing to its regular extension and starts acting as a name, it's fairly intuitive to explain to annotators what to do - a person called "Wolf" is PROPN, an instance of the animal is NOUN.

If we want annotators of English to treat the Mona Lisa as different from a person called Mona Lisa, I think we would find it difficult to teach to annotators, and even more problematically, we would need to go over every single NNP in the English corpora, which I think is a lost cause and not really a good idea. I like @sylvainkahane 's idea of explicitly modeling the duality of titles, but that too would require a very substantial manual annotation effort.

@sylvainkahane
Copy link
Contributor

Our goal is not to annotate named entities, but we have to ackowledge that sometimes the problem of NEs crosses the question of syntactic annotation. It happens with titles which behaves as a whole as PRONs but can be phrases of any POS.

To introduce a double POS, one for the external behavior (towards the governor of the phrase) and one for the internal behavior (inside the phrase) is quite simple and solve many problems: the problem of titles, but also the question of MWEs that have a regular internal syntactic structure but an external behavior which does not correspond to their internal structure and to the POS of the head. Again, our goal is not to annotate MWEs, but when MWEs crosses the question of syntactic annotation, we need a solution.

@nschneid
Copy link
Contributor Author

@sylvainkahane Agreed, I'm increasingly seeing the need for what could be called "syntactic MWEs": not all MWEs, to be sure ("pay attention" is syntactically normal), but ones that are problematic for the simple notion of head-modifier relations as a representation of syntagmatic compositionality.

  1. MWEs with internal syntax, but unpredictable external distribution: E.g. titles of works of art. As pointed out above, these would need something like an EXTPOS feature (or refinement of the external attachment edge) to properly capture both internal and external structure.

    • Word-level POS tags are especially problematic here due to the mixed use of "proper noun" tags. While I am sympathetic to UD's goal of adopting accessible and practical terminology for things, it is also worth noting that dictionaries include multiword names like "United States" as nouns, and do not POS-tag the individual words.
  2. MWEs that lack a clear internal structure: Currently handled by flat and fixed, to an extent, but these still have a head designated by convention, and word-level POS. And there are some multiple case attachments and the like that strike me as counterintuitive.

  3. Constructions with templatic internal structure: E.g. dates, and personal name constructions that where the name is preceded by a title used to address the person ("Mr. Smith"). There is a tension between treating the whole thing as flat and recognizing that parts serve separate functions and are not equally heads ("Mr." is something of a modifier of "Smith").

@amir-zeldes
Copy link
Contributor

EXTPOS sounds great, but I don't see how we would do this for all existing corpora... And even if it's done for some data sets, my instinct is to leave POS alone (since they reflect widely used standards) and rather add 'titlehood' to the large spans (à la PTB NP-TTL), or also add the non-name-like information somewhere, but in any case not redefining PROPN.

Making Smith the head of Mr. Smith would require another deprel - I guess most unoriginal would be dep:title, because I think even subtyping flat would not allow RTL dependencies for that relation. Conceptually I think it's probably right for Smith to be the head, but I'm very easily convinceable that this should be left alone for the status quo.

@nschneid
Copy link
Contributor Author

nschneid commented Dec 19, 2019

Making Smith the head of Mr. Smith would require another deprel - I guess most unoriginal would be dep:title, because I think even subtyping flat would not allow RTL dependencies for that relation. Conceptually I think it's probably right for Smith to be the head, but I'm very easily convinceable that this should be left alone for the status quo.

What about a new relation called formulaic, for template-like patterns that don't fall under the main syntactic relations used for normal vocabulary, but rather, specific formulas for putting together names and numbers ("Mr.", dates, units of measurement, etc.)? The technical difference would be

  • flat and fixed are for combinations where it is difficult to identify a unique head, so by convention the first word functions as the head in the tree
  • formulaic is for relations where a head can be designated based on criteria like omissibility, but none of the usual dependency relations apply

Examples (specific headedness decisions subject to discussion):

  • His Excellency Mr. John Smith, Jr.
    • flat(John, Smith)
    • formulaic(John, Mr.)
    • formulaic(John, Jr.)
    • flat(His, Excellency)
    • formulaic(John, His)
  • bus number 623 leaves at five o'clock on Sep 23 2020
    • Normal syntactic dependencies: nsubj(leaves, bus), obl:tmod(leaves, five), obl:tmod(leaves, 23)
    • "bus number 623" (which can be shortened to "bus 623"): formulaic(bus, 623), formulaic(623, number)
    • "five o'clock": formulaic(five, o'clock)
    • "Sep 23 2020": formulaic(23, Sep), formulaic(Sep, 2020)

And we could have the option of subtyping some of the formulaic relations, e.g. formulaic:month(23, Sep) vs. formulaic:year(Sep, 2020).

@jnivre
Copy link
Contributor

jnivre commented Dec 19, 2019 via email

@arademaker
Copy link
Contributor

Why I can't use appos instead of formulaic? appos(John, Mr.)?

@dan-zeman
Copy link
Member

Why I can't use appos instead of formulaic? appos(John, Mr.)?

Because appos is a left-to-right relation (besides also being meant for a different type of relation, but I think nobody has succeeded in precisely defining what that different relation is, while left-to-rightness is something easily testable).

@jnivre
Copy link
Contributor

jnivre commented Dec 19, 2019 via email

@nschneid
Copy link
Contributor Author

@jnivre As applied to English thus far, nmod is pretty much exclusively for "case"-marked nominals (PPs and possessives). In all of EWT I can find just 20 or so other uses of nmod.

The current definition says: "The nmod relation is used for nominal dependents of another noun or noun phrase and functionally corresponds to an attribute, or genitive complement." I don't know if "Mr." counts as an "attribute".

But assuming we were to widen nmod to include "Mr." then a reasonable question is, how to distinguish it from compound? The most productive kind of compounding in English is with a noun modifying another noun.

"Mr." feels quite different from both compounds and PPs, because it's a formulaic way of putting together names, and has a restrictive distribution (it can only precede a person's name).

@amir-zeldes
Copy link
Contributor

I would also prefer not to add major types - UD is complicated enough, and stability is important. If we did do a 'this is a special thing we don't have a name for', I would say dep:xyz is the way to go, so maybe dep:title. Honestly I've gotten used to flat(Mr.,Smith), so I can live with it - I agree with Nathan that nmod intuitively means something different to me (prepositional/case-bearing). This is the reason we have things like nmod:npmod in English for things that are not 'case-bearing'

Incidentally RE the comparison to compound - that's exactly what Stanford Dependencies had, where nn was used for both compound modifiers and titles. I think the reason it didn't annoy people was that names were right-headed anyway (last dominates first name), so it looked less jarring.

@jnivre
Copy link
Contributor

jnivre commented Dec 19, 2019 via email

@amir-zeldes
Copy link
Contributor

The name nmod:npmod is indeed odd and has purely historical reasons, but I don't agree that the functions it stands for are interchangeable with nmod. The criterion for the nmod/obl subtypes in English is that they appear without adpositions but are not objects. If we have:

I'll jog next week

Then we don't want:

obj(jog,week)

But we can't use the preposition marking guideline to do:

obl(jog,week)

And I think it's clear we have a different construction here than prepositional modification. So the solution is to give these cases a distinct relation:

obl:npmod(jog,week)

I would expect this to be applicable to other languages with unmediated adverbial NPs (e.g. 'accusativus graecus' in classical languages)

@amir-zeldes
Copy link
Contributor

Incidentally, we applied the same subtyping scheme to UD_Coptic for the same reasons:

https://corpling.uis.georgetown.edu/annis/#_q=cG9zPSJWIiAtPmRlcFtmdW5jPSJvYmw6bnBtb2QiXSBub3Jt&_c=Y29wdGljLnRyZWViYW5r&cl=5&cr=5&s=0&l=10&_seg=bm9ybV9ncm91cA

It has also been our main workaround for the validator shooting down advmod+not ADV.

@nschneid
Copy link
Contributor Author

I had momentarily forgotten about nmod:npmod. I too find it extremely confusing and it has its own issue: #478

@amir-zeldes, if the nmod subtypes in English are for non-case-bearing nominal modifiers, does that suggest nmod:title would be an appropriate solution for "Mr."? Or would you say that "Mr." is not really nominal, but rather an extremely special/miscellaneous modifier?

@amir-zeldes
Copy link
Contributor

It's true that the 'Mr.' modifier doesn't have an adposition, but I think it's different than the extent/spatiotemporal modifiers that typically get obl/nmod:npmod. Those are essentially adverbial, which I guess is what earns them the name 'oblique': 5 years old -> old how much? -> 5 years etc. The 'Mr.' titles are really part of the same NP they modify without instantiating a new morphosyntactic role. In that respect they're similar to appositions, except that appositions consist of two full NPs, and are capable of hosting articles if they are common in English:

The president, Jane Tanaka
Jane Tanaka, the president
a president, Jane Tanaka
Jana Tanaka, a president

Unlike:

Ms. Jane Tanaka
*Jane Tanaka Ms.
President Jane Tanaka
*Jane Tanaka President

Non appositions can't form two NPs with definiteness properties, etc. I think titles are really a distinct construction.

@dan-zeman
Copy link
Member

I agree with Joakim that nmod denotes a nominal that modifies another nominal, and it is not important whether the dependent nominal is or is not marked for case.

What is important however is that we have a reason to believe that one of the two nominals is the head and the other is dependent. Because if we don't have the reason, then they should be connected via flat.

As for nmod vs. compound specifically in English, I am not sure how exactly the distinction is or should be drawn. But since compounds are lexical, I don't think they can be used with person names; therefore, compound(Tanaka, Ms.) would seem wrong to me => it should be nmod or flat.

@nschneid
Copy link
Contributor Author

What is important however is that we have a reason to believe that one of the two nominals is the head and the other is dependent. Because if we don't have the reason, then they should be connected via flat.

Agreed—I think the fact that in "Mr. Smith", "Mr." is optional while "Smith" is not is reason enough to consider "Smith" the head.

As for nmod vs. compound specifically in English, I am not sure how exactly the distinction is or should be drawn. But since compounds are lexical, I don't think they can be used with person names; therefore, compound(Tanaka, Ms.) would seem wrong to me => it should be nmod or flat.

I agree that compound seems weird in this case, but could you elaborate on what you mean by "compounds are lexical"? In English I'd say that "valley unicorn" is just as compositional as "domestic unicorn" or "unicorn of the valley".

@dan-zeman
Copy link
Member

could you elaborate on what you mean by "compounds are lexical"?

A short answer should probably be "no I couldn't" — that's why I say I'm not sure how exactly the distinction should be drawn.

But if I step from exact definitions to vague intuition, then my understanding of nominal compounds is that one takes two common nouns, i.e. words denoting entities with given properties, and creates something that also denotes an entity with given properties, i.e., the result is like a common noun except that it is not written as one word. (Of course, the compound relation in UD is also used for compounds that are not nominal, but that does not seem relevant in this thread.)

Personally, I'd be quite fine if the compound relation were not used for nominal compounds and nmod were used instead, so I'd prefer other people to elaborate on this.

@nschneid
Copy link
Contributor Author

Related to the "two common nouns" criterion, we should also consider combinations of a normal common noun and a normal proper noun (by "normal" I mean the noun that could serve as an NP head, unlike "Mr.".):

  • "the Bush administration": compound(administration, Bush) seems reasonable to me

  • "the American singer Madonna": compound(Madonna, singer)? This is tougher, a combination of "the American singer" + "Madonna" (two full NPs)—almost an appositive but without punctuation/pause between them. Semantically, it seems to me that "the American singer" is a descriptor that elaborates on "Madonna".

@amir-zeldes
Copy link
Contributor

amir-zeldes commented Dec 20, 2019

I think "the American singer Madonna" is appos(singer,Madonna), since they are independent NPs, and are also reversible:

  • Madonna the American singer
  • * administration the Bush

The criterion for compounding in English is IMO not the fact that they create reference to an entity (so does 'unicorn of the valley'), but that they have the following properties:

  • They form a single NP, evidenced by the ability to insert only one article:
    • a/the valley unicorn (single NP, single article - 'valley' is not a full NP)
    • * a/the valley a/the unicorn
    • a/the unicorn of a/the valley (two NPs, with nesting, two articles possible)
  • The modifier is no longer referenceable:
    • "... a unicorn of the valley. It (=the valley) was full of grass, so unicorns ..."
    • * "... a valley unicorn. It (=the valley) was full of grass, so unicorns ..."
  • Canonically, the modifier cannot be inflected: (this has some empirical exceptions and has been discussed extensively, e.g. by Pinker)
    • rat trap(s)
    • * rats trap(s) - only head is pluralizable

@nschneid
Copy link
Contributor Author

I think "the American singer Madonna" is appos(singer,Madonna), since they are independent NPs, and are also reversible:

  • Madonna the American singer

I agree it's apposition-like, but I think there's a difference between "the American singer Madonna" and "the American singer, Madonna"—the second is a prototypical apposition which adds information about a fully established referent. "The American singer Madonna" strikes me as a descriptor-NP + name-NP construction, better paraphrased as "Madonna, the American singer". (Maybe these considerations are too semantic/nuanced, though, and for simplicity we should just pretend that there's an invisible comma.)

@amir-zeldes
Copy link
Contributor

amir-zeldes commented Dec 20, 2019

I think syntactically it's the same with or without the comma - the alternative is to not have the article, in which case we have the 'title' construction again:

American singer Madonna said today...

But not:

* Madonna American singer said today...

As soon as it can take 'the', it's an independent NP for me, and for me an apposition is two NPs in sequence fulfilling the same syntactic function (so both NPs are equally the subject of 'said' in "The American singer Madonna", in terms of argument structure, or you can postulate an NP containing both NPs)

@nschneid
Copy link
Contributor Author

I'm OK with the appos solution for "the American singer Madonna" if others are.

Getting back to the broader question of how compound is distinguished from nmod in languages like English where there isn't a cue from the morphology/spelling: Do @amir-zeldes's criteria for compound in English sit well with people? Should we document them somewhere?

@nschneid nschneid changed the title PROPN vs. PTB NNP(S) PROPN vs. PTB NNP(S), titles in names, and compound vs. nmod vs. appos Dec 20, 2019
@jnivre
Copy link
Contributor

jnivre commented Dec 20, 2019 via email

@dan-zeman
Copy link
Member

They do not sit well with me because the term "NP" is not defined in UD :-) But it can be probably rephrased so that it is clear what is meant.

I would also appreciate examples from more languages than just English, for the three-way distinction nmod-compound-appos.

@amir-zeldes
Copy link
Contributor

I think the criteria will vary somewhat from language to language, but you can get a 3-way distinction even in other language families. For example, in Semitic languages, construct states are often identified as compounds, but while they only allow a single article, the modifier can be inflected. I'll use Hebrew and horses instead of unicorns, since 'unicorn' is itself complex in Hebrew:

emek ha-susim 'the horses valley' (compound, note horses can be pluralized)
*ha-emek ha-susim '*the horses the valley'
ha-emek shel-ha-susim 'the valley of the horses' (nmod, two articles possible)
[ha-emek], [ha-makom shel-ha-susim] ' the valley, the place of the horses' (appos, both full NPs have articles, and the internal nmod does too)

The article+inflection criterion works for Romance too, though often these are spelled together or with a hyphen:

un/le bracelet-montre "a/the wristwatch (watch-bracelet)" (single article)

I think nmod vs. compound is motivated for Chinese and Japanese as well, where compounds are linked without adpositions, but nmod has adpositions, including for possession (de and no respectively).

@amir-zeldes
Copy link
Contributor

BTW for Slavic I suspect it's more difficult to make a 3-way distinction, since the (very rare) noun-noun 'compounds' inflect with agreeing case:

Polish:
krem-żel "cream gel"

And in a sentence in genitive context (real example):
to właśnie charakterystyczne cechy kremu-żelu do twarzy
"These are the characteristic features of the face cream.GEN gel.GEN"

So formally it looks a lot like an apposition. But I think it would be impossible to put a demonstrative on both parts:

* tego kremu tego żelu
* "of this cream-this gel"

Here's an authentic example with a single demonstrative:

po użyciu tego kremu-żelu cera jest ukojona...
after using this cream-gel, the complexion is soothed

So maybe the article criterion could even be used for Polish, if these things are treated as multi token units (might be spelling dependent in practice)

@dan-zeman
Copy link
Member

If kremu-żelu is tokenized as three tokens, can we say that one of the two lexical tokens is the head? It looks like flat to me.

So I actually forgot flat above, and it is potentially a four-way distinction:

  • a noun-noun flat, if there is no clear head
  • an nmod, if there are no language-specific criteria to say that it is compound
  • a noun-noun compound, if language-specific criteria exist and hold here
  • an appos; a language-specific criterion in some languages seems to be that both parts are "full NPs", whatever that means (an article/demonstrative is there or can be added?); I am not sure whether we should have apposition among the relations that "have a clear head" because in fact we always attach the second part to the first, so it is like flat.

@amir-zeldes
Copy link
Contributor

Yes, I agree the possibility of multiple determiners (including demonstratives in languages without articles) is a good way to distinguish appos from flat and compound.

The krem-żel example is more flat-like than the other compound examples since, insofar as it's a compound at all, it is an example of a copulative compound, like English singer-songwriter (which is both a singer and a songwriter, not a sub-type of songwriter). I think Slavic languages are often described as having little or no compounds, except for the type in N-o-N, which is a single token (e.g. bajkopisarz 'fairy-tale writer')

The more frequent type typologically is determinative compounds, where the compound is a semantic subtype of the head ("taxi driver", "night table"), which is clearly headed and not flat.

@nschneid nschneid added this to Names, titles, numbers, values in MWEs, names, adpositions, particles, etc. Dec 24, 2019
@dan-zeman
Copy link
Member

Okay, so what about this.

@arademaker
Copy link
Contributor

I didn't know about this page. We need a way to see in the website the context of a given page or what pages link to it.

@dan-zeman
Copy link
Member

I didn't know about this page.

That's because the page did not exist until about twenty minutes ago :-)

@amir-zeldes
Copy link
Contributor

I like the draft, but I would remove "in English, the criterion seems to be" (I'd rather specify what it is!), and I would add that in appositions, the two nominals can typically be reordered:

Appos:
Barack Obama, the President
The President, Barack Obama

Not appos:
President Barack Obama
x Barack Obama President

(I started using x for star to avoid MD bullets)

Also I would consider "Great deals, great pizza!" to be parataxis.

@dan-zeman
Copy link
Member

I would remove "in English, the criterion seems to be" (I'd rather specify what it is!)

So would I. I just was not sure that it was correct, but if it is, then I am happy to replace "seems to be" by "is".

@dan-zeman
Copy link
Member

English rules for apposition modified in 145521a.

Parataxis is another can of worms (but it deserves a separate thread if we are to discuss it). I would have preferred to give an example where asyndetic coordination of nominals acts as a subject / object / oblique dependent in a clause; but I did not find a good example in English. (In any case, Great deals, great pizza! is annotated as coordination in UD 2.5 English EWT: see the results of http://hdl.handle.net/11346/PMLTQ-BSUB.)

@amir-zeldes
Copy link
Contributor

RE appos my only suggested change to the commit is:

"has its own determiner" > "can have its own determiner"

Thanks for putting this up!

@amir-zeldes
Copy link
Contributor

Oh, and regarding the EWT query, yes, but you can easily find the opposite as well (in EWT itself, not even talking about whether that's consistent with other corpora):

http://hdl.handle.net/11346/PMLTQ-ZW7H

For example in EWT:
OK Food, Slow service
parataxis(food,service)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
MWEs, names, adpositions, particles, ...
  
Names, titles, numbers, values
Development

No branches or pull requests

6 participants