-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PROPN vs. PTB NNP(S), titles in names, and compound vs. nmod vs. appos #678
Comments
For some UD_French treebanks we decided to introduce en external POS for MWEs and titles. In other words, words in a title or a syntactically regular MWE receive their regular POS, and the whole expression receives an EXTPOS value on the head. For titles, EXTPOS=PROPN. We have one example of a one word title where POS and EXTPOS are diffferent. Moreover titles receive a feature Type=Title, while MWEs have a feature Type=MWE. |
From a linguistic point of view 'being a name' is not fundamentally a morphosyntactic category in English IMO, notwithstanding article behavior (which is not 1:1). The PTB guideline (any content word in a name is If I read the UD guideline literally, it sounds like a person named 'Violet Shoemaker' should not be ADJ+NOUN(?), but a play called 'Violet Shoemaker' should be ADJ+NOUN, which seems strange to me (what if it's about a person called Violet Shoemaker?). I think in practical terms this can't be implemented for the English corpora, since someone would have to go through each PROPN to establish whether it is a 'real' name or part of a 'work of art' like "Cat on a Hot Tin Roof" or similar. And even for Cat on a Hot Tin Roof, I agree with @nschneid that a normal annotator would probably tag "I saw Cat last night" as |
This seems to reiterate an old (pre-v1) discussion about whether That being said, personal names are special even in Czech. This involves mostly surnames because given names can rarely be confused with common nouns. A surname may be derived from an adjective but it will be treated as noun (i.e., proper noun in UD) because it has a radically different distribution. And if it is (zero-)derived from a common noun, we will still tag it |
As I wrote above, I agree that names are not really a 'part of speech', even if they have slightly different distributions some of the time, because 1. there are lots of other categories with slightly different distributions which we don't distinguish (e.g. mass nouns) and 2. there are names that don't follow those distributions (e.g. can take articles). It's really a semantic distinction, saying whether something is a name. However, if we want to follow common practice in Computational Linguistics, support NER applications, etc. etc., then we need to have decidable guidelines. For me, "Cat on a Hot Tin Roof" is definitely the name of a work of art, and if I use "Cat" to refer to it, that is PROPN and not NOUN. The same is true of the Mona Lisa (which even takes a 'the'): it is a PROPN because it is the name of a work of art, and for practical NER it should be included (work of art is even an NER category in OntoNotes). As soon as a noun stops pointing to its regular extension and starts acting as a name, it's fairly intuitive to explain to annotators what to do - a person called "Wolf" is PROPN, an instance of the animal is NOUN. If we want annotators of English to treat the Mona Lisa as different from a person called Mona Lisa, I think we would find it difficult to teach to annotators, and even more problematically, we would need to go over every single |
Our goal is not to annotate named entities, but we have to ackowledge that sometimes the problem of NEs crosses the question of syntactic annotation. It happens with titles which behaves as a whole as PRONs but can be phrases of any POS. To introduce a double POS, one for the external behavior (towards the governor of the phrase) and one for the internal behavior (inside the phrase) is quite simple and solve many problems: the problem of titles, but also the question of MWEs that have a regular internal syntactic structure but an external behavior which does not correspond to their internal structure and to the POS of the head. Again, our goal is not to annotate MWEs, but when MWEs crosses the question of syntactic annotation, we need a solution. |
@sylvainkahane Agreed, I'm increasingly seeing the need for what could be called "syntactic MWEs": not all MWEs, to be sure ("pay attention" is syntactically normal), but ones that are problematic for the simple notion of head-modifier relations as a representation of syntagmatic compositionality.
|
EXTPOS sounds great, but I don't see how we would do this for all existing corpora... And even if it's done for some data sets, my instinct is to leave POS alone (since they reflect widely used standards) and rather add 'titlehood' to the large spans (à la PTB Making Smith the head of Mr. Smith would require another deprel - I guess most unoriginal would be |
What about a new relation called
Examples (specific headedness decisions subject to discussion):
And we could have the option of subtyping some of the formulaic relations, e.g. formulaic:month(23, Sep) vs. formulaic:year(Sep, 2020). |
I think the simple solution is to use nmod (possibly with subtyping). If we assume that ”Smith” is the head, then ”Mister” is a noun modifying that head. This is what we do for similar constructions in Swedish. UD by necessity has to be rather coarse-grained, so I think we should be careful not to start proliferating relations.
Joakim
Skickat från min iPhone
19 dec. 2019 kl. 20:35 skrev Nathan Schneider <notifications@github.com>:
Making Smith the head of Mr. Smith would require another deprel - I guess most unoriginal would be dep:title, because I think even subtyping flat would not allow RTL dependencies for that relation. Conceptually I think it's probably right for Smith to be the head, but I'm very easily convinceable that this should be left alone for the status quo.
What about a new relation called formulaic, for template-like patterns that don't fall under the main syntactic relations used for normal vocabulary, but rather, specific formulas for putting together names and numbers ("Mr.", dates, units of measurement, etc.)? The technical difference would be
* flat and fixed are for combinations where it is difficult to identify a unique head, so by convention the first word functions as the head in the tree
* formulaic is for relations where a head can be designated based on criteria like omissibility, but none of the usual dependency apply
Examples (specific headedness decisions subject to discussion):
* His Excellency Mr. John Smith, Jr.
* flat(John, Smith)
* formulaic(John, Mr.)
* formulaic(John, Jr.)
* flat(His, Excellency)
* formulaic(John, His)
* bus number 623 leaves at five o'clock on Sep 23 2020
* Normal syntactic dependencies: nsubj(leaves, bus), obl:tmod(leaves, five), obl:tmod(leaves, 23)
* "bus number 623" (which can be shortened to "bus 623"): formulaic(bus, 623), formulaic(623, number)
* "five o'clock": formulaic(five, o'clock)
* "Sep 23 2020": formulaic(23, Sep), formulaic(Sep, 2020)
And we could have the option of subtyping some of the formulaic relations, e.g. formulaic:month(23, Sep) vs. formulaic:year(Sep, 2020).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#678?email_source=notifications&email_token=ABZ7ZVRYNYD62BB7SSNECLTQZPEGNA5CNFSM4JZT3ZS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHKV6RA#issuecomment-567631684>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABZ7ZVQU7CMB7J2GPZ3PAFTQZPEGNANCNFSM4JZT3ZSQ>.
När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
|
Why I can't use |
Because appos is a left-to-right relation (besides also being meant for a different type of relation, but I think nobody has succeeded in precisely defining what that different relation is, while left-to-rightness is something easily testable). |
This would indeed be classified as apposition in many traditional grammars (including the Swedish tradition), but the UD concept of apposition is narrower and restricted to thinks like ”president of X” in ”Mr. Smith, president of X”. Personally, I would be in favor of treating appos as a subtype of nmod rather than a universal relation.
Joakim
Skickat från min iPhone
19 dec. 2019 kl. 21:52 skrev Alexandre Rademaker <notifications@github.com>:
Why I can't use appos instead of formulaic? appos(John, Mr.)?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#678?email_source=notifications&email_token=ABZ7ZVVIVGZKHQI2FNO6JTLQZPNHXA5CNFSM4JZT3ZS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHK7YMI#issuecomment-567671857>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABZ7ZVS7YZYJ46SZC7MTZ5TQZPNHXANCNFSM4JZT3ZSQ>.
När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
|
@jnivre As applied to English thus far, The current definition says: "The But assuming we were to widen "Mr." feels quite different from both compounds and PPs, because it's a formulaic way of putting together names, and has a restrictive distribution (it can only precede a person's name). |
I would also prefer not to add major types - UD is complicated enough, and stability is important. If we did do a 'this is a special thing we don't have a name for', I would say Incidentally RE the comparison to compound - that's exactly what Stanford Dependencies had, where |
I’ve always considered nmod:npmod a strange quick of the English treebank. :) Even the name is a weird mix of dependency and phrase structure syntax.
In a cross-linguistic perspective, I think it makes more sense to use nmod for any kind of nominal modification, regardless of whether it is accompanied by head marking, dependent marking, or no marking.
Compounding is different because it is occurs at the lexical level, but I agree that English orthography makes it hard to separate in practice. It is easier in Swedish and German where compounds are written with no internal spaces.
Joakim
Skickat från min iPhone
19 dec. 2019 kl. 22:35 skrev Amir Zeldes <notifications@github.com>:
I would also prefer not to add major types - UD is complicated enough, and stability is important. If we did do a 'this is a special thing we don't have a name for', I would say dep:xyz is the way to go, so maybe dep:title. Honestly I've gotten used to flat(Mr.,Smith), so I can live with it - I agree with Nathan that nmod intuitively means something different to me (prepositional/case-bearing). This is the reason we have things like nmod:npmod in English for things that are not 'case-bearing'
Incidentally RE the comparison to compound - that's exactly what Stanford Dependencies had, where nn was used for both compound modifiers and titles. I think the reason it didn't annoy people was that names were right-headed anyway (last dominates first name), so it looked less jarring.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#678?email_source=notifications&email_token=ABZ7ZVRMJFKLLIRWUI52VK3QZPSLBA5CNFSM4JZT3ZS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHLDJVY#issuecomment-567686359>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABZ7ZVXTJFMBL23LQDZEQUTQZPSLBANCNFSM4JZT3ZSQ>.
När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
|
The name I'll jog next week Then we don't want: obj(jog,week) But we can't use the preposition marking guideline to do: obl(jog,week) And I think it's clear we have a different construction here than prepositional modification. So the solution is to give these cases a distinct relation: obl:npmod(jog,week) I would expect this to be applicable to other languages with unmediated adverbial NPs (e.g. 'accusativus graecus' in classical languages) |
Incidentally, we applied the same subtyping scheme to UD_Coptic for the same reasons: It has also been our main workaround for the validator shooting down advmod+not ADV. |
I had momentarily forgotten about @amir-zeldes, if the |
It's true that the 'Mr.' modifier doesn't have an adposition, but I think it's different than the extent/spatiotemporal modifiers that typically get obl/nmod:npmod. Those are essentially adverbial, which I guess is what earns them the name 'oblique': 5 years old -> old how much? -> 5 years etc. The 'Mr.' titles are really part of the same NP they modify without instantiating a new morphosyntactic role. In that respect they're similar to appositions, except that appositions consist of two full NPs, and are capable of hosting articles if they are common in English: The president, Jane Tanaka Unlike: Ms. Jane Tanaka Non appositions can't form two NPs with definiteness properties, etc. I think titles are really a distinct construction. |
I agree with Joakim that What is important however is that we have a reason to believe that one of the two nominals is the head and the other is dependent. Because if we don't have the reason, then they should be connected via As for |
Agreed—I think the fact that in "Mr. Smith", "Mr." is optional while "Smith" is not is reason enough to consider "Smith" the head.
I agree that compound seems weird in this case, but could you elaborate on what you mean by "compounds are lexical"? In English I'd say that "valley unicorn" is just as compositional as "domestic unicorn" or "unicorn of the valley". |
A short answer should probably be "no I couldn't" — that's why I say I'm not sure how exactly the distinction should be drawn. But if I step from exact definitions to vague intuition, then my understanding of nominal compounds is that one takes two common nouns, i.e. words denoting entities with given properties, and creates something that also denotes an entity with given properties, i.e., the result is like a common noun except that it is not written as one word. (Of course, the Personally, I'd be quite fine if the |
Related to the "two common nouns" criterion, we should also consider combinations of a normal common noun and a normal proper noun (by "normal" I mean the noun that could serve as an NP head, unlike "Mr.".):
|
I think "the American singer Madonna" is
The criterion for compounding in English is IMO not the fact that they create reference to an entity (so does 'unicorn of the valley'), but that they have the following properties:
|
I agree it's apposition-like, but I think there's a difference between "the American singer Madonna" and "the American singer, Madonna"—the second is a prototypical apposition which adds information about a fully established referent. "The American singer Madonna" strikes me as a descriptor-NP + name-NP construction, better paraphrased as "Madonna, the American singer". (Maybe these considerations are too semantic/nuanced, though, and for simplicity we should just pretend that there's an invisible comma.) |
I think syntactically it's the same with or without the comma - the alternative is to not have the article, in which case we have the 'title' construction again: American singer Madonna said today... But not:
As soon as it can take 'the', it's an independent NP for me, and for me an apposition is two NPs in sequence fulfilling the same syntactic function (so both NPs are equally the subject of 'said' in "The American singer Madonna", in terms of argument structure, or you can postulate an NP containing both NPs) |
I'm OK with the Getting back to the broader question of how |
The criteria look good to me. The key point is that they form a single NP or, rather, that they form a compound nominal head. That is what I meant by saying that it occurs at the lexical level, not the phrasal level.
|
They do not sit well with me because the term "NP" is not defined in UD :-) But it can be probably rephrased so that it is clear what is meant. I would also appreciate examples from more languages than just English, for the three-way distinction |
I think the criteria will vary somewhat from language to language, but you can get a 3-way distinction even in other language families. For example, in Semitic languages, construct states are often identified as compounds, but while they only allow a single article, the modifier can be inflected. I'll use Hebrew and horses instead of unicorns, since 'unicorn' is itself complex in Hebrew:
The article+inflection criterion works for Romance too, though often these are spelled together or with a hyphen: un/le bracelet-montre "a/the wristwatch (watch-bracelet)" (single article) I think nmod vs. compound is motivated for Chinese and Japanese as well, where compounds are linked without adpositions, but nmod has adpositions, including for possession (de and no respectively). |
BTW for Slavic I suspect it's more difficult to make a 3-way distinction, since the (very rare) noun-noun 'compounds' inflect with agreeing case: Polish: And in a sentence in genitive context (real example): So formally it looks a lot like an apposition. But I think it would be impossible to put a demonstrative on both parts:
Here's an authentic example with a single demonstrative: po użyciu tego kremu-żelu cera jest ukojona... So maybe the article criterion could even be used for Polish, if these things are treated as multi token units (might be spelling dependent in practice) |
If kremu-żelu is tokenized as three tokens, can we say that one of the two lexical tokens is the head? It looks like So I actually forgot
|
Yes, I agree the possibility of multiple determiners (including demonstratives in languages without articles) is a good way to distinguish appos from flat and compound. The krem-żel example is more flat-like than the other compound examples since, insofar as it's a compound at all, it is an example of a copulative compound, like English singer-songwriter (which is both a singer and a songwriter, not a sub-type of songwriter). I think Slavic languages are often described as having little or no compounds, except for the type in N-o-N, which is a single token (e.g. bajkopisarz 'fairy-tale writer') The more frequent type typologically is determinative compounds, where the compound is a semantic subtype of the head ("taxi driver", "night table"), which is clearly headed and not flat. |
Okay, so what about this. |
I didn't know about this page. We need a way to see in the website the context of a given page or what pages link to it. |
That's because the page did not exist until about twenty minutes ago :-) |
I like the draft, but I would remove "in English, the criterion seems to be" (I'd rather specify what it is!), and I would add that in appositions, the two nominals can typically be reordered: Appos: Not appos: (I started using Also I would consider "Great deals, great pizza!" to be |
So would I. I just was not sure that it was correct, but if it is, then I am happy to replace "seems to be" by "is". |
English rules for apposition modified in 145521a. Parataxis is another can of worms (but it deserves a separate thread if we are to discuss it). I would have preferred to give an example where asyndetic coordination of nominals acts as a subject / object / oblique dependent in a clause; but I did not find a good example in English. (In any case, Great deals, great pizza! is annotated as coordination in UD 2.5 English EWT: see the results of http://hdl.handle.net/11346/PMLTQ-BSUB.) |
RE appos my only suggested change to the commit is: "has its own determiner" > "can have its own determiner" Thanks for putting this up! |
Oh, and regarding the EWT query, yes, but you can easily find the opposite as well (in EWT itself, not even talking about whether that's consistent with other corpora): http://hdl.handle.net/11346/PMLTQ-ZW7H For example in EWT: |
We are realizing that the UDv2 guideline for
PROPN
is a pretty radical departure from the previous approach, which for English followed PTB guidelines.It makes sense to give function words their usual tags even within names.
But enforcing this policy would require reviewing all multiword names in (at least the English) treebanks to convert PROPN's into NOUN, ADJ, etc., depending on context.
Moreover, it's not entirely clear what the principle is behind the distinction. In the sentence "I'm reading Cat" (short for Cat on a Hot Tin Roof) should it be NOUN or PROPN?
Is this saying that an adjective in some names can be treated as PROPN but in other constructions (such as Cat on a Hot Tin Roof) it is obligatorily ADJ?
(Related: #664)
The text was updated successfully, but these errors were encountered: