-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changed common nouns with UPOS PROPN to NOUN, #94
Conversation
a few lemmas corrected (americans -> american, etc)
@@ -238,7 +238,7 @@ | |||
14 . . PUNCT . _ 13 punct 13:punct _ | |||
|
|||
# sent_id = newsgroup-groups.google.com_misc.consumers_a534e32067078b08_ENG_20060116_030800-0012 | |||
# text = 1. Blowback in Iraq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this on purpose? I guess it would be okay to remove multiple spaces but the multiple spaces are present in the original text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More or less by accident. There was an unbreakable space which is not mentioned in the SpacesAfter= key (MISC column). Probably I should have edited SpacesAfter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dan-zeman do you have thoughts on this? should spaces be normalized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong position. Is the validator happy with this? Technically it should not complain because if MISC does not say "SpaceAfter=No", there can be one or more whitespace characters between two tokens. But I would not be surprised if the validator or some other script did not expect this and simply assumed that there is always just one chr(32). SpacesAfter is optional and many scripts probably do not look for it.
Preserving the original whitespace characters in the text attribute sounds good initially, but in fact it is not possible to preserve everything: there cannot be a line break, for instance.
@@ -387,7 +387,7 @@ | |||
10 really really ADV RB _ 11 advmod 11:advmod _ | |||
11 attended attend VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 0 root 0:root _ | |||
12 by by ADP IN _ 13 case 13:case _ | |||
13 protestants protestants PROPN NNPS Number=Plur 11 obl 11:obl:by _ | |||
13 protestants protestant NOUN NNS Number=Plur 11 obl 11:obl:by _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protestant/Catholic should be PROPN right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UD guidelines say "A proper noun is a noun (or nominal content word) that is the name (or part of the name) of a specific individual, place, or object". So is a plural like protestants still an proper noun? I thought not, that is why I propose to retag it as NOUN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not make a distinction based on singular vs. plural. "Protestants" could refer to all individuals, or a subset of individuals, who identify as Protestant. Likewise, a plural person name ("John Smiths") should still be PROPN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But everything could refer to an individual, e.g. "the president said..." in a given context can refer to a president of a given country. Is this tagged PROPN as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there's some gray area with offices like "president". I would probably tag it as NOUN unless it's title within a specific person's name ("President Barack Obama"). UD strives to make tagging decisions on a fairly lexical basis and not require a ton of contextual interpretation about referents.
@@ -27,7 +27,7 @@ | |||
# text = using the metro or the air france bus | |||
1 using use VERB VBG VerbForm=Ger 0 root 0:root _ | |||
2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _ | |||
3 metro metro PROPN NNP Number=Sing 1 obj 1:obj _ | |||
3 metro metro NOUN NN Number=Sing 1 obj 1:obj _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be PROPN if referring to a specific city's transit service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. If you know that the sentence is about the parisian metro, metro could be seen as PROPN, but there is also a metro in Marseille, Rennes, Montréal and so it's just a different word for underground/subway/tube ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all cities with a subway call it a metro. But many do. So I think either interpretation is valid here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was probably guided by the French word "métro" which is used for any kind of subway train system ("le métro londonien", "le métro de Berlin" ...). Maybe in English the word metro is more used for the Parisian one? So let's keep it PROPN here since the context is clearly Paris?
@@ -1009,7 +1009,7 @@ | |||
22 - - PUNCT HYPH _ 25 punct 25:punct SpaceAfter=No | |||
23 planet planet NOUN NN Number=Sing 25 compound 25:compound SpaceAfter=No | |||
24 - - PUNCT HYPH _ 25 punct 25:punct SpaceAfter=No | |||
25 earth earth PROPN NNP Number=Sing 7 obl 7:obl:on SpaceAfter=No | |||
25 earth earth NOUN NN Number=Sing 7 obl 7:obl:on SpaceAfter=No |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earth can be considered a PROPN I think (if talking about the planet as a specific astronomical body)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hesitating to retag "Earth" (as for "Moon" in another sentence). I guess you are right
@@ -3370,7 +3370,7 @@ | |||
5 truth truth NOUN NN Number=Sing 3 nmod 3:nmod:of SpaceAfter=No | |||
6 , , PUNCT , _ 3 punct 3:punct _ | |||
7 ala ala ADP IN _ 8 case 8:case _ | |||
8 satanism satanism PROPN NNP Number=Sing 3 nmod 3:nmod:ala SpaceAfter=No | |||
8 satanism satanism NOUN NN Number=Sing 3 nmod 3:nmod:ala SpaceAfter=No |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as the name of a religion I would call it a proper name (by analogy to Judaism etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again not sure for satanism (and other religions). If we tag these as PROPN why "communism" or "neo-liberalism" shouldn't be tagged as PROPN as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Names of religions (but not necessarily political ideologies) are conventionally capitalized and treated as proper nouns in English. If the PROPN guidelines don't address this specifically I think the English corpora should at least try to be consistent with one another and default to precedents such as PTB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, convinced, I was not aware of this. I'll reset this to PROPN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, also treated as proper nouns in English are months and days of the week. I know these are not capitalized in French. It would be a major undertaking to develop detailed and crosslinguistically uniform policies for NOUN vs. PROPN, but for now we should default to language-specific conventions. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was I aware of that, that's why I didn't touch them :-). Indeed would be nice to have some universal policies, but I agree to stick for the time being to language-specific conventions. I saw the discussions on film titles and how to tag them. Even within a single language it's not always obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for staying as close as possible to PTB guidelines (better to have a questionable but consistent guideline than every English corpus being slightly different from every other one...)
I changed/commited satanism, earth, protestants and catholics back to PROPN following the discussions here |
@@ -3370,7 +3370,7 @@ | |||
5 truth truth NOUN NN Number=Sing 3 nmod 3:nmod:of SpaceAfter=No | |||
6 , , PUNCT , _ 3 punct 3:punct _ | |||
7 ala ala ADP IN _ 8 case 8:case _ | |||
8 satanism satanism PROPN NNP Number=Sing 3 nmod 3:nmod:ala SpaceAfter=No | |||
8 satanism satanism PPROPN NNP Number=Sing 3 nmod 3:nmod:ala SpaceAfter=No |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: PPROPN
LGTM apart from the typo! |
corrected, thanks! |
According to issue #91: changed common nouns with UPOS PROPN to NOUN, a few lemmas corrected (americans -> american, etc). nouns like "americans" or "caucasians" remain PROPN