Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changed common nouns with UPOS PROPN to NOUN, #94

Merged
merged 3 commits into from
Aug 30, 2020
Merged

changed common nouns with UPOS PROPN to NOUN, #94

merged 3 commits into from
Aug 30, 2020

Conversation

jheinecke
Copy link
Contributor

According to issue #91: changed common nouns with UPOS PROPN to NOUN, a few lemmas corrected (americans -> american, etc). nouns like "americans" or "caucasians" remain PROPN

a few lemmas corrected (americans -> american, etc)
@@ -238,7 +238,7 @@
14 . . PUNCT . _ 13 punct 13:punct _

# sent_id = newsgroup-groups.google.com_misc.consumers_a534e32067078b08_ENG_20060116_030800-0012
# text = 1.  Blowback in Iraq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this on purpose? I guess it would be okay to remove multiple spaces but the multiple spaces are present in the original text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More or less by accident. There was an unbreakable space which is not mentioned in the SpacesAfter= key (MISC column). Probably I should have edited SpacesAfter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dan-zeman do you have thoughts on this? should spaces be normalized?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong position. Is the validator happy with this? Technically it should not complain because if MISC does not say "SpaceAfter=No", there can be one or more whitespace characters between two tokens. But I would not be surprised if the validator or some other script did not expect this and simply assumed that there is always just one chr(32). SpacesAfter is optional and many scripts probably do not look for it.

Preserving the original whitespace characters in the text attribute sounds good initially, but in fact it is not possible to preserve everything: there cannot be a line break, for instance.

@@ -387,7 +387,7 @@
10 really really ADV RB _ 11 advmod 11:advmod _
11 attended attend VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 0 root 0:root _
12 by by ADP IN _ 13 case 13:case _
13 protestants protestants PROPN NNPS Number=Plur 11 obl 11:obl:by _
13 protestants protestant NOUN NNS Number=Plur 11 obl 11:obl:by _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protestant/Catholic should be PROPN right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UD guidelines say "A proper noun is a noun (or nominal content word) that is the name (or part of the name) of a specific individual, place, or object". So is a plural like protestants still an proper noun? I thought not, that is why I propose to retag it as NOUN

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not make a distinction based on singular vs. plural. "Protestants" could refer to all individuals, or a subset of individuals, who identify as Protestant. Likewise, a plural person name ("John Smiths") should still be PROPN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But everything could refer to an individual, e.g. "the president said..." in a given context can refer to a president of a given country. Is this tagged PROPN as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there's some gray area with offices like "president". I would probably tag it as NOUN unless it's title within a specific person's name ("President Barack Obama"). UD strives to make tagging decisions on a fairly lexical basis and not require a ton of contextual interpretation about referents.

@@ -27,7 +27,7 @@
# text = using the metro or the air france bus
1 using use VERB VBG VerbForm=Ger 0 root 0:root _
2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _
3 metro metro PROPN NNP Number=Sing 1 obj 1:obj _
3 metro metro NOUN NN Number=Sing 1 obj 1:obj _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be PROPN if referring to a specific city's transit service

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. If you know that the sentence is about the parisian metro, metro could be seen as PROPN, but there is also a metro in Marseille, Rennes, Montréal and so it's just a different word for underground/subway/tube ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all cities with a subway call it a metro. But many do. So I think either interpretation is valid here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was probably guided by the French word "métro" which is used for any kind of subway train system ("le métro londonien", "le métro de Berlin" ...). Maybe in English the word metro is more used for the Parisian one? So let's keep it PROPN here since the context is clearly Paris?

@@ -1009,7 +1009,7 @@
22 - - PUNCT HYPH _ 25 punct 25:punct SpaceAfter=No
23 planet planet NOUN NN Number=Sing 25 compound 25:compound SpaceAfter=No
24 - - PUNCT HYPH _ 25 punct 25:punct SpaceAfter=No
25 earth earth PROPN NNP Number=Sing 7 obl 7:obl:on SpaceAfter=No
25 earth earth NOUN NN Number=Sing 7 obl 7:obl:on SpaceAfter=No
Copy link
Contributor

@nschneid nschneid Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earth can be considered a PROPN I think (if talking about the planet as a specific astronomical body)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hesitating to retag "Earth" (as for "Moon" in another sentence). I guess you are right

@@ -3370,7 +3370,7 @@
5 truth truth NOUN NN Number=Sing 3 nmod 3:nmod:of SpaceAfter=No
6 , , PUNCT , _ 3 punct 3:punct _
7 ala ala ADP IN _ 8 case 8:case _
8 satanism satanism PROPN NNP Number=Sing 3 nmod 3:nmod:ala SpaceAfter=No
8 satanism satanism NOUN NN Number=Sing 3 nmod 3:nmod:ala SpaceAfter=No
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the name of a religion I would call it a proper name (by analogy to Judaism etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again not sure for satanism (and other religions). If we tag these as PROPN why "communism" or "neo-liberalism" shouldn't be tagged as PROPN as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Names of religions (but not necessarily political ideologies) are conventionally capitalized and treated as proper nouns in English. If the PROPN guidelines don't address this specifically I think the English corpora should at least try to be consistent with one another and default to precedents such as PTB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, convinced, I was not aware of this. I'll reset this to PROPN

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, also treated as proper nouns in English are months and days of the week. I know these are not capitalized in French. It would be a major undertaking to develop detailed and crosslinguistically uniform policies for NOUN vs. PROPN, but for now we should default to language-specific conventions. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was I aware of that, that's why I didn't touch them :-). Indeed would be nice to have some universal policies, but I agree to stick for the time being to language-specific conventions. I saw the discussions on film titles and how to tag them. Even within a single language it's not always obvious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for staying as close as possible to PTB guidelines (better to have a questionable but consistent guideline than every English corpus being slightly different from every other one...)

@jheinecke
Copy link
Contributor Author

I changed/commited satanism, earth, protestants and catholics back to PROPN following the discussions here

@@ -3370,7 +3370,7 @@
5 truth truth NOUN NN Number=Sing 3 nmod 3:nmod:of SpaceAfter=No
6 , , PUNCT , _ 3 punct 3:punct _
7 ala ala ADP IN _ 8 case 8:case _
8 satanism satanism PROPN NNP Number=Sing 3 nmod 3:nmod:ala SpaceAfter=No
8 satanism satanism PPROPN NNP Number=Sing 3 nmod 3:nmod:ala SpaceAfter=No
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: PPROPN

@nschneid
Copy link
Contributor

LGTM apart from the typo!

@jheinecke
Copy link
Contributor Author

corrected, thanks!
Sorry for the typo (shouldn't edit on weekends :-)

@nschneid nschneid merged commit 2d6abf1 into dev Aug 30, 2020
@jheinecke jheinecke deleted the commonnouns branch August 31, 2020 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants