-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NNPS lemmas shouldn't be plural #147
Comments
As with all things, there are always a lot of edge cases: Americans - this seems obvious, should be American |
I worry that if we tag it as morphologically plural—NNPS, Number=Plur—it gets tricky to use name-specific semantic criteria for whether the lemma should be singular or plural. "United States" has a holistic meaning that is idiosyncratic beyond its compositional meaning—but it does involve multiple states. So it seems unfair to make lemmatization decisions for a single word based on whether the name as a whole can be singular. Another way to put it is that pluralization is used in deriving the name, not that the name as a whole is plural. Still plural at the word level, so I think the lemma should be singular. "Falls" is an exception because even as a common noun it is always plural in that sense: https://en-word.net/lemma/waterfall So I would not remove the "s" in the lemma. (Cf. "pants".) |
Now that I am checking, though, I see "United States" is tagged with Number=Sing! If that is the policy then it's not being considered a "real" plural and thus the lemma shouldn't change. |
GUM tags Star Wars as NNPS and then shortens the lemma to War, so that would be consistent, at least. They do lemmatize the party "Neo Democrats" as "Neo Democrats", though, so maybe there are situations where Democrats should be left with the "s", or maybe GUM needs to change it They leave the "s" on Chatham Motors, though, so company names seem to be unchanged |
This is already corrected in the source repo but not yet propagated to ud: https://github.com/amir-zeldes/gum/blob/dev/_build/src/xml/GUM_voyage_chatham.xml#L787 It should be Democrat and Motor. |
See also list of known plural lemmas in GUM here: https://github.com/amir-zeldes/gum/blob/dev/_build/utils/validate.py#L624 |
That part of the validation script is referring to NNS, though, not NNPS.
United States is still tagged United_NNP States_NNPS? What does that do to
the lemma of States?
|
Yes looking at GUM it is mostly United_NNP States_NNPS. Honestly I think this is more intuitive than the EWT way, and that the lemma should be "State". |
That's right, it should be NNPS/State in GUM without exception now. The validation exceptions for plural proper noun lemmas are right above that line, here: https://github.com/amir-zeldes/gum/blob/dev/_build/utils/validate.py#L620 |
That says NPS, though, not NNPS. Am I missing something that transforms
NNPS into NPS?
|
It's the same. GUM has four sets of POS tags, PTB, AMALGAM/TT, upos and claws5. NPS corresponds to PTB NNPS. |
Alright, I guess we're doing this...
Securities -> Security?
Abacus Technologies -> Abacus Technology?
Marvel Consultants -> Marvel Consultant?
Risk Managers Conference -> Manager?
Your suggestion to introduce the concept discussed with one of the Lays is
welcomed ... Lay? I'm guessing there's multiple people named Lay from this
context, although I don't know
the International Fund for Animal Welfare and their friends 'The
Bateleurs' ... same thing, probably, so Bateleur?
Comets, the basketball team? Does this become Comet or stay Comets?
Enron Net Works -> Work ? That seems awkward
The India Diaries -> The India Diary?
Suns Systems -> Sun System? Not to mention Suns is probably a typo anyway
Lunar Transportation Systems -> Lunar Transportation System?
Five Guys -> Five Guy?
Printers ' Row -> Printer ' Row?
the Philippines -> the Philippines, unchanged?
Chicago Botanical Gardens -> Chicago Botanical Garden?
Conyers is unchanged?
Los Angeles Movers -> Mover?
Bright Futures -> Bright Future?
Family Bagels -> Family Bagel?
|
Keep Philippines, Conyers. "Net Works": is that supposed to be like works? In general I would say, if the noun could ever be used in the singular with the relevant sense, then it should be lemmatized as singular even in names. |
Net Works
Beats me
Comes from here:
http://www.enron-mail.com/email/skilling-j/discussion_threads/Enron_Net_Works_1.html
I suspect it's normally supposed to be Enron Networks, but I actually don't
know and haven't done a ton of research
|
New York Times .... Times?
United Nations -> Nation, NNPS?
Florida Keys?
Captain's Quarters?
Motorsports
Beatles
Middle Ages
Dominion Posts should be split into two tokens I assume. Do we do that
kind of thing? Same with bislas, if so. Hancocks, Birdies, Bridies
Levi`s - weird punctuation, but either way, should this be tokenized with a
possessive?
Most lowercase words already seem okay, but...
papers? maybe this goes to paper regardless of the context?
…On Mon, Apr 19, 2021 at 12:22 AM John Bauer ***@***.***> wrote:
> Net Works
Beats me
Comes from here:
http://www.enron-mail.com/email/skilling-j/discussion_threads/Enron_Net_Works_1.html
I suspect it's normally supposed to be Enron Networks, but I actually
don't know and haven't done a ton of research
|
"quarters" is definitely a noun with a special plural meaning: https://en-word.net/lemma/quarters I would say the same for Times, which has taken on a special meaning for newspaper titles in the plural. Beatles is borderline for me: the name of the musical group is plural, and the singular can only refer to an individual member of the particular group. (I guess the name was partly a pun on "Beat", and the reason the name was plural in the first place is because that is a common pattern with band names—e.g. "The Temptations", "The Stones", etc.) Nations, Keys, Ages, Motorsports can all have singular lemmas I think. There is a dialect difference in general with "sports" (U.S.) vs. "sport" (UK) when referring to athletic competitions in general, though U.S. English still has expressions like "for sport". Probably safe to always strip the -s in the lemma. (Similarly, "math" (U.S.) vs. "maths" (UK).)
If used in the sense of 'documentation' (e.g., of one's immigration status) I think it is legitimate to keep the lemma plural: https://en-word.net/lemma/papers If used in the sense of 'newspapers' or individual pieces of paper then it should be singular.
Yeah at least some of these look like they are actually possessive. Please open a separate issue or pull request for those. |
Most lowercase words already seem okay, but... papers? maybe this goes to
paper regardless of the context?
If used in the sense of 'documentation' (e.g., of one's immigration
status) I think it is legitimate to keep the lemma plural:
https://en-word.net/lemma/papers If used in the sense of 'newspapers' or
individual pieces of paper then it should be singular.
Settlement papers? Execution papers (presumably for a deal, not a
person)? I put both as "paper" for now
Anyway, I think I'm done, modulo whatever changes you need
… |
Big improvement, thanks @AngledLuffa! |
e.g. "Americans" is sometimes the lemma rather than "American"
The text was updated successfully, but these errors were encountered: