Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NNPS lemmas shouldn't be plural #147

Closed
nschneid opened this issue Apr 16, 2021 · 18 comments
Closed

NNPS lemmas shouldn't be plural #147

nschneid opened this issue Apr 16, 2021 · 18 comments

Comments

@nschneid
Copy link
Contributor

e.g. "Americans" is sometimes the lemma rather than "American"

@AngledLuffa
Copy link
Contributor

As with all things, there are always a lot of edge cases:

Americans - this seems obvious, should be American
Democrats / Republicans / Nazis / Communists / etc -> Democrat / Nazi / etc
Texas / Chris / Devries -> not changed
United States / Iguazu Falls -> not changed? There's no such thing as a United State or a Iguazu Fall
Channel Islands / Rocky Mountains -> this time I can see an argument for there being a single Channel Island
EBS Ventures / General Motors / Dixie Chicks / Beatles -> left unchanged
Fridays / Friday's (with this intended to be plural, not possessive) -> Friday
Book / movie titles: Seven Habits of Highly Effective People, Star Wars, Three Little Pigs -> unchanged?
Group titles: Joint Chiefs of Staff, House of Representatives, Conference of Mayors -> unchanged?
Family names used as plurals: Elliotts instead of Elliott, Gateses instead of Gates -> Elliott & Gates
We've been asking everyday if anyone had heard if the "Babies Elliott" had arrived. -> Baby?
Deemed ISDAs -> Deemed ISDA?
US Army Corps of Engineers -> leave it Engineers or make it Engineer?
Chicken McNuggets -> McNugget or McNuggets?

@nschneid
Copy link
Contributor Author

United States / Iguazu Falls -> not changed? There's no such thing as a United State or a Iguazu Fall
Book / movie titles: Seven Habits of Highly Effective People, Star Wars, Three Little Pigs -> unchanged?
Group titles: Joint Chiefs of Staff, House of Representatives, Conference of Mayors -> unchanged?
US Army Corps of Engineers -> leave it Engineers or make it Engineer?

I worry that if we tag it as morphologically plural—NNPS, Number=Plur—it gets tricky to use name-specific semantic criteria for whether the lemma should be singular or plural.

"United States" has a holistic meaning that is idiosyncratic beyond its compositional meaning—but it does involve multiple states. So it seems unfair to make lemmatization decisions for a single word based on whether the name as a whole can be singular.

Another way to put it is that pluralization is used in deriving the name, not that the name as a whole is plural. Still plural at the word level, so I think the lemma should be singular.

"Falls" is an exception because even as a common noun it is always plural in that sense: https://en-word.net/lemma/waterfall So I would not remove the "s" in the lemma. (Cf. "pants".)

@nschneid
Copy link
Contributor Author

Now that I am checking, though, I see "United States" is tagged with Number=Sing! If that is the policy then it's not being considered a "real" plural and thus the lemma shouldn't change.

@AngledLuffa
Copy link
Contributor

GUM tags Star Wars as NNPS and then shortens the lemma to War, so that would be consistent, at least.

They do lemmatize the party "Neo Democrats" as "Neo Democrats", though, so maybe there are situations where Democrats should be left with the "s", or maybe GUM needs to change it

They leave the "s" on Chatham Motors, though, so company names seem to be unchanged

@amir-zeldes

@amir-zeldes
Copy link
Contributor

This is already corrected in the source repo but not yet propagated to ud:

https://github.com/amir-zeldes/gum/blob/dev/_build/src/xml/GUM_voyage_chatham.xml#L787

It should be Democrat and Motor.

@amir-zeldes
Copy link
Contributor

See also list of known plural lemmas in GUM here:

https://github.com/amir-zeldes/gum/blob/dev/_build/utils/validate.py#L624

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 16, 2021 via email

@nschneid
Copy link
Contributor Author

Yes looking at GUM it is mostly United_NNP States_NNPS. Honestly I think this is more intuitive than the EWT way, and that the lemma should be "State".

@amir-zeldes
Copy link
Contributor

That's right, it should be NNPS/State in GUM without exception now. The validation exceptions for plural proper noun lemmas are right above that line, here:

https://github.com/amir-zeldes/gum/blob/dev/_build/utils/validate.py#L620

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 17, 2021 via email

@amir-zeldes
Copy link
Contributor

It's the same. GUM has four sets of POS tags, PTB, AMALGAM/TT, upos and claws5. NPS corresponds to PTB NNPS.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 19, 2021 via email

@nschneid
Copy link
Contributor Author

Keep Philippines, Conyers.

"Net Works": is that supposed to be like works?

In general I would say, if the noun could ever be used in the singular with the relevant sense, then it should be lemmatized as singular even in names.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 19, 2021 via email

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 23, 2021 via email

@nschneid
Copy link
Contributor Author

"quarters" is definitely a noun with a special plural meaning: https://en-word.net/lemma/quarters

I would say the same for Times, which has taken on a special meaning for newspaper titles in the plural.

Beatles is borderline for me: the name of the musical group is plural, and the singular can only refer to an individual member of the particular group. (I guess the name was partly a pun on "Beat", and the reason the name was plural in the first place is because that is a common pattern with band names—e.g. "The Temptations", "The Stones", etc.)

Nations, Keys, Ages, Motorsports can all have singular lemmas I think. There is a dialect difference in general with "sports" (U.S.) vs. "sport" (UK) when referring to athletic competitions in general, though U.S. English still has expressions like "for sport". Probably safe to always strip the -s in the lemma. (Similarly, "math" (U.S.) vs. "maths" (UK).)

Most lowercase words already seem okay, but... papers? maybe this goes to paper regardless of the context?

If used in the sense of 'documentation' (e.g., of one's immigration status) I think it is legitimate to keep the lemma plural: https://en-word.net/lemma/papers If used in the sense of 'newspapers' or individual pieces of paper then it should be singular.

Dominion Posts should be split into two tokens I assume. Do we do that kind of thing? Same with bislas, if so. Hancocks, Birdies, Bridies Levi`s - weird punctuation, but either way, should this be tokenized with a possessive?

Levi`s - weird punctuation, but either way, should this be tokenized with a possessive?

Yeah at least some of these look like they are actually possessive. Please open a separate issue or pull request for those.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 24, 2021 via email

nschneid pushed a commit that referenced this issue Apr 24, 2021
@nschneid
Copy link
Contributor Author

Big improvement, thanks @AngledLuffa!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants