Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overzealous lemmatisation of -ss nouns #903

Closed
adam-ra opened this issue Mar 22, 2017 · 3 comments
Closed

Overzealous lemmatisation of -ss nouns #903

adam-ra opened this issue Mar 22, 2017 · 3 comments
Labels
bug Bugs and behaviour differing from documentation help wanted (easy) Contributions welcome! (also suited for spaCy beginners)

Comments

@adam-ra
Copy link

adam-ra commented Mar 22, 2017

The final -s is stripped even though the tag assigned is a singular noun (NN).
Some examples: sleepiness, incompleteness, loss, ass (unless they are recognised as proper nouns, which happens often if they are sentence-first).

A similar thing happens to nouns with other suffixes, e.g. anus → anu.

Seen in Spacy 1.7.2, model en_depent_web_md-1.2.1.

@honnibal
Copy link
Member

Damn. I know exactly what must have caused this :(

There's a base-form check in the lemmatizer --- if a word is listed as a base form, it shouldn't be lemmatized. Obviously this check is broken for nouns.

@honnibal honnibal added bug Bugs and behaviour differing from documentation help wanted (easy) Contributions welcome! (also suited for spaCy beginners) labels Mar 22, 2017
@honnibal
Copy link
Member

Here: https://github.com/explosion/spaCy/blob/master/spacy/lemmatizer.py#L52

This looks up the enum symbols for the verb-forms, but misses the enum symbols for the nouns. We just need to list the morphological features that indicate the noun is a base form, and list them here.

We also need a regression test. I've got meetings today and most of tomorrow, so I'm hoping someone else can get the fix up? 🙇‍♂️ It will need a regression test too.

honnibal added a commit that referenced this issue Mar 25, 2017
The morphology class was calling the lemmatizer inconsistently,
which some string-valued attributes. This caused Issue #903.
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation help wanted (easy) Contributions welcome! (also suited for spaCy beginners)
Projects
None yet
Development

No branches or pull requests

2 participants