Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemma are not (always) lowercased in spacy 2.1 #3256

Closed
thomasopsomer opened this Issue Feb 11, 2019 · 5 comments

Comments

Projects
None yet
4 participants
@thomasopsomer
Copy link
Contributor

commented Feb 11, 2019

How to reproduce the behaviour

There is a change of behaviour with the lemma_ between 2.0 and 2.1:

doc = nlp("Wells Fargo Outages Hit Online and Mobile Banking.")
[x.lemma_ for x in doc]
# ["Wells", "Fargo", "Outages", "hit", "Online", "and", "Mobile", "Banking"]

whereas in 2.0, every lemma were lowercased.

Your Environment

  • Operating System:
  • Python Version Used:
  • spaCy Version Used: spacy-nightly==2.1.0a6
  • Environment Information:
@theudas

This comment has been minimized.

Copy link

commented Feb 11, 2019

The Issue is this rule in lemmatizer.py:

    elif univ_pos in (PROPN, "PROPN"):
        return [string]
    else:
        return [string.lower()]

so every PROPN will no get lowercased.

for token in doc:
print( token.lemma_, token.pos_, token.tag_)

Wells PROPN NNP
Fargo PROPN NNP
Outages PROPN NNPS
Hit PROPN NNP
Online PROPN NNP
and CCONJ CC
Mobile PROPN NNP
Banking PROPN NNP
. PUNCT .

So the question is, why are they all tagged as PROPN?

@thomasopsomer

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2019

Hum, it's because the models are not very good on capitalized text, so in this case the tagger think almost all words are proper noun. But it's not new, it was already the case in 2.0.x. However this new rule regarding 'PROPN' change the behaviour of the lemmatization.

I don't know what is the best solution. I'm used to have lowercased token when asking for lemma, but it's maybe a bad habit :)

@honnibal

This comment has been minimized.

Copy link
Member

commented Feb 17, 2019

In v2.1 we've been aiming for better compatibility with the Universal Dependencies data. In their scheme, for proper nouns the lemmas are capitalised --- so we've switched over to preserving them. I know this sort of change can be surprising. Sorry it wasn't communicated clearly.

@honnibal honnibal closed this Feb 17, 2019

@thomasopsomer

This comment has been minimized.

Copy link
Contributor Author

commented Feb 21, 2019

Sounds right, tks for the explanation :)

@lock

This comment has been minimized.

Copy link

commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.