New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemma_ for "I" returns weird value: -PRON- #962

ericzhao28 opened this Issue Apr 7, 2017 · 11 comments


None yet
7 participants

ericzhao28 commented Apr 7, 2017


I noticed something weird when finding the lemma_ of tokens.
When I find the lemma_ for the token for 'cakes': nlp("cakes")[0].lemma_, I get what is expected: 'cake'.
The same thing applies for nlp("i")[0].lemma_ which gives 'i'. However, I get some weird behavior when I use an uppercase "I", as in "I am hungry".

>>> nlp = spacy.load('en')
>>> print(nlp("I")[0].lemma_)

I'm not sure if this is intended behavior, or a bug. If it's a bug, is this something that's been encountered before?

I'm running spacy 1.7.3 on osx.

  • spaCy version: 1.7.3
  • Platform: Darwin-16.4.0-x86_64-i386-64bit
  • Python version: 3.6.0
  • Installed models: en

This comment has been minimized.

f11r commented Apr 7, 2017

This is expected behavior. See, #906 and #898 (comment).

@honnibal I think the amount of confusion/problems caused by this (#952, #898, #906) warrants reconsidering this decision for the 2.0 release. The Universal Dependencies project seems to go with "I" as the lemma (taken from

2	I	I	PRON	PRP	Case=Nom|Number=Sing|Person=1|PronType=Prs	4	nsubj	_	_

@honnibal honnibal added the performance label Apr 7, 2017


This comment has been minimized.


honnibal commented Apr 7, 2017

@f11r You're probably right.

The behaviour here is inconsistent though --- so there's a mistake either way.


This comment has been minimized.

adam-ra commented Apr 12, 2017

I'll repost my argument against '-PRON-’ lemmas here to make it visible to other interested participants: lemmas should arguably be part of the language.
I'm not a lexicographer or linguists, but looking at the definitions, I'm almost certain that it is the case. For practical reasons also: lemmatisation may be directly used for looking up items in external lexical resources. Using an artificial lemma is a guarantee that nothing will be found.


This comment has been minimized.


honnibal commented Apr 13, 2017

The look-up argument is decisive: the -PRON- lemma will be reversed in spaCy 2.

It sucks to change this, but it's better to be correct going forward.

Thanks @adam-ra for your input on this


This comment has been minimized.

adam-ra commented Apr 13, 2017

@honnibal I guess it's never easy, any decision will make some users happy and upset others. But you are the benevolent dictator here ;)
Thanks for the discussions and making it all transparent!

@ines ines added this to 📌 To Do in 💫 spaCy v2.0 Apr 16, 2017

@ines ines moved this from 📌 To Do to 💡 Idea in 💫 spaCy v2.0 May 9, 2017

@ines ines added this to Lemmatizer in 💫 spaCy v2.0 stable Sep 14, 2017

@honnibal honnibal removed this from Lemmatizer in 💫 spaCy v2.0 stable Oct 11, 2017

@ines ines removed this from 💡 Idea in 💫 spaCy v2.0 Oct 25, 2017


This comment has been minimized.

crystosis commented Oct 31, 2017

@ines So, is this still considered a bug to be fixed in a 2.x release ?


This comment has been minimized.

adam-ra commented Nov 3, 2017

@crystosis en_core_web_sm-2.0.0a7 still produces -PRON- lemmas


This comment has been minimized.


honnibal commented Nov 4, 2017

@crystosis @adam-ra

We've really gone back and forth on this (as you can see from the issue being moved around on our board...)

The thing is, all the alternatives really are worse, especially when you get to contractions and fused tokens. One alternative would be to have each pronoun be its own lemma...But then in the Universal Dependencies data, we get fused tokens where there's only one character for the pronoun. It's really not nice to have no lemma for these, but often getting the correct lemma would require a very difficult decision about the case, gender or other features of the word.

The other consideration is that we're really trying to have as few distinct types of changes in v2 as we can. The models are different, and so is loading and training, and so are the pipelines. Going from 0 changes to the annotation scheme to "just one" change seems quite undesirable. It's another type of thing for people to think about when they're upgrading.

So: lacking a better alternative, we prefer to keep the much unloved -PRON- lemma in v2. I'm sorry we haven't communicated clearly on this.


This comment has been minimized.

nateGeorge commented Nov 6, 2017

I agree with @adam-ra , but I guess as a workaround, you can do w.lemma_ if w.lemma_ != '-PRON-' else w.lower_ for w in d if d is a nlp() object.


This comment has been minimized.

adam-ra commented Nov 6, 2017

@honnibal I understand the rationale and respect your decision. It's not a significant practical fuss for me either.

Perhaps it's worth mentioning that if you decide to support more languages, issues like this will crop up and some of them should have much broader scope than just pronouns. For instance in most (I guess all) Slavic languages adjectives inflect for gender and if you want them to have lemmas, you need to arbitrarily select one form (this partriarchal world traditionally prefers masculine forms).


This comment has been minimized.

lock bot commented Jun 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.