Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemma_ for "I" returns weird value: -PRON- #962

Closed
ericzhao28 opened this issue Apr 7, 2017 · 11 comments
Closed

Lemma_ for "I" returns weird value: -PRON- #962

ericzhao28 opened this issue Apr 7, 2017 · 11 comments

Comments

@ericzhao28
Copy link
Contributor

@ericzhao28 ericzhao28 commented Apr 7, 2017

Hey,

I noticed something weird when finding the lemma_ of tokens.
When I find the lemma_ for the token for 'cakes': nlp("cakes")[0].lemma_, I get what is expected: 'cake'.
The same thing applies for nlp("i")[0].lemma_ which gives 'i'. However, I get some weird behavior when I use an uppercase "I", as in "I am hungry".

>>> nlp = spacy.load('en')
>>> print(nlp("I")[0].lemma_)
'-PRON-'

I'm not sure if this is intended behavior, or a bug. If it's a bug, is this something that's been encountered before?

I'm running spacy 1.7.3 on osx.

  • spaCy version: 1.7.3
  • Platform: Darwin-16.4.0-x86_64-i386-64bit
  • Python version: 3.6.0
  • Installed models: en
@f11r
Copy link
Contributor

@f11r f11r commented Apr 7, 2017

This is expected behavior. See https://spacy.io/docs/api/annotation#lemmatization, #906 and #898 (comment).

@honnibal I think the amount of confusion/problems caused by this (#952, #898, #906) warrants reconsidering this decision for the 2.0 release. The Universal Dependencies project seems to go with "I" as the lemma (taken from https://raw.githubusercontent.com/UniversalDependencies/UD_English/master/en-ud-dev.conllu):

2	I	I	PRON	PRP	Case=Nom|Number=Sing|Person=1|PronType=Prs	4	nsubj	_	_
@honnibal
Copy link
Member

@honnibal honnibal commented Apr 7, 2017

@f11r You're probably right.

The behaviour here is inconsistent though --- so there's a mistake either way.

@adam-ra
Copy link

@adam-ra adam-ra commented Apr 12, 2017

I'll repost my argument against '-PRON-’ lemmas here to make it visible to other interested participants: lemmas should arguably be part of the language.
I'm not a lexicographer or linguists, but looking at the definitions, I'm almost certain that it is the case. For practical reasons also: lemmatisation may be directly used for looking up items in external lexical resources. Using an artificial lemma is a guarantee that nothing will be found.

@honnibal
Copy link
Member

@honnibal honnibal commented Apr 13, 2017

The look-up argument is decisive: the -PRON- lemma will be reversed in spaCy 2.

It sucks to change this, but it's better to be correct going forward.

Thanks @adam-ra for your input on this

@adam-ra
Copy link

@adam-ra adam-ra commented Apr 13, 2017

@honnibal I guess it's never easy, any decision will make some users happy and upset others. But you are the benevolent dictator here ;)
Thanks for the discussions and making it all transparent!

@crystosis
Copy link

@crystosis crystosis commented Oct 31, 2017

@ines So, is this still considered a bug to be fixed in a 2.x release ?

@adam-ra
Copy link

@adam-ra adam-ra commented Nov 3, 2017

@crystosis en_core_web_sm-2.0.0a7 still produces -PRON- lemmas

@honnibal
Copy link
Member

@honnibal honnibal commented Nov 4, 2017

@crystosis @adam-ra

We've really gone back and forth on this (as you can see from the issue being moved around on our board...)

The thing is, all the alternatives really are worse, especially when you get to contractions and fused tokens. One alternative would be to have each pronoun be its own lemma...But then in the Universal Dependencies data, we get fused tokens where there's only one character for the pronoun. It's really not nice to have no lemma for these, but often getting the correct lemma would require a very difficult decision about the case, gender or other features of the word.

The other consideration is that we're really trying to have as few distinct types of changes in v2 as we can. The models are different, and so is loading and training, and so are the pipelines. Going from 0 changes to the annotation scheme to "just one" change seems quite undesirable. It's another type of thing for people to think about when they're upgrading.

So: lacking a better alternative, we prefer to keep the much unloved -PRON- lemma in v2. I'm sorry we haven't communicated clearly on this.

@nateGeorge
Copy link

@nateGeorge nateGeorge commented Nov 6, 2017

I agree with @adam-ra , but I guess as a workaround, you can do w.lemma_ if w.lemma_ != '-PRON-' else w.lower_ for w in d if d is a nlp() object.

@adam-ra
Copy link

@adam-ra adam-ra commented Nov 6, 2017

@honnibal I understand the rationale and respect your decision. It's not a significant practical fuss for me either.

Perhaps it's worth mentioning that if you decide to support more languages, issues like this will crop up and some of them should have much broader scope than just pronouns. For instance in most (I guess all) Slavic languages adjectives inflect for gender and if you want them to have lemmas, you need to arbitrarily select one form (this partriarchal world traditionally prefers masculine forms).

@lock
Copy link

@lock lock bot commented Jun 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants