Lemma_ for "I" returns weird value: -PRON- #962
Comments
This is expected behavior. See https://spacy.io/docs/api/annotation#lemmatization, #906 and #898 (comment). @honnibal I think the amount of confusion/problems caused by this (#952, #898, #906) warrants reconsidering this decision for the 2.0 release. The Universal Dependencies project seems to go with "I" as the lemma (taken from https://raw.githubusercontent.com/UniversalDependencies/UD_English/master/en-ud-dev.conllu):
|
@f11r You're probably right. The behaviour here is inconsistent though --- so there's a mistake either way. |
I'll repost my argument against '-PRON-’ lemmas here to make it visible to other interested participants: lemmas should arguably be part of the language. |
The look-up argument is decisive: the It sucks to change this, but it's better to be correct going forward. Thanks @adam-ra for your input on this |
@honnibal I guess it's never easy, any decision will make some users happy and upset others. But you are the benevolent dictator here ;) |
@ines So, is this still considered a bug to be fixed in a 2.x release ? |
@crystosis en_core_web_sm-2.0.0a7 still produces |
We've really gone back and forth on this (as you can see from the issue being moved around on our board...) The thing is, all the alternatives really are worse, especially when you get to contractions and fused tokens. One alternative would be to have each pronoun be its own lemma...But then in the Universal Dependencies data, we get fused tokens where there's only one character for the pronoun. It's really not nice to have no lemma for these, but often getting the correct lemma would require a very difficult decision about the case, gender or other features of the word. The other consideration is that we're really trying to have as few distinct types of changes in v2 as we can. The models are different, and so is loading and training, and so are the pipelines. Going from 0 changes to the annotation scheme to "just one" change seems quite undesirable. It's another type of thing for people to think about when they're upgrading. So: lacking a better alternative, we prefer to keep the much unloved |
I agree with @adam-ra , but I guess as a workaround, you can do |
@honnibal I understand the rationale and respect your decision. It's not a significant practical fuss for me either. Perhaps it's worth mentioning that if you decide to support more languages, issues like this will crop up and some of them should have much broader scope than just pronouns. For instance in most (I guess all) Slavic languages adjectives inflect for gender and if you want them to have lemmas, you need to arbitrarily select one form (this partriarchal world traditionally prefers masculine forms). |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Hey,
I noticed something weird when finding the lemma_ of tokens.
When I find the lemma_ for the token for 'cakes':
nlp("cakes")[0].lemma_
, I get what is expected: 'cake'.The same thing applies for
nlp("i")[0].lemma_
which gives 'i'. However, I get some weird behavior when I use an uppercase "I", as in "I am hungry".I'm not sure if this is intended behavior, or a bug. If it's a bug, is this something that's been encountered before?
I'm running spacy 1.7.3 on osx.
The text was updated successfully, but these errors were encountered: