Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Lemma are not (always) lowercased in spacy 2.1 #3256
How to reproduce the behaviour
There is a change of behaviour with the
doc = nlp("Wells Fargo Outages Hit Online and Mobile Banking.") [x.lemma_ for x in doc] # ["Wells", "Fargo", "Outages", "hit", "Online", "and", "Mobile", "Banking"]
whereas in 2.0, every lemma were lowercased.
The Issue is this rule in lemmatizer.py:
so every PROPN will no get lowercased.
for token in doc:
Wells PROPN NNP
So the question is, why are they all tagged as PROPN?
Hum, it's because the models are not very good on capitalized text, so in this case the tagger think almost all words are proper noun. But it's not new, it was already the case in 2.0.x. However this new rule regarding 'PROPN' change the behaviour of the lemmatization.
I don't know what is the best solution. I'm used to have lowercased token when asking for lemma, but it's maybe a bad habit :)
In v2.1 we've been aiming for better compatibility with the Universal Dependencies data. In their scheme, for proper nouns the lemmas are capitalised --- so we've switched over to preserving them. I know this sort of change can be surprising. Sorry it wasn't communicated clearly.