NER doesn't identify lowercase entities #701

bluefuzz01 · 2016-12-21T23:57:42Z

As the title suggests, entities in lower case are not recognized as entities. I also noticed entities in upper case are not recognized either. It seems to only recognize entities with title/proper case:

EX: United States but not united states or UNITED STATES

Are there any plans to improve detection for these instances? Has anyone attempted this problem yet? If so, what did you do to deal with these cases?

Thanks!

Your Environment

Operating System: Windows 7
Python Version Used: 2.7.12
spaCy Version Used: 1.4.0 (1.60 as of Jan 20, 2016)

honnibal · 2016-12-23T11:31:30Z

Hi,

We're working on NER models that are less case sensitive, but in the meantime, there are a few ways to exert rule-based control of the NER, to fix these cases. For single tokens, you could use the tokenizer exceptions as follows:

>>> from spacy.attrs import ORTH, LEMMA, TAG, ENT_TYPE, ENT_IOB
>>> nlp = spacy.load('en')
>>> nlp.tokenizer.add_special_case('india', [{ORTH: 'india', LEMMA: 'India', TAG: 'NNP', ENT_TYPE: 'GPE', ENT_IOB: 3}])
>>> doc = nlp(u'there are many innovative companies in india.')
>>> [(w.text, w.tag_, w.ent_type_) for w in doc]
[('there', 'EX', ''), ('are', 'VBP', ''), ('many', 'JJ', ''), ('innovative', 'JJ', ''), ('companies', 'NNS', ''), ('in', 'IN', ''), ('india', 'NNP', 'GPE'), ('.', '.', '')]

You can read more about the tokenizer exceptions here: https://spacy.io/docs/usage/customizing-tokenizer

The tokenizer exceptions solution works well for single words, but doesn't help you with something like 'south korea'. For that you could use the rule matcher: https://spacy.io/docs/usage/rule-based-matching . Remember to add an on_match callback to actually assign the entities --- the matcher itself just identifies the spans; you still need to set the attributes.

The problem in general is that both the tagger and entity recogniser make use of several feature functions that are case sensitive. This is good in general, but can be problematic for certain text types. Here's a suggestion I've been thinking about for a while, but haven't played with yet. It probably takes a little bit of tuning.

The relevant feature functions ask about the word's "shape", whether it's upper case, whether it's lower case, and it's distributional similarity cluster. We can redefine the values of these features for specific words, and thereby trick the models into making a different decision. To do this, first look up the word in spaCy's vocabulary, to get the relevant Lexeme object:

>>> india = nlp.vocab[u'india']
>>> India = nlp.vocab[u'India']
>>> india.is_lower = India.is_lower
>>> india.shape = India.shape
>>> india.is_upper = India.is_upper
>>> india.cluster = India.cluster

For a more systematic approach, we can find all word that are usually title-cased:

probs = {w.prob: w.orth for w in nlp.vocab}
usually_titled = [w for w in nlp.vocab if w.is_title and probs.get(w.lower, -10000) < probs.get(w.orth, -10000)]

You probably want some margin on the probability, but for now we'll just take everything that's more common in title-case than in lower-case.

Now we iterate over these usually titled words, look up the lower-case version, and rewrite the features:

>>> for lex in better_titled:
...   lower = nlp.vocab[lex.lower]
...   lower.shape = lex.shape
...   lower.is_title = lex.is_title
...   lower.cluster = lex.cluster
...   lower.is_lower = lex.is_lo

At first glance, this appears to work:

>>> doc = nlp(u'south korea is a state in asia.')
>>> for word in doc:
...   print(word.text, word.tag_, word.ent_type_)

If you give this a try, please let us all know how you go :)

bluefuzz01 · 2017-01-04T20:19:39Z

Thanks for such a great reply. I got re-tasked to another project temporarily but will be coming back this soon. I'm also tempted to see how many false positives I'd get if I simply title cased a query before passing it to the NER.

bluefuzz01 · 2017-01-23T23:35:39Z

@honnibal
So I'm back to working on this, and thanks again. I like how you gave a few different methods. It really helps, especially for learning the capabilities of the API.

Regarding the last option, in short, we're comparing the smoothed log probability estimate of token's type for each title cased word and its lower case version in the vocab. If the probability of the lower case version is less than the probability of the title case version, then we assume it is more likely to be title case. Next, we update the token attributes relevant for NER classification of the lower case version to match that of the title case version so the NER will think its an entity. Neat! I'll have to get back to you on how well this works in my domain.

Can you tell me more about the smoothed log probability estimate? I see how it is defined as a property in the lexeme code, but I'm interested in knowing how it is calculated. Couldn't find that part.

Couple minor code changes:

probs = {w.orth: w.prob for w in nlp.vocab}
usually_titled = [w for w in nlp.vocab if w.is_title and probs.get(nlp.vocab[w.orth].lower, -10000) < probs.get(w.orth, -10000)]

for lex in usually_titled:
   lower = nlp.vocab[lex.lower]
   lower.shape = lex.shape
   lower.is_title = lex.is_title
   lower.cluster = lex.cluster
   lower.is_lower = lex.is_lower
    
doc = nlp(u'south korea is a state in asia.')
for word in doc:
    print(word.text, word.tag_, word.ent_type_)

ghost · 2017-02-07T03:23:40Z

Finding the probability for the lower case compared to title case, and then updating the token attributes to mark it as an entity makes sense, but fails when generalised over a big set of data like name of persons. Would it be fine if we train a model on a training set which includes the same set of lines in all the lower , upper and title case ??

honnibal · 2017-02-07T12:35:57Z

@Spawnakshay If you have the training data yourself, then yes forcing to lower-case makes sense. The complication is that I can't ship you the training data I'm using, because of licensing constraints.

@bluefuzz01 : The log probability was estimated from counts over the Reddit comment corpus 2009-2015 (~80b tokens), smoothed using Simple Good-Turing estimation (Gale's publication "Good-Turing estimation without the tears"). The smoothing implementation is in the preshed.counter class.

ines · 2017-04-16T21:46:43Z

The new version 1.8.0 comes with bug fixes to the NER training procedure and a new save_to_directory() method. It should now be much easier to update the models yourself to fix the errors that occur on your data.

We've also updated the docs with more information on training and NER training in particular:

Workflow: Training the Named Entity Recognizer
Workflow: Saving and loading models
Example: Training an additional entity type
Command line interface for initialising, training and packaging models

I hope this helps!

arjunmenon · 2017-06-11T18:59:11Z

To someone who checks this part of issue tracker, one easy way to mitigate against this is to run your poorly-formatted text through Truecaser first, then apply the NER.
I have found it to be highly efficient.
Maybe the Spacy guys, @ines, @honnibal , can add this to their core. More industrial strength. ;)

arezae · 2017-09-16T16:09:00Z

Truecaser seems a good solution for case issue in NER with Spacy but the model is big and might not be good for real time applications. @arjunmenon do you have smaller (but less accurate) model?

arjunmenon · 2017-11-13T19:35:56Z

Hey @arezae
You can simply copy-paste 4-5 paragraphs of text from wiki, news site or a category of text representing your data.
It will still work very good.
But if you are facing issues there, let me know what kind of model you want and I will help you out.

PS - sorry getting back late. coudn't keep track of this issue.

lock · 2018-05-08T08:27:37Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the performance label Dec 23, 2016

ines added the 🌙 nightly Discussion and contributions related to nightly builds label Dec 23, 2016

honnibal mentioned this issue Jan 27, 2017

SpaCy NER training example from version 1.5.0 doesn't work in 1.6.0 #773

Closed

ines closed this as completed Apr 16, 2017

ines removed the 🌙 nightly Discussion and contributions related to nightly builds label May 7, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER doesn't identify lowercase entities #701

NER doesn't identify lowercase entities #701

bluefuzz01 commented Dec 21, 2016 •

edited

Loading

honnibal commented Dec 23, 2016

bluefuzz01 commented Jan 4, 2017

bluefuzz01 commented Jan 23, 2017

ghost commented Feb 7, 2017

honnibal commented Feb 7, 2017 •

edited

Loading

ines commented Apr 16, 2017

arjunmenon commented Jun 11, 2017 •

edited

Loading

arezae commented Sep 16, 2017

arjunmenon commented Nov 13, 2017

lock bot commented May 8, 2018

NER doesn't identify lowercase entities #701

NER doesn't identify lowercase entities #701

Comments

bluefuzz01 commented Dec 21, 2016 • edited Loading

Your Environment

honnibal commented Dec 23, 2016

bluefuzz01 commented Jan 4, 2017

bluefuzz01 commented Jan 23, 2017

ghost commented Feb 7, 2017

honnibal commented Feb 7, 2017 • edited Loading

ines commented Apr 16, 2017

arjunmenon commented Jun 11, 2017 • edited Loading

arezae commented Sep 16, 2017

arjunmenon commented Nov 13, 2017

lock bot commented May 8, 2018

bluefuzz01 commented Dec 21, 2016 •

edited

Loading

honnibal commented Feb 7, 2017 •

edited

Loading

arjunmenon commented Jun 11, 2017 •

edited

Loading