Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER doesn't identify lowercase entities #701

Closed
bluefuzz01 opened this issue Dec 21, 2016 · 10 comments
Closed

NER doesn't identify lowercase entities #701

bluefuzz01 opened this issue Dec 21, 2016 · 10 comments

Comments

@bluefuzz01
Copy link

bluefuzz01 commented Dec 21, 2016

As the title suggests, entities in lower case are not recognized as entities. I also noticed entities in upper case are not recognized either. It seems to only recognize entities with title/proper case:

EX: United States but not united states or UNITED STATES

Are there any plans to improve detection for these instances? Has anyone attempted this problem yet? If so, what did you do to deal with these cases?

Thanks!

Your Environment

  • Operating System: Windows 7
  • Python Version Used: 2.7.12
  • spaCy Version Used: 1.4.0 (1.60 as of Jan 20, 2016)
@honnibal
Copy link
Member

Hi,

We're working on NER models that are less case sensitive, but in the meantime, there are a few ways to exert rule-based control of the NER, to fix these cases. For single tokens, you could use the tokenizer exceptions as follows:

>>> from spacy.attrs import ORTH, LEMMA, TAG, ENT_TYPE, ENT_IOB
>>> nlp = spacy.load('en')
>>> nlp.tokenizer.add_special_case('india', [{ORTH: 'india', LEMMA: 'India', TAG: 'NNP', ENT_TYPE: 'GPE', ENT_IOB: 3}])
>>> doc = nlp(u'there are many innovative companies in india.')
>>> [(w.text, w.tag_, w.ent_type_) for w in doc]
[('there', 'EX', ''), ('are', 'VBP', ''), ('many', 'JJ', ''), ('innovative', 'JJ', ''), ('companies', 'NNS', ''), ('in', 'IN', ''), ('india', 'NNP', 'GPE'), ('.', '.', '')]

You can read more about the tokenizer exceptions here: https://spacy.io/docs/usage/customizing-tokenizer

The tokenizer exceptions solution works well for single words, but doesn't help you with something like 'south korea'. For that you could use the rule matcher: https://spacy.io/docs/usage/rule-based-matching . Remember to add an on_match callback to actually assign the entities --- the matcher itself just identifies the spans; you still need to set the attributes.

The problem in general is that both the tagger and entity recogniser make use of several feature functions that are case sensitive. This is good in general, but can be problematic for certain text types. Here's a suggestion I've been thinking about for a while, but haven't played with yet. It probably takes a little bit of tuning.

The relevant feature functions ask about the word's "shape", whether it's upper case, whether it's lower case, and it's distributional similarity cluster. We can redefine the values of these features for specific words, and thereby trick the models into making a different decision. To do this, first look up the word in spaCy's vocabulary, to get the relevant Lexeme object:

>>> india = nlp.vocab[u'india']
>>> India = nlp.vocab[u'India']
>>> india.is_lower = India.is_lower
>>> india.shape = India.shape
>>> india.is_upper = India.is_upper
>>> india.cluster = India.cluster

For a more systematic approach, we can find all word that are usually title-cased:

probs = {w.prob: w.orth for w in nlp.vocab}
usually_titled = [w for w in nlp.vocab if w.is_title and probs.get(w.lower, -10000) < probs.get(w.orth, -10000)]

You probably want some margin on the probability, but for now we'll just take everything that's more common in title-case than in lower-case.

Now we iterate over these usually titled words, look up the lower-case version, and rewrite the features:

>>> for lex in better_titled:
...   lower = nlp.vocab[lex.lower]
...   lower.shape = lex.shape
...   lower.is_title = lex.is_title
...   lower.cluster = lex.cluster
...   lower.is_lower = lex.is_lo

At first glance, this appears to work:

>>> doc = nlp(u'south korea is a state in asia.')
>>> for word in doc:
...   print(word.text, word.tag_, word.ent_type_)

If you give this a try, please let us all know how you go :)

@ines ines added the 🌙 nightly Discussion and contributions related to nightly builds label Dec 23, 2016
@bluefuzz01
Copy link
Author

Thanks for such a great reply. I got re-tasked to another project temporarily but will be coming back this soon. I'm also tempted to see how many false positives I'd get if I simply title cased a query before passing it to the NER.

@bluefuzz01
Copy link
Author

@honnibal
So I'm back to working on this, and thanks again. I like how you gave a few different methods. It really helps, especially for learning the capabilities of the API.

Regarding the last option, in short, we're comparing the smoothed log probability estimate of token's type for each title cased word and its lower case version in the vocab. If the probability of the lower case version is less than the probability of the title case version, then we assume it is more likely to be title case. Next, we update the token attributes relevant for NER classification of the lower case version to match that of the title case version so the NER will think its an entity. Neat! I'll have to get back to you on how well this works in my domain.

Can you tell me more about the smoothed log probability estimate? I see how it is defined as a property in the lexeme code, but I'm interested in knowing how it is calculated. Couldn't find that part.

Couple minor code changes:

probs = {w.orth: w.prob for w in nlp.vocab}
usually_titled = [w for w in nlp.vocab if w.is_title and probs.get(nlp.vocab[w.orth].lower, -10000) < probs.get(w.orth, -10000)]

for lex in usually_titled:
   lower = nlp.vocab[lex.lower]
   lower.shape = lex.shape
   lower.is_title = lex.is_title
   lower.cluster = lex.cluster
   lower.is_lower = lex.is_lower
    
doc = nlp(u'south korea is a state in asia.')
for word in doc:
    print(word.text, word.tag_, word.ent_type_) 

@ghost
Copy link

ghost commented Feb 7, 2017

Finding the probability for the lower case compared to title case, and then updating the token attributes to mark it as an entity makes sense, but fails when generalised over a big set of data like name of persons. Would it be fine if we train a model on a training set which includes the same set of lines in all the lower , upper and title case ??

@honnibal
Copy link
Member

honnibal commented Feb 7, 2017

@Spawnakshay If you have the training data yourself, then yes forcing to lower-case makes sense. The complication is that I can't ship you the training data I'm using, because of licensing constraints.

@bluefuzz01 : The log probability was estimated from counts over the Reddit comment corpus 2009-2015 (~80b tokens), smoothed using Simple Good-Turing estimation (Gale's publication "Good-Turing estimation without the tears"). The smoothing implementation is in the preshed.counter class.

@ines
Copy link
Member

ines commented Apr 16, 2017

The new version 1.8.0 comes with bug fixes to the NER training procedure and a new save_to_directory() method. It should now be much easier to update the models yourself to fix the errors that occur on your data.

We've also updated the docs with more information on training and NER training in particular:

I hope this helps!

@ines ines closed this as completed Apr 16, 2017
@ines ines removed the 🌙 nightly Discussion and contributions related to nightly builds label May 7, 2017
@arjunmenon
Copy link

arjunmenon commented Jun 11, 2017

To someone who checks this part of issue tracker, one easy way to mitigate against this is to run your poorly-formatted text through Truecaser first, then apply the NER.
I have found it to be highly efficient.
Maybe the Spacy guys, @ines, @honnibal , can add this to their core. More industrial strength. ;)

@arezae
Copy link

arezae commented Sep 16, 2017

Truecaser seems a good solution for case issue in NER with Spacy but the model is big and might not be good for real time applications. @arjunmenon do you have smaller (but less accurate) model?

@arjunmenon
Copy link

Hey @arezae
You can simply copy-paste 4-5 paragraphs of text from wiki, news site or a category of text representing your data.
It will still work very good.
But if you are facing issues there, let me know what kind of model you want and I will help you out.

PS - sorry getting back late. coudn't keep track of this issue.

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants