Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different ent_iob behavior after adding EntityRuler to pipeline #4267

Closed
jenojp opened this issue Sep 10, 2019 · 7 comments · Fixed by #4307
Closed

Different ent_iob behavior after adding EntityRuler to pipeline #4267

jenojp opened this issue Sep 10, 2019 · 7 comments · Fixed by #4307
Labels
bug Bugs and behaviour differing from documentation feat / ner Feature: Named Entity Recognizer

Comments

@jenojp
Copy link
Contributor

jenojp commented Sep 10, 2019

I'm not totally sure of the expected behavior but after adding an EntityRuler to a pipeline, non entities seem to get .ent_iob tags of 0 rather than 2 when just using the EntityRecognizer. This affects what doc.is_nered returns.

How to reproduce the behaviour

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)
## ['tagger', 'parser', 'ner']

doc = nlp("fgfgdghgdh")
print(doc.is_nered)
## True

for token in doc:
    print(token.ent_iob)
## 2

#addd entity ruler and run again
ruler = EntityRuler(nlp)
patterns = [{"label":"SOFTWARE", "pattern":"spacy"}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
print(nlp.pipe_names)
## ['tagger', 'parser', 'ner', 'entity_ruler']

doc = nlp("fgfgdghgdh")
print(doc.is_nered)
## False

for token in doc:
    print(token.ent_iob)
## 0

Your Environment

Info about spaCy

  • spaCy version: 2.1.8
  • Platform: Darwin-18.7.0-x86_64-i386-64bit
  • Python version: 3.6.8
@svlandeg svlandeg added bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher feat / ner Feature: Named Entity Recognizer and removed feat / matcher Feature: Token, phrase and dependency matcher labels Sep 10, 2019
svlandeg added a commit to svlandeg/spaCy that referenced this issue Sep 10, 2019
@svlandeg
Copy link
Member

Thanks for the report! I do think this is a bug. From the docs about EntityRuler:

If it’s added before the "ner" component, the entity recognizer will respect the existing entity spans and adjust its predictions around it. This can significantly improve accuracy in some cases. If it’s added after the "ner" component, the entity ruler will only add spans to the doc.ents if they don’t overlap with existing entities predicted by the model.

In this case it's added after the ner, and it should respect the NER annotations. Even if there are no entities found by the NER, I would still expect -as a user- that doc.is_nered would have stayed True.

@adrianeboyd
Copy link
Contributor

Huh, setting doc.ents does this:

spaCy/spacy/tokens/doc.pyx

Lines 547 to 550 in 669a7d3

for i in range(self.length):
self.c[i].ent_type = 0
self.c[i].ent_kb_id = 0
self.c[i].ent_iob = 0 # Means missing.

Maybe the resetting loop should preserve 2 in cases where it's 2? I'm not sure if there are some side effects I'm not thinking of...

@svlandeg
Copy link
Member

svlandeg commented Sep 10, 2019

Ha, yea, I was just looking at the same. One approach would be to set the default ent_iob to 2 instead of 0 if doc.is_nered, see here, but that makes another unit test fail...

@svlandeg
Copy link
Member

@honnibal : I guess this discussion has been had in the past ... So it is in fact desired behaviour that the setting of doc.ents resets everything to 0 (empty string) ? What about the original case discussed in this topic - would you expect the token.ent_iob to be reset to 0 (empty string) as well?

@adrianeboyd
Copy link
Contributor

I don't think it's possible to have the behavior in test_doc_add_entities_set_ents_iob and the correct is_nered, at least not without major changes to is_nered. I'm not sure why we would want the behavior in test_doc_add_entities_set_ents_iob, though?

@svlandeg
Copy link
Member

Ok it turns out this requires a more in-depth analysis and solution ;-) Will look into it ASAP.

@lock
Copy link

lock bot commented Oct 18, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / ner Feature: Named Entity Recognizer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants