Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EntityRuler causes NER entities to go missing #3775

Closed
hahuang65 opened this issue May 23, 2019 · 3 comments

Comments

Projects
None yet
2 participants
@hahuang65
Copy link

commented May 23, 2019

How to reproduce the behaviour

I've got a simple NLP setup with a single pattern for the EntityRuler. It's to find a series of capitalized words, followed by a stopword.

nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp)

capitalized_word = "([A-Z][a-z]+)"
corporate_stopwords = "([Ii]nc|[Cc]orp|[Cc]o)"

patterns = [
    {"label": "COMPANY", "pattern": [{"TEXT": {"REGEX": capitalized_word}, 'OP': '+'}, {'TEXT': {"REGEX": corporate_stopwords}}]}
]
ruler.add_patterns(patterns)

nlp.add_pipe(ruler, before='ner')

If I run this piece of text against it, it looks like my pattern is working:

doc = nlp("Acme Inc. have announced a new product in conjunction with FooBar Baz corp. This rounds out the market segment created by Stark Co., a company started by Tony Stark.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Acme Inc.', 'COMPANY'), ('FooBar Baz corp', 'COMPANY'), ('Stark Co.', 'COMPANY')]

Except for 1 thing:
If I remove the EntityRuler, the same doc has a PERSON entity in it that's missing when using the EntityRuler:

[('Acme Inc.', 'ORG'), ('FooBar Baz', 'ORG'), ('Stark Co.', 'ORG'), ('Tony Stark', 'PERSON')]

Environment

  • spaCy version: 2.1.4
  • Platform: Darwin-17.7.0-x86_64-i386-64bit
  • Python version: 3.7.3
@ines

This comment has been minimized.

Copy link
Member

commented May 24, 2019

Thanks for the report!

One possible explanation here is that the presence of a new pre-defined entiy changes the predictions in a way that "Tony Stark" is no longer predicted as a person. When you add the entity ruler before the statistical entity recognizer, it will still predict the remaining entities, given what's already there. So in theory, it's possible that the interpretation of "Tony Stark" as ["O", "O"] is now more likely than the interpretation as ["B-PERSON", L-PERSON"].

But it might still be worth investigating, to make sure nothing else is going on here.

Btw, not sure how well this generalises for you use case, but setting overwrite=True on the EntityRuler and adding it after="ner" seems to produce the correct result for me:

[('Acme Inc.', 'COMPANY'), ('FooBar Baz corp', 'COMPANY'), ('Stark Co.', 'COMPANY'), ('Tony Stark', 'PERSON')]

Just make sure to call nlp.vocab.strings.add("COMPANY") to make sure the entity label COMPANY is in the string store.

@hahuang65

This comment has been minimized.

Copy link
Author

commented May 24, 2019

Thanks for such a quick response! The workaround you suggested seems to work fine for us, so thank you for suggesting that.

One thing you said, I didn't quite understand:

So in theory, it's possible that the interpretation of "Tony Stark" as ["O", "O"] is now more likely than the interpretation as ["B-PERSON", L-PERSON"].

Could you elaborate a little on t hat? I'm not particularly sure what ["O", "O"] and ["B-PERSON", "L-PERSON"] means.

@ines

This comment has been minimized.

Copy link
Member

commented May 30, 2019

Could you elaborate a little on t hat? I'm not particularly sure what ["O", "O"] and ["B-PERSON", "L-PERSON"] means.

Sorry if this was unclear. Under the hood, named entities are represented using the token-based BILUO scheme where B = beginning, I = inside an entity, L = last token of an entity, U = unit (single-token entity) and O = outside an entity. For instance, if "Tony Stark" is a PERSON, the sentence "I am Tony Stark" could be represented as ["O", "O", "B-PERSON", "L-PERSON"].

When the entity recognizer "recognizes" named entities, it'll essentially try to predict those tags plus entity labels for each tokens. All sequences it predicts need to be valid – for instance, ["O", "I-ORG", "O"] wouldn't be allowed, because a token that's inside an entity (I) needs to have a beginning (B) and last token (L), and can't be surrounded by tokens outside an entity (O). For more details on how the transition-based NER system works, you might find this part of @honnibal's video on spaCy's NER model helpful.

So, to go back to the example: As the entity recognizer steps through the tokens, it's essentially predicting the entity labels for each token, given the previous tokens.

"John Doe works at ACME with Tony Stark"
------------------------------------------
["?", "?", "?", "?", "?", "?", "?", "?"]  <--- start
["B-PERSON", "?", "?", "?", "?", "?", "?", "?"]  <--- predicted first token – next one can only be I-PERSON or L-PERSON
["B-PERSON", "L-PERSON", "?", "?", "?", "?", "?", "?"]  <--- ended entity, next one can only be O, U-[SOMETHING] or B-[SOMETHING]
["B-PERSON", "L-PERSON", "O", "?", "?", "?", "?", "?"]  <--- "works" is outside an entity
["B-PERSON", "L-PERSON", "O", "O", "?", "?", "?", "?"]  <--- "at" is outside an entity
["B-PERSON", "L-PERSON", "O", "O", "U-ORG", "?", "?", "?"]  <--- "ACME" is single-token (unit) entity
["B-PERSON", "L-PERSON", "O", "O", "U-ORG", "O", "?", "?"]  <--- "with" is outside an entity
...

Next, it'd be predicting the entity label and BILUO tag for "Tony". This can either be O, B-[SOMETHING] or U-[SOMETHING]. The correct label would be B-PERSON, but it's possible that the model thinks that O is more likely here, given the previously assigned tags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.