Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EntityRecognizer throws IndexError when used in pipeline with Transformer and custom span getter #9719

Open
tomateit opened this issue Nov 22, 2021 · 2 comments
Labels
feat / ner Feature: Named Entity Recognizer feat / transformer Feature: Transformer

Comments

@tomateit
Copy link

tomateit commented Nov 22, 2021

EntityRecognizer throws IndexError when used in pipeline with Transformer and custom span getter during training:

File "/home/---/---/research_spacy_ru/.venv/lib/python3.8/site-packages/spacy/language.py", line 1122, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
  File "spacy/pipeline/transition_parser.pyx", line 416, in spacy.pipeline.transition_parser.Parser.update
  File "spacy/ml/parser_model.pyx", line 293, in spacy.ml.parser_model.ParserStepModel.finish_steps
  File "spacy/ml/parser_model.pyx", line 456, in spacy.ml.parser_model.precompute_hiddens.begin_update.backward
  File "/home/---/---/research_spacy_ru/.venv/lib/python3.8/site-packages/spacy/ml/_precomputable_affine.py", line 49, in backward
    Xf = X[ids]
IndexError: index 221 is out of bounds for axis 0 with size 221

How to reproduce the behaviour

I created my custom span_getter: https://gist.github.com/tomateit/06e53b108f764e7240ea7ae8e2e830fd
It adapts number of words to respective number of word pieces, to better fit into transformer window.
Pipeline works with this function, the exception is thrown only at some documents.

I plug it into simple transformer + ner pipeline like this: https://github.com/tomateit/natasha-spacy/blob/transformer-pipeline/project/config_trf.cfg
(in my tests I disabled all but transformer and NER)
This error is emitted at the line https://github.com/explosion/spaCy/blob/master/spacy/ml/_precomputable_affine.py#L49

Your Environment

  • spaCy version: 3.2.0
  • Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • spacy-transformers = "^1.0.6"
  • torch = "1.9.1"
  • Operating System: Ubuntu 20.04
  • Python Version Used: Python 3.8.10
  • spaCy Version Used: 3.2.0
@polm polm added feat / ner Feature: Named Entity Recognizer feat / transformer Feature: Transformer labels Nov 24, 2021
@polm
Copy link
Contributor

polm commented Jan 4, 2022

Thanks for the report and sorry it's taken us a long time to follow up on this. Unfortunately, because the issue is happening deep in the spaCy internals and your custom code isn't very simple, it's hard to be sure what's going on here.

Can you create a small example we can run to reproduce the problem? A repo like the one you linked to with a project file would be great, but that repo's project file doesn't seem to work and doesn't use Transformers anyway.

@polm polm added the more-info-needed This issue needs more information label Jan 4, 2022
@tomateit
Copy link
Author

tomateit commented Jan 5, 2022

Thanks for your reply.
I reproduced the behavior based on one of spaCy tutorials: https://github.com/tomateit/tutorial_spacy_custom_span_getter
The only changes I do are:

  • I add my span getter (I added more comments to make its algorithm more clear)
  • I alter config to use my transformer of choise

And the error remains.
P.S. The repo I linked in my first message does use transformer config, in project file it's called by "train_trf" and not "train" - to be able to use both configs.

@no-response no-response bot removed the more-info-needed This issue needs more information label Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / ner Feature: Named Entity Recognizer feat / transformer Feature: Transformer
Projects
None yet
Development

No branches or pull requests

2 participants