EntityRecognizer throws IndexError when used in pipeline with Transformer and custom span getter #9719

tomateit · 2021-11-22T10:15:25Z

EntityRecognizer throws IndexError when used in pipeline with Transformer and custom span getter during training:

File "/home/---/---/research_spacy_ru/.venv/lib/python3.8/site-packages/spacy/language.py", line 1122, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
  File "spacy/pipeline/transition_parser.pyx", line 416, in spacy.pipeline.transition_parser.Parser.update
  File "spacy/ml/parser_model.pyx", line 293, in spacy.ml.parser_model.ParserStepModel.finish_steps
  File "spacy/ml/parser_model.pyx", line 456, in spacy.ml.parser_model.precompute_hiddens.begin_update.backward
  File "/home/---/---/research_spacy_ru/.venv/lib/python3.8/site-packages/spacy/ml/_precomputable_affine.py", line 49, in backward
    Xf = X[ids]
IndexError: index 221 is out of bounds for axis 0 with size 221

How to reproduce the behaviour

I created my custom span_getter: https://gist.github.com/tomateit/06e53b108f764e7240ea7ae8e2e830fd
It adapts number of words to respective number of word pieces, to better fit into transformer window.
Pipeline works with this function, the exception is thrown only at some documents.

I plug it into simple transformer + ner pipeline like this: https://github.com/tomateit/natasha-spacy/blob/transformer-pipeline/project/config_trf.cfg
(in my tests I disabled all but transformer and NER)
This error is emitted at the line https://github.com/explosion/spaCy/blob/master/spacy/ml/_precomputable_affine.py#L49

Your Environment

spaCy version: 3.2.0
Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.29
Python version: 3.8.10
spacy-transformers = "^1.0.6"
torch = "1.9.1"

Operating System: Ubuntu 20.04
Python Version Used: Python 3.8.10
spaCy Version Used: 3.2.0

The text was updated successfully, but these errors were encountered:

polm · 2022-01-04T07:47:54Z

Thanks for the report and sorry it's taken us a long time to follow up on this. Unfortunately, because the issue is happening deep in the spaCy internals and your custom code isn't very simple, it's hard to be sure what's going on here.

Can you create a small example we can run to reproduce the problem? A repo like the one you linked to with a project file would be great, but that repo's project file doesn't seem to work and doesn't use Transformers anyway.

tomateit · 2022-01-05T11:06:03Z

Thanks for your reply.
I reproduced the behavior based on one of spaCy tutorials: https://github.com/tomateit/tutorial_spacy_custom_span_getter
The only changes I do are:

I add my span getter (I added more comments to make its algorithm more clear)
I alter config to use my transformer of choise

And the error remains.
P.S. The repo I linked in my first message does use transformer config, in project file it's called by "train_trf" and not "train" - to be able to use both configs.

polm added feat / ner Feature: Named Entity Recognizer feat / transformer Feature: Transformer labels Nov 24, 2021

polm added the more-info-needed This issue needs more information label Jan 4, 2022

no-response bot removed the more-info-needed This issue needs more information label Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EntityRecognizer throws IndexError when used in pipeline with Transformer and custom span getter #9719

EntityRecognizer throws IndexError when used in pipeline with Transformer and custom span getter #9719

tomateit commented Nov 22, 2021 •

edited

polm commented Jan 4, 2022

tomateit commented Jan 5, 2022

EntityRecognizer throws IndexError when used in pipeline with Transformer and custom span getter #9719

EntityRecognizer throws IndexError when used in pipeline with Transformer and custom span getter #9719

Comments

tomateit commented Nov 22, 2021 • edited

How to reproduce the behaviour

Your Environment

polm commented Jan 4, 2022

tomateit commented Jan 5, 2022

tomateit commented Nov 22, 2021 •

edited