Skip to content

Training a parser for custom semantics with Doc.retokenize applyied to Named Entities example #5921

Discussion options

You must be logged in to vote

You have to consider the difference between training your pipeline, and applying it. Adding a custom merging component before the parser, is useful when applying the pipeline as a whole, because then the parser will only see the one merged token.

However during training, this custom component is not run on your texts, as you disable all components when training the parser and feeding the raw texts as input. This means that the training step will still see "new york" as two tokens, and it will want annotations for both, thus resulting in this out-of-bounds indexing error.

During training, you could remedy this by specifically telling the parser which are the (merged) words in the texts:

"f…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage General spaCy usage feat / pipeline Feature: Processing pipeline and components
2 participants
Converted from issue

This discussion was converted from issue #5921 on December 11, 2020 00:03.