Training a parser for custom semantics with Doc.retokenize applyied to Named Entities example #5921
-
In the docs about Training a parser for custom semantics there is a tip that says: "To achieve even better accuracy, try merging multi-word tokens and entities specific to your domain into one token before parsing your text. You can do this by running the entity recognizer or rule-based matcher to find relevant spans, and merging them using Doc.retokenize. You could even add your own custom pipeline component to do this automatically – just make sure to add it before='parser'." I tried to implement this using a custom pipeline component before the parser, but couldn't make it work. I am probably doing something wrong. I would like to know if someone has already done this or if it would be possible to include an example of in the docs. What I tried (and is probably wrong. It is basically the docs with my new component before the parser and changed london (in training set) and berlin (in dev set) for new york:
I get the following error (which I suspect occurs because it isn't merging the two tokens in one somehow):
Appreciate any help. Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
You have to consider the difference between training your pipeline, and applying it. Adding a custom merging component before the parser, is useful when applying the pipeline as a whole, because then the parser will only see the one merged token. However during training, this custom component is not run on your texts, as you disable all components when training the parser and feeding the raw texts as input. This means that the training step will still see "new york" as two tokens, and it will want annotations for both, thus resulting in this out-of-bounds indexing error. During training, you could remedy this by specifically telling the parser which are the (merged) words in the texts:
|
Beta Was this translation helpful? Give feedback.
-
Thank you @svlandeg! I'll try that! |
Beta Was this translation helpful? Give feedback.
You have to consider the difference between training your pipeline, and applying it. Adding a custom merging component before the parser, is useful when applying the pipeline as a whole, because then the parser will only see the one merged token.
However during training, this custom component is not run on your texts, as you disable all components when training the parser and feeding the raw texts as input. This means that the training step will still see "new york" as two tokens, and it will want annotations for both, thus resulting in this out-of-bounds indexing error.
During training, you could remedy this by specifically telling the parser which are the (merged) words in the texts: