Prevent base English models from producing "subtok" parse? #4099
Labels
bug
Bugs and behaviour differing from documentation
feat / parser
Feature: Dependency Parser
models
Issues related to the statistical models
How to reproduce the behaviour
Using the base English en_core_web_sm model, it's possible for the dependency parser to produce tokens with the "subtok" label. This seems like unexpected behavior and it's definitely problematic since training the parser on examples with "subtok" labels causes crashes.
To reproduce:
Granted the above isn't a particularly well-formed English sentence, but nevertheless I would not expect the "subtok" label. Is there a flag we can set that prevents the model from generating this label?
Also, running
merge_subtokens
afterwards isn't a solution because I need "of" to be a separate token.Your Environment
Info about spaCy
The text was updated successfully, but these errors were encountered: