Prevent base English models from producing "subtok" parse? #4099

dxiao2003 · 2019-08-08T20:23:00Z

How to reproduce the behaviour

Using the base English en_core_web_sm model, it's possible for the dependency parser to produce tokens with the "subtok" label. This seems like unexpected behavior and it's definitely problematic since training the parser on examples with "subtok" labels causes crashes.

To reproduce:

> nlp = spacy.load("en_core_web_sm")
> d = nlp("I'd rather have one Sharks with no crunch mustard instead of 205S")
> d[11].dep_
'subtok'

Granted the above isn't a particularly well-formed English sentence, but nevertheless I would not expect the "subtok" label. Is there a flag we can set that prevents the model from generating this label?

Also, running merge_subtokens afterwards isn't a solution because I need "of" to be a separate token.

Your Environment

Info about spaCy

spaCy version: 2.1.8
Platform: Linux-4.15.0-1040-aws-x86_64-with-debian-9.9
Python version: 3.7.3

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2019-08-09T10:44:25Z

The subtok dependency labels are a known problem (#3830) with the current models, but I think the models need to be retrained to fix it, which as far as I know is planned for 2.2. No simple/quick solution here, sorry.

The fact that including a subtok label in your data causes crashes when training the parser is a separate issue that should definitely be addressed.

dxiao2003 · 2019-08-12T17:28:20Z

I think I understand why training is crashing. If I understand correctly subtok labelled tokens as produced by the built-in models always have the next token as their head. But for our examples sometimes we have to massage them and introduce intervening tokens, so that now the subtok token's head is two or three tokens away. I'm guessing this may be what's causing training to fail.

ines · 2019-10-02T15:24:58Z

Just released v2.2, which fixes the subtok issue!

lock · 2019-11-01T15:54:37Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added the feat / parser Feature: Dependency Parser label Aug 9, 2019

ines added bug Bugs and behaviour differing from documentation models Issues related to the statistical models labels Aug 9, 2019

ines closed this as completed Oct 2, 2019

lock bot locked as resolved and limited conversation to collaborators Nov 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent base English models from producing "subtok" parse? #4099

Prevent base English models from producing "subtok" parse? #4099

dxiao2003 commented Aug 8, 2019 •

edited

adrianeboyd commented Aug 9, 2019

dxiao2003 commented Aug 12, 2019

ines commented Oct 2, 2019

lock bot commented Nov 1, 2019

Prevent base English models from producing "subtok" parse? #4099

Prevent base English models from producing "subtok" parse? #4099

Comments

dxiao2003 commented Aug 8, 2019 • edited

How to reproduce the behaviour

Your Environment

Info about spaCy

adrianeboyd commented Aug 9, 2019

dxiao2003 commented Aug 12, 2019

ines commented Oct 2, 2019

lock bot commented Nov 1, 2019

dxiao2003 commented Aug 8, 2019 •

edited