Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent base English models from producing "subtok" parse? #4099

Closed
dxiao2003 opened this issue Aug 8, 2019 · 4 comments
Closed

Prevent base English models from producing "subtok" parse? #4099

dxiao2003 opened this issue Aug 8, 2019 · 4 comments
Labels
bug Bugs and behaviour differing from documentation feat / parser Feature: Dependency Parser models Issues related to the statistical models

Comments

@dxiao2003
Copy link

dxiao2003 commented Aug 8, 2019

How to reproduce the behaviour

Using the base English en_core_web_sm model, it's possible for the dependency parser to produce tokens with the "subtok" label. This seems like unexpected behavior and it's definitely problematic since training the parser on examples with "subtok" labels causes crashes.

To reproduce:

> nlp = spacy.load("en_core_web_sm")
> d = nlp("I'd rather have one Sharks with no crunch mustard instead of 205S")
> d[11].dep_
'subtok'

Granted the above isn't a particularly well-formed English sentence, but nevertheless I would not expect the "subtok" label. Is there a flag we can set that prevents the model from generating this label?

Also, running merge_subtokens afterwards isn't a solution because I need "of" to be a separate token.

Your Environment

Info about spaCy

  • spaCy version: 2.1.8
  • Platform: Linux-4.15.0-1040-aws-x86_64-with-debian-9.9
  • Python version: 3.7.3
@svlandeg svlandeg added the feat / parser Feature: Dependency Parser label Aug 9, 2019
@adrianeboyd
Copy link
Contributor

The subtok dependency labels are a known problem (#3830) with the current models, but I think the models need to be retrained to fix it, which as far as I know is planned for 2.2. No simple/quick solution here, sorry.

The fact that including a subtok label in your data causes crashes when training the parser is a separate issue that should definitely be addressed.

@ines ines added bug Bugs and behaviour differing from documentation models Issues related to the statistical models labels Aug 9, 2019
@dxiao2003
Copy link
Author

I think I understand why training is crashing. If I understand correctly subtok labelled tokens as produced by the built-in models always have the next token as their head. But for our examples sometimes we have to massage them and introduce intervening tokens, so that now the subtok token's head is two or three tokens away. I'm guessing this may be what's causing training to fail.

@ines
Copy link
Member

ines commented Oct 2, 2019

Just released v2.2, which fixes the subtok issue!

@ines ines closed this as completed Oct 2, 2019
@lock
Copy link

lock bot commented Nov 1, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / parser Feature: Dependency Parser models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

4 participants