Skip to content

Can't train_test_split with test_size < 0.32 #4946

Discussion options

You must be logged in to vote

Thanks for the report, this is an interesting case related to subtle behavior around character offsets and tokenization. Spacy doesn't warn/explain enough about what's going on here and at the very least this behavior needs to be more transparent for users. I think it would be even better if there were some additional settings for how to handle misaligned data, too, but that would be a larger change.

I suspect the underlying problem is that you have a lot of cases where the tokenization in your annotation doesn't line up with spacy's tokenization. As an example, an obvious case would be something like this, where my annotation says that ome i is an entity:

"Rome is a city.",
{"entities": …

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / training Feature: Training utils, Example, Corpus and converters
2 participants
Converted from issue

This discussion was converted from issue #4946 on December 10, 2020 23:50.