New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support offset mapping alignment for fast tokenizers #338
Support offset mapping alignment for fast tokenizers #338
Conversation
Switch to offset mapping-based alignment for fast tokenizers. With this change, slow vs. fast tokenizers will not give identical results with `spacy-transformers`. Additional modifications: * Update package setup for cython * Update CI for compiled package
Concerns:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to finally get this properly working!
I suppose we'd want this as a minor release - should we create a develop
branch for it? Or just remember to have the next one be 1.2.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Yes, this would need to be v1.2 because it will change the behavior of trained models. |
In terms of model performance and speed this implementation appears to be on par with the existing I'm trying to find a reasonable test case for where the actual alignments would make a difference, but most things I've tried so far were a wash. The main known cases are with roberta/gpt2 with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good with just one minor comment. I don't see any immediate opportunities for optimization, but we should profile it as part of a deployed pipeline at some point.
Switch to offset mapping-based alignment for fast tokenizers. With this change, slow vs. fast tokenizers will not give identical results with
spacy-transformers
.Additional modifications: