Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support offset mapping alignment for fast tokenizers #338

Merged
merged 16 commits into from Sep 1, 2022

Conversation

adrianeboyd
Copy link
Collaborator

Switch to offset mapping-based alignment for fast tokenizers. With this change, slow vs. fast tokenizers will not give identical results with spacy-transformers.

Additional modifications:

  • Update package setup for cython
  • Update CI for compiled package

Switch to offset mapping-based alignment for fast tokenizers. With this
change, slow vs. fast tokenizers will not give identical results with
`spacy-transformers`.

Additional modifications:

* Update package setup for cython
* Update CI for compiled package
@adrianeboyd
Copy link
Collaborator Author

adrianeboyd commented Jul 26, 2022

Concerns:

@adrianeboyd adrianeboyd added the feat / alignment Feature: alignment label Jul 26, 2022
@svlandeg svlandeg self-requested a review August 10, 2022 11:15
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to finally get this properly working!

I suppose we'd want this as a minor release - should we create a develop branch for it? Or just remember to have the next one be 1.2.

spacy_transformers/align.py Show resolved Hide resolved
spacy_transformers/align.py Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
spacy_transformers/_align.pyx Outdated Show resolved Hide resolved
spacy_transformers/_align.pyx Outdated Show resolved Hide resolved
spacy_transformers/_align.pyx Outdated Show resolved Hide resolved
spacy_transformers/_align.pyx Outdated Show resolved Hide resolved
spacy_transformers/_align.pyx Outdated Show resolved Hide resolved
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
@adrianeboyd
Copy link
Collaborator Author

adrianeboyd commented Aug 12, 2022

Yes, this would need to be v1.2 because it will change the behavior of trained models.

@svlandeg svlandeg changed the base branch from master to develop August 16, 2022 13:00
@adrianeboyd
Copy link
Collaborator Author

In terms of model performance and speed this implementation appears to be on par with the existing spacy-alignments alignment. In general I think it should able be to be little bit faster, my first guess (as mentioned above) is now here in the renamed version:

https://github.com/explosion/spacy-transformers/pull/338/files#diff-b326f7a5839c4589ffcac094a207251f14e6235d150b35a7ae07459776bcc19aR104

I'm trying to find a reasonable test case for where the actual alignments would make a difference, but most things I've tried so far were a wash. The main known cases are with roberta/gpt2 with the bytes_to_unicode mapping. At least users looking at the alignments would be less concerned about it?

Copy link
Collaborator

@shadeMe shadeMe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with just one minor comment. I don't see any immediate opportunities for optimization, but we should profile it as part of a deployed pipeline at some point.

spacy_transformers/align.pyx Outdated Show resolved Hide resolved
@svlandeg svlandeg merged commit f6a1a02 into explosion:develop Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / alignment Feature: alignment
Projects
None yet
3 participants