Support offset mapping alignment for fast tokenizers #338

adrianeboyd · 2022-07-26T12:26:30Z

Switch to offset mapping-based alignment for fast tokenizers. With this change, slow vs. fast tokenizers will not give identical results with spacy-transformers.

Additional modifications:

Update package setup for cython
Update CI for compiled package

Switch to offset mapping-based alignment for fast tokenizers. With this change, slow vs. fast tokenizers will not give identical results with `spacy-transformers`. Additional modifications: * Update package setup for cython * Update CI for compiled package

adrianeboyd · 2022-07-26T12:30:27Z

Concerns:

~~use_fast should be saved correctly in the tokenizer settings (as a separate PR, Serialize tokenizer use_fast setting #339)~~
~~naming of _align and new methods~~
whether we (initially) want an automatic backoff to the old alignment if the new alignment fails in some cases
further speed improvements
- in particular I suspect this should be improved: https://github.com/explosion/spacy-transformers/pull/338/files#diff-f767b104e773bafeb21b7870c06043e8cebea5c4fce193b171b737192d589f62R90

svlandeg

Nice to finally get this properly working!

I suppose we'd want this as a minor release - should we create a develop branch for it? Or just remember to have the next one be 1.2.

spacy_transformers/align.py

spacy_transformers/tests/test_model_wrapper.py

requirements.txt

spacy_transformers/_align.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

adrianeboyd · 2022-08-12T12:59:03Z

Yes, this would need to be v1.2 because it will change the behavior of trained models.

…ignment

spacy_transformers/align.pyx

…ignment

adrianeboyd · 2022-08-25T15:51:48Z

In terms of model performance and speed this implementation appears to be on par with the existing spacy-alignments alignment. In general I think it should able be to be little bit faster, my first guess (as mentioned above) is now here in the renamed version:

https://github.com/explosion/spacy-transformers/pull/338/files#diff-b326f7a5839c4589ffcac094a207251f14e6235d150b35a7ae07459776bcc19aR104

I'm trying to find a reasonable test case for where the actual alignments would make a difference, but most things I've tried so far were a wash. The main known cases are with roberta/gpt2 with the bytes_to_unicode mapping. At least users looking at the alignments would be less concerned about it?

shadeMe

Looks good with just one minor comment. I don't see any immediate opportunities for optimization, but we should profile it as part of a deployed pipeline at some point.

spacy_transformers/align.pyx

adrianeboyd added the feat / alignment Feature: alignment label Jul 26, 2022

svlandeg self-requested a review August 10, 2022 11:15

svlandeg reviewed Aug 12, 2022

View reviewed changes

Update spacy_transformers/_align.pyx

4528563

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

adrianeboyd added 8 commits August 12, 2022 15:02

Check for differing spans / offsets lengths

efe7642

Add numpy requirement

6dcbb76

Merge remote-tracking branch 'upstream/master' into feature/offset-al…

1f79ed0

…ignment

Use numpy.full

960f576

Move all alignment methods into align

d8ee186

Add offset alignment test

6390cee

Remove _align

2e536b7

Update requirements comment

37baef3

adrianeboyd commented Aug 15, 2022

View reviewed changes

spacy_transformers/align.pyx Outdated Show resolved Hide resolved

Add type for offset_mapping

b070a29

svlandeg changed the base branch from master to develop August 16, 2022 13:00

adrianeboyd added 4 commits August 23, 2022 11:05

Minor refactoring

b1c710f

Merge remote-tracking branch 'upstream/master' into feature/offset-al…

7cccaa6

…ignment

Fix types

a2fa5ac

Fix order of steps in CI

a62bc5e

shadeMe reviewed Aug 29, 2022

View reviewed changes

spacy_transformers/align.pyx Outdated Show resolved Hide resolved

Fix type for offset_mapping

b4362a3

svlandeg merged commit f6a1a02 into explosion:develop Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support offset mapping alignment for fast tokenizers #338

Support offset mapping alignment for fast tokenizers #338

adrianeboyd commented Jul 26, 2022

adrianeboyd commented Jul 26, 2022 •

edited

svlandeg left a comment

adrianeboyd commented Aug 12, 2022 •

edited

adrianeboyd commented Aug 25, 2022

shadeMe left a comment

Support offset mapping alignment for fast tokenizers #338

Support offset mapping alignment for fast tokenizers #338

Conversation

adrianeboyd commented Jul 26, 2022

adrianeboyd commented Jul 26, 2022 • edited

svlandeg left a comment

Choose a reason for hiding this comment

adrianeboyd commented Aug 12, 2022 • edited

adrianeboyd commented Aug 25, 2022

shadeMe left a comment

Choose a reason for hiding this comment

adrianeboyd commented Jul 26, 2022 •

edited

adrianeboyd commented Aug 12, 2022 •

edited