Fix alignment algorithm #123

tamuhey · 2020-01-02T17:50:29Z

From #87

FIx the below method with pytokenizations, which is I created for this purpose.

spacy-transformers/spacy_transformers/pipeline/wordpiecer.py

Line 108 in 378d6aa

def _align(self, segment, wp_tokens, *, offset=0):

This library is based on "shortest edit script" and can handle noisy tokenizations, for example:

a = ["げん", "ご"]
b = ["けんこ"] # all accents are dropped (が -> か, ご -> こ)
a2b = [[0], [0]]
b2a = [[0, 1]]
assert tokenizations.get_alignments(a, b) == (a2b, b2a)

(I thinks this library is useful for spacy.align function. see explosion/spaCy#4554)

tamuhey · 2020-01-05T14:15:31Z

Passed https://github.com/explosion/spacy-transformers/blob/master/examples/test_wordpiece_alignment.py

tamuhey · 2020-01-06T01:56:43Z

I created PR in spacy: explosion/spaCy#4878

tamuhey added 13 commits January 3, 2020 00:27

fix alignment algorithm

66622f4

add sentencizer

00f4e59

modified: tests/test_wordpiecer.py

7859e19

found error

dbf1c49

fail test

5681daa

remove length constraint

37fc114

head controll char

d78ca3d

restore length check

e3ef7c0

update library

b26d0be

fix head attach

d149b7e

restore head attach

6238648

fix handling boundary

a3d8bf6

pass example/test_wordpiece_alignment.py

ed4f003

tamuhey mentioned this pull request Jan 6, 2020

Update gold.align with pytokenizations explosion/spaCy#4878

Closed

3 tasks

honnibal merged commit 8fbcbd5 into explosion:master Apr 21, 2020

tamuhey deleted the apply-pytokenizations branch May 12, 2020 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix alignment algorithm #123

Fix alignment algorithm #123

tamuhey commented Jan 2, 2020 •

edited

Loading

tamuhey commented Jan 5, 2020

tamuhey commented Jan 6, 2020

Fix alignment algorithm #123

Fix alignment algorithm #123

Conversation

tamuhey commented Jan 2, 2020 • edited Loading

tamuhey commented Jan 5, 2020

tamuhey commented Jan 6, 2020

tamuhey commented Jan 2, 2020 •

edited

Loading