Fix wordpiece indexer truncation #2931

maksymbevza · 2019-06-07T12:25:25Z

The problem

Sometimes offsets contain an extra item, because of the wrong condition in the if check.

What happens

Let's consider the case of _truncate_long_sequences = True with one start=[CLS] and end=[SEP] token, max_pieces == 4 and window_length == 2.

Let's say that our sentence is being encoded into following subtokens:
["the", "quick", "##est"]

For this case ["the"] will fit without any cut, but ["quick", "##est"] will not. The old code produced following ending offsets: [1, 3], but wordpieces after the cut are the following ["[CLS]", "the", "quick", "[SEP]"] and position 3 is invalid in this case cause it points to "[SEP]". But sometimes it also fails with index out of bound error, when offset goes beyond the size of the tensor.

The fix

By adding len(token) - 1 to current offset we make sure that the last wordpiece of the current token will fit in the sentence cut on line 208

joelgrus · 2019-06-07T17:15:04Z

thanks for this, I think this looks good, but can you

give the tests more descriptive names, and
add some comments to them

so that it's more obvious what behavior exactly they're testing and what would cause them to fail?

maksymbevza · 2019-06-10T15:21:22Z

@joelgrus , I've added comments for the tests
PTAL

matt-gardner · 2019-06-14T15:35:48Z

@joelgrus, ping.

maksymbevza · 2019-06-20T11:44:45Z

@joelgrus Could you please take a look?

joelgrus

looks good, thanks for the fix

* Fix wordpiece indexer * Add comments for test and count pieces accumulated

Fix wordpiece indexer

2d21b55

kl2806 requested a review from joelgrus June 7, 2019 22:28

Add comments for test and count pieces accumulated

c667424

Merge branch 'master' into maks/fix-wordpiece-indexer

d9d3438

joelgrus approved these changes Jun 20, 2019

View reviewed changes

Merge branch 'master' into maks/fix-wordpiece-indexer

53ec5b3

joelgrus merged commit 7e08298 into allenai:master Jun 20, 2019

maksymbevza deleted the maks/fix-wordpiece-indexer branch June 21, 2019 07:44

reiyw pushed a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019

Fix wordpiece indexer truncation (allenai#2931)

34ea8c7

* Fix wordpiece indexer * Add comments for test and count pieces accumulated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix wordpiece indexer truncation #2931

Fix wordpiece indexer truncation #2931

maksymbevza commented Jun 7, 2019

joelgrus commented Jun 7, 2019

maksymbevza commented Jun 10, 2019

matt-gardner commented Jun 14, 2019

maksymbevza commented Jun 20, 2019

joelgrus left a comment

Fix wordpiece indexer truncation #2931

Fix wordpiece indexer truncation #2931

Conversation

maksymbevza commented Jun 7, 2019

The problem

What happens

The fix

joelgrus commented Jun 7, 2019

maksymbevza commented Jun 10, 2019

matt-gardner commented Jun 14, 2019

maksymbevza commented Jun 20, 2019

joelgrus left a comment

Choose a reason for hiding this comment