Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Fix wordpiece indexer truncation #2931

Merged
merged 4 commits into from
Jun 20, 2019

Conversation

maksymbevza
Copy link
Contributor

The problem

Sometimes offsets contain an extra item, because of the wrong condition in the if check.

What happens

Let's consider the case of _truncate_long_sequences = True with one start=[CLS] and end=[SEP] token, max_pieces == 4 and window_length == 2.

Let's say that our sentence is being encoded into following subtokens:
["the", "quick", "##est"]

For this case ["the"] will fit without any cut, but ["quick", "##est"] will not. The old code produced following ending offsets: [1, 3], but wordpieces after the cut are the following ["[CLS]", "the", "quick", "[SEP]"] and position 3 is invalid in this case cause it points to "[SEP]". But sometimes it also fails with index out of bound error, when offset goes beyond the size of the tensor.

The fix

By adding len(token) - 1 to current offset we make sure that the last wordpiece of the current token will fit in the sentence cut on line 208

@joelgrus
Copy link
Contributor

joelgrus commented Jun 7, 2019

thanks for this, I think this looks good, but can you

  • give the tests more descriptive names, and
  • add some comments to them

so that it's more obvious what behavior exactly they're testing and what would cause them to fail?

@kl2806 kl2806 requested a review from joelgrus June 7, 2019 22:28
@maksymbevza
Copy link
Contributor Author

@joelgrus , I've added comments for the tests
PTAL

@matt-gardner
Copy link
Contributor

@joelgrus, ping.

@maksymbevza
Copy link
Contributor Author

@joelgrus Could you please take a look?

Copy link
Contributor

@joelgrus joelgrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks for the fix

@joelgrus joelgrus merged commit 7e08298 into allenai:master Jun 20, 2019
@maksymbevza maksymbevza deleted the maks/fix-wordpiece-indexer branch June 21, 2019 07:44
reiyw pushed a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019
* Fix wordpiece indexer

* Add comments for test and count pieces accumulated
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants