Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply tokenizer truncation before post-processor #307

Merged
merged 1 commit into from
Dec 14, 2023

Conversation

jonatanklosko
Copy link
Member

Closes #306.

Comment on lines +28 to +31
Tokenizer.set_truncation(tokenizer,
max_length: upper_bound_length,
direction: opts[:truncate_direction]
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we apply the tokenizer it does:

  1. Truncate the sequence if configured (accounting for the number of tokens added by 2.)
  2. Applies post-processor (in particular templating that adds leading trailing tokens)
  3. Applies padding if configured

So for example, lets say we have sequence that tokenizes to [10, 20, 30, 40] and the processor that adds 0 at the beginning and 1 at the end. And let's say we want to enforce length 4. If we tokenize without truncation, it's going to give us [0, 10, 20, 30, 40, 1], and then we manually truncate to [0, 10, 20, 30], which looses the added end token. On the other hand if we configure truncation it's going to compute [10, 20, 30, 40], then first truncate to 4 - 2 addeditional tokens, so [10, 20], and then add the tokens [0, 10, 20, 1].


Now the issue is that Tokenizer.set_truncation copies the underlying tokenizer. It is relatively cheap (bumping some refcounts and copying added_vocabulary, not the full vocabulary). That said ideally we would configure the tokenizer once. So what we can do deprecate options like :length, :truncate_direction, :pad_direction, and instead have a separate function to configure that. @josevalim wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you prefer!

@jonatanklosko jonatanklosko merged commit 93b1430 into main Dec 14, 2023
2 checks passed
@jonatanklosko jonatanklosko deleted the jk-tokenizer-truncation branch December 14, 2023 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Token Truncation differs from Transformers implementation
2 participants