Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bbpe: Add token LRU cache and avoid string lookups #57

Merged
merged 3 commits into from
Apr 12, 2024

Conversation

danieldk
Copy link
Contributor

Description

This change combines some performance improvemets:

  • We now cache token -> ids lookup in an LRU cache of size 8192.
  • Represent merges as a mapping (piece_id, piece_id) -> piece_id rather than a mapping of strings. This speeds up the C++ code considerably, making encoding faster with a fresh or useless cache.

Also add a small benchmark.

I think the merges files could also use a clang-format, but let's do that in a separate PR to keep the diff clean.

Types of change

Performance improvements.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@danieldk danieldk added the enhancement New feature or request label Apr 11, 2024
@danieldk danieldk requested a review from shadeMe April 11, 2024 14:18
This change combines some performance improvemets:

- We now cache token -> ids lookup in an LRU cache of size 8192.
- Represent merges as a mapping (piece_id, piece_id) -> piece_id rather
  than a mapping of strings. This speeds up the C++ code considerably,
  making encoding faster with a fresh or useless cache.

Also add a small benchmark.
Copy link
Collaborator

@shadeMe shadeMe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor fixes.

curated_tokenizers/merges.hh Outdated Show resolved Hide resolved
curated_tokenizers/merges.hh Outdated Show resolved Hide resolved
danieldk and others added 2 commits April 12, 2024 13:28
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
@danieldk danieldk merged commit 8aba85e into explosion:main Apr 12, 2024
6 checks passed
@danieldk danieldk deleted the feature/bbpe-optimizations branch April 12, 2024 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants