Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group all tokenizers undera single module and configure upfront #310

Merged
merged 2 commits into from
Dec 15, 2023

Conversation

jonatanklosko
Copy link
Member

Closes #143.

Instead of having one module per tokenizer (which we generated with a macro, since they were the same except for some defaults), we now have a single Bumblebee.Text.PreTrainedTokenizer module with a :type field, somewhat similar to how we have models with multiple architectures. We use the type to pick the right set of defaults.

Also, instead of passing options to Bumblebee.apply_tokenizer, they need to be set on the tokenizer itself via Bumblebee.configure. I added a deprecation notice, and also for serving users it's handled transparently either way. This is primarily a small optimisation as mentioned in #307 (comment), but also aligns with featurizers.

@jonatanklosko jonatanklosko merged commit 8b52612 into main Dec 15, 2023
2 checks passed
@jonatanklosko jonatanklosko deleted the jk-tokenizer branch December 15, 2023 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify tokenizer modules
2 participants