Group all tokenizers undera single module and configure upfront #310

jonatanklosko · 2023-12-15T15:00:25Z

Closes #143.

Instead of having one module per tokenizer (which we generated with a macro, since they were the same except for some defaults), we now have a single Bumblebee.Text.PreTrainedTokenizer module with a :type field, somewhat similar to how we have models with multiple architectures. We use the type to pick the right set of defaults.

Also, instead of passing options to Bumblebee.apply_tokenizer, they need to be set on the tokenizer itself via Bumblebee.configure. I added a deprecation notice, and also for serving users it's handled transparently either way. This is primarily a small optimisation as mentioned in #307 (comment), but also aligns with featurizers.

jonatanklosko added 2 commits December 15, 2023 21:53

Group all tokenizers under a single module and configure upfront

c8819b7

Add TODO to remove deprecation

07ef9ba

josevalim approved these changes Dec 15, 2023

View reviewed changes

jonatanklosko merged commit 8b52612 into main Dec 15, 2023
2 checks passed

jonatanklosko deleted the jk-tokenizer branch December 15, 2023 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group all tokenizers undera single module and configure upfront #310

Group all tokenizers undera single module and configure upfront #310

jonatanklosko commented Dec 15, 2023

Group all tokenizers undera single module and configure upfront #310

Group all tokenizers undera single module and configure upfront #310

Conversation

jonatanklosko commented Dec 15, 2023