Skip to content

Conversation

travismorton
Copy link
Contributor

This allows a pretrained tokenizer to be loaded with additional special tokens. It enables the creation of a Whisper tokenizer which is based on the GPT2 tokenizer with new special tokens. Usage looks like:

{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2", additional_special_tokens: ["<|special|>"])
# or directly from file
{:ok, tokenizer} = Tokenizers.Tokenizer.from_file("tokenizer.json", additional_special_tokens: ["<|special|>"])

Full Whisper tokenizer example here

I'm not sure what the preferred way of passing opts through to the tokenizer is, interested in feedback there. I left it open to have more options in the future for possibly adding (non-special) tokens and other values that could modify the initial state.

@josevalim josevalim merged commit f0540fc into elixir-nx:main Dec 15, 2022
@josevalim
Copy link
Contributor

💚 💙 💜 💛 ❤️

@travismorton travismorton deleted the feature/add_special_tokens branch December 15, 2022 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants