Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpt2 tokenizer does not support different vocab_size #148

Closed
rektomar opened this issue Aug 16, 2023 · 1 comment
Closed

Gpt2 tokenizer does not support different vocab_size #148

rektomar opened this issue Aug 16, 2023 · 1 comment

Comments

@rektomar
Copy link

rektomar commented Aug 16, 2023

I want to reduce vocab_size for the gpt2 tokenizer, but the tokenizer still has a full vocabulary size.

using Transformers.HuggingFace

config= hgf"gpt2:config"
vocab_size=2053
new_config = HuggingFace.HGFConfig(config, vocab_size=vocab_size, bos_token_id=vocab_size-1, eos_token_id=vocab_size-1)
te = HuggingFace.load_tokenizer("gpt2";config=new_config)
julia> te.vocab
Vocab{String, SizedArray}(size = 50257, unk = <unk>, unki = 0)
@chengchingwen
Copy link
Owner

vocab_size is only for modifying the model with different size of embedding table. It does not affect the tokenizer. It's also unclear what a correct behavior should be when specifying a smaller (or larger) value.

A workaround approach for reducing the vocab size is to create new tokenizer with a smaller vocabulary by directly copy the subset of the vocabulary, but this would result in having most of the tokens become unknown token. Personally, the better way would be constructing/training your own tokenizer with smaller vocabulary size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants