Gpt2 tokenizer does not support different vocab_size #148

rektomar · 2023-08-16T18:21:46Z

I want to reduce vocab_size for the gpt2 tokenizer, but the tokenizer still has a full vocabulary size.

using Transformers.HuggingFace

config= hgf"gpt2:config"
vocab_size=2053
new_config = HuggingFace.HGFConfig(config, vocab_size=vocab_size, bos_token_id=vocab_size-1, eos_token_id=vocab_size-1)
te = HuggingFace.load_tokenizer("gpt2";config=new_config)

julia> te.vocab
Vocab{String, SizedArray}(size = 50257, unk = <unk>, unki = 0)

The text was updated successfully, but these errors were encountered:

chengchingwen · 2023-08-18T17:08:07Z

vocab_size is only for modifying the model with different size of embedding table. It does not affect the tokenizer. It's also unclear what a correct behavior should be when specifying a smaller (or larger) value.

A workaround approach for reducing the vocab size is to create new tokenizer with a smaller vocabulary by directly copy the subset of the vocabulary, but this would result in having most of the tokens become unknown token. Personally, the better way would be constructing/training your own tokenizer with smaller vocabulary size.

chengchingwen closed this as completed Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpt2 tokenizer does not support different vocab_size #148

Gpt2 tokenizer does not support different vocab_size #148

rektomar commented Aug 16, 2023 •

edited

Loading

chengchingwen commented Aug 18, 2023

Gpt2 tokenizer does not support different vocab_size #148

Gpt2 tokenizer does not support different vocab_size #148

Comments

rektomar commented Aug 16, 2023 • edited Loading

chengchingwen commented Aug 18, 2023

rektomar commented Aug 16, 2023 •

edited

Loading