Skip to content

v1.3.4

Choose a tag to compare

@AbrahamSanders AbrahamSanders released this 11 Apr 07:42
· 6 commits to main since this release
  • codec_bpe.extend_tokenizer now accepts an arbitrary list of tokens to append to the existing tokenizers' additional special token list via the --additional_special_tokens argument.

For example:

python -m codec_bpe.extend_tokenizer \
    --existing_tokenizer mistralai/Mistral-7B-v0.1 \
    --codec_bpe_tokenizer output/encodec_bpe_4cb_30k \
    --additional_special_tokens "<audio>" "</audio>" # optional