Minor improvements in GPT2 tokenizer #3567

goerch · 2023-10-10T05:34:23Z

Fixes our bugs in #3502.

@cmp-nct : the small fixes in bpe_gpt2_preprocess could be relevant for you?

ggerganov

Does this make the tokenization of wikitext match with the Python, or is there still the 4-digit tokenization discrepancy?

goerch · 2023-10-10T11:03:25Z

Does this make the tokenization of wikitext match with the Python, or is there still the 4-digit tokenization discrepancy?

The 4-digit tokenization discrepancy could be a bug in HF fast tokenizers.

ggerganov · 2023-10-10T11:11:07Z

Thank you for the thorough investigation. After @cebtenzzre's review we can merge this

apage43 · 2023-10-10T16:53:44Z

remaining discrepancy from HF is caused by different pretokenzation: #3502 (comment)

…example * 'master' of github.com:ggerganov/llama.cpp: (34 commits) examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436) docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597) cmake : fix add_compile_options on macOS typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592) ci : check if there is enough VRAM (ggerganov#3596) server : add completion mode (no chat) (ggerganov#3582) prompts : add mnemonics.txt server : fix kv cache management (ggerganov#3588) main : fix session loading bug (ggerganov#3400) server : add parameter -tb N, --threads-batch N (ggerganov#3584) common : fix mirostat state when using multiple sequences (ggerganov#3543) batched : add bench tool (ggerganov#3545) examples : add batched.swift + improve CI for swift (ggerganov#3562) Add MPT model to supported models in README.md (ggerganov#3574) Minor improvements in GPT2 tokenizer (ggerganov#3567) readme : add bloom (ggerganov#3570) llm : add bloom models (ggerganov#3553) swift : improvements and fixes (ggerganov#3564) llm : add MPT support (ggerganov#3417) infill. : fix tokenization (ggerganov#3508) ...

Fixing minor bugs in bpe_gpt2_preprocess

8d0c575

goerch requested a review from ggerganov October 10, 2023 05:34

goerch changed the title ~~Minor improvements for #3502~~ Minor improvements in GPT2 tokenizer Oct 10, 2023

Don't add bos token in test

d0c8d14

goerch requested a review from cebtenzzre October 10, 2023 05:48

ggerganov approved these changes Oct 10, 2023

View reviewed changes

cebtenzzre removed their request for review October 10, 2023 16:56

goerch merged commit 233fc1c into ggerganov:master Oct 10, 2023
39 checks passed

goerch deleted the fix-3502 branch October 22, 2023 15:20

bobqianic mentioned this pull request Feb 11, 2024

Fix bpe_gpt2_preprocess #5446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor improvements in GPT2 tokenizer #3567

Minor improvements in GPT2 tokenizer #3567

goerch commented Oct 10, 2023 •

edited

Loading

ggerganov left a comment

goerch commented Oct 10, 2023

ggerganov commented Oct 10, 2023

apage43 commented Oct 10, 2023

Minor improvements in GPT2 tokenizer #3567

Minor improvements in GPT2 tokenizer #3567

Conversation

goerch commented Oct 10, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

goerch commented Oct 10, 2023

ggerganov commented Oct 10, 2023

apage43 commented Oct 10, 2023

goerch commented Oct 10, 2023 •

edited

Loading