Fix `bpe_gpt2_preprocess` #5446

bobqianic · 2024-02-10T23:22:08Z

This is a byproduct of the work on whisper.cpp. Could anyone conduct a few tests to verify if it resolves the issue detailed at #3502? @ggerganov

In the development of whisper.cpp, I've performed tests on a relatively large dataset, and the results have been quite promising. You can find more details about this testing at ggerganov/whisper.cpp#1854.

ggerganov · 2024-02-11T11:12:35Z

There is a tokenization test that is failing with this change:

make -j tests && ./tests/test-tokenizer-0-falcon ./models/ggml-vocab-falcon.gguf

...
src: '
 ='
res: '
 ='
tok: 193 204 40 
main : failed test:    '
 ='
main : detokenized to: '
 =' instead of '
 ='
main : expected tokens:   1212,     40, 
main : got tokens:         193,    204,     40,

I.e. the string \n = used to tokenize to [1212, 40] and now it tokenizes to [193, 204, 40]. Can you confirm which is correct?

bobqianic · 2024-02-11T22:23:22Z

It's odd, but this PR seems to undo the bug fix introduced in PR #3567, which aimed to address the issue reported in Issue #3502.

I'll check if reverting to the version before PR #3567 eliminates the error encountered in whisper.cpp

Edit: After rolling back to the version prior to PR #3567, the situation become even worse. The error rate surged to 0.507%

bobqianic · 2024-02-11T22:51:45Z

We might need to add some additional code to ensure our tokenizer aligns with the Falcon tokenizer's behavior. It seems there aren't any major issues with this pull request. Echoing @apage43's comments in this discussion, to match the Falcon tokenizer exactly, we should consider developing specific functions aimed at this requirement, rather than altering the bpe_gpt2_preprocess function. @ggerganov

bobqianic · 2024-02-11T22:56:05Z

This time, I will conduct some stricter tests, using a test set that is 50 times larger than the previous one, wikitext1 (500MiB), as the test set, and compare it with OpenAI's tokenizer to ensure the correctness of bpe_gpt2_preprocess.

Edit: Great news. No errors were detected!

ggerganov · 2024-02-12T08:14:00Z

Ok, got it. We should however fix the test - maybe just use the new set of tokens and add a short comment explaining why there is a difference (or just link to this discussion)

ggerganov · 2024-02-12T17:40:05Z

Btw, #5464 claims to fix some issues in the BPE preprocess function - might be useful

bobqianic · 2024-02-13T01:14:45Z

Btw, #5464 claims to fix some issues in the BPE preprocess function - might be useful

I attempted to build the PR you mentioned, but unfortunately, it didn't succeed.

C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(415): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(415): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(419): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(419): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(444): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(444): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(448): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(448): error C2015: too many characters in constant

Fix

bobqianic · 2024-02-14T01:47:06Z

Good, nearly all of Falcon's tokenizer tests have failed, which indicates that our implementation of bpe_gpt2_preprocess is correct.

bobqianic added 2 commits February 10, 2024 23:17

Fix bpe_gpt2_preprocess

98d5d20

remove static

e5dfeda

This comment was marked as resolved.

Sign in to view

bobqianic added 2 commits February 12, 2024 11:03

Update test-tokenizer-0-falcon.cpp

0da1e9c

Remove trailing whitespace in test-tokenizer-0-falcon.cpp

0c612e5

This comment was marked as resolved.

Sign in to view

cebtenzzre marked this pull request as draft February 12, 2024 16:37

bobqianic added 8 commits February 13, 2024 02:58

fix bugs

0e5b25b

Add files via upload

7570bbc

fix typo

861f544

Add files via upload

69aaa24

Update llama.cpp

bb2e1e8

Merge pull request #1 from bobqianic/fix

49bba30

Fix

Merge branch 'master' into master

3cd3964

Update unicode.h

9d04f3b

bobqianic marked this pull request as ready for review February 14, 2024 01:18

bobqianic closed this Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `bpe_gpt2_preprocess` #5446

Fix `bpe_gpt2_preprocess` #5446

bobqianic commented Feb 10, 2024

ggerganov commented Feb 11, 2024

This comment was marked as resolved.

bobqianic commented Feb 11, 2024 •

edited

Loading

bobqianic commented Feb 11, 2024

bobqianic commented Feb 11, 2024 •

edited

Loading

ggerganov commented Feb 12, 2024

This comment was marked as resolved.

ggerganov commented Feb 12, 2024

bobqianic commented Feb 13, 2024

bobqianic commented Feb 14, 2024

Fix bpe_gpt2_preprocess #5446

Fix bpe_gpt2_preprocess #5446

Conversation

bobqianic commented Feb 10, 2024

ggerganov commented Feb 11, 2024

This comment was marked as resolved.

bobqianic commented Feb 11, 2024 • edited Loading

bobqianic commented Feb 11, 2024

bobqianic commented Feb 11, 2024 • edited Loading

ggerganov commented Feb 12, 2024

This comment was marked as resolved.

ggerganov commented Feb 12, 2024

bobqianic commented Feb 13, 2024

bobqianic commented Feb 14, 2024

Fix `bpe_gpt2_preprocess` #5446

Fix `bpe_gpt2_preprocess` #5446

bobqianic commented Feb 11, 2024 •

edited

Loading

bobqianic commented Feb 11, 2024 •

edited

Loading