Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bpe_gpt2_preprocess #5446

Closed
wants to merge 12 commits into from
Closed

Fix bpe_gpt2_preprocess #5446

wants to merge 12 commits into from

Conversation

bobqianic
Copy link
Contributor

This is a byproduct of the work on whisper.cpp. Could anyone conduct a few tests to verify if it resolves the issue detailed at #3502? @ggerganov

In the development of whisper.cpp, I've performed tests on a relatively large dataset, and the results have been quite promising. You can find more details about this testing at ggerganov/whisper.cpp#1854.

@ggerganov
Copy link
Owner

There is a tokenization test that is failing with this change:

make -j tests && ./tests/test-tokenizer-0-falcon ./models/ggml-vocab-falcon.gguf

...
src: '
 ='
res: '
 ='
tok: 193 204 40 
main : failed test:    '
 ='
main : detokenized to: '
 =' instead of '
 ='
main : expected tokens:   1212,     40, 
main : got tokens:         193,    204,     40, 

I.e. the string \n = used to tokenize to [1212, 40] and now it tokenizes to [193, 204, 40]. Can you confirm which is correct?

@bobqianic

This comment was marked as resolved.

@bobqianic
Copy link
Contributor Author

bobqianic commented Feb 11, 2024

It's odd, but this PR seems to undo the bug fix introduced in PR #3567, which aimed to address the issue reported in Issue #3502.

I'll check if reverting to the version before PR #3567 eliminates the error encountered in whisper.cpp

Edit: After rolling back to the version prior to PR #3567, the situation become even worse. The error rate surged to 0.507%

@bobqianic
Copy link
Contributor Author

We might need to add some additional code to ensure our tokenizer aligns with the Falcon tokenizer's behavior. It seems there aren't any major issues with this pull request. Echoing @apage43's comments in this discussion, to match the Falcon tokenizer exactly, we should consider developing specific functions aimed at this requirement, rather than altering the bpe_gpt2_preprocess function. @ggerganov

@bobqianic
Copy link
Contributor Author

bobqianic commented Feb 11, 2024

This time, I will conduct some stricter tests, using a test set that is 50 times larger than the previous one, wikitext1 (500MiB), as the test set, and compare it with OpenAI's tokenizer to ensure the correctness of bpe_gpt2_preprocess.

Edit: Great news. No errors were detected!

@ggerganov
Copy link
Owner

Ok, got it. We should however fix the test - maybe just use the new set of tokens and add a short comment explaining why there is a difference (or just link to this discussion)

@bobqianic

This comment was marked as resolved.

@cebtenzzre cebtenzzre marked this pull request as draft February 12, 2024 16:37
@ggerganov
Copy link
Owner

Btw, #5464 claims to fix some issues in the BPE preprocess function - might be useful

@bobqianic
Copy link
Contributor Author

Btw, #5464 claims to fix some issues in the BPE preprocess function - might be useful

I attempted to build the PR you mentioned, but unfortunately, it didn't succeed.

C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(415): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(415): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(419): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(419): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(444): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(444): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(448): error C2015: too many characters in constant
C:\Users\qianp\CLionProjects\testcpp\unicode_v2.h(448): error C2015: too many characters in constant

@bobqianic bobqianic marked this pull request as ready for review February 14, 2024 01:18
@bobqianic
Copy link
Contributor Author

Good, nearly all of Falcon's tokenizer tests have failed, which indicates that our implementation of bpe_gpt2_preprocess is correct.

@bobqianic bobqianic closed this Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants