Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed convert-pt-to-ggml.py to use .tiktoken tokenizer files #725

Merged
merged 1 commit into from
Apr 14, 2023

Conversation

ivan-gorin
Copy link
Contributor

Fixes my issue #724
Changed the way vocabulary is read and converted to ggml. Used the same code as in original tiktoken library.

The resulting ggml files are the same as ones downloaded from huggingface. For multilingual exactly the same, for en only there is a difference of 17 bytes, due to the difference in whisper's vocab files.

-rw-rw-r--  1 ivan ivan 77691713 Mar 22 11:09 ggml-tiny.bin
-rw-rw-r--  1 ivan ivan 77691713 Apr  6 06:13 ggml-tiny.bin.new
-rw-rw-r--  1 ivan ivan 77704715 Mar 22 11:08 ggml-tiny.en.bin
-rw-rw-r--  1 ivan ivan 77704698 Apr  6 06:14 ggml-tiny.en.bin.new

In the old vocab.json there are 50257 tokens, the last one <|endoftext|> with index 50256. In the new gpt2.tiktoken there are only 50256 tokens, the endoftext is removed. 4 bytes for the int storing string length + 13 bytes for the string itself = 17 byte difference. It doesn't seem to make a difference in the model output anyway, not sure why this token was in the vocab.json previously, since it is probably a special token.

I have only tested with tiny models, but the only change is in the tokenizer so it should work for all the others.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with medium and it produces the same model as before.
Thank you very much!

@ggerganov ggerganov merged commit 62b51c3 into ggerganov:master Apr 14, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants