Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make convert-pt-to-ggml.py backwards compatible with older vocab.json tokenizer files #1001

Merged
merged 2 commits into from
Jun 25, 2023

Conversation

akashmjn
Copy link
Contributor

Patches the script to determine what type of tokenizer files are present and convert appropriately.

Mostly borrows from #725 as a reference. For the same reason mentioned in that PR, converted multilingual checkpoints will continue to exactly match while .en checkpoints have a 17 byte difference compared to ggml files downloaded by whisper.cpp.

The script will also now produce exactly the same files regardless of which type of tokenizer files were used for conversion.

-rw-r--r--  1 Akash  staff  487614184 Jun 10 21:30 ggml-small.en.hf.bin
-rw-r--r--  1 Akash  staff  487614184 Jun 10 21:28 ggml-small.en.tiktoken.bin
-rw-r--r--  1 Akash  staff  487601967 Jun 10 21:32 ggml-small.hf.bin
-rw-r--r--  1 Akash  staff  487601967 Jun 10 21:35 ggml-small.tiktoken.bin

@ggerganov ggerganov merged commit 3ec7bff into ggerganov:master Jun 25, 2023
@ggerganov
Copy link
Owner

Thank you!

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
….json tokenizer files (ggerganov#1001)

* patch checkpoint convert script to keep compatibility with older hf_transformers whisper tokenizer

* typo fix
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
….json tokenizer files (ggerganov#1001)

* patch checkpoint convert script to keep compatibility with older hf_transformers whisper tokenizer

* typo fix
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
….json tokenizer files (ggerganov#1001)

* patch checkpoint convert script to keep compatibility with older hf_transformers whisper tokenizer

* typo fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants