Skip to content

Conversation

@jokkebk
Copy link
Contributor

@jokkebk jokkebk commented Oct 23, 2022

I had issues on Windows 10 using the premade .bin files so I tried to use models/convert-pt-to-ggml.py to convert my existing OpenAI Whisper .pt files. However, it throws an error:

>python convert-pt-to-ggml.py ..\..\..\models\tiny.en.pt ..\..\..\whisper .
hparams: {'n_mels': 80, 'n_vocab': 51864, 'n_audio_ctx': 1500, 'n_audio_state': 384, 'n_audio_head': 6, 'n_audio_layer': 4, 'n_text_ctx': 448, 'n_text_state': 384, 'n_text_head': 6, 'n_text_layer': 4}
Traceback (most recent call last):
  File "D:\programs\OpenAI-whisper\fork\whisper.cpp\models\convert-pt-to-ggml.py", line 238, in <module>
    tokens = json.load(f)
  File "C:\Users\Joonas\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Users\Joonas\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 926: character maps to <undefined>

This is likely because my machine does not default to utf8 localization on Windows, but the legacy cp1252.

It was an easy fix of adding enconding="utf8" to open(dir_tokenizer + "/vocab.json", "r"). Works perfectly. Should fix other Windows setups as well and I can't see any drawbacks on other systems, OpenAI Whisper's vocab.json most definitely is utf8.

This is my first pull request ever, so apologies if my fork+add upstream+create branch+commit+push+pull request process is wrong. :)

@ggerganov
Copy link
Member

Thank you for the contribution!

This is my first pull request ever, so apologies if my fork+add upstream+create branch+commit+push+pull request process is wrong. :)

Everything looks fine!

@ggerganov ggerganov merged commit 3d37ad5 into ggml-org:master Oct 23, 2022
anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023
Add enconding parameter to vocab.json opening to fix errors
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
Add enconding parameter to vocab.json opening to fix errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants