Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language selection #2

Closed
ArtyomZemlyak opened this issue Sep 26, 2022 · 9 comments
Closed

Language selection #2

ArtyomZemlyak opened this issue Sep 26, 2022 · 9 comments
Labels
enhancement New feature or request

Comments

@ArtyomZemlyak
Copy link

I'm glad you shared this implementation.
A steep increase in performance relative to the torch on the CPU.

It is possible that you already know, but found how to enable recognition of a certain language.
We just can put in line 2012 main.cpp this:

std::vector<whisper_vocab::id> prompt = { vocab.token_sot, vocab.token_lang, vocab.token_task };  

This 3 tokens formed here:
https://github.com/openai/whisper/blob/8cf36f3508c9acd341a45eb2364239a3d81458b9/whisper/tokenizer.py#L324-L331

For specific use in main.cpp, you can simply specify the desired index manually. But for regular users, it would be cool to specify which language they would prefer to see in the output.

@ggerganov ggerganov added the enhancement New feature or request label Sep 26, 2022
@ggerganov
Copy link
Owner

ggerganov commented Sep 26, 2022

Thanks for looking into this. I will definitely add support for language selection.
I wasn't 100% sure if it's just the starting tokens that I need to modify to make it work with other languages, but it looks like this is the case.

I will probably add a CLI argument to be able to select the language.

Regarding the performance:
Yes, I suppose the speed for the smaller models should be better compared to torch. For the bigger models however, my matrix multiplication cannot match the performance of the original implementation. I measured about 2-3 times slower performance for 1024 x 4096 matrix sizes on my M1 MacBook.

Also, my sampling strategy is very basic - this is another thing that makes whisper.cpp go faster, but of course, the results won't be as good as a proper beam search implementation.

@ArtyomZemlyak
Copy link
Author

I can share the latest tests.

By accuracy:

  • Little difference on "ideal" recordings.
  • But a rather strong degradation for bad recordings with a lot of noise.

Device

  • CPU: i7 11800H
  • GPU: RTX 3080 Laptop
  • Python: 3.8.12
  • torch: 1.8.2

Torch

Model Time, s CPU/GPU RAM, GB VRAM, GB DISK, GB
tiny 488 CPU 0.5 0.074
base 564 CPU 1 0.142
small *3 CPU 2.5 0.472
medium *20 CPU 6 1.492
large *30 CPU 10 3.014
tiny 24 GPU 3.4 2.7 0.074
base 29 GPU 3.4 2.7 0.142
small 41 GPU 3.6 3.5 0.472
medium 89 GPU 4.3 6.1 1.492
large - GPU - - 3.014

C++ ggml

Model Time, s CPU/GPU RAM, GB VRAM, GB DISK, GB
tiny 18 CPU 0.4 0.076
large 520 CPU 2.5 3.022

@ggerganov
Copy link
Owner

Much appreciated!

Few things:

  • Your CPU has 8 cores, so make sure to use -t 8 to run whisper.cpp with 8 threads. By default it uses 4
  • What does the *30 mean for the large / CPU run with Torch ?
  • How long is the audio recording that you used?

@ArtyomZemlyak
Copy link
Author

  • Yes, -t 8 used for this test.
  • Time=488*3, 488*20, 488*30. I did not want to wait for the end of the test, since it was already clear that the speed of processing was sooo low. And I just pointed out that it's about that much slower than the tiny model.
  • 7 files, total 200 seconds (and for ggml I subtracted model loading time for this result)

ggerganov added a commit that referenced this issue Sep 28, 2022
- Achieved big performance improvement + memory usage reduction
- Can now translate / transcribe different languages
ggerganov added a commit that referenced this issue Sep 28, 2022
- Achieved big performance improvement + memory usage reduction
- Can now translate / transcribe different languages
@ggerganov
Copy link
Owner

I just added an option to be able to select a language.
For example, the following command will translate French audio to English using the small model:

./download-ggml-model.sh small
./main -m models/ggml-small.bin -f fr0.wav --language fr --translate

Additionally, I was able to reduce the memory usage at runtime using flash attention and 16-bit float key/value memory. Also the inference speed has also improved as a result.

@ArtyomZemlyak
Copy link
Author

New test on my same device.

Model time, s CPU/GPU
tiny 16 CPU
base 32 CPU
large 472 CPU

~10 % for large model better

@frankiedrake
Copy link

frankiedrake commented Jan 5, 2023

@ggerganov Language question again :) Is it possible to add a possibility to specify the langs array for each file respectively? To be able to specify different languages for different files

@ggerganov
Copy link
Owner

It's relatively easy to add this functionality.
Feel free to open an issue with feature request

anandijain pushed a commit to anandijain/whisper.cpp that referenced this issue Apr 28, 2023
- Achieved big performance improvement + memory usage reduction
- Can now translate / transcribe different languages
@jacob-salassi jacob-salassi mentioned this issue May 15, 2023
10 tasks
ggerganov added a commit that referenced this issue Aug 27, 2023
* Fix MSVC compile error C3688

Instead of simply using 'add_compile_options(/utf-8)' to address the MSVC compile error C3688, a better approach would be to handle it in a way that prevents passing '/utf-8' to NVCC.

* Significantly improve inference quality

In the function `log_mel_spectrogram_worker_thread`, there's an array out-of-bounds issue occurring during the calculation of complex number moduli. This issue is causing disruptions in the FFT spectrum, which, in turn, is reducing the quality of inference.

* Significantly improve inference quality

At last, I've pinpointed the actual source of the problem. Given that the frequency spectrum generated from real input data is symmetrical around the Nyquist frequency, there's a for-loop within the `log_mel_spectrogram_worker_thread` function that attempts to fold the frequency spectrum. Regrettably, a bug within this for-loop is causing a frame shift in the frequency spectrum. The previous attempt to remedy this, which involved using `fft_size + 1` when calculating the modulus, was merely a band-aid solution and did not address the underlying issue.

* Addressed a few minor issues

Fixed the issue of `fft_out` continuously expanding. Resolved the fallback caused by using 'break' instead of `fft_in[j] = 0`.

* Significantly improve inference quality 

Thanks for your patience everyone. It's finally sorted out. Now, the right side of the FFT spectrum is being flipped over to the left, and the amplitudes at corresponding positions on the left and right are added together (the spectrum on the left needs to be shifted by one position), then the average is calculated. FFT_OUT[0] is no longer discarded, making full use of the limited space to pack in more information.

* Add annotation and performance improvement

* Calculate FFT only when fft_in are not all zero

* Some minor performance improvement

* Fixed a bug impacting inference quality

* The first version after all the analysis is completed.

* Fix some bugs and add debug mode

* Fixed several bugs

* Temporarily disable speed-up mode and add debug mode.

* Add debug mode

* Disable speed-up mode and add debug mode

* Fix CI error (#1)

* Fix error

* Fix error

* Fixed several bugs including [BLANK_AUDIO] problem

* Remove Hard-coded hann window

* Some Final Fix (#2)

* Fix error

* Fix error

* Probably the last commit

* Probably the last commit

* whisper : minor coding style changes

* whisper : remove debug from public API

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
* Fix MSVC compile error C3688

Instead of simply using 'add_compile_options(/utf-8)' to address the MSVC compile error C3688, a better approach would be to handle it in a way that prevents passing '/utf-8' to NVCC.

* Significantly improve inference quality

In the function `log_mel_spectrogram_worker_thread`, there's an array out-of-bounds issue occurring during the calculation of complex number moduli. This issue is causing disruptions in the FFT spectrum, which, in turn, is reducing the quality of inference.

* Significantly improve inference quality

At last, I've pinpointed the actual source of the problem. Given that the frequency spectrum generated from real input data is symmetrical around the Nyquist frequency, there's a for-loop within the `log_mel_spectrogram_worker_thread` function that attempts to fold the frequency spectrum. Regrettably, a bug within this for-loop is causing a frame shift in the frequency spectrum. The previous attempt to remedy this, which involved using `fft_size + 1` when calculating the modulus, was merely a band-aid solution and did not address the underlying issue.

* Addressed a few minor issues

Fixed the issue of `fft_out` continuously expanding. Resolved the fallback caused by using 'break' instead of `fft_in[j] = 0`.

* Significantly improve inference quality 

Thanks for your patience everyone. It's finally sorted out. Now, the right side of the FFT spectrum is being flipped over to the left, and the amplitudes at corresponding positions on the left and right are added together (the spectrum on the left needs to be shifted by one position), then the average is calculated. FFT_OUT[0] is no longer discarded, making full use of the limited space to pack in more information.

* Add annotation and performance improvement

* Calculate FFT only when fft_in are not all zero

* Some minor performance improvement

* Fixed a bug impacting inference quality

* The first version after all the analysis is completed.

* Fix some bugs and add debug mode

* Fixed several bugs

* Temporarily disable speed-up mode and add debug mode.

* Add debug mode

* Disable speed-up mode and add debug mode

* Fix CI error (ggerganov#1)

* Fix error

* Fix error

* Fixed several bugs including [BLANK_AUDIO] problem

* Remove Hard-coded hann window

* Some Final Fix (ggerganov#2)

* Fix error

* Fix error

* Probably the last commit

* Probably the last commit

* whisper : minor coding style changes

* whisper : remove debug from public API

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
* Fix MSVC compile error C3688

Instead of simply using 'add_compile_options(/utf-8)' to address the MSVC compile error C3688, a better approach would be to handle it in a way that prevents passing '/utf-8' to NVCC.

* Significantly improve inference quality

In the function `log_mel_spectrogram_worker_thread`, there's an array out-of-bounds issue occurring during the calculation of complex number moduli. This issue is causing disruptions in the FFT spectrum, which, in turn, is reducing the quality of inference.

* Significantly improve inference quality

At last, I've pinpointed the actual source of the problem. Given that the frequency spectrum generated from real input data is symmetrical around the Nyquist frequency, there's a for-loop within the `log_mel_spectrogram_worker_thread` function that attempts to fold the frequency spectrum. Regrettably, a bug within this for-loop is causing a frame shift in the frequency spectrum. The previous attempt to remedy this, which involved using `fft_size + 1` when calculating the modulus, was merely a band-aid solution and did not address the underlying issue.

* Addressed a few minor issues

Fixed the issue of `fft_out` continuously expanding. Resolved the fallback caused by using 'break' instead of `fft_in[j] = 0`.

* Significantly improve inference quality 

Thanks for your patience everyone. It's finally sorted out. Now, the right side of the FFT spectrum is being flipped over to the left, and the amplitudes at corresponding positions on the left and right are added together (the spectrum on the left needs to be shifted by one position), then the average is calculated. FFT_OUT[0] is no longer discarded, making full use of the limited space to pack in more information.

* Add annotation and performance improvement

* Calculate FFT only when fft_in are not all zero

* Some minor performance improvement

* Fixed a bug impacting inference quality

* The first version after all the analysis is completed.

* Fix some bugs and add debug mode

* Fixed several bugs

* Temporarily disable speed-up mode and add debug mode.

* Add debug mode

* Disable speed-up mode and add debug mode

* Fix CI error (ggerganov#1)

* Fix error

* Fix error

* Fixed several bugs including [BLANK_AUDIO] problem

* Remove Hard-coded hann window

* Some Final Fix (ggerganov#2)

* Fix error

* Fix error

* Probably the last commit

* Probably the last commit

* whisper : minor coding style changes

* whisper : remove debug from public API

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
vonstring pushed a commit to vonstring/whisper.cpp that referenced this issue Nov 7, 2023
* Fix MSVC compile error C3688

Instead of simply using 'add_compile_options(/utf-8)' to address the MSVC compile error C3688, a better approach would be to handle it in a way that prevents passing '/utf-8' to NVCC.

* Significantly improve inference quality

In the function `log_mel_spectrogram_worker_thread`, there's an array out-of-bounds issue occurring during the calculation of complex number moduli. This issue is causing disruptions in the FFT spectrum, which, in turn, is reducing the quality of inference.

* Significantly improve inference quality

At last, I've pinpointed the actual source of the problem. Given that the frequency spectrum generated from real input data is symmetrical around the Nyquist frequency, there's a for-loop within the `log_mel_spectrogram_worker_thread` function that attempts to fold the frequency spectrum. Regrettably, a bug within this for-loop is causing a frame shift in the frequency spectrum. The previous attempt to remedy this, which involved using `fft_size + 1` when calculating the modulus, was merely a band-aid solution and did not address the underlying issue.

* Addressed a few minor issues

Fixed the issue of `fft_out` continuously expanding. Resolved the fallback caused by using 'break' instead of `fft_in[j] = 0`.

* Significantly improve inference quality 

Thanks for your patience everyone. It's finally sorted out. Now, the right side of the FFT spectrum is being flipped over to the left, and the amplitudes at corresponding positions on the left and right are added together (the spectrum on the left needs to be shifted by one position), then the average is calculated. FFT_OUT[0] is no longer discarded, making full use of the limited space to pack in more information.

* Add annotation and performance improvement

* Calculate FFT only when fft_in are not all zero

* Some minor performance improvement

* Fixed a bug impacting inference quality

* The first version after all the analysis is completed.

* Fix some bugs and add debug mode

* Fixed several bugs

* Temporarily disable speed-up mode and add debug mode.

* Add debug mode

* Disable speed-up mode and add debug mode

* Fix CI error (ggerganov#1)

* Fix error

* Fix error

* Fixed several bugs including [BLANK_AUDIO] problem

* Remove Hard-coded hann window

* Some Final Fix (ggerganov#2)

* Fix error

* Fix error

* Probably the last commit

* Probably the last commit

* whisper : minor coding style changes

* whisper : remove debug from public API

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
landtanin pushed a commit to landtanin/whisper.cpp that referenced this issue Dec 16, 2023
* Fix MSVC compile error C3688

Instead of simply using 'add_compile_options(/utf-8)' to address the MSVC compile error C3688, a better approach would be to handle it in a way that prevents passing '/utf-8' to NVCC.

* Significantly improve inference quality

In the function `log_mel_spectrogram_worker_thread`, there's an array out-of-bounds issue occurring during the calculation of complex number moduli. This issue is causing disruptions in the FFT spectrum, which, in turn, is reducing the quality of inference.

* Significantly improve inference quality

At last, I've pinpointed the actual source of the problem. Given that the frequency spectrum generated from real input data is symmetrical around the Nyquist frequency, there's a for-loop within the `log_mel_spectrogram_worker_thread` function that attempts to fold the frequency spectrum. Regrettably, a bug within this for-loop is causing a frame shift in the frequency spectrum. The previous attempt to remedy this, which involved using `fft_size + 1` when calculating the modulus, was merely a band-aid solution and did not address the underlying issue.

* Addressed a few minor issues

Fixed the issue of `fft_out` continuously expanding. Resolved the fallback caused by using 'break' instead of `fft_in[j] = 0`.

* Significantly improve inference quality 

Thanks for your patience everyone. It's finally sorted out. Now, the right side of the FFT spectrum is being flipped over to the left, and the amplitudes at corresponding positions on the left and right are added together (the spectrum on the left needs to be shifted by one position), then the average is calculated. FFT_OUT[0] is no longer discarded, making full use of the limited space to pack in more information.

* Add annotation and performance improvement

* Calculate FFT only when fft_in are not all zero

* Some minor performance improvement

* Fixed a bug impacting inference quality

* The first version after all the analysis is completed.

* Fix some bugs and add debug mode

* Fixed several bugs

* Temporarily disable speed-up mode and add debug mode.

* Add debug mode

* Disable speed-up mode and add debug mode

* Fix CI error (ggerganov#1)

* Fix error

* Fix error

* Fixed several bugs including [BLANK_AUDIO] problem

* Remove Hard-coded hann window

* Some Final Fix (ggerganov#2)

* Fix error

* Fix error

* Probably the last commit

* Probably the last commit

* whisper : minor coding style changes

* whisper : remove debug from public API

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jettoblack pushed a commit to jettoblack/whisper.cpp that referenced this issue Feb 8, 2024
@alpezajosip
Copy link

Is there a way to set it up so it doesn't translate to English, I am trying to get to work from Croatian to Croatian, right now it translates it to English?

Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants