Hallucination on silence #1724

pprobst · 2024-01-04T15:18:48Z

Hello! In some experiments, I've noticed that in audio files that have silence at the end (even ~1s of silence), whispercpp sometimes transcribes "bullshit" text from nonexistent speech. This does not happen when I'm using the evaluate/predict functions from transformers, or transcribe from whisperx (although the latter uses VAD), which makes me think there's a parameter or something in whispercpp that may be making it prone to hallucination in these cases. Note that I'm using a converted fine-tuned base model (h5 to ggml).

I'm using the latest 1.5.3 version, but this also happened in 1.5.2.

An example below:

λ ./main -f 1635687465_8386435.ogg -l pt -m ../eval/ggml-model.bin -pc

whisper_init_from_file_with_params_no_state: loading model from '../eval/ggml-model.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050 6GB Laptop GPU, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =   147.46 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =   16.52 MB
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: compute buffer (conv)   =   14.86 MB
whisper_init_state: compute buffer (encode) =   85.99 MB
whisper_init_state: compute buffer (cross)  =    4.78 MB
whisper_init_state: compute buffer (decode) =   96.48 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |

main: processing '1635687465_8386435.wav' (118886 samples, 7.4 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = pt, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:06.300]   ponto parágrafo planos musculares com aspecto habitual a faixa etária
[00:00:06.300 --> 00:00:36.300]   subcutâneo de l cinco e l cinco e l cinco l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco


whisper_print_timings:     load time =   116.86 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     9.17 ms
whisper_print_timings:   sample time =   325.28 ms /  1212 runs (    0.27 ms per run)
whisper_print_timings:   encode time =   120.70 ms /     2 runs (   60.35 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   555.86 ms /  1208 runs (    0.46 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1176.76 ms

The transcription in [00:00:00.000 --> 00:00:06.300] ponto parágrafo planos musculares com aspecto habitual a faixa etária is correct. But after that is just about 1s of silence. After transcribing the first segment, it "hangs" for a sec and then it hallucinates.

(note that the audio file being passed is OGG, but in code I'm converting it to WAV 16khz mono with ffmpeg)

The text was updated successfully, but these errors were encountered:

bobqianic · 2024-01-04T19:06:05Z

Indeed, I've noticed that as well. I'll need some time to look into it more thoroughly.

pprobst · 2024-01-04T19:54:27Z

Also: when the audio has a repetition of sounds, whispercpp also tends to hallucinate. Example:

Ground-truth: "íntegro íntegro íntegro íntegro íntegro íntegro íntegro"

Prediction: "íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro ínteg"

mrfragger · 2024-01-04T21:09:46Z

[ ! -d output ] && mkdir output ; for f in *.mp3 ; do ffmpeg -hide_banner -i "$f" -c:a libopus -b:a 32k -af "silenceremove=start_periods=1:stop_periods=-1:start_threshold=-50dB:stop_threshold=-50dB:start_silence=1:start_duration=0:stop_duration=3:detection=peak",highpass=200,lowpass=3000,afftdn,volume=12dB,dynaudnorm output/"${f%.*}.opus" ; done

I pretty much remove all silence segments in audio before transcribing to avoid hallucination. Here is 3 seconds minimum of silence (stop_duration=3) to remove as well as hiss.

pprobst · 2024-01-07T18:57:01Z

Hey guys. I had a good time today benchmarking and comparing different inference backends on the transcription of 3000 Brazilian Portuguese audio files of varying quality. While I had good results in terms of WER (word error rate, less is better) with HuggingFace's ASR pipeline and whisperX (about 3%), I struggled to achieve acceptable results with faster-whisper or whispercpp, which had a ~4x worse WER (about 13%). Furthermore, activating VAD in faster-whisper had minimal impact.

Then, since whisperX uses faster-whisper for its inference, I compared which parameters differed between them. After some tests, I achieved a 4x reduction in WER in faster-whisper by setting without_timestamps=True. Since my use-case has no use for timestamps, this is OK for me.

I proceeded to repeat the same procedure in whispercpp, by setting the following line to true,

whisper.cpp/whisper.cpp

Line 4322 in 022756a

/*.no_timestamps =*/ false,

and also achieved a 4x reduction in WER, and not a single hallucination like the ones I showed above.

I wonder why computing timestamps makes Whisper more prone to hallucinations.

pprobst · 2024-01-07T19:07:23Z

Also: maybe it's a good idea to make it so that -nt in main.cpp not only does not print timestamps, but also does not compute them:

wparams.no_timestamps = params.no_timestamps;

bobqianic · 2024-01-07T19:12:49Z

After some tests, I achieved a 4x reduction in WER in faster-whisper by setting without_timestamps=True

That's really interesting. Have you experimented with OpenAI's official implementation of Whisper? It also generates timestamps.

https://github.com/openai/whisper

pprobst · 2024-01-07T19:17:07Z

That's really interesting. Have you experimented with OpenAI's official implementation of Whisper? It also generates timestamps.

https://github.com/openai/whisper

I have not, but it makes sense to experiment with it. I'll probably do it in the next few days.

ggerganov · 2024-01-07T20:51:48Z

Also: maybe it's a good idea to make it so that -nt in main.cpp not only does not print timestamps, but also does not compute them:

wparams.no_timestamps = params.no_timestamps;

Yes, this should be updated. The reason is that "not computing timestamps" option was added just recently and before that, they were always computed but not being displayed. Now we can disable them properly

pprobst · 2024-01-08T14:28:07Z

I still have to figure out how to load my fine-tuned model using the official OpenAI implementation. Still, preliminary results in the same dataset using the multilingual base model showed that setting word_timestamps=False and without_timestamps=True when calling the transcribe function improved WER from 64% to 54%.

Sing303 · 2024-01-15T15:12:17Z

If you set the context to 0, does the problem go away? Parameter: -mc 0
I'm having problems disappearing. Maybe timestamps get into context and break the "brain" of the model?

pprobst · 2024-01-15T17:06:16Z

If you set the context to 0, does the problem go away? Parameter: -mc 0 I'm having problems disappearing. Maybe timestamps get into context and break the "brain" of the model?

It does not solve the issue, and the WER increases slightly. I tried a ton of parameters, and the only one that solved the issue was completely disabling timestamps.

Sing303 · 2024-01-16T12:15:38Z

@pprobst Could you provide a link to the file you are testing this problem on?

pprobst · 2024-01-16T12:31:46Z

@pprobst Could you provide a link to the file you are testing this problem on?

Unfortunately, it's a private dataset that I have no permission to share 🫠
Although I have not replicated the experiment in other datasets, I believe the drop in accuracy when computing timestamps can occur in any dataset.

bobqianic · 2024-01-16T22:26:01Z

Give my latest PR #1768 a try. It's still a WIP, but if you compile it yourself, it should significantly reduce the hallucinations towards the end of the audio file.

jettoblack · 2024-01-17T02:41:14Z

@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.

bobqianic · 2024-01-17T11:52:59Z

@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.

Discord:
bob20231894

jettoblack · 2024-01-18T18:20:03Z

@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.

Discord: bob20231894

Ok thanks I sent you a friend request on discord.

mrfragger · 2024-01-20T05:44:47Z

openai/whisper#1962
two PRs on openai whisper seem to be most promising 1808 and 1963 in regardings to drastically reducing hallucinations.

bygreencn · 2024-02-03T14:34:49Z

@ggerganov any schedules to implement #1838 Skip silence around hallucinations?

bobqianic · 2024-02-03T14:37:51Z

@ggerganov any schedules to implement #1838 Skip silence around hallucinations?

#1768 (comment)

bobqianic · 2024-02-10T16:07:56Z

Hey guys. I had a good time today benchmarking and comparing different inference backends on the transcription of 3000 Brazilian Portuguese audio files of varying quality. While I had good results in terms of WER (word error rate, less is better) with HuggingFace's ASR pipeline and whisperX (about 3%), I struggled to achieve acceptable results with faster-whisper or whispercpp, which had a ~4x worse WER (about 13%). Furthermore, activating VAD in faster-whisper had minimal impact.

Then, since whisperX uses faster-whisper for its inference, I compared which parameters differed between them. After some tests, I achieved a 4x reduction in WER in faster-whisper by setting without_timestamps=True. Since my use-case has no use for timestamps, this is OK for me.

I proceeded to repeat the same procedure in whispercpp, by setting the following line to true,

whisper.cpp/whisper.cpp

Line 4322 in 022756a

/*.no_timestamps =*/ false,

and also achieved a 4x reduction in WER, and not a single hallucination like the ones I showed above.

I wonder why computing timestamps makes Whisper more prone to hallucinations.

It's likely true. This is because the approach Whisper uses to transcribe audio with and without timestamps varies significantly. When transcribing without timestamps, it processes the audio in 30-second segments, sequentially moving from one chunk to the next. However, when transcribing with timestamps, it operates differently. It first determines whether a segment is complete. If so, it proceeds to the next 30-second segment. If not, it adjusts its position based on the last timestamp token before resuming transcription. For instance, let's say there's a 30-second segment, and the decoder encounters ...[TT_1264] (incomplete). Instead of transcribing from 30 to 60 seconds, it would adjust to start at 25.28 seconds within the segment and then transcribe from 25.28 to 55.28 seconds.

This is likely to result in repetition. Additionally, we must now include a timestamp token in our context, which is sized at 448, and half of this is reserved for the prompt, limiting the longest sequence we can generate to 224. Consequently, the actual information that can be accommodated within the context window is reduced, leading to diminished performance.

pprobst · 2024-02-10T19:22:16Z

Very interesting! I'm thankful you took the time to investigate this further.

RazeBerry · 2024-04-05T04:46:02Z

Same problem here! Whispercpp, and I am not sure about regular whisper has substantial difficulties picking up a conversation after a long period of silence!

Timestamp computation can degrade WER (ggerganov#1724)

pprobst changed the title ~~Transcription on silence~~ Hallucination on silence Jan 4, 2024

bobqianic added the bug Something isn't working label Jan 4, 2024

This was referenced Jan 8, 2024

Big drop in accuracy compared to Python version? #1734

Open

Duplicate sentences in results and mistake timestamps #1745

Open

ghindle mentioned this issue Jan 11, 2024

Don't compute timestamps when not printing them. #1755

Merged

jwijffels mentioned this issue Jan 29, 2024

Notes on repetitions bnosac/audio.whisper#38

Open

bobqianic mentioned this issue Feb 8, 2024

Fix the decoding issues #1768

Open

11 tasks

pprobst added a commit to iarahealth/whisper.cpp that referenced this issue Apr 8, 2024

feat: add param to disable timestamp computation on stream.cpp

6e1c283

Timestamp computation can degrade WER (ggerganov#1724)

pprobst mentioned this issue Apr 11, 2024

How to solve the problem of hallucinations #2040

Open

pprobst mentioned this issue Apr 28, 2024

Problem compiling addon.node (+solution) #2091

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hallucination on silence #1724

Hallucination on silence #1724

pprobst commented Jan 4, 2024 •

edited

bobqianic commented Jan 4, 2024

pprobst commented Jan 4, 2024

mrfragger commented Jan 4, 2024 •

edited

pprobst commented Jan 7, 2024 •

edited

pprobst commented Jan 7, 2024 •

edited

bobqianic commented Jan 7, 2024

pprobst commented Jan 7, 2024

ggerganov commented Jan 7, 2024

pprobst commented Jan 8, 2024 •

edited

Sing303 commented Jan 15, 2024 •

edited

pprobst commented Jan 15, 2024

Sing303 commented Jan 16, 2024

pprobst commented Jan 16, 2024

bobqianic commented Jan 16, 2024

jettoblack commented Jan 17, 2024

bobqianic commented Jan 17, 2024

jettoblack commented Jan 18, 2024

mrfragger commented Jan 20, 2024

bygreencn commented Feb 3, 2024

bobqianic commented Feb 3, 2024

bobqianic commented Feb 10, 2024 •

edited

pprobst commented Feb 10, 2024

RazeBerry commented Apr 5, 2024

Hallucination on silence #1724

Hallucination on silence #1724

Comments

pprobst commented Jan 4, 2024 • edited

bobqianic commented Jan 4, 2024

pprobst commented Jan 4, 2024

mrfragger commented Jan 4, 2024 • edited

pprobst commented Jan 7, 2024 • edited

pprobst commented Jan 7, 2024 • edited

bobqianic commented Jan 7, 2024

pprobst commented Jan 7, 2024

ggerganov commented Jan 7, 2024

pprobst commented Jan 8, 2024 • edited

Sing303 commented Jan 15, 2024 • edited

pprobst commented Jan 15, 2024

Sing303 commented Jan 16, 2024

pprobst commented Jan 16, 2024

bobqianic commented Jan 16, 2024

jettoblack commented Jan 17, 2024

bobqianic commented Jan 17, 2024

jettoblack commented Jan 18, 2024

mrfragger commented Jan 20, 2024

bygreencn commented Feb 3, 2024

bobqianic commented Feb 3, 2024

bobqianic commented Feb 10, 2024 • edited

pprobst commented Feb 10, 2024

RazeBerry commented Apr 5, 2024

pprobst commented Jan 4, 2024 •

edited

mrfragger commented Jan 4, 2024 •

edited

pprobst commented Jan 7, 2024 •

edited

pprobst commented Jan 7, 2024 •

edited

pprobst commented Jan 8, 2024 •

edited

Sing303 commented Jan 15, 2024 •

edited

bobqianic commented Feb 10, 2024 •

edited