Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Xeon #16

Closed
ArtyomZemlyak opened this issue Oct 3, 2022 · 7 comments
Closed

Performance Xeon #16

ArtyomZemlyak opened this issue Oct 3, 2022 · 7 comments
Labels
performance CPU and memory usage - results and comparisons

Comments

@ArtyomZemlyak
Copy link

Performance report.
Meaning V2 and V3: V2 its before this commit

  • CPU: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz
  • Task: 200 s of audio (7 diff files with diff quality)

V2 -t

model T, s -t, CPU
tiny 64 1
tiny 21 4
tiny 21 8
tiny 80 16
tiny 175 24
base 42 8
base 93 16
small 110 8
small 190 16
large 420 8
large 537 16

V3 -t

model T, s -t, CPU
tiny 84 1
tiny 32 4
tiny 28 8
tiny 56 16
tiny 86 24
base 58 8
base 125 16
small 104 8
small 177 16
large 570 8
large 850 16

V2 parallel

  • Use parallel bash computations
  • 7 parallel jobs, in each job -t specified
model T, s -t, CPU
tiny 17 1
tiny 9 2
tiny 5 4
base 56 1
base 25 2
base 16 4
small 155 1
small 86 2
small 53 4
large 788 1
large 428 2
large 260 4

Encode vs Decode time (V2 vs V3) tiny

V2

  • File 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 452.00 MB
main:     load time =    84.28 ms
main:      mel time =   118.88 ms
main:   sample time =    46.91 ms
main:   encode time =   531.27 ms / 132.82 ms per layer
main:   decode time =  3730.47 ms
main:    total time =  6181.17 ms
  • File 2
main:     load time =    80.49 ms
main:      mel time =    97.64 ms
main:   sample time =    13.85 ms
main:   encode time =   533.10 ms / 133.27 ms per layer
main:   decode time =  1036.91 ms
main:    total time =  2348.79 ms

V3

  • File 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 244.00 MB
main:     load time =   241.68 ms
main:      mel time =   656.11 ms
main:   sample time =  1202.84 ms
main:   encode time =  1736.55 ms / 434.14 ms per layer
main:   decode time =  8354.48 ms
main:    total time = 12211.61 ms
  • File 2
main:     load time =   243.57 ms
main:      mel time =   541.42 ms
main:   sample time =   209.42 ms
main:   encode time =  2901.70 ms / 725.42 ms per layer
main:   decode time =  1588.76 ms
main:    total time =  5501.20 ms
@ArtyomZemlyak
Copy link
Author

Interesting drop performance for t > 8

@ggerganov
Copy link
Owner

Interesting drop performance for t > 8

Yes, i've noticed that. I have 2 guesses:

  • The computation is memory bound so at some point increasing the number of threads does not help because the memory bandwidth is full
  • I have a problem in my thread synchronization implementation - currently, I use "busy-waiting" on atomic variables which you probably noticed keeps the CPU's at 100% all the time. This is much faster compared to locking mutexes. However, I am not sure if it has some negative side effects for large number of threads. Needs some investigation

The last section V3 is surprising - I don't expect the encode time to be different for different files, given that they are the same length. Something is not right there.

The "parallel" idea is very interesting - I never realised that we can split the file in chunks and run multiple whisper.cpp processes in parallel. This might be a very efficient approach for multi-core systems.
Can you provide some more information about your parallel approach? How did you split the audio?

I think we have to provide an offset argument to main to be able to start the transcription at different start offset of the audio file.

@ArtyomZemlyak
Copy link
Author

In my previos example its just parallel jobs in bash script:

start=$SECONDS

export MODEL=tiny
# export MODEL=base
# export MODEL=small
# export MODEL=large

export THREADS=4

./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker1.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker2.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker_frag1.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/gokov1.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/gokov2.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/fragmen1t.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/very_bad_sample.wav &

wait

duration=$(( SECONDS - start ))

echo ""
echo "TOTAL_TIME:"
echo $duration

But if we need same effect on real audio, we can try to use 2 approaches:

  1. VAD - voice activity detection. Find all chunks, where voice exist.
  2. Split finded chunks to little chunks (if they long, > 30 s) and put them to different processes.

But we need synhronize time for output - need remeber timings of chunks and add this timings to resulted output.

@ArtyomZemlyak
Copy link
Author

Or we just can run multiple apps for whisper.cpp - just process multiple audio files in one time. If we dont need fastest recognition of one file, but need a lot of AudioSeconds recognized for ProcessingHour

@ggerganov ggerganov added the performance CPU and memory usage - results and comparisons label Oct 5, 2022
ggerganov added a commit that referenced this issue Oct 7, 2022
Allows to start processing the input audio at some offset from the
beginning. Useful for splitting a long job into multiple tasks.
@kevin01881
Copy link

@ggerganov Thanks very much sir for making whisper.cpp!! It is pure insanity that I can run a model that requires 12 GB of VRAM, on my ultra-slow PC that is pushing 8 years old (i7-5500U). You are a wizard.

This shows how most of todays models are written very poorly as far as efficiency goes. Truly makes one wonder what else we could be running on CPU's that currently requires RTX 3090's or even T4/A100's.

So far, I succesfully ran on this ancient computer: Facebook Research Demucs (stock, no optimized port), Stable Diffusion (openvino port), and thanks to your C++ port now Whisper as well.

@ggerganov
Copy link
Owner

ggerganov commented Oct 24, 2022

Hi @kevin01881 and thanks for the kind words.
Btw, testing on AMD CPUs I find that whisper.cpp performance is comparable (maybe slightly faster) with the stock PyTorch implementation. Just make sure to run the PyTorch version with the Greedy decoder to make things even. I don't have an Intel CPU though, so not sure how it compares there.

But yeah, on M1 I think we still have a big edge - probably 2 or 3 times faster (I haven't done a proper benchmark yet).
Probably this will be the case until PyTorch has proper support for Arm processors.

Btw, on this note, someone reported that on M1 Max it is efficient to split the job into multiple runs with fewer threads [0].
I guess, we should have a built-in option in whisper.cpp to split the job in N tasks and run the multiple inferences - similar to what @ArtyomZemlyak did earlier in this thread.

[0] openai/whisper#208 (reply in thread)

@i-am-neo
Copy link

@ArtyomZemlyak Careful with the output you get when fragmenting audio for parallel inference jobs.
See openai/whisper#440

cc @ggerganov

anandijain pushed a commit to anandijain/whisper.cpp that referenced this issue Apr 28, 2023
Allows to start processing the input audio at some offset from the
beginning. Useful for splitting a long job into multiple tasks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance CPU and memory usage - results and comparisons
Projects
None yet
Development

No branches or pull requests

4 participants