Performance Xeon #16

ArtyomZemlyak · 2022-10-03T16:46:02Z

Performance report.
Meaning V2 and V3: V2 its before this commit

CPU: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz
Task: 200 s of audio (7 diff files with diff quality)

V2 -t

model	T, s	-t, CPU
tiny	64	1
tiny	21	4
tiny	21	8
tiny	80	16
tiny	175	24
base	42	8
base	93	16
small	110	8
small	190	16
large	420	8
large	537	16

V3 -t

model	T, s	-t, CPU
tiny	84	1
tiny	32	4
tiny	28	8
tiny	56	16
tiny	86	24
base	58	8
base	125	16
small	104	8
small	177	16
large	570	8
large	850	16

V2 parallel

Use parallel bash computations
7 parallel jobs, in each job -t specified

model	T, s	-t, CPU
tiny	17	1
tiny	9	2
tiny	5	4
base	56	1
base	25	2
base	16	4
small	155	1
small	86	2
small	53	4
large	788	1
large	428	2
large	260	4

Encode vs Decode time (V2 vs V3) tiny

V2

File 1

whisper_model_load: type          = 1
whisper_model_load: mem_required  = 452.00 MB
main:     load time =    84.28 ms
main:      mel time =   118.88 ms
main:   sample time =    46.91 ms
main:   encode time =   531.27 ms / 132.82 ms per layer
main:   decode time =  3730.47 ms
main:    total time =  6181.17 ms

File 2

main:     load time =    80.49 ms
main:      mel time =    97.64 ms
main:   sample time =    13.85 ms
main:   encode time =   533.10 ms / 133.27 ms per layer
main:   decode time =  1036.91 ms
main:    total time =  2348.79 ms

V3

File 1

whisper_model_load: type          = 1
whisper_model_load: mem_required  = 244.00 MB
main:     load time =   241.68 ms
main:      mel time =   656.11 ms
main:   sample time =  1202.84 ms
main:   encode time =  1736.55 ms / 434.14 ms per layer
main:   decode time =  8354.48 ms
main:    total time = 12211.61 ms

File 2

main:     load time =   243.57 ms
main:      mel time =   541.42 ms
main:   sample time =   209.42 ms
main:   encode time =  2901.70 ms / 725.42 ms per layer
main:   decode time =  1588.76 ms
main:    total time =  5501.20 ms

ArtyomZemlyak · 2022-10-03T16:47:31Z

Interesting drop performance for t > 8

ggerganov · 2022-10-03T17:13:52Z

Interesting drop performance for t > 8

Yes, i've noticed that. I have 2 guesses:

The computation is memory bound so at some point increasing the number of threads does not help because the memory bandwidth is full
I have a problem in my thread synchronization implementation - currently, I use "busy-waiting" on atomic variables which you probably noticed keeps the CPU's at 100% all the time. This is much faster compared to locking mutexes. However, I am not sure if it has some negative side effects for large number of threads. Needs some investigation

The last section V3 is surprising - I don't expect the encode time to be different for different files, given that they are the same length. Something is not right there.

The "parallel" idea is very interesting - I never realised that we can split the file in chunks and run multiple whisper.cpp processes in parallel. This might be a very efficient approach for multi-core systems.
Can you provide some more information about your parallel approach? How did you split the audio?

I think we have to provide an offset argument to main to be able to start the transcription at different start offset of the audio file.

ArtyomZemlyak · 2022-10-03T17:26:19Z

In my previos example its just parallel jobs in bash script:

start=$SECONDS

export MODEL=tiny
# export MODEL=base
# export MODEL=small
# export MODEL=large

export THREADS=4

./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker1.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker2.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker_frag1.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/gokov1.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/gokov2.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/fragmen1t.wav &
./main  --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/very_bad_sample.wav &

wait

duration=$(( SECONDS - start ))

echo ""
echo "TOTAL_TIME:"
echo $duration

But if we need same effect on real audio, we can try to use 2 approaches:

VAD - voice activity detection. Find all chunks, where voice exist.
Split finded chunks to little chunks (if they long, > 30 s) and put them to different processes.

But we need synhronize time for output - need remeber timings of chunks and add this timings to resulted output.

ArtyomZemlyak · 2022-10-03T17:28:49Z

Or we just can run multiple apps for whisper.cpp - just process multiple audio files in one time. If we dont need fastest recognition of one file, but need a lot of AudioSeconds recognized for ProcessingHour

Allows to start processing the input audio at some offset from the beginning. Useful for splitting a long job into multiple tasks.

kevin01881 · 2022-10-23T18:27:35Z

@ggerganov Thanks very much sir for making whisper.cpp!! It is pure insanity that I can run a model that requires 12 GB of VRAM, on my ultra-slow PC that is pushing 8 years old (i7-5500U). You are a wizard.

This shows how most of todays models are written very poorly as far as efficiency goes. Truly makes one wonder what else we could be running on CPU's that currently requires RTX 3090's or even T4/A100's.

So far, I succesfully ran on this ancient computer: Facebook Research Demucs (stock, no optimized port), Stable Diffusion (openvino port), and thanks to your C++ port now Whisper as well.

ggerganov · 2022-10-24T05:11:47Z

Hi @kevin01881 and thanks for the kind words.
Btw, testing on AMD CPUs I find that whisper.cpp performance is comparable (maybe slightly faster) with the stock PyTorch implementation. Just make sure to run the PyTorch version with the Greedy decoder to make things even. I don't have an Intel CPU though, so not sure how it compares there.

But yeah, on M1 I think we still have a big edge - probably 2 or 3 times faster (I haven't done a proper benchmark yet).
Probably this will be the case until PyTorch has proper support for Arm processors.

Btw, on this note, someone reported that on M1 Max it is efficient to split the job into multiple runs with fewer threads [0].
I guess, we should have a built-in option in whisper.cpp to split the job in N tasks and run the multiple inferences - similar to what @ArtyomZemlyak did earlier in this thread.

[0] openai/whisper#208 (reply in thread)

i-am-neo · 2022-10-30T23:55:39Z

@ArtyomZemlyak Careful with the output you get when fragmenting audio for parallel inference jobs.
See openai/whisper#440

cc @ggerganov

Allows to start processing the input audio at some offset from the beginning. Useful for splitting a long job into multiple tasks.

ggerganov added the performance CPU and memory usage - results and comparisons label Oct 5, 2022

ggerganov added a commit that referenced this issue Oct 7, 2022

ref #16, #22 : add "offset" argument

7787b87

Allows to start processing the input audio at some offset from the beginning. Useful for splitting a long job into multiple tasks.

ArtyomZemlyak mentioned this issue Oct 20, 2022

transcription time 2.7x than wav file #69

Closed

ggerganov closed this as completed Oct 26, 2022

ggerganov mentioned this issue Oct 29, 2022

Support for parallel processing #110

Merged

anandijain pushed a commit to anandijain/whisper.cpp that referenced this issue Apr 28, 2023

ref ggerganov#16, ggerganov#22 : add "offset" argument

3365ff7

Allows to start processing the input audio at some offset from the beginning. Useful for splitting a long job into multiple tasks.

warkcod mentioned this issue Jun 8, 2023

OpenCL clCreateCommandQueue error -30 on MacOS 13.4 intel #996

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Xeon #16

Performance Xeon #16

ArtyomZemlyak commented Oct 3, 2022

ArtyomZemlyak commented Oct 3, 2022

ggerganov commented Oct 3, 2022

ArtyomZemlyak commented Oct 3, 2022

ArtyomZemlyak commented Oct 3, 2022

kevin01881 commented Oct 23, 2022

ggerganov commented Oct 24, 2022 •

edited

i-am-neo commented Oct 30, 2022

Performance Xeon #16

Performance Xeon #16

Comments

ArtyomZemlyak commented Oct 3, 2022

V2 -t

V3 -t

V2 parallel

Encode vs Decode time (V2 vs V3) tiny

V2

V3

ArtyomZemlyak commented Oct 3, 2022

ggerganov commented Oct 3, 2022

ArtyomZemlyak commented Oct 3, 2022

ArtyomZemlyak commented Oct 3, 2022

kevin01881 commented Oct 23, 2022

ggerganov commented Oct 24, 2022 • edited

i-am-neo commented Oct 30, 2022

ggerganov commented Oct 24, 2022 •

edited