New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Xeon #16
Comments
Interesting drop performance for t > 8 |
Yes, i've noticed that. I have 2 guesses:
The last section The "parallel" idea is very interesting - I never realised that we can split the file in chunks and run multiple I think we have to provide an |
In my previos example its just parallel jobs in bash script: start=$SECONDS
export MODEL=tiny
# export MODEL=base
# export MODEL=small
# export MODEL=large
export THREADS=4
./main --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker1.wav &
./main --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker2.wav &
./main --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/cuker_frag1.wav &
./main --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/gokov1.wav &
./main --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/gokov2.wav &
./main --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/fragmen1t.wav &
./main --language ru -t $THREADS -m ../models/ggml-model-$MODEL.bin -f ../audio/very_bad_sample.wav &
wait
duration=$(( SECONDS - start ))
echo ""
echo "TOTAL_TIME:"
echo $duration But if we need same effect on real audio, we can try to use 2 approaches:
But we need synhronize time for output - need remeber timings of chunks and add this timings to resulted output. |
Or we just can run multiple apps for |
@ggerganov Thanks very much sir for making whisper.cpp!! It is pure insanity that I can run a model that requires 12 GB of VRAM, on my ultra-slow PC that is pushing 8 years old (i7-5500U). You are a wizard. This shows how most of todays models are written very poorly as far as efficiency goes. Truly makes one wonder what else we could be running on CPU's that currently requires RTX 3090's or even T4/A100's. So far, I succesfully ran on this ancient computer: Facebook Research Demucs (stock, no optimized port), Stable Diffusion (openvino port), and thanks to your C++ port now Whisper as well. |
Hi @kevin01881 and thanks for the kind words. But yeah, on M1 I think we still have a big edge - probably 2 or 3 times faster (I haven't done a proper benchmark yet). Btw, on this note, someone reported that on M1 Max it is efficient to split the job into multiple runs with fewer threads [0]. |
@ArtyomZemlyak Careful with the output you get when fragmenting audio for parallel inference jobs. cc @ggerganov |
Allows to start processing the input audio at some offset from the beginning. Useful for splitting a long job into multiple tasks.
Performance report.
Meaning V2 and V3: V2 its before this commit
V2 -t
V3 -t
V2 parallel
Encode vs Decode time (V2 vs V3) tiny
V2
V3
The text was updated successfully, but these errors were encountered: