Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX acceleration #617

Merged
merged 3 commits into from Mar 31, 2023
Merged

Add AVX acceleration #617

merged 3 commits into from Mar 31, 2023

Conversation

perserk
Copy link
Contributor

@perserk perserk commented Mar 30, 2023

My old Xeon E5-2670 doesn't have AVX2 support. So I have added AVX acceleration to quantize_row_q4_0() and ggml_vec_dot_q4_0().

Here is the result before the change:

./main -m ./models/ggml-alpaca-7b-q4.bin --color -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 4 -p 'Below is an instruction that describes a task. Write a response that appropriately completes the request. Tell me about alpacas.'
main: seed = 1680149433
llama_model_load: loading model from './models/ggml-alpaca-7b-q4.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml ctx size = 4273.34 MB
llama_model_load: mem required  = 6065.34 MB (+ 1026.00 MB per state)
llama_model_load: loading model part 1/1 from './models/ggml-alpaca-7b-q4.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 30 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.200000, top_k = 10000, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 256, n_predict = 128, n_keep = 0


 Below is an instruction that describes a task. Write a response that appropriately completes the request. Tell me about alpacas. Alpacas are a species of South American camelid that are bred primarily for their fleece. They are smaller than llamas, and have a finer, softer fleece that is lighter in weight and warmer in nature. Alpacas are shorn once a year, in the summer, and the fleece is then used for a variety of products, including clothing, home furnishings, and crafts. Alpacas are also bred for their meat, which is lean and flavorful. What is the difference between an alpaca and a ll
llama_print_timings:        load time =  4579.85 ms
llama_print_timings:      sample time =   451.16 ms /   128 runs   (    3.52 ms per run)
llama_print_timings: prompt eval time = 99743.01 ms /    29 tokens ( 3439.41 ms per token)
llama_print_timings:        eval time = 459193.73 ms /   127 runs   ( 3615.70 ms per run)
llama_print_timings:       total time = 565421.86 ms

and after:

./main -m ./models/ggml-alpaca-7b-q4.bin --color -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 4 -p 'Below is an instruction that describes a task. Write a response that appropriately completes the request. Tell me about alpacas.'
main: seed = 1680151620
llama_model_load: loading model from './models/ggml-alpaca-7b-q4.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml ctx size = 4273.34 MB
llama_model_load: mem required  = 6065.34 MB (+ 1026.00 MB per state)
llama_model_load: loading model part 1/1 from './models/ggml-alpaca-7b-q4.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 30 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.200000, top_k = 10000, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 256, n_predict = 128, n_keep = 0


 Below is an instruction that describes a task. Write a response that appropriately completes the request. Tell me about alpacas. Alpacas are domesticated animals that are related to camels and are native to South America. They are typically kept as livestock and are known for their wool, which is very soft and silky. Alpacas are shy and typically flee when they see humans, but they can also be very friendly and curious. They typically live in herds of 5 to 20 animals and can live up to 20 years in captivity. Alpacas are an important source of income for many South American families, as their wool can be sold for various products, such as cl
llama_print_timings:        load time =  5761.50 ms
llama_print_timings:      sample time =   464.30 ms /   128 runs   (    3.63 ms per run)
llama_print_timings: prompt eval time = 13906.61 ms /    29 tokens (  479.54 ms per token)
llama_print_timings:        eval time = 79917.90 ms /   127 runs   (  629.27 ms per run)
llama_print_timings:       total time = 101737.53 ms

@anzz1
Copy link
Contributor

anzz1 commented Mar 30, 2023

That is a very significant increase !
Great work ! 👍

@sw
Copy link
Collaborator

sw commented Mar 30, 2023

For the vector dot-product, have you looked at _mm_maddubs_epi16? The AVX2 equivalent did not make much difference, but here it might. See here for an explanation of how to deal with the signs: https://astojanov.github.io/projects/clover/

In the inner loop, after subtracting 8:

...
    // Get absolute values of x vectors
    const __m128i ax = _mm_sign_epi8(bx, bx);

    // Sign the values of the y vectors
    const __m128i sy = _mm_sign_epi8(by, bx);

    // Perform multiplication and create 16-bit values
    const __m128i dot = _mm_maddubs_epi16(ax, sy);

    const __m128i ones = _mm_set1_epi16(1);
    i32[j] = _mm_madd_epi16(ones, dot);
}

Beware: this is completely untested.

@RiccaDS
Copy link

RiccaDS commented Mar 30, 2023

Great job! In my case the premise is I am using alpaca.cpp, so I just modified the ggml.c according to @perserk modifications. I don't know therefore what kind of difference this might pose. I had an improvement although not as dramatic as the one the OP had. I'm on i7-2630QM, 16GB RAM. I hope I implemented it correctly. As I have time I'll try this on llama.cpp
BEFORE
A
AFTER
B

@Green-Sky
Copy link
Collaborator

@RiccaDS why would you use alpaca.cpp ? - afaik llama.cpp has all and more features of alpaca.cpp

@RiccaDS
Copy link

RiccaDS commented Mar 30, 2023

@Green-Sky Merely a matter of first approach. I heard about alpaca and llama a few days ago. From my few readings I understood that Alpaca was more chatGPT-like, whilst llama was more like the auto-completion core on which Alpaca is based. But I might have understand it incorrectly, my knowledge on these models is pretty scarce, let alone on AI in general. I will try out llama too thanks.

@perserk
Copy link
Contributor Author

perserk commented Mar 30, 2023

@sw Thanks for the link and the patch. I changed the code and it got a bit faster, at least not slower.

@sw
Copy link
Collaborator

sw commented Mar 30, 2023

It's looking good, almost too good ;-) because your AVX acceleration is almost as fast as the AVX2 we currently have.

What I mean by that: if I have an AVX2-capable CPU and allow it with the -mavx2 flag, but then disable the AVX2 code paths that have an AVX alternative on your PR, I don't see much difference in performance at all.

#elif 0 //defined(__AVX2__)
... // don't use AVX2 explicitly
#elif defined(__AVX__)
... // but use AVX intrinsics and let the compiler do its magic

This means we could just have a plain AVX implementation that would work on your Xeon but also be just as fast on AVX2 processors as before. Can anyone else confirm or disprove this?

This would be preferable because the code is getting quite unwieldy with all the different processor optimizations.

@ggerganov
Copy link
Owner

ggerganov commented Mar 30, 2023

@sw

I think this is very possible. I think this can help explain the observations in #603
The deprecated mad routines were apparently utilizing much more efficiently the CPU / Memory on x86. I guess as efficient as optimized AVX2-level. But then, I removed them and effectively replaced them with a non-optimal AVX2 dot product implementation (the one that we currently have), which is probably at the level of an optimized AVX dot product (as we observe in this PR).

In short, I believe the AVX2 code has room for improvements.
My suspicion is that QK == 64 would have been more suitable for AVX2, but since Apple Silicon is highest priority, I chose QK == 32.

@slaren
Copy link
Collaborator

slaren commented Mar 30, 2023

I tried adapting the same technique used here to AVX2 (src) and it performs a little better:

Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.11, 3.59, 2.64
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
BM_ggml_vec_dot_q4_0_avx             679 ns          679 ns      1032721
BM_ggml_vec_dot_q4_0_avx2            590 ns          590 ns      1185302
BM_ggml_vec_dot_q4_0_avx2_new        525 ns          525 ns      1329767

While it is not exactly twice as fast, there is still a significant performance advantage in using AVX2.

This change reduces eval times for me just a bit, but in the perplexity calculation the difference is bigger, from 43 seconds/pass to 35 seconds/pass.

@ggerganov
Copy link
Owner

@slaren

If you observe good results with this and #642 , go ahead and merge both PR. I won't have much time to look today.

Btw, here is also something worth looking into: #205 (comment)

@slaren slaren merged commit 02c5b27 into ggerganov:master Mar 31, 2023
16 of 22 checks passed
Nuked88 pushed a commit to Nuked88/llama.http that referenced this pull request Mar 31, 2023
* ggml : add AVX quantize_row_q4_0()

* ggml : add AVX ggml_vec_dot_q4_0()

* ggml : refactor AVX part of ggml_vec_dot_q4_0()

ggerganov#617 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants