Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on x86_64 - better support for different x86_64 CPU instruction extensions #196

Closed
xiliuya opened this issue Mar 16, 2023 · 35 comments
Labels
bug Something isn't working build Compilation issues hardware Hardware related performance Speed related topics

Comments

@xiliuya
Copy link

xiliuya commented Mar 16, 2023

When I compile with make, the following error occurs

inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’: target specific option mismatch
   52 | _mm256_cvtph_ps (__m128i __A)

Error will be reported when executing cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx -msse3 -c ggml.c -o ggml.o .
But the error of executing cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -msse3 -c ggml.c -o ggml.o will not occur.
Must -mavx be used with -mf16c?


OS: Arch Linux x86_64
Kernel: 6.1.18-1-lts

@gjmulder gjmulder changed the title compilation error with -mavx compilation error with -mavx - inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ Mar 16, 2023
@gjmulder gjmulder added duplicate This issue or pull request already exists hardware Hardware related build Compilation issues labels Mar 16, 2023
@gjmulder
Copy link
Collaborator

Looks like a duplicate of #107. Can you please confirm you're running on native x86_64 and not emulated?

@xiliuya
Copy link
Author

xiliuya commented Mar 16, 2023

Looks like a duplicate of #107. Can you please confirm you're running on native x86_64 and not emulated?

Yes, not in a virtual environment such as docker.
I will also report an error when I execute cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx -msse3 -c ggml.c -o ggml.o on other machines.

@gjmulder gjmulder added duplicate This issue or pull request already exists and removed duplicate This issue or pull request already exists labels Mar 16, 2023
@gjmulder gjmulder changed the title compilation error with -mavx - inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ compilation error with -mavx - inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on Arch Linux x86_64 Mar 16, 2023
@gjmulder
Copy link
Collaborator

If it is Arch I'm guessing you're using a very recent g++ version. I know it compiles with g++10 under Debian and Ubuntu. We haven't collected data on other g++ versions.

@gjmulder gjmulder added the question Further information is requested label Mar 16, 2023
@xiliuya
Copy link
Author

xiliuya commented Mar 16, 2023

If it is Arch I'm guessing you're using a very recent g++ version. I know it compiles with g++10 under Debian and Ubuntu. We haven't collected data on other g++ versions.

I tried to execute gcc-10 -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx -c ggml.c -o ggml.o using gcc10 on archlinux.
Errors may also occur.

usr/lib/gcc/x86_64-pc-linux-gnu/10.4.0/include/f16cintrin.h:52:1: error:inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’: target specific option mismatch
   52 | _mm256_cvtph_ps (__m128i __A)
      | ^~~~~~~~~~~~~~~

The gcc version is as follows:

llama.cpp (master|✔) $ gcc-10 --version
gcc-10 (Arch Linux 10.4.0-1) 10.4.0
Copyright © 2020 Free Software Foundation, Inc.

I'm not sure if this is the reason for archlinux.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Mar 16, 2023

_mm256_cvtph_ps requires the fp16c extension(?) see here

You need to add -mf16c to the build command

@gjmulder
Copy link
Collaborator

Are the g++ defaults different then for Arch than for Debian derived distros, or is it something else?

@Green-Sky
Copy link
Collaborator

I am using the provided Makefile, which set those flags for you https://github.com/ggerganov/llama.cpp/blob/master/Makefile#L92

@ReOT20
Copy link

ReOT20 commented Mar 16, 2023

Have the same issue using provided Makefile. Ubuntu 22.04 LTS, gcc-11.3.0, Xeon E5 2690

@j-f1
Copy link
Collaborator

j-f1 commented Mar 16, 2023

I believe that CPU supports only AVX, not AVX2. edit: this is wrong, see below llama.cpp requires AVX2. (The Makefile should probably check for this and emit an error message instead of this random sounding compiler error)

@gjmulder gjmulder changed the title compilation error with -mavx - inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on Arch Linux x86_64 compilation error with -mavx - inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on x86_64 Mar 16, 2023
@gjmulder gjmulder added bug Something isn't working and removed duplicate This issue or pull request already exists labels Mar 16, 2023
@xiliuya
Copy link
Author

xiliuya commented Mar 16, 2023

I believe that CPU supports only AVX, not AVX2. llama.cpp requires AVX2.

No, when I execute cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -msse3 -c ggml.c -o ggml.o to generate. o files, run make can run normally.

@gjmulder gjmulder removed the question Further information is requested label Mar 16, 2023
@j-f1
Copy link
Collaborator

j-f1 commented Mar 16, 2023

Try running:

grep -o 'avx2' /proc/cpuinfo

If it doesn’t print avx2 then AVX2 is not supported.

@gjmulder gjmulder changed the title compilation error with -mavx - inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on x86_64 Error: inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on x86_64 due to Makefile incorrectly identifying AVX as AVX2 Mar 16, 2023
@xiliuya
Copy link
Author

xiliuya commented Mar 16, 2023

Try running:


grep -o 'avx2' /proc/cpuinfo

If it doesn’t print avx2 then AVX2 is not supported.

My CPU does not support avx2, but it can run normally through the above method.

@xiliuya
Copy link
Author

xiliuya commented Mar 16, 2023

I am using the provided Makefile, which set those flags for you https://github.com/ggerganov/llama.cpp/blob/master/Makefile#L92

I understand, but my CPU does not support FP16C, only avx.
Should the makefile be modified?

@Green-Sky
Copy link
Collaborator

We should set DEFINES for each featureflag and decide which code to use inside ggml.c on a more granular level.

@xiliuya
Copy link
Author

xiliuya commented Mar 16, 2023

We should set DEFINES for each featureflag and decide which code to use inside ggml.c on a more granular level.

I made a patch and can make normally

diff --git a/Makefile b/Makefile
index 1601079..cf4a536 100644
--- a/Makefile
+++ b/Makefile
@@ -90,6 +90,8 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
                F16C_M := $(shell grep "f16c " /proc/cpuinfo)
                ifneq (,$(findstring f16c,$(F16C_M)))
                        CFLAGS += -mf16c
+               else ifneq (,$(findstring avx,$(AVX1_M)))
+                       CFLAGS := $(filter-out -mavx,$(CFLAGS))
                endif
                SSE3_M := $(shell grep "sse3 " /proc/cpuinfo)
                ifneq (,$(findstring sse3,$(SSE3_M)))

@gjmulder gjmulder changed the title Error: inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on x86_64 due to Makefile incorrectly identifying AVX as AVX2 Error: inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on x86_64 - better support for different x86_64 CPU instruction extensions Mar 16, 2023
@polkovnikov
Copy link

@gjmulder @xiliuya

I have this issue reported issue on my CPU. Apparently it has AVX, but no F16C (and no AVX2). I have quite old 10-15 year old Intel CPU on laptop.

Probably it is the case that some old CPUs have AVX while having no F16C.

I had this compilation issue on Windows latest 16-th Clang when provided -march=native. As you know arch native tells compiler to use all CPU features of current CPU, and it appears that it provides AVX feature but without F16C feature.

My compilation was fixed and program was working (although not to very fast) after I implemented this conversion functions myself and placed following code inside #elif defined(__AVX__) section of ggml.c:

#if defined(__AVX__) && !defined(__F16C__)
__m256 _mm256_cvtph_ps(__m128i x) {
    ggml_fp16_t const * src = &x;
    float dst[8];
    for (int i = 0; i < 8; ++i)
        dst[i] = GGML_FP16_TO_FP32(src[i]);
    return *(__m256*)&dst;
}
__m128i _mm256_cvtps_ph(__m256 x, int imm) {
    float const * src = &x;
    ggml_fp16_t dst[8];
    for (int i = 0; i < 8; ++i)
        dst[i] = GGML_FP32_TO_FP16(src[i]);
    return *(__m128i*)&dst;
}
#endif

If some C/C++ gurus know faster implementation of this function for AVX then please tell here.

For know suggesting to put fix above into main branch by any volunteer. If code above is alright.

@gjmulder
Copy link
Collaborator

It would be great if @xiliuya and @polkovnikov could work together to both create a pull request with your patches so we can support a wider range of CPUs.

@xiliuya
Copy link
Author

xiliuya commented Mar 18, 2023

#if defined(AVX) && !defined(F16C)
__m256 _mm256_cvtph_ps(__m128i x) {
ggml_fp16_t const * src = &x;
float dst[8];
for (int i = 0; i < 8; ++i)
dst[i] = GGML_FP16_TO_FP32(src[i]);
return (__m256)&dst;
}
__m128i _mm256_cvtps_ph(__m256 x, int imm) {
float const * src = &x;
ggml_fp16_t dst[8];
for (int i = 0; i < 8; ++i)
dst[i] = GGML_FP32_TO_FP16(src[i]);
return (__m128i)&dst;
}
#endif

Yes, it works. But it doesn't perform as well on my machine as not using AVX.
Test command:
./main -m ./models/7B/ggml-model-q4_0.bin -p "Hello Bob" -t 15 -n 128
Before adding code:

main: seed = 1679115214
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 15 / 16 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

main: prompt: ' Hello Bob'
main: number of tokens in prompt = 3
     1 -> ''
 15043 -> ' Hello'
  7991 -> ' Bob'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


 Hello Bob. Congratulations on your book "The Buzz About Bees". It is a very interesting read for adults and children alike because it shares lots of information about beekeeping in an entertainment way that makes you want to know more!
I would like congratulate you again, also with the good reviews from Amazon.com customers so far who have commented on your book saying they learned something new or were entertained by it (see below). I am impressed how quickly people started leaving comments when your publisher contacted them to ask for their thoughts after reading this "amazingly interesting"

main: mem per token = 14549044 bytes
main:     load time =  1696.78 ms
main:   sample time =   146.30 ms
main:  predict time = 279482.84 ms / 2149.87 ms per token
main:    total time = 289818.09 ms

After adding code:

main: seed = 1679115582
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 15 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

main: prompt: ' Hello Bob'
main: number of tokens in prompt = 3
     1 -> ''
 15043 -> ' Hello'
  7991 -> ' Bob'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


 Hello Bob! I love your site, and read it every day. But today's article about the H-Sphere seemed a bit confusing to me...
http://www.mightyneato.com/2015_498673926368-1b6a8fbdcbeeebdaaeccd2fe2caecac?utm_source=buffer&amp;utm_content&ltp=Buffer &gtp=Facebook
I would like to ask for clarification on some points: 1. When you say "this

main: mem per token = 14549044 bytes
main:     load time =  1679.01 ms
main:   sample time =   145.62 ms
main:  predict time = 519427.31 ms / 3995.59 ms per token
main:    total time = 537130.75 ms

@polkovnikov
Copy link

@xiliuya You have two different answers for three reasons:

  1. LLaMa uses random SEED value each time you run it, hence may produce different results even on same program.

  2. Your query "Hello Bob" may mean anything, and it looks to me that both answers are good.

  3. Intel CPU intrinsic function from F16C feature has different kinds of rounding. And the one that I use in my code uses generic code, which may do some totally different rounding. This results in difference of last bit of Mantissa of Float-16, which may lead to other answers of neural network.

Please try some more descriptive query, for example instead of "Hello Bob" try query "Building a website can be done in 10 simple steps:". And see if in both cases (before and after adding my code) answers are good enough, i.e. in both cases they really describe building steps of real websites.

@xiliuya
Copy link
Author

xiliuya commented Mar 18, 2023

@xiliuya You have two different answers for three reasons:

1. LLaMa uses random SEED value each time you run it, hence may produce different results even on same program.

2. Your query "Hello Bob" may mean anything, and it looks to me that both answers are good.

3. Intel CPU intrinsic function from F16C feature has different kinds of rounding. And the one that I use in my code uses generic code, which may do some totally different rounding. This results in difference of last bit of Mantissa of Float-16, which may lead to other answers of neural network.

Please try some more descriptive query, for example instead of "Hello Bob" try query "Building a website can be done in 10 simple steps:". And see if in both cases (before and after adding my code) answers are good enough, i.e. in both cases they really describe building steps of real websites.

I am testing and there is no problem with the generated output. The problem is that the generation speed has slowed down since AVX was turned on.

Does turning on AVX really make the output better?

@polkovnikov
Copy link

@xiliuya Turning on AVX changes only speed of computation, but not the quality of answer. Don't know why it happened that AVX version is 1.5 times slower in your case.

Maybe because AVX is not used that much but instead F16C is used more. And in case of my code I emulate F16C through generic algorithms, which are slow. It could be the case that my generic version is slower somehow than generic version of NON-AVX code. It could explain the reason of slow down.

@xiliuya
Copy link
Author

xiliuya commented Mar 18, 2023

@xiliuya Turning on AVX changes only speed of computation, but not the quality of answer. Don't know why it happened that AVX version is 1.5 times slower in your case.

Maybe because AVX is not used that much but instead F16C is used more. And in case of my code I emulate F16C through generic algorithms, which are slow. It could be the case that my generic version is slower somehow than generic version of NON-AVX code. It could explain the reason of slow down.

There is nothing wrong with your code.
I tried to turn off the relevant definition and got the same result.

diff --git a/ggml.c b/ggml.c
index 535c7b7..c6cc145 100644
--- a/ggml.c
+++ b/ggml.c
@@ -850,8 +850,9 @@ void dequantize_row_q4_1(const void * restrict x, float * restrict y, int k) {
 
 #elif defined(__AVX__)
 
-#define GGML_SIMD
-
+#if defined(__F16C__)
+    #define GGML_SIMD
+#endif
 // F32 AVX
 
 #define GGML_F32_STEP 32
@@ -908,8 +909,12 @@ void dequantize_row_q4_1(const void * restrict x, float * restrict y, int k) {
 #define GGML_F32Cx8             __m256
 #define GGML_F32Cx8_ZERO        _mm256_setzero_ps()
 #define GGML_F32Cx8_SET1(x)     _mm256_set1_ps(x)
-#define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
-#define GGML_F32Cx8_STORE(x, y) _mm_storeu_si128((__m128i *)(x), _mm256_cvtps_ph(y, 0))
+
+#if  defined(__F16C__)
+    #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
+    #define GGML_F32Cx8_STORE(x, y) _mm_storeu_si128((__m128i *)(x), _mm256_cvtps_ph(y, 0))
+#endif
+
 #define GGML_F32Cx8_FMA         GGML_F32x8_FMA
 #define GGML_F32Cx8_ADD         _mm256_add_ps
 #define GGML_F32Cx8_MUL         _mm256_mul_ps
@@ -918,8 +923,10 @@ void dequantize_row_q4_1(const void * restrict x, float * restrict y, int k) {
 #define GGML_F16_VEC                GGML_F32Cx8
 #define GGML_F16_VEC_ZERO           GGML_F32Cx8_ZERO
 #define GGML_F16_VEC_SET1           GGML_F32Cx8_SET1
-#define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
-#define GGML_F16_VEC_STORE(p, r, i) GGML_F32Cx8_STORE(p, r[i])
+#if  defined(__F16C__)
+    #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
+    #define GGML_F16_VEC_STORE(p, r, i) GGML_F32Cx8_STORE(p, r[i])
+#endif
 #define GGML_F16_VEC_FMA            GGML_F32Cx8_FMA
 #define GGML_F16_VEC_ADD            GGML_F32Cx8_ADD
 #define GGML_F16_VEC_MUL            GGML_F32Cx8_MUL

@polkovnikov
Copy link

@xiliuya The problem with your last patch of code is that it TOTALLY removes use of AVX or any other SIMD. Because if you don't define GGML_SIMD macro then only generic non-SIMD code is used everywhere.

But in case of my code I only implement two functions as generic code, while the rest of AVX is still used, like _mm256_mul_ps() function for example.

So if I was to compare my code patch and yours I would choose my version as it should be more faster.

But other experts may disagree, we need more ideas here about ways to improve.

@gjmulder gjmulder added the performance Speed related topics label Mar 18, 2023
@gjmulder
Copy link
Collaborator

gjmulder commented Mar 18, 2023

Good work guys. I am not a C++ programmer...

I am however interested in performance. I'd ideally want the most performant CPU code for any arch.

@xiliuya
Copy link
Author

xiliuya commented Mar 19, 2023

Good work guys. I am not a C++ programmer...

I am however interested in performance. I'd ideally want the most performant CPU code for any arch.

Thank you.

This is a CPU flame diagram in two ways:

  1. Using AVX

main
2. not AVX

main_no

@tommybutler
Copy link

CFLAGS := $(filter-out -mavx,$(CFLAGS))

We should set DEFINES for each featureflag and decide which code to use inside ggml.c on a more granular level.

I made a patch and can make normally

diff --git a/Makefile b/Makefile
index 1601079..cf4a536 100644
--- a/Makefile
+++ b/Makefile
@@ -90,6 +90,8 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
                F16C_M := $(shell grep "f16c " /proc/cpuinfo)
                ifneq (,$(findstring f16c,$(F16C_M)))
                        CFLAGS += -mf16c
+               else ifneq (,$(findstring avx,$(AVX1_M)))
+                       CFLAGS := $(filter-out -mavx,$(CFLAGS))
                endif
                SSE3_M := $(shell grep "sse3 " /proc/cpuinfo)
                ifneq (,$(findstring sse3,$(SSE3_M)))

This patch allowed me to successfully run the make command.

@beaclnd
Copy link

beaclnd commented Mar 26, 2023

I run into the same error when executing make on Ubuntu 22.04 x86_64 within a virtual machine launched by VirtualBox.

In the virtual machine, the content of the proc/cpuinfo file misses the cpu flags: F16C and FMA, but contains AVX, AVX2 and SSE3. Then i found my host machine in fact supports all above flags including F16C and FMA (checked that with coreInfo). So i just modified the Makefile to add the missing flags like this:

diff --git a/Makefile b/Makefile
index 98a2d85..1b0f28c 100644
--- a/Makefile
+++ b/Makefile
@@ -80,6 +80,8 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
                        CFLAGS += -mavx2
                endif
        else ifeq ($(UNAME_S),Linux)
+               CFLAGS += -mfma
+               CFLAGS += -mf16c
                AVX1_M := $(shell grep "avx " /proc/cpuinfo)
                ifneq (,$(findstring avx,$(AVX1_M)))
                        CFLAGS += -mavx

After that, i can run make without error and execute the compiled result just like bellow:
./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -p 'Hello there'

then got the result:

main: seed = 1679824080
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml ctx size = 4273.34 MB
llama_model_load: mem required  = 6065.34 MB (+ 1026.00 MB per state)
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0


 Hello there and welcome to my personal web-site. I'm a graphic designer, illustrator & photographer based in London, UK. This site is primarily used as an online portfolio showcasing my work but I also use it to blog about what I am doing in my spare time (and the occasional rant!)
I'm available for freelance projects and always open to new opportunities. Please feel free to contact me with any project proposals or enquiries via my details page, email address is there as well as a Twitter link. [end of text]

llama_print_timings:        load time =  4220.08 ms
llama_print_timings:      sample time =    90.68 ms /   117 runs   (    0.78 ms per run)
llama_print_timings: prompt eval time =   673.42 ms /     3 tokens (  224.47 ms per token)
llama_print_timings:        eval time = 34385.98 ms /   116 runs   (  296.43 ms per run)
llama_print_timings:       total time = 40892.13 ms

BTW, I also noticed the CMakeLists.txt currently always enable F16C, FMA, AVX and AVX2 for X86 Linux, so i could just use cmake directly:

cmake .
make
cp bin/* ./

that also works well for me.

@polkovnikov
Copy link

@beaclnd92 The problem with your solution that it just enables F16C feature of CPU. But my old CPU has only AVX, but no F16C feature. So your solution works for part of CPUs, but doesn't work for mine.

@beaclnd
Copy link

beaclnd commented Mar 27, 2023

@beaclnd92 The problem with your solution that it just enables F16C feature of CPU. But my old CPU has only AVX, but no F16C feature. So your solution works for part of CPUs, but doesn't work for mine.

Yeah, it maybe only works for a guest virtual machine based on a qualified physical cpu with the required SIMD features. Generally, it gets a better performance compared to not enabling the features.

@RiccaDS
Copy link

RiccaDS commented Mar 29, 2023

hi, I have the same inlining problem when using -mavx , running Linux Mint on i7-2630QM, 16GB RAM, pretty old laptop (13 years), and the problem is I'm not able to get AVX to be used. I know cpu supports it however. I tried @polkovnikov patch, it allows making but still no AVX and prompt reply is really slow. Any idea?

@slaren
Copy link
Collaborator

slaren commented Mar 29, 2023

@RiccaDS The problem with f16c and avx should have been fixed with #563, please update to current master and recompile again. If it fails please copy the entire build output here.

@slaren
Copy link
Collaborator

slaren commented Mar 30, 2023

Fixed in #563

@slaren slaren closed this as completed Mar 30, 2023
@polkovnikov
Copy link

@slaren @ggerganov I found a BAD mistake in this #563 issue - and posted a comment, if you have access/desire to modifying code, please fix this commented issue.

@RiccaDS
Copy link

RiccaDS commented Mar 30, 2023

I can confirm I now have AVX1 activated. However no apparent improvement in performance, also with the fix suggested by @polkovnikov. Probably AVX1 is not sufficient for good performance. I'm on i7-2630QM 16BG RAM, clearly not a quantum computer, but I hoped better honestly. Takes minutes to complete a 1 line of reply. Thank you for your work btw.

@slaren
Copy link
Collaborator

slaren commented Mar 30, 2023

@RiccaDS you can try merging #617, that should significant boost AVX1 performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build Compilation issues hardware Hardware related performance Speed related topics
Projects
None yet
Development

No branches or pull requests

10 participants