Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clang build fails with inline ASM on NEON64 (Apple M1) #96

Closed
Tracked by #90
mscdex opened this issue Jul 17, 2022 · 7 comments
Closed
Tracked by #90

clang build fails with inline ASM on NEON64 (Apple M1) #96

mscdex opened this issue Jul 17, 2022 · 7 comments
Assignees
Labels

Comments

@mscdex
Copy link
Contributor

mscdex commented Jul 17, 2022

clang must not be allocating l3 in a contiguous register? While building 3eab8e6, the compiler errors are:

In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:32:44: error: registers must be sequential
                "and  %[t3].16b, v14.16b,   %[n63].16b \n\t"
                                                         ^
<inline asm>:10:40: note: instantiated into assembly here
        tbl v12.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v3.16b
                                              ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:32:44: error: unknown token in expression
                "and  %[t3].16b, v14.16b,   %[n63].16b \n\t"
                                                         ^
<inline asm>:10:48: note: instantiated into assembly here
        tbl v12.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v3.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:32:44: error: invalid operand
                "and  %[t3].16b, v14.16b,   %[n63].16b \n\t"
                                                         ^
<inline asm>:10:48: note: instantiated into assembly here
        tbl v12.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v3.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:35:75: error: registers must be sequential
                "tbl v12.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t0].16b \n\t"
                                                                                        ^
<inline asm>:11:40: note: instantiated into assembly here
        tbl v13.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v2.16b
                                              ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:35:75: error: unknown token in expression
                "tbl v12.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t0].16b \n\t"
                                                                                        ^
<inline asm>:11:48: note: instantiated into assembly here
        tbl v13.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v2.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:35:75: error: invalid operand
                "tbl v12.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t0].16b \n\t"
                                                                                        ^
<inline asm>:11:48: note: instantiated into assembly here
        tbl v13.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v2.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:36:75: error: registers must be sequential
                "tbl v13.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t1].16b \n\t"
                                                                                        ^
<inline asm>:12:40: note: instantiated into assembly here
        tbl v14.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v1.16b
                                              ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:36:75: error: unknown token in expression
                "tbl v13.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t1].16b \n\t"
                                                                                        ^
<inline asm>:12:48: note: instantiated into assembly here
        tbl v14.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v1.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:36:75: error: invalid operand
                "tbl v13.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t1].16b \n\t"
                                                                                        ^
<inline asm>:12:48: note: instantiated into assembly here
        tbl v14.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v1.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:37:75: error: registers must be sequential
                "tbl v14.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t2].16b \n\t"
                                                                                        ^
<inline asm>:13:40: note: instantiated into assembly here
        tbl v15.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v0.16b
                                              ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:37:75: error: unknown token in expression
                "tbl v14.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t2].16b \n\t"
                                                                                        ^
<inline asm>:13:48: note: instantiated into assembly here
        tbl v15.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v0.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:37:75: error: invalid operand
                "tbl v14.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t2].16b \n\t"
                                                                                        ^
<inline asm>:13:48: note: instantiated into assembly here
        tbl v15.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v0.16b
                                                      ^
@mscdex
Copy link
Contributor Author

mscdex commented Jul 18, 2022

Apparently it only happens when compiling with something like -O0.

@aklomp
Copy link
Owner

aklomp commented Jul 18, 2022

So the problem is that the following set of four registers, which together form the lookup table, are not sequentially numbered:

{v5.16b, v6.16b, v7.16b, v16.16b}

That sucks, because as you mention, the code goes to great lengths to load that table into four hardcoded sequential registers: v8, v9, v10 and v11.

For some unclear reason, the compiler chooses to rename those registers when returning from the function. I was really hoping that any reasonable compiler would never do that, because the hardcoded registers are already taken and the table stays live for the duration of the encoder.

Yet here we are. My little gambit failed.

Testing a fix sucks, because I don't have an ARM64 machine that I can test on, and even then I'm not sure that I can reproduce the bug.

The silver lining is that clang should not be affected by the codegen bug that GCC has for vld1q_u8_x4. So we should hopefully be able to use that instead...

Could you try changing line 28 to this:

#if defined(BASE64_NEON64_USE_ASM) && !defined(__clang__)

@aklomp
Copy link
Owner

aklomp commented Jul 18, 2022

Another thing to try is to add the always_inline attribute to the function:

__attribute__((always_inline))
static inline uint8x16x4_t
load_64byte_table (const uint8_t *p)
{
#ifdef BASE64_NEON64_USE_ASM

I believe that -O0 can turn off inlining, and that may mean that the compiler can't make the reasonable inference that it should not rename the registers.

@mscdex
Copy link
Contributor Author

mscdex commented Jul 18, 2022

Both suggestions result in the same compiler errors.

FWIW I don't have an arm64 device handy either, so I just installed and used clang (v14) with an aarch64 sysroot (https://developer.arm.com/-/media/Files/downloads/gnu-a/10.3-2021.07/binrel/gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu.tar.xz).

@mscdex
Copy link
Contributor Author

mscdex commented Jul 18, 2022

Here's the command line I'm using (from the project root) to test FWIW (on Linux):

clang-14 -DHAVE_NEON64=1 -I./include -I./lib -O0 -I/tmp/aarch64-none-linux-gnu/libc/usr/include -target arm64-linux-gnu -c lib/arch/neon64/codec.c -o base64_neon64.codec.o

@aklomp
Copy link
Owner

aklomp commented Jul 18, 2022

Thanks for linking to the sysroot and for sharing your script! Those will be useful in the future. I was able to reproduce the bug and also affirm your conclusions that my proposed fixes don't work.

This looks like a nasty bug. Even when I inline the table-loading code into the encoder loop, the bug appears. Even when I don't create a uint8x16x4_t, but pass the t0-t3 registers (which should surely be in v8-v11...) directly to the inline assembly, the bug manifests itself.

I'm unsure of how to fix this, other than to rewrite the whole encoder logic in assembly. (That was something that I was actually planning on, because it would let me interleave loads and stores more naturally.)

Maybe the best fix for the time being is indeed the one you pushed: to just disable inline asm for clang when not optimizing.

@aklomp aklomp mentioned this issue Jul 20, 2022
7 tasks
aklomp added a commit that referenced this issue Jul 20, 2022
Convert the full encoding loop to an inline assembly implementation for
systems that can use inline assembly.

The motivation for this work is that when optimization was turned off on
recent versions of clang, the encoding table would not be loaded into
sequential registers (issue #96). This happened despite taking pains to
ensure that the compiler would use an explicit set of registers
(v8-v11).

Finding ourselves at the bottom of our bag of tricks, and faced with a
real bug, we were left with no option than to reimplement the _entire_
encoding loop in inline assembly. It is the only way to get full control
over the loading of the encoding table. Thankfully, aarch64 assembly is
not very difficult to write by hand.

In making this change, we can/should add some optimizations in the loop
unrolling for rounds >= 8. The unrolled loop should optimize pipeline
efficiency by interleaving memory operations (like loads and stores)
with data operations (like table lookups). The best way to achieve this
is to blend the unrolled loops such that one loop prefetches the
registers needed in the next loop.

To make that possible without duplicating massive amounts of code, we
abstract the various assembly blocks into preprocessor macros and
instantiate them as needed. This mixing of the preprocessor with inline
assembly is perhaps a bit gnarly, but I think the usage is simple enough
that the advantages (code reuse) outweigh the disadvantages.

Code was tested on a Debian VM running under QEMU. Unfortunately this
does not let us see how the actual bare metal performance
increases/decreases.
aklomp added a commit that referenced this issue Jul 20, 2022
Convert the full encoding loop to an inline assembly implementation for
systems that can use inline assembly.

The motivation for this work is that when optimization is turned off on
recent versions of clang, the encoding table would not be loaded into
sequential registers (see issue #96). This happened despite taking pains
to ensure that the compiler uses an explicit set of registers for the
load (v8-v11).

This leaves us with not much options beside rewriting the full encoding
loop in inline assembly. Only that way can we be absolutely certain that
the register usage is always correct. Thankfully, aarch64 assembly is
not very difficult to write by hand.

In making this change, we can/should add some optimizations in the loop
unrolling for rounds >= 8. The unrolled loop should optimize pipeline
efficiency by interleaving memory operations (like loads and stores)
with data operations (like table lookups). The best way to achieve this
is to blend the unrolled loops such that one loop prefetches the
registers needed in the next loop.

To make that possible without duplicating massive amounts of code, we
abstract the various assembly blocks into preprocessor macros and
instantiate them as needed. This mixing of the preprocessor with inline
assembly is perhaps a bit gnarly, but I think the usage is simple enough
that the advantages (code reuse) outweigh the disadvantages.

Code was tested on a Debian VM running under QEMU. Unfortunately this
does not let us see how the actual bare metal performance
increases/decreases.
aklomp added a commit that referenced this issue Jul 20, 2022
Convert the full encoding loop to an inline assembly implementation for
systems that can use inline assembly.

The motivation for this work is that when optimization is turned off on
recent versions of clang, the encoding table would not be loaded into
sequential registers (see issue #96). This happened despite taking pains
to ensure that the compiler uses an explicit set of registers for the
load (v8-v11).

This leaves us with not much options beside rewriting the full encoding
loop in inline assembly. Only that way can we be absolutely certain that
the register usage is always correct. Thankfully, aarch64 assembly is
not very difficult to write by hand.

In making this change, we can/should add some optimizations in the loop
unrolling for rounds >= 8. The unrolled loop should optimize pipeline
efficiency by interleaving memory operations (like loads and stores)
with data operations (like table lookups). The best way to achieve this
is to blend the unrolled loops such that one loop prefetches the
registers needed in the next loop.

To make that possible without duplicating massive amounts of code, we
abstract the various assembly blocks into preprocessor macros and
instantiate them as needed. This mixing of the preprocessor with inline
assembly is perhaps a bit gnarly, but I think the usage is simple enough
that the advantages (code reuse) outweigh the disadvantages.

Code was tested on a Debian VM running under QEMU. Unfortunately this
does not let us see how the actual bare metal performance
increases/decreases.
@aklomp
Copy link
Owner

aklomp commented Jul 20, 2022

Yesterday I set up a small AArch64 Debian VM using qemu-system-aarch64 to do quick prototyping on the AArch64 platform. I was hoping that it would be relatively simple to rewrite the entire NEON64 encoding loop in inline assembly, and it turns out I was right. AArch64 assembly is pretty approachable. I managed to implement the entire loop in inline assembly, including proper interleaving and pipelining of the 8x unrolled loop. All tests pass, and I'm reasonably happy with the cleanness of the code.

I've created a new issue (#98) for this enhancement and also pushed a testing branch, issue98.

This was the nuclear option, but also the only solution I saw to fixing this bug. I was not hopeful that I could find any more tricks to get the compiler to generate the correct code by itself.

aklomp added a commit that referenced this issue Jul 20, 2022
Convert the full encoding loop to an inline assembly implementation for
systems that can use inline assembly.

The motivation for this work is that when optimization is turned off on
recent versions of clang, the encoding table would not be loaded into
sequential registers (see issue #96). This happened despite taking pains
to ensure that the compiler uses an explicit set of registers for the
load (v8-v11).

This leaves us with not much options beside rewriting the full encoding
loop in inline assembly. Only that way can we be absolutely certain that
the register usage is always correct. Thankfully, aarch64 assembly is
not very difficult to write by hand.

In making this change, we can/should add some optimizations in the loop
unrolling for rounds >= 8. The unrolled loop should optimize pipeline
efficiency by interleaving memory operations (like loads and stores)
with data operations (like table lookups). The best way to achieve this
is to blend the unrolled loops such that one loop prefetches the
registers needed in the next loop.

To make that possible without duplicating massive amounts of code, we
abstract the various assembly blocks into preprocessor macros and
instantiate them as needed. This mixing of the preprocessor with inline
assembly is perhaps a bit gnarly, but I think the usage is simple enough
that the advantages (code reuse) outweigh the disadvantages.

Code was tested on a Debian VM running under QEMU. Unfortunately this
does not let us see how the actual bare metal performance
increases/decreases.
aklomp added a commit that referenced this issue Jul 20, 2022
Convert the full encoding loop to an inline assembly implementation for
systems that can use inline assembly.

The motivation for this work is that when optimization is turned off on
recent versions of clang, the encoding table would not be loaded into
sequential registers (see issue #96). This happened despite taking pains
to ensure that the compiler uses an explicit set of registers for the
load (v8-v11).

This leaves us with not much options beside rewriting the full encoding
loop in inline assembly. Only that way can we be absolutely certain that
the register usage is always correct. Thankfully, aarch64 assembly is
not very difficult to write by hand.

In making this change, we can/should add some optimizations in the loop
unrolling for rounds >= 8. The unrolled loop should optimize pipeline
efficiency by interleaving memory operations (like loads and stores)
with data operations (like table lookups). The best way to achieve this
is to blend the unrolled loops such that one loop prefetches the
registers needed in the next loop.

To make that possible without duplicating massive amounts of code, we
abstract the various assembly blocks into preprocessor macros and
instantiate them as needed. This mixing of the preprocessor with inline
assembly is perhaps a bit gnarly, but I think the usage is simple enough
that the advantages (code reuse) outweigh the disadvantages.

Code was tested on a Debian VM running under QEMU. Unfortunately this
does not let us see how the actual bare metal performance
increases/decreases.
aklomp added a commit that referenced this issue Jul 20, 2022
Convert the full encoding loop to an inline assembly implementation for
systems that can use inline assembly.

The motivation for this work is that when optimization is turned off on
recent versions of clang, the encoding table would not be loaded into
sequential registers (see issue #96). This happened despite taking pains
to ensure that the compiler uses an explicit set of registers for the
load (v8..v11).

This leaves us with not much options beside rewriting the full encoding
loop in inline assembly. Only that way can we be absolutely certain that
the register usage is always correct. Thankfully, aarch64 assembly is
not very difficult to write by hand.

In making this change, we can/should add some optimizations in the loop
unrolling for rounds >= 8. The unrolled loop should optimize pipeline
efficiency by interleaving memory operations (like loads and stores)
with data operations (like table lookups). The best way to achieve this
is to blend the unrolled loops such that one loop prefetches the
registers needed in the next loop.

To make that possible without duplicating massive amounts of code, we
abstract the various assembly blocks into preprocessor macros and
instantiate them as needed. This mixing of the preprocessor with inline
assembly is perhaps a bit gnarly, but I think the usage is simple enough
that the advantages (code reuse) outweigh the disadvantages.

Code was tested on a Debian VM running under QEMU. Unfortunately this
does not let us see how the actual bare metal performance
increases/decreases.
aklomp added a commit that referenced this issue Jul 21, 2022
Convert the full encoding loop to an inline assembly implementation for
systems that can use inline assembly.

The motivation for this work is that when optimization is turned off on
recent versions of clang, the encoding table would not be loaded into
sequential registers (see issue #96). This happened despite taking pains
to ensure that the compiler uses an explicit set of registers for the
load (v8..v11).

This leaves us with not much options beside rewriting the full encoding
loop in inline assembly. Only that way can we be absolutely certain that
the register usage is always correct. Thankfully, aarch64 assembly is
not very difficult to write by hand.

In making this change, we can/should add some optimizations in the loop
unrolling for rounds >= 8. The unrolled loop should optimize pipeline
efficiency by interleaving memory operations (like loads and stores)
with data operations (like table lookups). The best way to achieve this
is to blend the unrolled loops such that one loop prefetches the
registers needed in the next loop.

To make that possible without duplicating massive amounts of code, we
abstract the various assembly blocks into preprocessor macros and
instantiate them as needed. This mixing of the preprocessor with inline
assembly is perhaps a bit gnarly, but I think the usage is simple enough
that the advantages (code reuse) outweigh the disadvantages.

Code was tested on a Debian VM running under QEMU. Unfortunately this
does not let us see how the actual bare metal performance
increases/decreases.
aklomp added a commit that referenced this issue Jul 21, 2022
Convert the full encoding loop to an inline assembly implementation for
systems that can use inline assembly.

The motivation for this work is that when optimization is turned off on
recent versions of clang, the encoding table would not be loaded into
sequential registers (see issue #96). This happened despite taking pains
to ensure that the compiler uses an explicit set of registers for the
load (v8..v11).

This leaves us with not much options beside rewriting the full encoding
loop in inline assembly. Only that way can we be absolutely certain that
the register usage is always correct. Thankfully, aarch64 assembly is
not very difficult to write by hand.

In making this change, we can/should add some optimizations in the loop
unrolling for rounds >= 8. The unrolled loop should optimize pipeline
efficiency by interleaving memory operations (like loads and stores)
with data operations (like table lookups). The best way to achieve this
is to blend the unrolled loops such that one loop prefetches the
registers needed in the next loop.

To make that possible without duplicating massive amounts of code, we
abstract the various assembly blocks into preprocessor macros and
instantiate them as needed. This mixing of the preprocessor with inline
assembly is perhaps a bit gnarly, but I think the usage is simple enough
that the advantages (code reuse) outweigh the disadvantages.

Code was tested on a Debian VM running under QEMU. Unfortunately this
does not let us see how the actual bare metal performance
increases/decreases.
aklomp added a commit that referenced this issue Jul 26, 2022
Convert the full encoding loop to an inline assembly implementation for
systems that can use inline assembly.

The motivation for this work is that when optimization is turned off on
recent versions of clang, the encoding table would not be loaded into
sequential registers (see issue #96). This happened despite taking pains
to ensure that the compiler uses an explicit set of registers for the
load (v8..v11).

This leaves us with not much options beside rewriting the full encoding
loop in inline assembly. Only that way can we be absolutely certain that
the register usage is always correct. Thankfully, aarch64 assembly is
not very difficult to write by hand.

In making this change, we can/should add some optimizations in the loop
unrolling for rounds >= 8. The unrolled loop should optimize pipeline
efficiency by interleaving memory operations (like loads and stores)
with data operations (like table lookups). The best way to achieve this
is to blend the unrolled loops such that one loop prefetches the
registers needed in the next loop.

To make that possible without duplicating massive amounts of code, we
abstract the various assembly blocks into preprocessor macros and
instantiate them as needed. This mixing of the preprocessor with inline
assembly is perhaps a bit gnarly, but I think the usage is simple enough
that the advantages (code reuse) outweigh the disadvantages.

Code was tested on a Debian VM running under QEMU. Unfortunately this
does not let us see how the actual bare metal performance
increases/decreases.
aklomp added a commit that referenced this issue Jul 28, 2022
Convert the full encoding loop to an inline assembly implementation for
compilers that support inline assembly.

The motivation for this change is issue #96: when optimization is turned
off on recent versions of clang, the encoding table is sometimes not
loaded into sequential registers. This happens despite taking pains to
ensure that the compiler uses an explicit set of registers for the load
(v8..v11).

This leaves us with not much options beside rewriting the full encoding
loop in inline assembly. Only then can we be absolutely certain that the
right registers are used. Thankfully, AArch64 assembly is not very
difficult to write by hand.

In making this change, we optimize the unrolled loops for rounds >= 8 by
interleaving memory operations (loads, stores) with data operations
(arithmetic, table lookups). Splitting these two classes of instructions
avoids pipeline stalls and data dependencies. The current loop iteration
also prefetches the data needed in the next iteration.

To allow that without duplicating massive amounts of code, we abstract
the various assembly blocks into preprocessor macros and instantiate
them as needed. This mixing of the preprocessor with inline assembly is
perhaps a bit gnarly, but I think the usage is simple enough that the
advantages (code reuse) outweigh the disadvantages.

Code was tested on a Debian VM running under QEMU. Unfortunately,
testing in a VM does not let us measure the actual performance impact.
@aklomp aklomp closed this as completed in dd7a2b5 Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants