Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86-64: Enable Intel CET #2992

Merged
merged 2 commits into from
Jan 20, 2022
Merged

x86-64: Enable Intel CET #2992

merged 2 commits into from
Jan 20, 2022

Conversation

hjl-tools
Copy link
Contributor

@hjl-tools hjl-tools commented Jan 11, 2022

Intel Control-flow Enforcement Technology (CET):

https://en.wikipedia.org/wiki/Control-flow_integrity#Intel_Control-flow_Enforcement_Technology

requires that on Linux, all linker input files are marked as CET enabled
in .note.gnu.property section. For high-level language source codes,
.note.gnu.property section is added by compiler with the -fcf-protection
option. For assembly sources, include <cet.h> to add .note.gnu.property
section.

Copy link
Contributor

@felixhandte felixhandte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'd never heard of Intel CET, but it looks harmless enough...

Could you take a minute to describe what benefit this inclusion provides?

Note, GCC says:

[...] the recommended use of the [__has_include] operator is as follows:

#if defined __has_include
#  if __has_include (<stdatomic.h>)
#    include <stdatomic.h>
#  endif
#endif

The first #if test succeeds only when the operator is supported by the version of GCC (or another compiler) being used. Only when that test succeeds is it valid to use __has_include as a preprocessor operator. As a result, combining the two tests into a single expression as shown below would only be valid with a compiler that supports the operator but not with others that don’t.

#if defined __has_include && __has_include ("header.h")   /* not portable */
…
#endif

Please rewrite the test to match that style.

@hjl-tools
Copy link
Contributor Author

Ok. I'd never heard of Intel CET, but it looks harmless enough...

Could you take a minute to describe what benefit this inclusion provides?

Note, GCC says:

Done.

[...] the recommended use of the [__has_include] operator is as follows:

#if defined __has_include
#  if __has_include (<stdatomic.h>)
#    include <stdatomic.h>
#  endif
#endif

The first #if test succeeds only when the operator is supported by the version of GCC (or another compiler) being used. Only when that test succeeds is it valid to use __has_include as a preprocessor operator. As a result, combining the two tests into a single expression as shown below would only be valid with a compiler that supports the operator but not with others that don’t.

#if defined __has_include && __has_include ("header.h")   /* not portable */
…
#endif

Please rewrite the test to match that style.

Done.

@felixhandte
Copy link
Contributor

Thanks for applying the patch so quickly!

I've tried to do some reading to understand what CET does.

If I understand correctly, it impacts (all?) control flow instructions, in different ways. It looks like no modification of the source is required for call and ret instructions, which in CET mode just do internal checks against a shadow stack that we don't need to think about. And this file makes no calls anyways, and only normal rets.

However, it sounds to me like (some?) jmps require that the destination is marked with an _CET_ENDBR instruction. Actually now I'm reading more and it seems like it only applies to jmps with non-immediate targets? Of which we don't have any.

It looks like those are the only things tracked by CET? So it sounds to me like our current source is compatible with CET as-is.

The criteria I'm trying to evaluate overall here are:

  1. Is this a no-op when CET isn't available/enabled?
    Looks good on that front.
  2. Is this correct when it is available/enabled?
    Based on the above, it seems like it should work.
  3. Are the above tested by our test suite such that we don't accidentally introduce a breakage in the future?
    W.r.t. breaking zstd: I see in man gcc that "In Ubuntu 19.10 and later versions, -fcf-protection is enabled by default". So, it sounds like our existing tests may be building and running with CET? It would be good to confirm this, and see that our test suite will actually produce a fault if we try adding an unmarked indirect jmp.
    W.r.t. staying CET-enabled: We should also add a test, along the lines of the testing in Mark Huffman Decoder Assembly noexecstack on All Architectures #2964, to make sure that we don't accidentally lose CET-compatibility (by forgetting to include this header in some other assembly source) in the future.
  4. Does this have a performance impact?
    The whole reason that this block is in assembly is because it is extremely performance-critical, -optimized, and -sensitive. We frequently see that even perturbing the alignment of our hot paths produces significant performance regressions.
    But again, if I understand this correctly, it seems like this shouldn't change the generated code at all. (Though, does running the same instructions under CET have a performance impact? Is it possible that part of the win we saw with switching to the assembly huffman implementation came from losing CET?) Maybe we should check to be sure.

So it seems like this is close to good to go. Please correct me if I'm misunderstanding or missing anything.

And if you're able to help us make sure we have good test coverage for this, that would also be great!

@hjl-tools
Copy link
Contributor Author

Thanks for applying the patch so quickly!

I've tried to do some reading to understand what CET does.

If I understand correctly, it impacts (all?) control flow instructions, in different ways. It looks like no modification of the source is required for call and ret instructions, which in CET mode just do internal checks against a shadow stack that we don't need to think about. And this file makes no calls anyways, and only normal rets.

However, it sounds to me like (some?) jmps require that the destination is marked with an _CET_ENDBR instruction. Actually now I'm reading more and it seems like it only applies to jmps with non-immediate targets? Of which we don't have any.

It looks like those are the only things tracked by CET? So it sounds to me like our current source is compatible with CET as-is.

The criteria I'm trying to evaluate overall here are:

  1. Is this a no-op when CET isn't available/enabled?
    Looks good on that front.

Correct.

  1. Is this correct when it is available/enabled?
    Based on the above, it seems like it should work.

Correct.

  1. Are the above tested by our test suite such that we don't accidentally introduce a breakage in the future?
    W.r.t. breaking zstd: I see in man gcc that "In Ubuntu 19.10 and later versions, -fcf-protection is enabled by default". So, it sounds like our existing tests may be building and running with CET? It would be good to confirm this, and see that our test suite will actually produce a fault if we try adding an unmarked indirect jmp.
    W.r.t. staying CET-enabled: We should also add a test, along the lines of the testing in Mark Huffman Decoder Assembly noexecstack on All Architectures #2964, to make sure that we don't accidentally lose CET-compatibility (by forgetting to include this header in some other assembly source) in the future.

On Linux/x86, you can use

$ LDFLAGS="-Wl,-z,cet-report=error" CFLAGS="-O2 -fcf-protection" make -j8
make[1]: Entering directory '/export/gnu/import/git/github/zstd/lib'
make[1]: Entering directory '/export/gnu/import/git/github/zstd/programs'
compiling multi-threaded dynamic library 1.5.1
/usr/local/bin/ld: obj/conf_9d28cd3298aed57cd94d6115242a56d4/dynamic/huf_decompress_amd64.o: error: missing IBT and SHSTK properties
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:161: obj/conf_9d28cd3298aed57cd94d6115242a56d4/dynamic/libzstd.so.1.5.1] Error 1
make[1]: *** [Makefile:148: libzstd.so.1.5.1] Error 2
make[1]: Leaving directory '/export/gnu/import/git/github/zstd/lib'
make: *** [Makefile:63: lib-release] Error 2
make: *** Waiting for unfinished jobs....
==> building with threading support
==> building zstd with .gz compression support
==> building zstd with .xz/.lzma compression support
==> no liblz4, building zstd without .lz4 support
LINK obj/conf_9f586c118439b7633512d66a7741fe07/zstd
/usr/local/bin/ld: obj/conf_9f586c118439b7633512d66a7741fe07/huf_decompress_amd64.o: error: missing IBT and SHSTK properties
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:162: obj/conf_9f586c118439b7633512d66a7741fe07/zstd] Error 1
make[1]: *** [Makefile:150: zstd] Error 2
make[1]: Leaving directory '/export/gnu/import/git/github/zstd/programs'
make: *** [Makefile:67: zstd-release] Error 2
[hjl@gnu-tgl-3 zstd]$ 
  1. Does this have a performance impact?
    The whole reason that this block is in assembly is because it is extremely performance-critical, -optimized, and -sensitive. We frequently see that even perturbing the alignment of our hot paths produces significant performance regressions.
    But again, if I understand this correctly, it seems like this shouldn't change the generated code at all. (Though, does running the same instructions under CET have a performance impact? Is it possible that part of the win we saw with switching to the assembly huffman implementation came from losing CET?) Maybe we should check to be sure.

There is no performance impact.

So it seems like this is close to good to go. Please correct me if I'm misunderstanding or missing anything.

And if you're able to help us make sure we have good test coverage for this, that would also be great!

Let me try.

@hjl-tools
Copy link
Contributor Author

So it seems like this is close to good to go. Please correct me if I'm misunderstanding or missing anything.
And if you're able to help us make sure we have good test coverage for this, that would also be great!

Let me try.

See: #2994

Intel Control-flow Enforcement Technology (CET):

https://en.wikipedia.org/wiki/Control-flow_integrity#Intel_Control-flow_Enforcement_Technology

requires that on Linux, all linker input files are marked as CET enabled
in .note.gnu.property section.  For high-level language source codes,
.note.gnu.property section is added by compiler with the -fcf-protection
option.  For assembly sources, include <cet.h> to add .note.gnu.property
section.
@felixhandte
Copy link
Contributor

Ok, so the only remaining topic was to see if I could get a CET-violating instruction to trigger a fault. I tried the following patch on top of this PR:

--- a/lib/decompress/huf_decompress_amd64.S
+++ b/lib/decompress/huf_decompress_amd64.S
@@ -303,7 +303,8 @@ HUF_decompress4X1_usingDTable_internal_bmi2_asm_loop:
     movq 16(%rsp), %ip3
 
     /* Re-compute olimit */
-    jmp .L_4X1_compute_olimit
+    lea .L_4X1_compute_olimit(%rip), %rax
+    jmp *%rax
 
 #undef GET_NEXT_DELT
 #undef DECODE_FROM_DELT
@@ -535,7 +536,8 @@ HUF_decompress4X2_usingDTable_internal_bmi2_asm_loop:
 
     cmp %op3, 48(%rsp)
     ja .L_4X2_loop_body
-    jmp .L_4X2_compute_olimit
+    lea .L_4X2_compute_olimit(%rip), %rax
+    jmp *%rax
 
 #undef DECODE
 #undef RELOAD_BITS

I built this with:

V=1 CC=gcc-11 MOREFLAGS="-fcf-protection=full" make -j zstd

As I understand it, this should trigger a fault when run under CET. But no such error occurs when I run it.

Googling shows that hardware CET support is only present starting in Tiger Lake chips. And my workstation is a Haswell... I'm assuming that GitHub's virtual environments don't support CET either. Is there any way we can simulate running under CET to validate that our binary works? Maybe under QEMU or something?

@hjl-tools
Copy link
Contributor Author

Ok, so the only remaining topic was to see if I could get a CET-violating instruction to trigger a fault. I tried the following patch on top of this PR:

--- a/lib/decompress/huf_decompress_amd64.S
+++ b/lib/decompress/huf_decompress_amd64.S
@@ -303,7 +303,8 @@ HUF_decompress4X1_usingDTable_internal_bmi2_asm_loop:
     movq 16(%rsp), %ip3
 
     /* Re-compute olimit */
-    jmp .L_4X1_compute_olimit
+    lea .L_4X1_compute_olimit(%rip), %rax
+    jmp *%rax
 
 #undef GET_NEXT_DELT
 #undef DECODE_FROM_DELT
@@ -535,7 +536,8 @@ HUF_decompress4X2_usingDTable_internal_bmi2_asm_loop:
 
     cmp %op3, 48(%rsp)
     ja .L_4X2_loop_body
-    jmp .L_4X2_compute_olimit
+    lea .L_4X2_compute_olimit(%rip), %rax
+    jmp *%rax
 
 #undef DECODE
 #undef RELOAD_BITS

I built this with:

V=1 CC=gcc-11 MOREFLAGS="-fcf-protection=full" make -j zstd

As I understand it, this should trigger a fault when run under CET. But no such error occurs when I run it.

You need to run it under a CET enabled kernel on TGL to see the CP fault.
You can use my 5.15 CET kernel:

https://github.com/hjl-tools/linux/tree/hjl/cet%2Flinux-5.15.y

Googling shows that hardware CET support is only present starting in Tiger Lake chips. And my workstation is a Haswell... I'm assuming that GitHub's virtual environments don't support CET either. Is there any way we can simulate running under CET to validate that our binary works? Maybe under QEMU or something?

@felixhandte
Copy link
Contributor

Short of going out and buying one, it looks like I don't have access to a Tiger Lake or Alder Lake system. Certainly we don't have continuous testing infrastructure on such a platform.

I'm not really sure how best to proceed. On the one hand, I have high confidence that this patch is correct at present. And on that basis I'm inclined to merge it. But I'm uncomfortable enabling a feature whose purpose is to blow up on users' machines under conditions that are (absent CET) otherwise valid, when we have no way to test that our code does not meet those conditions.

It's not totally unthinkable that some day in the future someone will want to ship an indirect jump, and we will have forgotten that this is a thing. Do you have any suggestions?

@hjl-tools
Copy link
Contributor Author

Short of going out and buying one, it looks like I don't have access to a Tiger Lake or Alder Lake system. Certainly we don't have continuous testing infrastructure on such a platform.

I'm not really sure how best to proceed. On the one hand, I have high confidence that this patch is correct at present. And on that basis I'm inclined to merge it. But I'm uncomfortable enabling a feature whose purpose is to blow up on users' machines under conditions that are (absent CET) otherwise valid, when we have no way to test that our code does not meet those conditions.

It's not totally unthinkable that some day in the future someone will want to ship an indirect jump, and we will have forgotten that this is a thing. Do you have any suggestions?

Please try Intel SDE on Linux:

https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html

You can pass "-tgl -cet" to SDE to test CET.

@felixhandte
Copy link
Contributor

@hjl-tools, ah this looks exactly like what I'm looking for. I can't seem to get it to work though.

I verified (I'm pretty sure) that my built binary has CET enabled:

$ readelf -x .note.gnu.property ./zstd

Hex dump of section '.note.gnu.property':
  0x004002c8 04000000 10000000 05000000 474e5500 ............GNU.
  0x004002d8 020000c0 04000000 03000000 00000000 ................

But when I invoke sde it runs and exits normally:

$ .../sde-external-9.0.0-2021-11-07-lin/sde -tgl -cet -cet_abort -- ./zstd -b1
 1#Synthetic 50%     :  10000000 ->   3154223 (x3.170),  348.1 MB/s,  872.3 MB/s

I can verify that the indirect jump I inserted above is being executed:

$ gdb zstd
[...]
Reading symbols from zstd...
(gdb) b huf_decompress_amd64.S:306
Breakpoint 1 at 0x48ff29: file ../lib/decompress/huf_decompress_amd64.S, line 306.
(gdb) b huf_decompress_amd64.S:539
Breakpoint 2 at 0x49045b: file ../lib/decompress/huf_decompress_amd64.S, line 539.
(gdb) r -b1
Starting program: /data/users/felixh/zstd/zstd -b1
warning: Loadable section ".note.gnu.property" outside of ELF segments
 |-Synthetic 50%     :  10000000 ->   3154223 (x3.170),  515.8 MB/s   
Breakpoint 2, _HUF_decompress4X2_usingDTable_internal_bmi2_asm_loop () at ../lib/decompress/huf_decompress_amd64.S:539
539	    lea .L_4X2_compute_olimit(%rip), %rax
(gdb) si
540	    jmp *%rax
(gdb) si
412	    movq %rdx, 0(%rsp)
(gdb) 

What am I doing wrong?

@felixhandte
Copy link
Contributor

felixhandte commented Jan 12, 2022

Aha! I figured it out. This document describes that -cet only turns on stack checks and you have to add more flags to turn on indirect jump checks. I now successfully see a failure:

$ .../sde-external-9.0.0-2021-11-07-lin/sde -tgl -cet -cet-endbr-exe -cet-stderr -cet-abort  -- ./zstd -b1
[...]
Control flow ENDBRANCH error detected at IP: 0x000000000048ffe8 INS:  mov qword ptr [rsp], rdx
     Last branch IP: 0x0000000000490462 INS:  jmp rax

$

Excellent! I will look into turning this into a test we can run continuously. And then this likely clears us to merge this PR.

@felixhandte
Copy link
Contributor

@hjl-tools, I'm trying to come up with a test we can run continuously in GitHub and once again I'm seeing behavior I don't understand. When I invoke zstd through SDE, it seems like SDE is doing nothing and zstd is being run directly:

https://github.com/felixhandte/zstd/runs/479429041

I would have expected:

  1. SDE to fail since this branch includes the violating indirect jump.
  2. SDE to at least print something. When I run it locally I first see output like:
IMG: [...]/zstd Address: 0x00050769f:0x000400000 added to the checked image map
IMG: /lib64/ld-linux-x86-64.so.2 Address: 0x7ff14c6b471b:0x7ff14c68c000 added to the legacy images list
IMG: [vdso] Address: 0x7ffdaadf500a:0x7ffdaadf4000 added to the legacy images
IMG: [vdso] Address: 0x7ffdaadf500a:0x7ffdaadf4000 added to the legacy images list
IMG: /lib/x86_64-linux-gnu/libz.so.1 Address: 0x7ff035d940af:0x7ff035b78000 added to the legacy images list
IMG: /lib/x86_64-linux-gnu/libpthread.so.0 Address: 0x7ff035ab247f:0x7ff035894000 added to the legacy images list
IMG: /lib/x86_64-linux-gnu/libc.so.6 Address: 0x7ff03587cadf:0x7ff03548c000 added to the legacy images list
  1. Zstd to run much slower than normal, since it's supposed to be under at least partial emulation. But I see it running at normal speeds.

Do you have any insight into what's going wrong?

@hjl-tools
Copy link
Contributor Author

@hjl-tools, I'm trying to come up with a test we can run continuously in GitHub and once again I'm seeing behavior I don't understand. When I invoke zstd through SDE, it seems like SDE is doing nothing and zstd is being run directly:

https://github.com/felixhandte/zstd/runs/479429041

I would have expected:

  1. SDE to fail since this branch includes the violating indirect jump.
  2. SDE to at least print something. When I run it locally I first see output like:
IMG: [...]/zstd Address: 0x00050769f:0x000400000 added to the checked image map
IMG: /lib64/ld-linux-x86-64.so.2 Address: 0x7ff14c6b471b:0x7ff14c68c000 added to the legacy images list
IMG: [vdso] Address: 0x7ffdaadf500a:0x7ffdaadf4000 added to the legacy images
IMG: [vdso] Address: 0x7ffdaadf500a:0x7ffdaadf4000 added to the legacy images list
IMG: /lib/x86_64-linux-gnu/libz.so.1 Address: 0x7ff035d940af:0x7ff035b78000 added to the legacy images list
IMG: /lib/x86_64-linux-gnu/libpthread.so.0 Address: 0x7ff035ab247f:0x7ff035894000 added to the legacy images list
IMG: /lib/x86_64-linux-gnu/libc.so.6 Address: 0x7ff03587cadf:0x7ff03548c000 added to the legacy images list
  1. Zstd to run much slower than normal, since it's supposed to be under at least partial emulation. But I see it running at normal speeds.

Do you have any insight into what's going wrong?

How can I reproduce it?

@felixhandte
Copy link
Contributor

You can grab my attempt here:

git fetch git@github.com:felixhandte/zstd.git pull/6/head:cet-action-test

You can then open a PR with that code against a zstd repo (mine, for example) to run the actions.

@hjl-tools
Copy link
Contributor Author

You can grab my attempt here:

git fetch git@github.com:felixhandte/zstd.git pull/6/head:cet-action-test

You can then open a PR with that code against a zstd repo (mine, for example) to run the actions.

I reproduced it. I reported it to our SDE developers.

@hjl-tools
Copy link
Contributor Author

You can grab my attempt here:

git fetch git@github.com:felixhandte/zstd.git pull/6/head:cet-action-test

You can then open a PR with that code against a zstd repo (mine, for example) to run the actions.

Please check felixhandte#7
You need to pass "-cet -cet-raise 0", instead of "-tgl", to SDE, to catch missing ENDBR.

@felixhandte
Copy link
Contributor

@hjl-tools, that worked! Excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants