Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscompilation of exception handling with NDK r16 using -Oz #573

Closed
b-spencer opened this issue Nov 21, 2017 · 7 comments
Closed

Miscompilation of exception handling with NDK r16 using -Oz #573

b-spencer opened this issue Nov 21, 2017 · 7 comments
Labels
Milestone

Comments

@b-spencer
Copy link

Description

When using Android NDK r16, C++ code that throws an exception does not always work. Our analysis so far points to an error in the exception handling tables when compiled with -Oz. The same code works as expected with NDK r15c or when -Os is used instead of -Oz.

I have prepared a repository that demonstrates the issue. This repository also includes a PDF that details the exact nature of this specific miscompilation.

https://github.com/b-spencer/android-ndk-r16-miscompilation

Output with r15c or r16 with -Os:

bytes: 8080
expected1
bytes: 8080
expected2

Output with r16 with -Oz (the default):

bytes: 8080
expected1
bytes: fa33
$$$$$$$$$$$$$$$$ FAILED
expected2

Environment Details

  • NDK Version: 16.0.4442984
  • Clang Version: Android clang version 5.0.300080 (based on LLVM 5.0.300080)
  • Build sytem: ndk-build
  • Host OS: Linux Debian 9
  • Compiler: clang
  • ABI: armeabi
  • STL: c++_shared
  • NDK API level: android-24
  • Device API level: android-24
@rprichard
Copy link
Collaborator

I reduced the test case a bit more here: https://github.com/rprichard/android-ndk-r16-miscompilation. In particular, std::string isn't needed, so I removed it.

We're miscompiling test_fn.cpp, which is just this C++ code:

extern "C" void impl(int *pi);

extern "C" void work()
{
  int i = 10;
  impl(&i);
}

As with the original submission, run the test case with ./1-build.sh && ./2-run.sh:

$ ./1-build.sh  && ./2-run.sh 
...
ERROR: (1, 10) != (1, 2)

The problem would likely reproduce in a non-Android ARM environment -- in that case, just compile the jni/test.cpp and jni/test_fn.cpp files together. The only dependency is stdio.h.

The assembly from test_fn.cpp shows the same problem described in the original PDF analysis:

  • the prologue incidentally saves r5 and r6 (it just needs to reserve two stack locations)
  • the function reuses the two words it had used for r5 and r6
  • the epilogue throws away the first two words into r2 and r3, which is fine, but...
  • if an exception is thrown, the overwritten values are instead placed into r5 and r6, overwriting the caller's values
	.globl	work                    @ -- Begin function work
	.p2align	2
	.type	work,%function
	.code	16                      @ @work
	.thumb_func
work:
	.fnstart
@ BB#0:
	.save	{r5, r6, r7, lr}
	push	{r5, r6, r7, lr}
	.setfp	r7, sp, #8
	add	r7, sp, #8
	ldr	r0, .LCPI0_0
.LPC0_0:
	add	r0, pc
	ldr	r0, [r0]
	ldr	r0, [r0]
	str	r0, [sp, #4]
	movs	r0, #10
	str	r0, [sp]
	mov	r0, sp
	bl	impl
	ldr	r0, .LCPI0_1
.LPC0_1:
	add	r0, pc
	ldr	r0, [r0]
	ldr	r0, [r0]
	ldr	r1, [sp, #4]
	subs	r0, r0, r1
	it	eq
	popeq	{r2, r3, r7, pc}
	bl	__stack_chk_fail
	.p2align	2
@ BB#1:
.LCPI0_0:
.Ltmp0:
	.long	__stack_chk_guard(GOT_PREL)-((.LPC0_0+4)-.Ltmp0)
.LCPI0_1:
.Ltmp1:
	.long	__stack_chk_guard(GOT_PREL)-((.LPC0_1+4)-.Ltmp1)
.Lfunc_end0:
	.size	work, .Lfunc_end0-work
	.fnend
                                        @ -- End function

I also tried the test case with a newer NDK compiler, and it still failed:

Android clang version 5.0.1 (https://us3-mirror-android.googlesource.com/toolchain/clang 00e4a5a67eb7d626653c23780ff02367ead74955) (https://us3-mirror-android.googlesource.com/toolchain/llvm ef376ecb7d9c1460216126d102bb32fc5f73800d) (based on LLVM 5.0.1svn)

@rprichard
Copy link
Collaborator

Here's a example of the problem using the Godbolt Compiler Explorer (https://godbolt.org/g/rJRr75) and a very recent upstream Clang (clang version 6.0.0 (trunk 318735)). It looks like r318735 is about a day old.

The example at that link saves r10:

        .save   {r9, r10, r11, lr}
        push    {r9, r10, r11, lr}
        .setfp  r11, sp, #8
        add     r11, sp, #8

... but then overwrites it with the value of the i local variable:

        str     r0, [sp, #4]

As long as an exception isn't thrown, it avoids restoring the overwritten value into r10:

        mov     sp, r11
        pop     {r11, lr}
        bx      lr

But if an exception is thrown, then I think the .save directive would result in the value of i being written into r10, and r10 is a callee-save/non-volatile register.

@stephenhines
Copy link
Collaborator

I created an upstream LLVM bug and CCed some of our connections there - https://bugs.llvm.org/show_bug.cgi?id=35379. My suspicion is that an optimization for -Oz uses bogus registers in the save mask to create free space for local variables. In the restore path, this space is skipped over completely before restoring the real registers. Unfortunately, exceptions mess this up because the unwinder will look at the .save directive, which specifies the bogus registers (and then will ultimately clobber them).

@DanAlbert
Copy link
Member

DanAlbert commented Nov 30, 2017

FYI, will be making an r16b to revert the Os -> Oz switch (will be going back to Os).

Will probably be r18 by the time we have a clang with the fix for this miscompile (bug was only just discovered, so has to get fixed and then we need to update Android's Clang, which requires a lot of validation time).

@b-spencer
Copy link
Author

b-spencer commented Dec 21, 2017

BTW, I've confirmed that r16b fixes this problem on the original code (by using its default of -Os).

@DanAlbert DanAlbert added this to the r18 milestone Mar 16, 2018
miodragdinic pushed a commit to MIPS/ndk that referenced this issue Apr 17, 2018
Test: ./checkbuild.py && ./run_tests.py
Bug: android/ndk#573
Change-Id: I75362443a3f4c293e5b674e6b9e01f6792c9951c
@rprichard
Copy link
Collaborator

I retested https://github.com/rprichard/android-ndk-r16-miscompilation, and it passed now. The upstream bug was fixed in https://reviews.llvm.org/rL321996. It's included in the r18 canary clang (4751641 based on r328903).

The NDK build systems still default to -Os rather than -Oz. Should we switch it back?

@DanAlbert
Copy link
Member

The NDK build systems still default to -Os rather than -Oz. Should we switch it back?

On it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants
@DanAlbert @rprichard @b-spencer @stephenhines and others