Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"native stack overflow" detection is sometimes inappropriate #2105

Closed
yamt opened this issue Apr 4, 2023 · 30 comments
Closed

"native stack overflow" detection is sometimes inappropriate #2105

yamt opened this issue Apr 4, 2023 · 30 comments

Comments

@yamt
Copy link
Collaborator

yamt commented Apr 4, 2023

check_stack_boundary uses the number of parameters and locals to calculate the required stack:

aot_func = func_ctxes[func_idx - import_func_count]->aot_func;
callee_cell_num =
aot_func->param_cell_num + aot_func->local_cell_num + 1;
if (!check_stack(comp_ctx, func_ctx, callee_cell_num))
goto fail;

unfortunately, the estimation is not always appropriate.

for example, the code attached to this issue is compiled to wasm bytecode like:

/opt/wasi-sdk-20.0/bin/clang++ -O0 -fno-exceptions a.cxx

(don't ask me why -O0 is used here. how to build a wasm module is not important to the rest of this issue.)

000d7b func[14] <main>:
 000d7c: c1 21 7f                   | local[3..4291] type=i32
 000d7f: 23 80 80 80 80 00          | global.get 0 <__stack_pointer>
 000d85: 21 02                      | local.set 2
 000d87: 41 10                      | i32.const 16
 000d89: 21 03                      | local.set 3
 000d8b: 20 02                      | local.get 2
 000d8d: 20 03                      | local.get 3
 000d8f: 6b                         | i32.sub
 000d90: 21 04                      | local.set 4
 000d92: 20 04                      | local.get 4
 000d94: 24 80 80 80 80 00          | global.set 0 <__stack_pointer>

note that the function has a ton of locals.
check_stack_boundary uses the number of locals to estimate the stack usage.

at least there are a few issues:

  • the estimation seems to assume that explicit allocas are the major source of stack consumption. actually, it depends. if there are not enough machine registers, llvm places temporary variables on stack. you can see it in the --opt-level=0 output below.
  • the estimation assumes later optimization passes don't change the stack consumption much. actually it does. in the above example code, the optimized output of the function only consumes a small amount of stack. you can see it in the --opt-level=3 output below.

amd64 wamrc --opt-level=0 output of the function:

0000000000000270 <aot_func#3>:
     270: 48 81 ec 98 4e 01 00          subq    $85656, %rsp            # imm = 0x14E98
     277: 41 89 f0                      movl    %esi, %r8d
     27a: 48 89 bc 24 70 4e 01 00       movq    %rdi, 85616(%rsp)
     282: 48 89 f8                      movq    %rdi, %rax
     285: 48 83 c0 10                   addq    $16, %rax

amd64 wamrc --opt-level=3 output of the function:

0000000000000170 <aot_func#3>:
     170: 55                            pushq   %rbp
     171: 41 57                         pushq   %r15
     173: 41 56                         pushq   %r14
     175: 41 55                         pushq   %r13
     177: 41 54                         pushq   %r12
     179: 53                            pushq   %rbx
     17a: 50                            pushq   %rax
     17b: 49 89 fd                      movq    %rdi, %r13
     17e: 48 8b 47 10                   movq    16(%rdi), %rax
     182: 8b 88 a8 01 00 00             movl    424(%rax), %ecx
     188: 83 c1 f0                      addl    $-16, %ecx

a.zip

@wenyongh
Copy link
Contributor

wenyongh commented Apr 4, 2023

@yamt Seems there are two issues: one is the estimated size of native stack used by the callee may be too small for wamrc --opt-level=0, which may cause stack overflow core dump; the other is the estimated size of native stack used by the callee may be too large for wamrc --opt-level=3, which doesn't cause core dump, but it wastes a lot of space, right? Do you have good idea to fix them?

@yamt
Copy link
Collaborator Author

yamt commented Apr 5, 2023

@yamt Seems there are two issues: one is the estimated size of native stack used by the callee may be too small for wamrc --opt-level=0, which may cause stack overflow core dump; the other is the estimated size of native stack used by the callee may be too large for wamrc --opt-level=3, which doesn't cause core dump, but it wastes a lot of space, right?

right.

Do you have good idea to fix them?

no good idea right now.

unless llvm itself has a nice feature to report the stack usage (i dunno if it does)
i suspect we have to investigate the optimized IRs or even machine code.

even if we have a precise estimation, it might be tricky to tell it to the generated check code. (can we change the constant after optimization passes?)

@yamt
Copy link
Collaborator Author

yamt commented Apr 13, 2023

another example of underestimation.

wasm binary: many_stack5.wasm.zip
its source: https://github.com/yamt/toywasm/blob/f917a189f3970959fb5f2ea72083cc62b88829c6/wat/many_stack.wat.in#L26

wamrc -o o --stack-bounds-checks=1 many_stack5.wasm

callee

0000000000000010 <aot_func#1>:
      10: 55                            pushq   %rbp
      11: 41 57                         pushq   %r15
      13: 41 56                         pushq   %r14
      15: 41 55                         pushq   %r13
      17: 41 54                         pushq   %r12
      19: 53                            pushq   %rbx
      1a: 48 81 ec 08 15 00 00          subq    $5384, %rsp             # imm = 0x1508

caller

000000000000d460 <aot_func#2>:
    d460: 53                            pushq   %rbx
    d461: 48 83 ec 10                   subq    $16, %rsp
    d465: 48 8b 47 10                   movq    16(%rdi), %rax
    d469: 48 8b 4f 20                   movq    32(%rdi), %rcx
    d46d: 48 83 c1 08                   addq    $8, %rcx
    d471: 48 8d 54 24 0f                leaq    15(%rsp), %rdx
    d476: 48 39 ca                      cmpq    %rcx, %rdx
    d479: 73 13                         jae     0xd48e <aot_func#2+0x2e>
    d47b: 48 89 c7                      movq    %rax, %rdi
    d47e: be 0b 00 00 00                movl    $11, %esi
    d483: e8 00 00 00 00                callq   0xd488 <aot_func#2+0x28>
                000000000000d484:  R_X86_64_PLT32       aot_set_exception_with_id-0x4
    d488: 48 83 c4 10                   addq    $16, %rsp
    d48c: 5b                            popq    %rbx
    d48d: c3                            retq
    d48e: 48 8b 98 58 01 00 00          movq    344(%rax), %rbx
    d495: be 01 00 00 00                movl    $1, %esi
    d49a: e8 00 00 00 00                callq   0xd49f <aot_func#2+0x3f>
                000000000000d49b:  R_X86_64_PLT32       aot_func#1-0x4
    d49f: 89 c0                         movl    %eax, %eax
    d4a1: c7 04 03 00 00 00 00          movl    $0, (%rbx,%rax)
    d4a8: 48 83 c4 10                   addq    $16, %rsp
    d4ac: 5b                            popq    %rbx
    d4ad: c3                            retq

@yamt
Copy link
Collaborator Author

yamt commented Apr 27, 2023

i'm thinking to make it 2-pass.

  1. the first pass calculates stack usage for each functions. something like wamrc: Add --stack-usage option #2158
  2. the second pass uses the info when generating the check functions.

an obvious downside is that it would double compilation time.
how do you think?

@yamt
Copy link
Collaborator Author

yamt commented May 2, 2023

a related topic:

right now, wamrc injects stack checking logic in the caller.

however, i think it's simpler to perform the stack check in the callee because
the caller-side check is difficult/impossible for call_indirect and imported functions.
(while it doesn't matter as far as imported functions are always native functions,
i guess the situation should change in a future.)

i think it can be done with a wrapper function like:

  aot_func#111 // wrapper function
    preform check stack, probably using the info from https://github.com/bytecodealliance/wasm-micro-runtime/pull/2158

    if not enough stack, set_exception and return

    otherwise,
    tail call aot_func_body#111

  aot_func_body#111 noinline
    // real implementation

the stack consumption of the wrapper functions is not a big problem
because it should be constant.

how do you think?

@yamt
Copy link
Collaborator Author

yamt commented May 2, 2023

i made a bit research on a few seemingly related llvm features.

  • probe-stack

    llvm-level attribute to instrument alloca with a call.

    i guess it's primarily for windows' __stkchk.
    http://msdn.microsoft.com/en-us/library/ms648426.aspx
    recent versions of apple clang seems to use an equivalent by default. (___stkchk_darwin)

    it seems available for multiple platforms including x86.
    but probably not for eg. xtensa.

    while this probe can be used to detect overflow,
    i can't think of a straightforward way to implement wamr-specific
    recovery using this probe. (eg. set_exception and make the caller return)

  • StackClashProtector

    when moving the stack pointer, avoid skipping the guard page.
    probably we should enable this.
    but it doesn't help where OS_ENABLE_HW_BOUND_CHECK is not available.

  • segmented stack

    https://llvm.org/docs/SegmentedStacks.html
    dynamically grow stack with some support from runtime (__morestack) and linker.
    interesting but probably a bit off topic.

  • stack protector

    a canary-based corruption detector.
    i'm not interested right now because it can't prevent corruption.

@wenyongh
Copy link
Contributor

wenyongh commented May 4, 2023

a related topic:

right now, wamrc injects stack checking logic in the caller.

however, i think it's simpler to perform the stack check in the callee because the caller-side check is difficult/impossible for call_indirect and imported functions. (while it doesn't matter as far as imported functions are always native functions, i guess the situation should change in a future.)

i think it can be done with a wrapper function like:

  aot_func#111 // wrapper function
    preform check stack, probably using the info from https://github.com/bytecodealliance/wasm-micro-runtime/pull/2158

    if not enough stack, set_exception and return

    otherwise,
    tail call aot_func_body#111

  aot_func_body#111 noinline
    // real implementation

the stack consumption of the wrapper functions is not a big problem because it should be constant.

how do you think?

Hi, yes, I agree to add check in the callee, but had better not implement it with a wrapper function since the function call may impact the performance a lot. Instead, suppose we can get the current stack pointer, we may add both checks in callee and caller:
(1) in the callee, suppose that the stack pointer is always updated at the beginning of the AOT code, or, code like subq $16, %rsp is always before the check code we add, then in the check code, we can just check whether the sp is smaller than the stack min (or stack boundary), if yes, report fail and return. The generated machine may be like:

    subq    $16, %rsp
    mov %reg, %rsp
    cmp %reg, stack_min
    jl error
    ...
error:
    ret

(2) in the caller, when calling functions, we can just check whether there are enough stack space to pass the function arguments, or to ensure that the call instruction is safely executed, since then in the beginning of the callee, the stack pointer will be checked again after it is subtracted. The necessary stack space to pass the function arguments depends on the calling conventions of the target, like:
https://en.wikipedia.org/wiki/X86_calling_conventions
It may be a little complex to calculate the size precisely for each ABI, maybe we can use an evaluated maximum size.

Adding two checks may degrade the performance, maybe we can merge them together: in the loader, record the maximum size needed to call the functions (we know the func type to call), and then here, in the beginning of each AOT function, check whether sp - max_size_needed_to_call_func < sp_min.

For how to get the stack pointer, I am not very sure, there may be some methods, one is to use the LLVM intrinsic llvm.read_register:
https://llvm.org/docs/LangRef.html#llvm-read-register-llvm-read-volatile-register-and-llvm-write-register-intrinsics
another may be like current implementation, use LLVMBuildAlloca to allocate a memory from stack.

@yamt
Copy link
Collaborator Author

yamt commented May 8, 2023

a related topic:
right now, wamrc injects stack checking logic in the caller.
however, i think it's simpler to perform the stack check in the callee because the caller-side check is difficult/impossible for call_indirect and imported functions. (while it doesn't matter as far as imported functions are always native functions, i guess the situation should change in a future.)
i think it can be done with a wrapper function like:

  aot_func#111 // wrapper function
    preform check stack, probably using the info from https://github.com/bytecodealliance/wasm-micro-runtime/pull/2158

    if not enough stack, set_exception and return

    otherwise,
    tail call aot_func_body#111

  aot_func_body#111 noinline
    // real implementation

the stack consumption of the wrapper functions is not a big problem because it should be constant.
how do you think?

Hi, yes, I agree to add check in the callee, but had better not implement it with a wrapper function since the function call may impact the performance a lot.

i suppose it isn't that expensive. on x86, it can be a jmp.
but it needs a benchmark/evaluation anyway.

Instead, suppose we can get the current stack pointer, we may add both checks in callee and caller: (1) in the callee, suppose that the stack pointer is always updated at the beginning of the AOT code, or, code like subq $16, %rsp is always before the check code we add, then in the check code, we can just check whether the sp is smaller than the stack min (or stack boundary), if yes, report fail and return. The generated machine may be like:

    subq    $16, %rsp
    mov %reg, %rsp
    cmp %reg, stack_min
    jl error
    ...
error:
    ret

i feel it's problematic (or at least tricky) to check after the substraction.
eg. it needs sigaltstack

(2) in the caller, when calling functions, we can just check whether there are enough stack space to pass the function arguments, or to ensure that the call instruction is safely executed, since then in the beginning of the callee, the stack pointer will be checked again after it is subtracted. The necessary stack space to pass the function arguments depends on the calling conventions of the target, like: https://en.wikipedia.org/wiki/X86_calling_conventions It may be a little complex to calculate the size precisely for each ABI, maybe we can use an evaluated maximum size.

at this point, i don't concern much about function arguments
because we only support a moderate numbers of parameters/results.

Adding two checks may degrade the performance, maybe we can merge them together: in the loader, record the maximum size needed to call the functions (we know the func type to call), and then here, in the beginning of each AOT function, check whether sp - max_size_needed_to_call_func < sp_min.

why in the loader?

For how to get the stack pointer, I am not very sure, there may be some methods, one is to use the LLVM intrinsic llvm.read_register: https://llvm.org/docs/LangRef.html#llvm-read-register-llvm-read-volatile-register-and-llvm-write-register-intrinsics another may be like current implementation, use LLVMBuildAlloca to allocate a memory from stack.

i suppose LLVMBuildAlloca or llvm.frameaddress is a bit more portable than llvm.read_register.

@wenyongh
Copy link
Contributor

wenyongh commented May 8, 2023

Hi, yes, I agree to add check in the callee, but had better not implement it with a wrapper function since the function call may impact the performance a lot.

i suppose it isn't that expensive. on x86, it can be a jmp. but it needs a benchmark/evaluation anyway.

It should be cmp + jcc instructions, and yes, we had better test it with benchmarks.

Instead, suppose we can get the current stack pointer, we may add both checks in callee and caller: (1) in the callee, suppose that the stack pointer is always updated at the beginning of the AOT code, or, code like subq $16, %rsp is always before the check code we add, then in the check code, we can just check whether the sp is smaller than the stack min (or stack boundary), if yes, report fail and return. The generated machine may be like:

    subq    $16, %rsp
    mov %reg, %rsp
    cmp %reg, stack_min
    jl error
    ...
error:
    ret

i feel it's problematic (or at least tricky) to check after the substraction. eg. it needs sigaltstack

IIUC, the subq sp instruction is an instruction of the entry part of the function code generated, which normally pushes some registers and then subs the sp. And after the entry code, the compiler then generats the code related to the LLVM IRs created by developer. If you have concern, we can try using LLVMBuildAlloca or llvm.frameaddress as you mentioned below, the address of an variable allocated by LLVMBuildAlloca should be allocated after the sp is subbed, it should unreasonable that compiler allocates a variable from stack to developer first, and then reserve some stack space for its internal usage.
Not sure why need sigaltstack?

(2) in the caller, when calling functions, we can just check whether there are enough stack space to pass the function arguments, or to ensure that the call instruction is safely executed, since then in the beginning of the callee, the stack pointer will be checked again after it is subtracted. The necessary stack space to pass the function arguments depends on the calling conventions of the target, like: https://en.wikipedia.org/wiki/X86_calling_conventions It may be a little complex to calculate the size precisely for each ABI, maybe we can use an evaluated maximum size.

at this point, i don't concern much about function arguments because we only support a moderate numbers of parameters/results.

Adding two checks may degrade the performance, maybe we can merge them together: in the loader, record the maximum size needed to call the functions (we know the func type to call), and then here, in the beginning of each AOT function, check whether sp - max_size_needed_to_call_func < sp_min.

why in the loader?

There may be many function callings inside a function, the callee's function types differ, if we want to calculate the required stack size to pass the callee's arguments, we should calculate them firstly and get the maximum size. For example, in Linux x86-64, in function A, it may call function B(i32, i32, i32, i32, i32, i32) and C(i32, i32, i32, i32, i32, i32, i32), since the first 6 int arguments are passed by registers (RDI, RSI, RDX, RCX, R8, R9), no stack space is required to call B and 8-byte is needed to call C. And for Linux x86-32, all these arguments are passed by stack, B requires 24 bytes, C requires 28 bytes. And it is more complex when considering float registers and more ABIs.
If we don't want to calculate the size precisely, we may assume that all the arguments are passed by stack, and evaluate a maximum size needed.

For how to get the stack pointer, I am not very sure, there may be some methods, one is to use the LLVM intrinsic llvm.read_register: https://llvm.org/docs/LangRef.html#llvm-read-register-llvm-read-volatile-register-and-llvm-write-register-intrinsics another may be like current implementation, use LLVMBuildAlloca to allocate a memory from stack.

i suppose LLVMBuildAlloca or llvm.frameaddress is a bit more portable than llvm.read_register.

Agree.

@yamt
Copy link
Collaborator Author

yamt commented May 8, 2023

Hi, yes, I agree to add check in the callee, but had better not implement it with a wrapper function since the function call may impact the performance a lot.

i suppose it isn't that expensive. on x86, it can be a jmp. but it needs a benchmark/evaluation anyway.

It should be cmp + jcc instructions, and yes, we had better test it with benchmarks.

the cmp + jcc part is necessary regardless of a wrapper function.
i meant that extra overhead of wrapper function could be small as a jmp.

Instead, suppose we can get the current stack pointer, we may add both checks in callee and caller: (1) in the callee, suppose that the stack pointer is always updated at the beginning of the AOT code, or, code like subq $16, %rsp is always before the check code we add, then in the check code, we can just check whether the sp is smaller than the stack min (or stack boundary), if yes, report fail and return. The generated machine may be like:

    subq    $16, %rsp
    mov %reg, %rsp
    cmp %reg, stack_min
    jl error
    ...
error:
    ret

i feel it's problematic (or at least tricky) to check after the substraction. eg. it needs sigaltstack

IIUC, the subq sp instruction is an instruction of the entry part of the function code generated, which normally pushes some registers and then subs the sp. And after the entry code, the compiler then generats the code related to the LLVM IRs created by developer. If you have concern, we can try using LLVMBuildAlloca or llvm.frameaddress as you mentioned below, the address of an variable allocated by LLVMBuildAlloca should be allocated after the sp is subbed, it should unreasonable that compiler allocates a variable from stack to developer first, and then reserve some stack space for its internal usage. Not sure why need sigaltstack?

i meant it's safer to check before actually subtract the stack pointer.

if you subtract the stack pointer before the check, it might point to non-stack area.
if the thread gets a signal, the signal handler can clobber the non-stack area. (w/o sigaltstack)

(2) in the caller, when calling functions, we can just check whether there are enough stack space to pass the function arguments, or to ensure that the call instruction is safely executed, since then in the beginning of the callee, the stack pointer will be checked again after it is subtracted. The necessary stack space to pass the function arguments depends on the calling conventions of the target, like: https://en.wikipedia.org/wiki/X86_calling_conventions It may be a little complex to calculate the size precisely for each ABI, maybe we can use an evaluated maximum size.

at this point, i don't concern much about function arguments because we only support a moderate numbers of parameters/results.

Adding two checks may degrade the performance, maybe we can merge them together: in the loader, record the maximum size needed to call the functions (we know the func type to call), and then here, in the beginning of each AOT function, check whether sp - max_size_needed_to_call_func < sp_min.

why in the loader?

There may be many function callings inside a function, the callee's function types differ, if we want to calculate the required stack size to pass the callee's arguments, we should calculate them firstly and get the maximum size. For example, in Linux x86-64, in function A, it may call function B(i32, i32, i32, i32, i32, i32) and C(i32, i32, i32, i32, i32, i32, i32), since the first 6 int arguments are passed by registers (RDI, RSI, RDX, RCX, R8, R9), no stack space is required to call B and 8-byte is needed to call C. And for Linux x86-32, all these arguments are passed by stack, B requires 24 bytes, C requires 28 bytes. And it is more complex when considering float registers and more ABIs. If we don't want to calculate the size precisely, we may assume that all the arguments are passed by stack, and evaluate a maximum size needed.

i'm not familiar enough with llvm to say if max(A, B, C) is always a right calculation.
i suspect llvm can produce machine code which consumes sum(A, B, C).
it's a trade-off between the stack consumption and the number of stack restoring instructions.

For how to get the stack pointer, I am not very sure, there may be some methods, one is to use the LLVM intrinsic llvm.read_register: https://llvm.org/docs/LangRef.html#llvm-read-register-llvm-read-volatile-register-and-llvm-write-register-intrinsics another may be like current implementation, use LLVMBuildAlloca to allocate a memory from stack.

i suppose LLVMBuildAlloca or llvm.frameaddress is a bit more portable than llvm.read_register.

Agree.

@wenyongh
Copy link
Contributor

wenyongh commented May 8, 2023

IIUC, the subq sp instruction is an instruction of the entry part of the function code generated, which normally pushes some registers and then subs the sp. And after the entry code, the compiler then generats the code related to the LLVM IRs created by developer. If you have concern, we can try using LLVMBuildAlloca or llvm.frameaddress as you mentioned below, the address of an variable allocated by LLVMBuildAlloca should be allocated after the sp is subbed, it should unreasonable that compiler allocates a variable from stack to developer first, and then reserve some stack space for its internal usage. Not sure why need sigaltstack?

i meant it's safer to check before actually subtract the stack pointer.

if you subtract the stack pointer before the check, it might point to non-stack area. if the thread gets a signal, the signal handler can clobber the non-stack area. (w/o sigaltstack)

My understanding is that the sub instruction is generated before the check instructions added by us, we cannot add checks before the sub instruction, the behavior is controlled by LLVM compiler. And at that time the machine code just subtracts the sp, it hasn't actually accessed the local variables yet, it may have pushed several necessary callee-save registers, but we can check size at the caller to make these push operations safe.

There may be many function callings inside a function, the callee's function types differ, if we want to calculate the required stack size to pass the callee's arguments, we should calculate them firstly and get the maximum size. For example, in Linux x86-64, in function A, it may call function B(i32, i32, i32, i32, i32, i32) and C(i32, i32, i32, i32, i32, i32, i32), since the first 6 int arguments are passed by registers (RDI, RSI, RDX, RCX, R8, R9), no stack space is required to call B and 8-byte is needed to call C. And for Linux x86-32, all these arguments are passed by stack, B requires 24 bytes, C requires 28 bytes. And it is more complex when considering float registers and more ABIs. If we don't want to calculate the size precisely, we may assume that all the arguments are passed by stack, and evaluate a maximum size needed.

i'm not familiar enough with llvm to say if max(A, B, C) is always a right calculation. i suspect llvm can produce machine code which consumes sum(A, B, C). it's a trade-off between the stack consumption and the number of stack restoring instructions.

Which arguments are passed by int/float registers and which arguments are to be stored into stack space depend on the target calling convention, we have handled such operations in wasm_runtime_invoke_native and fast-jit. For x86, we can refer to https://en.wikipedia.org/wiki/X86_calling_conventions.

Anyway, it is only necessary to calculate max(A, B, C) when we want to add a check in the beginning of a function to improve performance. If we add checks in both caller (before calling A/B/C) and callee (at the beginning of A/B/C), no need to calculate it in the wasm loader, instead, we may calculate the required size in caller (before calling A/B/C).

@yamt
Copy link
Collaborator Author

yamt commented May 10, 2023

IIUC, the subq sp instruction is an instruction of the entry part of the function code generated, which normally pushes some registers and then subs the sp. And after the entry code, the compiler then generats the code related to the LLVM IRs created by developer. If you have concern, we can try using LLVMBuildAlloca or llvm.frameaddress as you mentioned below, the address of an variable allocated by LLVMBuildAlloca should be allocated after the sp is subbed, it should unreasonable that compiler allocates a variable from stack to developer first, and then reserve some stack space for its internal usage. Not sure why need sigaltstack?

i meant it's safer to check before actually subtract the stack pointer.
if you subtract the stack pointer before the check, it might point to non-stack area. if the thread gets a signal, the signal handler can clobber the non-stack area. (w/o sigaltstack)

My understanding is that the sub instruction is generated before the check instructions added by us, we cannot add checks before the sub instruction, the behavior is controlled by LLVM compiler.

you are right it's difficult (or probably impossible) to control these frame-generating instructions
at IR level. the only thing i could find was the probe-stack attribute i mentioned above.
that's why i ended up with the wrapper function idea mentioned above.

And at that time the machine code just subtracts the sp, it hasn't actually accessed the local variables yet, it may have pushed several necessary callee-save registers, but we can check size at the caller to make these push operations safe.

if you use sigaltstack, it's probably safe.
i still feel it's a bit fragile though.

There may be many function callings inside a function, the callee's function types differ, if we want to calculate the required stack size to pass the callee's arguments, we should calculate them firstly and get the maximum size. For example, in Linux x86-64, in function A, it may call function B(i32, i32, i32, i32, i32, i32) and C(i32, i32, i32, i32, i32, i32, i32), since the first 6 int arguments are passed by registers (RDI, RSI, RDX, RCX, R8, R9), no stack space is required to call B and 8-byte is needed to call C. And for Linux x86-32, all these arguments are passed by stack, B requires 24 bytes, C requires 28 bytes. And it is more complex when considering float registers and more ABIs. If we don't want to calculate the size precisely, we may assume that all the arguments are passed by stack, and evaluate a maximum size needed.

i'm not familiar enough with llvm to say if max(A, B, C) is always a right calculation. i suspect llvm can produce machine code which consumes sum(A, B, C). it's a trade-off between the stack consumption and the number of stack restoring instructions.

Which arguments are passed by int/float registers and which arguments are to be stored into stack space depend on the target calling convention, we have handled such operations in wasm_runtime_invoke_native and fast-jit. For x86, we can refer to https://en.wikipedia.org/wiki/X86_calling_conventions.

Anyway, it is only necessary to calculate max(A, B, C) when we want to add a check in the beginning of a function to improve performance. If we add checks in both caller (before calling A/B/C) and callee (at the beginning of A/B/C), no need to calculate it in the wasm loader, instead, we may calculate the required size in caller (before calling A/B/C).

@yamt
Copy link
Collaborator Author

yamt commented May 10, 2023

i'm thinking to generate something like this:

IR

declare i32 @dummy(i32 *)
declare i1 @llvm.expect.i1(i1, i1)
@bound = global i64 2147483648 ; a dummy value for testing

define i32 @func_wrapper(i32, i32) {
b1:
	%sp = alloca i8, i32 1
	%nextsp = getelementptr i8, i8* %sp, i32 -400007 ; -(100000*4)+1
	%bound_i = load i64, i64* @bound
	%bound_p = inttoptr i64 %bound_i to i8*
	%cmp = icmp ult i8* %nextsp, %bound_p
	%cmp2 = call i1 @llvm.expect.i1(i1 %cmp, i1 0)
	br i1 %cmp2, label %b3, label %b2
b2:
	%c = tail call i32 @func_body(i32 %0, i32 %1)
	ret i32 %c
b3:
	ret i32 1
}

define i32 @func_body(i32, i32) noinline {
	%p = alloca i32, i32 100000
	%c = call i32 @dummy(i32* %p)
	ret i32 %c
}

the corresponding amd64 -O3 output

	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.globl	_func_wrapper                   ## -- Begin function func_wrapper
	.p2align	4, 0x90
_func_wrapper:                          ## @func_wrapper
	.cfi_startproc
## %bb.0:                               ## %b1
	leaq	-400008(%rsp), %rax
	cmpq	_bound(%rip), %rax
	jb	LBB0_1
## %bb.2:                               ## %b2
	jmp	_func_body                      ## TAILCALL
LBB0_1:                                 ## %common.ret
	movl	$1, %eax
	retq
	.cfi_endproc
                                        ## -- End function
	.globl	_func_body                      ## -- Begin function func_body
	.p2align	4, 0x90
_func_body:                             ## @func_body
	.cfi_startproc
## %bb.0:
	subq	$400008, %rsp                   ## imm = 0x61A88
	.cfi_def_cfa_offset 400016
	leaq	8(%rsp), %rdi
	callq	_dummy
	addq	$400008, %rsp                   ## imm = 0x61A88
	retq
	.cfi_endproc
                                        ## -- End function
	.section	__DATA,__data
	.globl	_bound                          ## @bound
	.p2align	3
_bound:
	.quad	2147483648                      ## 0x80000000

.subsections_via_symbols

@wenyongh
Copy link
Contributor

Do you mean rename the original aot_func#n to aot_func#n_body then add new aot_func#n to call aot_func#n_body, and add sp check in the new aot_func#n? This seems adding check in the callee, and introduce another jmp instruction:

        cmpq	_bound(%rip), %rax
	jb	LBB0_1
## %bb.2:                               ## %b2
	jmp	_func_body                      ## TAILCALL
LBB0_1:
        movl	$1, %eax
	retq

Note that the most likely case is cmpq + jb + jmp, 3 instructions. I think the ideal result is like:

        cmpq	_bound(%rip), %rax
	jnb	_func_body            ## TAILCALL
        movl	$1, %eax
	retq

Which only has 2 instructions cmpq + jnb.

Could you try changing

        %cmp = icmp ult i8* %nextsp, %bound_p
	%cmp2 = call i1 @llvm.expect.i1(i1 %cmp, i1 0)
	br i1 %cmp2, label %b3, label %b2

to

        %cmp = icmp uge %bound_p, i8* %nextsp
	%cmp2 = call i1 @llvm.expect.i1(i1 %cmp, i1 0)
	br i1 %cmp2, %b2, label %b3       

And try again?

@wenyongh
Copy link
Contributor

wenyongh commented May 10, 2023

But wondering why add check in this way, will it prevent accessing the unexpected stack space in the func_body? Note that the _func_body is also another complete function, will it prevent LLVM -O0 generating the similar machine code like the code you pasted at the beginning of this issue:

amd64 wamrc --opt-level=0 output of the function:

0000000000000270 <aot_func#3>:
     270: 48 81 ec 98 4e 01 00          subq    $85656, %rsp            # imm = 0x14E98
     277: 41 89 f0                      movl    %esi, %r8d
     27a: 48 89 bc 24 70 4e 01 00       movq    %rdi, 85616(%rsp)
     282: 48 89 f8                      movq    %rdi, %rax
     285: 48 83 c0 10                   addq    $16, %rax

@yamt
Copy link
Collaborator Author

yamt commented May 10, 2023

Do you mean rename the original aot_func#n to aot_func#n_body then add new aot_func#n to call aot_func#n_body, and add sp check in the new aot_func#n? This seems adding check in the callee, and introduce another jmp instruction:

yes.

        cmpq	_bound(%rip), %rax
	jb	LBB0_1
## %bb.2:                               ## %b2
	jmp	_func_body                      ## TAILCALL
LBB0_1:
        movl	$1, %eax
	retq

Note that the most likely case is cmpq + jb + jmp, 3 instructions. I think the ideal result is like:

        cmpq	_bound(%rip), %rax
	jnb	_func_body            ## TAILCALL
        movl	$1, %eax
	retq

Which only has 2 instructions cmpq + jnb.

the latest llvm with https://reviews.llvm.org/D140931 will produce such a code.

Could you try changing

        %cmp = icmp ult i8* %nextsp, %bound_p
	%cmp2 = call i1 @llvm.expect.i1(i1 %cmp, i1 0)
	br i1 %cmp2, label %b3, label %b2

to

        %cmp = icmp uge %bound_p, i8* %nextsp
	%cmp2 = call i1 @llvm.expect.i1(i1 %cmp, i1 0)
	br i1 %cmp2, %b2, label %b3       

And try again?

i don't think the current version (apple clang 14.0.0) can produce a tail call with jcc.

@yamt
Copy link
Collaborator Author

yamt commented May 10, 2023

But wondering why add check in this way, will it prevent accessing the unexpected stack space in the func_body? Note that the _func_body is also another complete function, will it prevent LLVM -O0 generating the similar machine code like the code you pasted at the beginning of this issue:

because it's the simplest way i could think of to perform the check before actually moving the stack pointer.
the wrapper function can use the info from #2158
(it likely requires 2-pass compilation.)

amd64 wamrc --opt-level=0 output of the function:

0000000000000270 <aot_func#3>:
     270: 48 81 ec 98 4e 01 00          subq    $85656, %rsp            # imm = 0x14E98
     277: 41 89 f0                      movl    %esi, %r8d
     27a: 48 89 bc 24 70 4e 01 00       movq    %rdi, 85616(%rsp)
     282: 48 89 f8                      movq    %rdi, %rax
     285: 48 83 c0 10                   addq    $16, %rax

@wenyongh
Copy link
Contributor

But wondering why add check in this way, will it prevent accessing the unexpected stack space in the func_body? Note that the _func_body is also another complete function, will it prevent LLVM -O0 generating the similar machine code like the code you pasted at the beginning of this issue:

because it's the simplest way i could think of to perform the check before actually moving the stack pointer. the wrapper function can use the info from #2158 (it likely requires 2-pass compilation.)

2-pass compilation requires double time, developer already complains a lot about the compilation time of aot compiler, had better not add another pass. Can we just modify (hack) the immediate value in lea like instruction after the object file is generated and the stack usage info is gotten?

leaq	-400008(%rsp), %rax  => For example, put UINT32_MAX as the initial value and change it to 400008 later

And I have another question:
Does it still need the check in the caller before calling other functions (the current implementation)? Does the stack usage generated include the stack size needed to pass function arguments? Or shall we just remove the check since we have already reserved some space before the stack boundary (stack min), we always assume that space is enough to pass function arguments?

@wenyongh
Copy link
Contributor

Which only has 2 instructions cmpq + jnb.

the latest llvm with https://reviews.llvm.org/D140931 will produce such a code.

i don't think the current version (apple clang 14.0.0) can produce a tail call with jcc.

So the latest llvm supports it? We have upgraded llvm to 15.0 for LLVM JIT/AOT, does llvm-15.0 support that? Not sure why the version for apple is clang-14.0.0.

@yamt
Copy link
Collaborator Author

yamt commented May 11, 2023

But wondering why add check in this way, will it prevent accessing the unexpected stack space in the func_body? Note that the _func_body is also another complete function, will it prevent LLVM -O0 generating the similar machine code like the code you pasted at the beginning of this issue:

because it's the simplest way i could think of to perform the check before actually moving the stack pointer. the wrapper function can use the info from #2158 (it likely requires 2-pass compilation.)

2-pass compilation requires double time, developer already complains a lot about the compilation time of aot compiler, had better not add another pass. Can we just modify (hack) the immediate value in lea like instruction after the object file is generated and the stack usage info is gotten?

i'm afraid that it's difficult to modify the immediate.
a simpler but slower alternative would be to make the check code load the value from a table.

leaq	-400008(%rsp), %rax  => For example, put UINT32_MAX as the initial value and change it to 400008 later

And I have another question: Does it still need the check in the caller before calling other functions (the current implementation)? Does the stack usage generated include the stack size needed to pass function arguments? Or shall we just remove the check since we have already reserved some space before the stack boundary (stack min), we always assume that space is enough to pass function arguments?

as we currently have a hard limit (64) for the number of parameters/results,
reserving fixed amount (2 * 64 * sizeof(v128) bytes) is good enough.

@yamt
Copy link
Collaborator Author

yamt commented May 11, 2023

Which only has 2 instructions cmpq + jnb.

the latest llvm with https://reviews.llvm.org/D140931 will produce such a code.
i don't think the current version (apple clang 14.0.0) can produce a tail call with jcc.

So the latest llvm supports it? We have upgraded llvm to 15.0 for LLVM JIT/AOT, does llvm-15.0 support that?

i don't think llvm 16 has it.
i guess the change will be included in llvm 17.

llvm 35d218e92740fb49ad5e2be4c700aa38c1133809 (the head of main branch as of yesterday)
produced the following code for the same IR.

	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.globl	_func_wrapper                   ## -- Begin function func_wrapper
	.p2align	4, 0x90
_func_wrapper:                          ## @func_wrapper
	.cfi_startproc
## %bb.0:                               ## %b1
	leaq	-400008(%rsp), %rax
	cmpq	_bound(%rip), %rax
	jae	_func_body                      ## TAILCALL
## %bb.1:                               ## %common.ret
	movl	$1, %eax
	retq
	.cfi_endproc
                                        ## -- End function
	.globl	_func_body                      ## -- Begin function func_body
	.p2align	4, 0x90
_func_body:                             ## @func_body
	.cfi_startproc
## %bb.0:
	subq	$400008, %rsp                   ## imm = 0x61A88
	.cfi_def_cfa_offset 400016
	leaq	8(%rsp), %rdi
	callq	_dummy
	addq	$400008, %rsp                   ## imm = 0x61A88
	retq
	.cfi_endproc
                                        ## -- End function
	.section	__DATA,__data
	.globl	_bound                          ## @bound
	.p2align	3, 0x0
_bound:
	.quad	2147483648                      ## 0x80000000

.subsections_via_symbols

Not sure why the version for apple is clang-14.0.0.

by "the current version", i meant the version i happened to use for the experiment above. (xcode clang)
not the version used by wamr's official builds.

@wenyongh
Copy link
Contributor

2-pass compilation requires double time, developer already complains a lot about the compilation time of aot compiler, had better not add another pass. Can we just modify (hack) the immediate value in lea like instruction after the object file is generated and the stack usage info is gotten?

i'm afraid that it's difficult to modify the immediate. a simpler but slower alternative would be to make the check code load the value from a table.

Yes, we may define a global uint32 array with length aot_func_count, load the stack usage of aot_func i from array[i], and when emitting the aot file, get the stack usage and write the data to the array, which should have been converted into a special data section in the object file.

And I have another question: Does it still need the check in the caller before calling other functions (the current implementation)? Does the stack usage generated include the stack size needed to pass function arguments? Or shall we just remove the check since we have already reserved some space before the stack boundary (stack min), we always assume that space is enough to pass function arguments?

as we currently have a hard limit (64) for the number of parameters/results, reserving fixed amount (2 * 64 * sizeof(v128) bytes) is good enough.

2 * 64 * sizeof(v128) = 128 * 16 = 2048 bytes, it may be a little large, by default the guard stack size for non-uvwasi mode is 1024:
https://github.com/bytecodealliance/wasm-micro-runtime/blob/main/core/config.h#L405
And developer may set it smaller, it might be not able meet the requirements of some developers. How about limiting the minimum guard stack size to 256 bytes, and calculate the max size needed to pass the parameters/results according to their number and type, and then: if max_size < 256, ignore the check, else add the check? I think in most cases we don't need to add this check, so it won't impact performance and also it won't increase the footprint.

i don't think llvm 16 has it. i guess the change will be included in llvm 17.

llvm 35d218e92740fb49ad5e2be4c700aa38c1133809 (the head of main branch as of yesterday) produced the following code for the same IR.

	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.globl	_func_wrapper                   ## -- Begin function func_wrapper
	.p2align	4, 0x90
_func_wrapper:                          ## @func_wrapper
	.cfi_startproc
## %bb.0:                               ## %b1
	leaq	-400008(%rsp), %rax
	cmpq	_bound(%rip), %rax
	jae	_func_body                      ## TAILCALL
## %bb.1:                               ## %common.ret
	movl	$1, %eax
	retq
	.cfi_endproc
                                        ## -- End function
	.globl	_func_body                      ## -- Begin function func_body
	.p2align	4, 0x90
_func_body:                             ## @func_body
	.cfi_startproc
## %bb.0:
	subq	$400008, %rsp                   ## imm = 0x61A88
	.cfi_def_cfa_offset 400016
	leaq	8(%rsp), %rdi
	callq	_dummy
	addq	$400008, %rsp                   ## imm = 0x61A88
	retq
	.cfi_endproc
                                        ## -- End function
	.section	__DATA,__data
	.globl	_bound                          ## @bound
	.p2align	3, 0x0
_bound:
	.quad	2147483648                      ## 0x80000000

.subsections_via_symbols

Not sure why the version for apple is clang-14.0.0.

by "the current version", i meant the version i happened to use for the experiment above. (xcode clang) not the version used by wamr's official builds.

Got it, thanks, we may upgrade the LLVM version for WAMR in the future. Currently it slightly impacts performance but should be acceptable.

@yamt
Copy link
Collaborator Author

yamt commented May 15, 2023

i'm afraid that we need to use different mechanisms to obtain stack sizes for jit and aot.

for jit, the only way i can think of right now is to use an extra codegen pass.
something like this: #2216

for aot, given that we support external compilers (WAMRC_LLC_COMPILER/WAMRC_ASM_COMPILER)
and non-ELF, (COFF)
i guess we need to parse the -fstack-usage output file.
(note: .stack_sizes section is ELF-only)

how do you think?

@wenyongh
Copy link
Contributor

It is good to me that JIT obtains stack usage in that way.

For AOT, can we get the stack usage file after emitting the object file:
https://github.com/bytecodealliance/wasm-micro-runtime/blob/main/core/iwasm/compilation/aot_emit_aot_file.c#L2707-L2709
And then parse the data and write the stack usage into the special data section when refactoring object file to aot file?

I found that the stack usage file can be generated together with aot file with PR #2158, and doubt that the above is feasible.

@yamt
Copy link
Collaborator Author

yamt commented May 15, 2023

For AOT, can we get the stack usage file after emitting the object file: https://github.com/bytecodealliance/wasm-micro-runtime/blob/main/core/iwasm/compilation/aot_emit_aot_file.c#L2707-L2709 And then parse the data and write the stack usage into the special data section when refactoring object file to aot file?

I found that the stack usage file can be generated together with aot file with PR #2158, and doubt that the above is feasible.

yes.
it's what i meant by "i guess we need to parse the -fstack-usage output file."

@wenyongh
Copy link
Contributor

Yes, got it, sounds good, thanks.

@yamt
Copy link
Collaborator Author

yamt commented Jun 1, 2023

as we currently have a hard limit (64) for the number of parameters/results, reserving fixed amount (2 * 64 * sizeof(v128) bytes) is good enough.

2 * 64 * sizeof(v128) = 128 * 16 = 2048 bytes, it may be a little large, by default the guard stack size for non-uvwasi mode is 1024: https://github.com/bytecodealliance/wasm-micro-runtime/blob/main/core/config.h#L405 And developer may set it smaller, it might be not able meet the requirements of some developers.

i made a miscalculation.
64 was the number of cells, not values.
so, actually it was 2 * 64 * sizeof(i32) = 512 bytes.

@yamt
Copy link
Collaborator Author

yamt commented Jun 8, 2023

as we currently have a hard limit (64) for the number of parameters/results, reserving fixed amount (2 * 64 * sizeof(v128) bytes) is good enough.

2 * 64 * sizeof(v128) = 128 * 16 = 2048 bytes, it may be a little large, by default the guard stack size for non-uvwasi mode is 1024: https://github.com/bytecodealliance/wasm-micro-runtime/blob/main/core/config.h#L405 And developer may set it smaller, it might be not able meet the requirements of some developers.

i made a miscalculation. 64 was the number of cells, not values. so, actually it was 2 * 64 * sizeof(i32) = 512 bytes.

well, actually, the pointer size matters for results.
also, i32 parameters pushed onto the stack uses 64-bit on x86-64.

i have implemented a hopefully better estimation in #2244
(aot_estimate_stack_usage_for_function_call)

@yamt
Copy link
Collaborator Author

yamt commented Jun 13, 2023

but it needs a benchmark/evaluation anyway.

#2244 (comment)

wenyongh pushed a commit that referenced this issue Jun 21, 2023
Move the native stack overflow check from the caller to the callee because the
former doesn't work for call_indirect and imported functions.

Make the stack usage estimation more accurate. Instead of making a guess from
the number of wasm locals in the function, use the LLVM's idea of the stack size
of each MachineFunction. The former is inaccurate because a) it doesn't reflect
optimization passes, and b) wasm locals are not the only reason to use stack.

To use the post-compilation stack usage information without requiring 2-pass
compilation or machine-code imm rewriting, introduce a global array to store
stack consumption of each functions:
For JIT, use a custom IRCompiler with an extra pass to fill the array.
For AOT, use `clang -fstack-usage` equivalent because we support external llc.

Re-implement function call stack usage estimation to reflect the real calling
conventions better. (aot_estimate_stack_usage_for_function_call)

Re-implement stack estimation logic (--enable-memory-profiling) based on the new
machinery.

Discussions: #2105.
@yamt
Copy link
Collaborator Author

yamt commented Jul 6, 2023

i tested a few examples in this PR with the latest wamrc. it produced more reasonable estimation than before. let's close.

@yamt yamt closed this as completed Jul 6, 2023
victoryang00 pushed a commit to victoryang00/wamr-aot-gc-checkpoint-restore that referenced this issue May 27, 2024
Move the native stack overflow check from the caller to the callee because the
former doesn't work for call_indirect and imported functions.

Make the stack usage estimation more accurate. Instead of making a guess from
the number of wasm locals in the function, use the LLVM's idea of the stack size
of each MachineFunction. The former is inaccurate because a) it doesn't reflect
optimization passes, and b) wasm locals are not the only reason to use stack.

To use the post-compilation stack usage information without requiring 2-pass
compilation or machine-code imm rewriting, introduce a global array to store
stack consumption of each functions:
For JIT, use a custom IRCompiler with an extra pass to fill the array.
For AOT, use `clang -fstack-usage` equivalent because we support external llc.

Re-implement function call stack usage estimation to reflect the real calling
conventions better. (aot_estimate_stack_usage_for_function_call)

Re-implement stack estimation logic (--enable-memory-profiling) based on the new
machinery.

Discussions: bytecodealliance#2105.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants