-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"native stack overflow" detection is sometimes inappropriate #2105
Comments
@yamt Seems there are two issues: one is the estimated size of native stack used by the callee may be too small for |
right.
no good idea right now. unless llvm itself has a nice feature to report the stack usage (i dunno if it does) even if we have a precise estimation, it might be tricky to tell it to the generated check code. (can we change the constant after optimization passes?) |
another example of underestimation. wasm binary: many_stack5.wasm.zip
callee
caller
|
i'm thinking to make it 2-pass.
an obvious downside is that it would double compilation time. |
a related topic: right now, wamrc injects stack checking logic in the caller. however, i think it's simpler to perform the stack check in the callee because i think it can be done with a wrapper function like:
the stack consumption of the wrapper functions is not a big problem how do you think? |
i made a bit research on a few seemingly related llvm features.
|
Hi, yes, I agree to add check in the callee, but had better not implement it with a wrapper function since the function call may impact the performance a lot. Instead, suppose we can get the current stack pointer, we may add both checks in callee and caller: subq $16, %rsp
mov %reg, %rsp
cmp %reg, stack_min
jl error
...
error:
ret (2) in the caller, when calling functions, we can just check whether there are enough stack space to pass the function arguments, or to ensure that the call instruction is safely executed, since then in the beginning of the callee, the stack pointer will be checked again after it is subtracted. The necessary stack space to pass the function arguments depends on the calling conventions of the target, like: Adding two checks may degrade the performance, maybe we can merge them together: in the loader, record the maximum size needed to call the functions (we know the func type to call), and then here, in the beginning of each AOT function, check whether For how to get the stack pointer, I am not very sure, there may be some methods, one is to use the LLVM intrinsic |
i suppose it isn't that expensive. on x86, it can be a
i feel it's problematic (or at least tricky) to check after the substraction.
at this point, i don't concern much about function arguments
why in the loader?
i suppose |
It should be
IIUC, the
There may be many function callings inside a function, the callee's function types differ, if we want to calculate the required stack size to pass the callee's arguments, we should calculate them firstly and get the maximum size. For example, in Linux x86-64, in function A, it may call function B(i32, i32, i32, i32, i32, i32) and C(i32, i32, i32, i32, i32, i32, i32), since the first 6 int arguments are passed by registers (RDI, RSI, RDX, RCX, R8, R9), no stack space is required to call B and 8-byte is needed to call C. And for Linux x86-32, all these arguments are passed by stack, B requires 24 bytes, C requires 28 bytes. And it is more complex when considering float registers and more ABIs.
Agree. |
the
i meant it's safer to check before actually subtract the stack pointer. if you subtract the stack pointer before the check, it might point to non-stack area.
i'm not familiar enough with llvm to say if
|
My understanding is that the sub instruction is generated before the check instructions added by us, we cannot add checks before the sub instruction, the behavior is controlled by LLVM compiler. And at that time the machine code just subtracts the sp, it hasn't actually accessed the local variables yet, it may have pushed several necessary callee-save registers, but we can check size at the caller to make these push operations safe.
Which arguments are passed by int/float registers and which arguments are to be stored into stack space depend on the target calling convention, we have handled such operations in wasm_runtime_invoke_native and fast-jit. For x86, we can refer to https://en.wikipedia.org/wiki/X86_calling_conventions. Anyway, it is only necessary to calculate |
you are right it's difficult (or probably impossible) to control these frame-generating instructions
if you use sigaltstack, it's probably safe.
|
i'm thinking to generate something like this: IR
the corresponding amd64
|
Do you mean rename the original cmpq _bound(%rip), %rax
jb LBB0_1
## %bb.2: ## %b2
jmp _func_body ## TAILCALL
LBB0_1:
movl $1, %eax
retq Note that the most cmpq _bound(%rip), %rax
jnb _func_body ## TAILCALL
movl $1, %eax
retq Which only has 2 instructions Could you try changing %cmp = icmp ult i8* %nextsp, %bound_p
%cmp2 = call i1 @llvm.expect.i1(i1 %cmp, i1 0)
br i1 %cmp2, label %b3, label %b2 to %cmp = icmp uge %bound_p, i8* %nextsp
%cmp2 = call i1 @llvm.expect.i1(i1 %cmp, i1 0)
br i1 %cmp2, %b2, label %b3 And try again? |
But wondering why add check in this way, will it prevent accessing the unexpected stack space in the amd64 wamrc --opt-level=0 output of the function:
0000000000000270 <aot_func#3>:
270: 48 81 ec 98 4e 01 00 subq $85656, %rsp # imm = 0x14E98
277: 41 89 f0 movl %esi, %r8d
27a: 48 89 bc 24 70 4e 01 00 movq %rdi, 85616(%rsp)
282: 48 89 f8 movq %rdi, %rax
285: 48 83 c0 10 addq $16, %rax |
yes.
the latest llvm with https://reviews.llvm.org/D140931 will produce such a code.
i don't think the current version (apple clang 14.0.0) can produce a tail call with jcc. |
because it's the simplest way i could think of to perform the check before actually moving the stack pointer.
|
2-pass compilation requires double time, developer already complains a lot about the compilation time of aot compiler, had better not add another pass. Can we just modify (hack) the immediate value in leaq -400008(%rsp), %rax => For example, put UINT32_MAX as the initial value and change it to 400008 later And I have another question: |
So the latest llvm supports it? We have upgraded llvm to 15.0 for LLVM JIT/AOT, does llvm-15.0 support that? Not sure why the version for apple is clang-14.0.0. |
i'm afraid that it's difficult to modify the immediate.
as we currently have a hard limit (64) for the number of parameters/results, |
i don't think llvm 16 has it. llvm 35d218e92740fb49ad5e2be4c700aa38c1133809 (the head of main branch as of yesterday)
by "the current version", i meant the version i happened to use for the experiment above. (xcode clang) |
Yes, we may define a global uint32 array with length aot_func_count, load the stack usage of aot_func i from array[i], and when emitting the aot file, get the stack usage and write the data to the array, which should have been converted into a special data section in the object file.
2 * 64 * sizeof(v128) = 128 * 16 = 2048 bytes, it may be a little large, by default the guard stack size for non-uvwasi mode is 1024:
Got it, thanks, we may upgrade the LLVM version for WAMR in the future. Currently it slightly impacts performance but should be acceptable. |
i'm afraid that we need to use different mechanisms to obtain stack sizes for jit and aot. for jit, the only way i can think of right now is to use an extra codegen pass. for aot, given that we support external compilers ( how do you think? |
It is good to me that JIT obtains stack usage in that way. For AOT, can we get the stack usage file after emitting the object file: I found that the stack usage file can be generated together with aot file with PR #2158, and doubt that the above is feasible. |
yes. |
Yes, got it, sounds good, thanks. |
i made a miscalculation. |
well, actually, the pointer size matters for results. i have implemented a hopefully better estimation in #2244 |
|
Move the native stack overflow check from the caller to the callee because the former doesn't work for call_indirect and imported functions. Make the stack usage estimation more accurate. Instead of making a guess from the number of wasm locals in the function, use the LLVM's idea of the stack size of each MachineFunction. The former is inaccurate because a) it doesn't reflect optimization passes, and b) wasm locals are not the only reason to use stack. To use the post-compilation stack usage information without requiring 2-pass compilation or machine-code imm rewriting, introduce a global array to store stack consumption of each functions: For JIT, use a custom IRCompiler with an extra pass to fill the array. For AOT, use `clang -fstack-usage` equivalent because we support external llc. Re-implement function call stack usage estimation to reflect the real calling conventions better. (aot_estimate_stack_usage_for_function_call) Re-implement stack estimation logic (--enable-memory-profiling) based on the new machinery. Discussions: #2105.
i tested a few examples in this PR with the latest wamrc. it produced more reasonable estimation than before. let's close. |
Move the native stack overflow check from the caller to the callee because the former doesn't work for call_indirect and imported functions. Make the stack usage estimation more accurate. Instead of making a guess from the number of wasm locals in the function, use the LLVM's idea of the stack size of each MachineFunction. The former is inaccurate because a) it doesn't reflect optimization passes, and b) wasm locals are not the only reason to use stack. To use the post-compilation stack usage information without requiring 2-pass compilation or machine-code imm rewriting, introduce a global array to store stack consumption of each functions: For JIT, use a custom IRCompiler with an extra pass to fill the array. For AOT, use `clang -fstack-usage` equivalent because we support external llc. Re-implement function call stack usage estimation to reflect the real calling conventions better. (aot_estimate_stack_usage_for_function_call) Re-implement stack estimation logic (--enable-memory-profiling) based on the new machinery. Discussions: bytecodealliance#2105.
check_stack_boundary
uses the number of parameters and locals to calculate the required stack:wasm-micro-runtime/core/iwasm/compilation/aot_emit_function.c
Lines 945 to 950 in 6af8785
unfortunately, the estimation is not always appropriate.
for example, the code attached to this issue is compiled to wasm bytecode like:
(don't ask me why
-O0
is used here. how to build a wasm module is not important to the rest of this issue.)note that the function has a ton of locals.
check_stack_boundary
uses the number of locals to estimate the stack usage.at least there are a few issues:
alloca
s are the major source of stack consumption. actually, it depends. if there are not enough machine registers, llvm places temporary variables on stack. you can see it in the--opt-level=0
output below.--opt-level=3
output below.amd64
wamrc --opt-level=0
output of the function:amd64
wamrc --opt-level=3
output of the function:a.zip
The text was updated successfully, but these errors were encountered: