Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong GSCookie range recorded in GCInfo #13041

Closed
k15tfu opened this issue Jul 8, 2019 · 16 comments · Fixed by #40637
Closed

Wrong GSCookie range recorded in GCInfo #13041

k15tfu opened this issue Jul 8, 2019 · 16 comments · Fixed by #40637
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@k15tfu
Copy link
Contributor

k15tfu commented Jul 8, 2019

Hi! We use DoStackSnapshot() in a signal handler, but sometimes CheckGSCookies() fails like:

Assert failure(PID 10030 [0x0000272e], Thread: 10063 [0x274f]): !"About to FailFast. set ComPlus_AssertOnFailFast=0 if this is expected"
    File: ../../../src/vm/jithelpers.cpp Line: 5157
    Image: /home/user/p/dotnet-products/2/Profiler/test/data/RealApplications/Mandelbrot/bin/Debug/netcoreapp2.1/linux-x64/Mandelbrot

I found that it happens only when checking GSCookie in a particular slot = fffffdd8 (in hex),
and only when relOffset (in EECodeManager::GetGSCookieAddr()) is close to the end of its valid region (i.e. gcInfoDecoder.GetGSCookieValidRangeEnd()).
In my case, it failed many times with different relOffset: 0x451, 0x44f, 0x451, 0x451, 0x454, 0x451, 0x453, 0x454, whereas the valid range was 0x59 -- 0x455.

Here is a disasm for the top stack frame before signal handler:

EECM::GetGSCookieAddr relOfs af validRange 59-455
EECM::GetGSCookieAddr ret *00007F855D1EA498 = 9abcdef012345678 GSCookieSlot = fffffdd8 SP = 00007F855D1EA6C0
(gdb) x/251i $pc-0xaf
   0x7f8595aae8e0:      push   %rbp  <-- $pc - 0xaf
   0x7f8595aae8e1:      push   %r13
   0x7f8595aae8e3:      push   %r12
   0x7f8595aae8e5:      sub    $0x210,%rsp
   0x7f8595aae8ec:      lea    0x220(%rsp),%rbp
   0x7f8595aae8f4:      mov    %rcx,%r12
   0x7f8595aae8f7:      mov    %rdi,%r13
   0x7f8595aae8fa:      lea    -0x210(%rbp),%rdi
   0x7f8595aae901:      mov    $0x6e,%ecx
   0x7f8595aae906:      xor    %eax,%eax
   0x7f8595aae908:      rep stos %eax,%es:(%rdi)
   0x7f8595aae90a:      mov    %r12,%rcx
   0x7f8595aae90d:      mov    %r13,%rdi
   0x7f8595aae910:      mov    %rsp,-0x18(%rbp)
   0x7f8595aae914:      mov    -0xc5dc03(%rip),%rax        # 0x7f8594e50d18
   0x7f8595aae91b:      mov    %rax,-0x218(%rbp)  <-- write GSCookie 0x9abcdef012345678
   0x7f8595aae922:      mov    %edi,-0x1c(%rbp)
   0x7f8595aae925:      mov    %rsi,-0x30(%rbp)
   0x7f8595aae929:      mov    %rdx,-0x28(%rbp)
   0x7f8595aae92d:      mov    %rcx,-0x38(%rbp)
   0x7f8595aae931:      mov    %r8,-0x48(%rbp)
   0x7f8595aae935:      mov    %r9,-0x40(%rbp)
   0x7f8595aae939:      mov    -0xc5dc30(%rip),%rax        # 0x7f8594e50d10  <-- start of GSCookie valid range (+0x59)
   0x7f8595aae940:      cmpl   $0x0,(%rax)
   0x7f8595aae943:      je     0x7f8595aae94a
   0x7f8595aae945:      callq  0x7f85955f9010
   0x7f8595aae94a:      nop
   0x7f8595aae94b:      lea    -0x30(%rbp),%rdi
   0x7f8595aae94f:      callq  0x7f85959c7930
   0x7f8595aae954:      mov    %eax,-0x128(%rbp)
   0x7f8595aae95a:      cmpl   $0x0,-0x128(%rbp)
   0x7f8595aae961:      sete   %dil
   0x7f8595aae965:      movzbl %dil,%edi
   0x7f8595aae969:      mov    %edi,-0x68(%rbp)
   0x7f8595aae96c:      cmpl   $0x0,-0x68(%rbp)
   0x7f8595aae970:      je     0x7f8595aae9c4
   0x7f8595aae972:      nop
   0x7f8595aae973:      mov    -0x1c(%rbp),%edi
   0x7f8595aae976:      mov    %edi,-0x174(%rbp)
   0x7f8595aae97c:      movdqu -0x48(%rbp),%xmm0
   0x7f8595aae981:      movdqu %xmm0,-0x170(%rbp)
   0x7f8595aae989:      mov    -0x174(%rbp),%edi
=> 0x7f8595aae98f:      mov    -0x170(%rbp),%rdx  <-- $pc
   0x7f8595aae996:      mov    -0x168(%rbp),%rcx
   0x7f8595aae99d:      mov    0x10(%rbp),%r8
   0x7f8595aae9a1:      mov    $0xffffffff,%esi
   0x7f8595aae9a6:      callq  0x7f8595ab0c70
   0x7f8595aae9ab:      mov    %eax,-0x160(%rbp)
   0x7f8595aae9b1:      mov    -0x160(%rbp),%edi
   0x7f8595aae9b7:      movzbl %dil,%edi
   0x7f8595aae9bb:      mov    %edi,-0x6c(%rbp)
   0x7f8595aae9be:      nop
   0x7f8595aae9bf:      jmpq   0x7f8595aaed0f
   0x7f8595aae9c4:      movdqu -0x30(%rbp),%xmm0
   0x7f8595aae9c9:      movdqu %xmm0,-0x188(%rbp)
   0x7f8595aae9d1:      mov    -0x188(%rbp),%rdi
   0x7f8595aae9d8:      mov    -0x180(%rbp),%rsi
   0x7f8595aae9df:      lea    -0x58(%rbp),%rdx
   0x7f8595aae9e3:      callq  0x7f8595ab21b0
   0x7f8595aae9e8:      mov    %eax,-0x12c(%rbp)
   0x7f8595aae9ee:      mov    -0x12c(%rbp),%edi
   0x7f8595aae9f4:      movzwl %di,%edi
   0x7f8595aae9f7:      mov    %edi,-0x4c(%rbp)
   0x7f8595aae9fa:      mov    -0x38(%rbp),%rdi
---Type <return> to continue, or q <return> to quit---
   0x7f8595aae9fe:      callq  0x7f8595a6fdc0
   0x7f8595aaea03:      mov    %rax,-0x138(%rbp)
   0x7f8595aaea0a:      mov    -0x138(%rbp),%rdi
   0x7f8595aaea11:      mov    %rdi,-0x60(%rbp)
   0x7f8595aaea15:      mov    -0x4c(%rbp),%edi
   0x7f8595aaea18:      and    $0xffdf,%edi
   0x7f8595aaea1e:      mov    %edi,-0x64(%rbp)
   0x7f8595aaea21:      cmpl   $0x47,-0x64(%rbp)
   0x7f8595aaea25:      jne    0x7f8595aaea2d
   0x7f8595aaea27:      cmpl   $0x0,-0x58(%rbp)
   0x7f8595aaea2b:      jle    0x7f8595aaea41
   0x7f8595aaea2d:      cmpl   $0x44,-0x64(%rbp)
   0x7f8595aaea31:      sete   %dil
   0x7f8595aaea35:      movzbl %dil,%edi
   0x7f8595aaea39:      mov    %edi,-0x13c(%rbp)
   0x7f8595aaea3f:      jmp    0x7f8595aaea4b
   0x7f8595aaea41:      movl   $0x1,-0x13c(%rbp)
   0x7f8595aaea4b:      mov    -0x13c(%rbp),%edi
   0x7f8595aaea51:      movzbl %dil,%edi
   0x7f8595aaea55:      mov    %edi,-0x70(%rbp)
   0x7f8595aaea58:      cmpl   $0x0,-0x70(%rbp)
   0x7f8595aaea5c:      je     0x7f8595aaeaba
   0x7f8595aaea5e:      nop
   0x7f8595aaea5f:      mov    -0x1c(%rbp),%edi
   0x7f8595aaea62:      mov    %edi,-0x19c(%rbp)
   0x7f8595aaea68:      mov    -0x58(%rbp),%edi
   0x7f8595aaea6b:      mov    %edi,-0x1a0(%rbp)
   0x7f8595aaea71:      movdqu -0x48(%rbp),%xmm0
   0x7f8595aaea76:      movdqu %xmm0,-0x198(%rbp)
   0x7f8595aaea7e:      mov    -0x19c(%rbp),%edi
   0x7f8595aaea84:      mov    -0x1a0(%rbp),%esi
   0x7f8595aaea8a:      mov    -0x198(%rbp),%rdx
   0x7f8595aaea91:      mov    -0x190(%rbp),%rcx
   0x7f8595aaea98:      mov    0x10(%rbp),%r8
   0x7f8595aaea9c:      callq  0x7f8595ab0c70
   0x7f8595aaeaa1:      mov    %eax,-0x15c(%rbp)
   0x7f8595aaeaa7:      mov    -0x15c(%rbp),%edi
   0x7f8595aaeaad:      movzbl %dil,%edi
   0x7f8595aaeab1:      mov    %edi,-0x6c(%rbp)
   0x7f8595aaeab4:      nop
   0x7f8595aaeab5:      jmpq   0x7f8595aaed0f
   0x7f8595aaeaba:      cmpl   $0x58,-0x64(%rbp)
   0x7f8595aaeabe:      sete   %dil
   0x7f8595aaeac2:      movzbl %dil,%edi
   0x7f8595aaeac6:      mov    %edi,-0x74(%rbp)
   0x7f8595aaeac9:      cmpl   $0x0,-0x74(%rbp)
   0x7f8595aaeacd:      je     0x7f8595aaeb40
   0x7f8595aaeacf:      nop
   0x7f8595aaead0:      mov    -0x1c(%rbp),%edi
   0x7f8595aaead3:      mov    %edi,-0x1b4(%rbp)
   0x7f8595aaead9:      mov    -0x4c(%rbp),%edi
   0x7f8595aaeadc:      add    $0xffffffdf,%edi
   0x7f8595aaeadf:      movzwl %di,%edi
   0x7f8595aaeae2:      mov    %edi,-0x1b8(%rbp)
   0x7f8595aaeae8:      mov    -0x58(%rbp),%edi
   0x7f8595aaeaeb:      mov    %edi,-0x1bc(%rbp)
   0x7f8595aaeaf1:      movdqu -0x48(%rbp),%xmm0
   0x7f8595aaeaf6:      movdqu %xmm0,-0x1b0(%rbp)
   0x7f8595aaeafe:      mov    -0x1b4(%rbp),%edi
   0x7f8595aaeb04:      mov    -0x1b8(%rbp),%esi
   0x7f8595aaeb0a:      mov    -0x1bc(%rbp),%edx
   0x7f8595aaeb10:      mov    -0x1b0(%rbp),%rcx
   0x7f8595aaeb17:      mov    -0x1a8(%rbp),%r8
   0x7f8595aaeb1e:      mov    0x10(%rbp),%r9
---Type <return> to continue, or q <return> to quit---
   0x7f8595aaeb22:      callq  0x7f8595ab0620
   0x7f8595aaeb27:      mov    %eax,-0x158(%rbp)
   0x7f8595aaeb2d:      mov    -0x158(%rbp),%esi
   0x7f8595aaeb33:      movzbl %sil,%esi
   0x7f8595aaeb37:      mov    %esi,-0x6c(%rbp)
   0x7f8595aaeb3a:      nop
   0x7f8595aaeb3b:      jmpq   0x7f8595aaed0f
   0x7f8595aaeb40:      nop
   0x7f8595aaeb41:      xor    %esi,%esi
   0x7f8595aaeb43:      lea    -0xf8(%rbp),%rdi
   0x7f8595aaeb4a:      xorps  %xmm0,%xmm0
   0x7f8595aaeb4d:      movdqu %xmm0,(%rdi)
   0x7f8595aaeb51:      movdqu %xmm0,0x10(%rdi)
   0x7f8595aaeb56:      movdqu %xmm0,0x20(%rdi)
   0x7f8595aaeb5b:      movdqu %xmm0,0x30(%rdi)
   0x7f8595aaeb60:      movdqu %xmm0,0x40(%rdi)
   0x7f8595aaeb65:      movdqu %xmm0,0x50(%rdi)
   0x7f8595aaeb6a:      movdqu %xmm0,0x60(%rdi)
   0x7f8595aaeb6f:      mov    %rsi,0x70(%rdi)
   0x7f8595aaeb73:      mov    %si,0x78(%rdi)
   0x7f8595aaeb77:      lea    -0xf8(%rbp),%rsi
   0x7f8595aaeb7e:      mov    -0x1c(%rbp),%edi
   0x7f8595aaeb81:      callq  0x7f8595ab08b0
   0x7f8595aaeb86:      nop
   0x7f8595aaeb87:      nop
   0x7f8595aaeb88:      mov    $0x40,%edi
   0x7f8595aaeb8d:      mov    %edi,%edi
   0x7f8595aaeb8f:      test   %rdi,%rdi
   0x7f8595aaeb92:      je     0x7f8595aaebc5
   0x7f8595aaeb94:      mov    %rdi,%rdx
   0x7f8595aaeb97:      add    $0xf,%rdx
   0x7f8595aaeb9b:      and    $0xfffffffffffffff0,%rdx
   0x7f8595aaeb9f:      neg    %rdx
   0x7f8595aaeba2:      add    %rsp,%rdx
   0x7f8595aaeba5:      jb     0x7f8595aaeba9
   0x7f8595aaeba7:      xor    %edx,%edx
   0x7f8595aaeba9:      test   %esp,(%rsp)
   0x7f8595aaebac:      mov    %rsp,%rsi
   0x7f8595aaebaf:      sub    $0x1000,%rsi
   0x7f8595aaebb6:      mov    %rsi,%rsp
   0x7f8595aaebb9:      cmp    %rdx,%rsp
   0x7f8595aaebbc:      jae    0x7f8595aaeba9
   0x7f8595aaebbe:      mov    %rdx,%rsp
   0x7f8595aaebc1:      lea    (%rsp),%rdi
   0x7f8595aaebc5:      mov    %rsp,-0x18(%rbp)
   0x7f8595aaebc9:      mov    %rdi,-0x120(%rbp)
   0x7f8595aaebd0:      lea    -0x150(%rbp),%rdi
   0x7f8595aaebd7:      xorps  %xmm0,%xmm0
   0x7f8595aaebda:      movdqu %xmm0,(%rdi)
   0x7f8595aaebde:      lea    -0x150(%rbp),%rdi
   0x7f8595aaebe5:      mov    -0x120(%rbp),%rsi
   0x7f8595aaebec:      mov    $0x20,%edx
   0x7f8595aaebf1:      callq  0x7f85959cb5b0
   0x7f8595aaebf6:      lea    -0x118(%rbp),%rdi
   0x7f8595aaebfd:      mov    %rdi,-0x1d8(%rbp)
   0x7f8595aaec04:      movdqu -0x150(%rbp),%xmm0
   0x7f8595aaec0c:      movdqu %xmm0,-0x1d0(%rbp)
   0x7f8595aaec14:      mov    -0x1d8(%rbp),%rdi
   0x7f8595aaec1b:      mov    -0x1d0(%rbp),%rsi
   0x7f8595aaec22:      mov    -0x1c8(%rbp),%rdx
   0x7f8595aaec29:      callq  0x7f8595b897c0
   0x7f8595aaec2e:      nop
   0x7f8595aaec2f:      cmpl   $0x0,-0x4c(%rbp)
   0x7f8595aaec33:      setne  %dil
---Type <return> to continue, or q <return> to quit---
   0x7f8595aaec37:      movzbl %dil,%edi
   0x7f8595aaec3b:      mov    %edi,-0x124(%rbp)
   0x7f8595aaec41:      cmpl   $0x0,-0x124(%rbp)
   0x7f8595aaec48:      je     0x7f8595aaec70
   0x7f8595aaec4a:      nop
   0x7f8595aaec4b:      lea    -0x118(%rbp),%rdi
   0x7f8595aaec52:      lea    -0xf8(%rbp),%rsi
   0x7f8595aaec59:      mov    -0x4c(%rbp),%edx
   0x7f8595aaec5c:      mov    -0x58(%rbp),%ecx
   0x7f8595aaec5f:      mov    -0x60(%rbp),%r8
   0x7f8595aaec63:      xor    %r9d,%r9d
   0x7f8595aaec66:      callq  0x7f8595ab25b0
   0x7f8595aaec6b:      nop
   0x7f8595aaec6c:      nop
   0x7f8595aaec6d:      nop
   0x7f8595aaec6e:      jmp    0x7f8595aaecc1
   0x7f8595aaec70:      nop
   0x7f8595aaec71:      lea    -0x118(%rbp),%rdi
   0x7f8595aaec78:      mov    %rdi,-0x1f0(%rbp)
   0x7f8595aaec7f:      lea    -0xf8(%rbp),%rdi
   0x7f8595aaec86:      mov    %rdi,-0x1f8(%rbp)
   0x7f8595aaec8d:      movdqu -0x30(%rbp),%xmm0
   0x7f8595aaec92:      movdqu %xmm0,-0x1e8(%rbp)
   0x7f8595aaec9a:      mov    -0x1f0(%rbp),%rdi
   0x7f8595aaeca1:      mov    -0x1f8(%rbp),%rsi
   0x7f8595aaeca8:      mov    -0x1e8(%rbp),%rdx
   0x7f8595aaecaf:      mov    -0x1e0(%rbp),%rcx
   0x7f8595aaecb6:      mov    -0x60(%rbp),%r8
   0x7f8595aaecba:      callq  0x7f8595ab2c40
   0x7f8595aaecbf:      nop
   0x7f8595aaecc0:      nop
   0x7f8595aaecc1:      lea    -0x118(%rbp),%rdi
   0x7f8595aaecc8:      mov    %rdi,-0x210(%rbp)
   0x7f8595aaeccf:      movdqu -0x48(%rbp),%xmm0
   0x7f8595aaecd4:      movdqu %xmm0,-0x208(%rbp)
   0x7f8595aaecdc:      mov    -0x210(%rbp),%rdi
   0x7f8595aaece3:      mov    -0x208(%rbp),%rsi
   0x7f8595aaecea:      mov    -0x200(%rbp),%rdx
   0x7f8595aaecf1:      mov    0x10(%rbp),%rcx
   0x7f8595aaecf5:      callq  0x7f8595b89a10
   0x7f8595aaecfa:      mov    %eax,-0x154(%rbp)
   0x7f8595aaed00:      mov    -0x154(%rbp),%eax
   0x7f8595aaed06:      movzbl %al,%eax
   0x7f8595aaed09:      mov    %eax,-0x6c(%rbp)
   0x7f8595aaed0c:      nop
   0x7f8595aaed0d:      jmp    0x7f8595aaed0f
   0x7f8595aaed0f:      mov    -0x6c(%rbp),%eax
   0x7f8595aaed12:      lea    -0xc5e001(%rip),%rdi        # 0x7f8594e50d18
   0x7f8595aaed19:      mov    (%rdi),%rdi
   0x7f8595aaed1c:      cmp    %rdi,-0x218(%rbp)
   0x7f8595aaed23:      je     0x7f8595aaed2a
   0x7f8595aaed25:      callq  0x7f85955f9060
   0x7f8595aaed2a:      nop
   0x7f8595aaed2b:      lea    -0x10(%rbp),%rsp
   0x7f8595aaed2f:      pop    %r12  <-- CheckGSCookies() aborts if signal comes here (+0x44f)
   0x7f8595aaed31:      pop    %r13  <-- or here (+0x451)
   0x7f8595aaed33:      pop    %rbp  <-- or here (+0x453)
   0x7f8595aaed34:      retq  <-- or here (+0x454)
   0x7f8595aaed35:      int3  <-- end of GSCookie valid range (+0x455)
(gdb) x/g 0x7f8594e50d18
0x7f8594e50d18: 0x9abcdef012345678

After receiving the signal being on one of these tail instructions, if we use more than 0x218 bytes on the stack (we do) the old GSCookie value will be overwritten and then when we call DoStackSnapshot() the CheckGSCookies() will abort due to GSCookie mismatch. So obviously GetGSCookieValidRangeEnd() returns incorrect offset because it contains the code after freeing the stack.

As I can see @hoyosjs has faced with the same problem dotnet/coreclr@32741b9 , so probably in my case we have to backport this patch to all supported versions and disable GSCookie check for ProfToEEInterfaceImpl::DoStackSnapshot(). P.S. I'm ready to prepare a PR and check it locally.

category:correctness
theme:gc-info
skill-level:intermediate
cost:medium

@k15tfu k15tfu changed the title Wrong usage CheckGSCookies() when calling ProfToEEInterfaceImpl::DoStackSnapshot() Wrong usage of CheckGSCookies() when calling ProfToEEInterfaceImpl::DoStackSnapshot() Jul 8, 2019
@tommcdon
Copy link
Member

tommcdon commented Jul 8, 2019

@davmason

@davmason
Copy link
Member

Hi @k15tfu, can you tell me more about your scenario? I would like to understand what you are trying to accomplish to make sure I give the right advice.

Right now the only officially supported ways to call DoStackSnapshot on Linux are the following

  1. inside a profiler callback on the same thread you are executing on
  2. After suspending the runtime via ICorProfilerInfo10::SuspendRuntime

There are OS differences that prevent us from giving identical DoStackSnapshot behavior on Linux versus what is capable on Windows.

I'm not convinced that allowing the profiler to skip the GSCookie check is the right way to go with this, and even if I was convinced there is likely not enough time to make that change safely.

That being said, my goal here is to find an acceptable workaround to make your scenario work.

@k15tfu
Copy link
Contributor Author

k15tfu commented Jul 12, 2019

@davmason, hi!

can you tell me more about your scenario?

I am calling DoStackSnapshot for current thread from a signal handler, without use of ICorProfilerInfo10::SuspendRuntime. It looks correct, right?

There are OS differences that prevent us from giving identical DoStackSnapshot behavior on Linux versus what is capable on Windows.

And what are the differences? Is there anything I should know? :)

I'm not convinced that allowing the profiler to skip the GSCookie check is the right way to go with this, and even if I was convinced there is likely not enough time to make that change safely.

I know that it's not the best thing we can do because the root of this problem is an incorrect epilog info, but I think it's more complicated fix, taking into account that your debugger team have already faced with this issue and they decided to add SKIP_GSCOOKIE_CHECK flag.

So, that patch is already used in 3.0, and the only thing we should do for the profiler is to use SKIP_GSCOOKIE_CHECK flag in DoStackSnapshot (what looks easy for me).

@janvorli
Copy link
Member

@k15tfu are you aware of the fact that from a signal handler, you can call only a code that uses very limited subset of runtime functions (so called async signal safe functions) which are listed in http://man7.org/linux/man-pages/man7/signal-safety.7.html? Otherwise you can get crashes, hangs and other unexpected issues. Even accessing thread local store variables, taking locks or doing memory allocations is prohibited in general.
To give you some idea of a reason for that, just imagine a case when the interrupted thread was executing a memory allocator code and it was in the middle of updating some state of the allocator. Then if the signal handler code did a memory allocation (which obviously happens on the same thread), the allocator would be in an inconsistent state and can for example return the same memory block that it has already returned to someone else, leak memory or do something even worse.
The DoStackSnapshot definitely does some of the prohibited things, for example taking locks.

@k15tfu
Copy link
Contributor Author

k15tfu commented Jul 12, 2019

@janvorli Yes, I know, and the only things I use are libunwind (which is safe), and DoStackSnapshot.

Bottom line: DoStackSnapshot should not be called from signal handlers because it's not safe, and can cause ... any kind of errors. Is it correct?

If yes, the one ways the sampling profiler can be implemented are:

  • use SuspendRuntime followed by DoStackSnapshot
  • suspend a thread in a signal handler (using sleep for ex), and use DoStackSnapshot from another thread

Is there other options?

And I think in case dotnet/coreclr#2 we will avoid signal handler limitations, but this bug still will occur. So it's still actual issue.

@janvorli
Copy link
Member

Is it correct?

Yes.

Is there other options?

I think that another option might be to have a separate thread / set of threads dedicated to calling the DoStackSnapshot. The signal handler would synchronize with that thread e.g. using pipe, eventfd or some other mean that is async signal safe. It would essentially wait in the handler until the DoStackSnapshot completes and then continue running.

@k15tfu
Copy link
Contributor Author

k15tfu commented Jul 12, 2019

@janvorli Do agree that in this case GSCookie check continue fails, and that this bug is still actual?

@janvorli
Copy link
Member

Yes, it seems like a JIT bug. @k15tfu can you please share which function have you hit this issue with?
@dotnet/jit-contrib, it seems the reported valid range for GSCookie in GCInfo incorrectly includes function epilog in some functions.

@k15tfu
Copy link
Contributor Author

k15tfu commented Jul 17, 2019

@janvorli I don't know the exact function, but I can share the project where it happens: Mandelbrot.zip , and try it with netcoreapp2.1 / Debug:

$ dotnet build -c Debug -f netcoreapp2.1
$ dotnet publish -c Debug -f netcoreapp2.1 -r linux-x64 --self-contained true

CheckGSCookies() fails in threads running Writer.CreatePgmAscii and Writer.CreatePpmAscii (probably it's the same issue).

Here is clrstack output, probably it will be useful for you:

Process 12553 stopped
* thread dotnet/coreclr#16, name = 'Mandelbrot', stop reason = signal SIGABRT
    frame #0: 0x00007f6f37770e97 libc.so.6`__GI_raise(sig=2) at raise.c:51
(lldb) f 32
frame dotnet/coreclr#32: 0x00007f6ebd23ed31
->  0x7f6ebd23ed31: popq   %r13
    0x7f6ebd23ed33: popq   %rbp
    0x7f6ebd23ed34: retq
    0x7f6ebd23ed35: int3
(lldb) p/x $sp
(unsigned long) $0 = 0x00007f6e77ffd678
(lldb) clrstack
OS Thread Id: 0x31d5 (16)
        Child SP               IP Call Site
00007F6E77FFE768 00007f6f37770e97 [GCFrame: 00007f6e77ffe768]
00007F6E77FFEB38 00007f6f37770e97 [DebuggerU2MCatchHandlerFrame: 00007f6e77ffeb38]
(lldb) x/10i $pc-20
    0x7f6ebd23ed1d: 39 bd e8 fd ff ff  cmpl   %edi, -0x218(%rbp)
    0x7f6ebd23ed23: 74 05              je     0x7f6ebd23ed2a
    0x7f6ebd23ed25: e8 36 a3 b4 ff     callq  0x7f6ebcd89060
    0x7f6ebd23ed2a: 90                 nop
    0x7f6ebd23ed2b: 48 8d 65 f0        leaq   -0x10(%rbp), %rsp
    0x7f6ebd23ed2f: 41 5c              popq   %r12
->  0x7f6ebd23ed31: 41 5d              popq   %r13
    0x7f6ebd23ed33: 5d                 popq   %rbp
    0x7f6ebd23ed34: c3                 retq
    0x7f6ebd23ed35: cc                 int3

@k15tfu
Copy link
Contributor Author

k15tfu commented Jul 24, 2019

Bottom line: DoStackSnapshot should not be called from signal handlers because it's not safe, and can cause ... any kind of errors. Is it correct?

Btw, @janvorli one of the way to work with unsafe functions from a signal handler is to guarantee that signal will not be delivered during such functions, therefore http://man7.org/linux/man-pages/man7/signal-safety.7.html suggests to block signals before calling them. So, if I don't do anything printf-related in my program, I can safely use printf() in signal handler. What if I do all DoStackSnapshot() stuff in the signal handler, but to avoid possible problems, I will check if the signal arrives during any of the unsafe functions? I do already skip all native frames, trying to find the topmost managed frame, so I can also check the name/module of these frames and do an early-return if find something unsafe.

@janvorli
Copy link
Member

So, if I don't do anything printf-related in my program, I can safely use printf() in signal handler.

How do you know that the DoStackSnapshot doesn't end up calling unsafe functions itself? How do you know it would not allocate memory using malloc, for example? I think it is too dangerous to assume such things. Moreover, even if we don't do anything unsafe today, it is too easy to add some code somewhere in the stack walker that will break this assuption.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@tommcdon tommcdon added the tracking This issue is tracking the completion of other related issues. label May 6, 2020
@tommcdon tommcdon modified the milestones: 5.0, Future May 6, 2020
@tommcdon tommcdon modified the milestones: Future, 6.0.0 Jul 8, 2020
@jkotas
Copy link
Member

jkotas commented Jul 9, 2020

@sergiy-k reported this problem as well offline. His repro was not using profiler interfaces, but used a different atypical environment.

This is a problem in the JIT. The JIT makes following simplifying assumption about GS cookie reporting:

        // The code offset ranges assume that the GS Cookie slot is initialized in the prolog, and is valid
        // through the remainder of the method.  We will not query for the GS Cookie while we're in an epilog,
        // so the question of where in the epilog it becomes invalid is moot.

The simplifying assumption is not correct. The GS cookie can be queried in the epilog in certain rare cases.

@jkotas jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-Diagnostics-coreclr labels Jul 9, 2020
@jkotas jkotas changed the title Wrong usage of CheckGSCookies() when calling ProfToEEInterfaceImpl::DoStackSnapshot() Wrong GSCookie range recorded in GCInfo Jul 9, 2020
@jkotas jkotas removed this from the 6.0.0 milestone Jul 9, 2020
@jkotas jkotas removed the tracking This issue is tracking the completion of other related issues. label Jul 9, 2020
@AndyAyersMS
Copy link
Member

cc @dotnet/jit-contrib -- we should consider fixing this for 5.0.

@sandreenko
Copy link
Contributor

sandreenko commented Jul 31, 2020

I am planning to fix this issue by exclusing the epilog block from the live range, like:

        gcInfoEncoderWithLog->SetGSCookieStackSlot(offset, prologSize, methodSize - epilogSize);

however, we don't have correct epilog (or epilogs) size right now and I am working on a change to calculate that.

@jkotas I think it would be a sufficient fix. We have our last use of GS before we jump to an epilog so can shorter its lifetime. Am I right or do we need to know a precise live range of the GS cookie (including a part of the prolog and part of the epilog)?

In the meantime, I need to find a way to repro the issue to test the fix, @sergiy-k could would it be hard to use your case for a local repro?

@jkotas
Copy link
Member

jkotas commented Jul 31, 2020

I think it is ok to stop reporting the GS cookie after the check that validated it did not get corrupted. I assume that it is what you meant by the last use.

@sandreenko
Copy link
Contributor

I assume that it is what you meant by the last use.

Yes, thanks.

@sandreenko sandreenko assigned jkotas and unassigned sandreenko Aug 10, 2020
jkotas added a commit to jkotas/runtime that referenced this issue Aug 10, 2020
The information about end of GS cookie scope recorded in GC info is not accurate and it cannot even be made accurate without redesign that is not worth it. Detect end of GS cookie scope by comparing it with current SP instead.

Fixes dotnet#13041
jkotas added a commit that referenced this issue Aug 11, 2020
The information about end of GS cookie scope recorded in GC info is not accurate and it cannot even be made accurate without redesign that is not worth it. Detect end of GS cookie scope by comparing it with current SP instead.

Fixes #13041
@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants