Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASPNet service crashing on linux-arm64 when targeted to net9.0 #110442

Open
pavel-faltynek opened this issue Dec 5, 2024 · 18 comments
Open

ASPNet service crashing on linux-arm64 when targeted to net9.0 #110442

pavel-faltynek opened this issue Dec 5, 2024 · 18 comments

Comments

@pavel-faltynek
Copy link

pavel-faltynek commented Dec 5, 2024

Description

Aspnet service crashes on SIGSEGV when compiled for linux-arm64 and targeted to net9.0.
It does not crash on windows at all, also as on windows and linux when targeted to net8.0.

Reproduction Steps

Unfortunately I have no repro steps (other than just run the service and send few http requests to it).
On Ubuntu 22.04.5 LTS (GNU/Linux 5.15.0-125-generic aarch64) it looks like it just randomly crashes.
Installed dotnet: 9.0.101.

Expected behavior

Don't crash even on linux-arm64, please 😁

Actual behavior

There is a crash report available which - when preprocessed by apport-unpack /var/crash/_usr_lib_dotnet_dotnet.1000.crash ~/crash - can provide a dump. As there is possibly sensitive data in the dump, I can share it for inspection only via some "more secure" channels, if needed. I'm far from being expert here, but I'm able to open it in WinDbg and observe following:

(12f144.12f14a): Signal SIGSEGV (Segmentation fault) code SEGV_MAPERR (Address not mapped to object) at 0xffbe9093ca2a
*** WARNING: Unable to verify timestamp for libcoreclr.so
libcoreclr!alloc_context::init_alloc_count [inlined in libcoreclr!SVR::GCHeap::FixAllocContext+0x14]:
0000ffff`850eca54 79406428 ldrh        w8,[x1,#0x32]

Under the "Stack" section, there is:

[0x0]   libcoreclr!alloc_context::init_alloc_count   (Inline Function)   (Inline Function)   
[0x1]   libcoreclr!SVR::GCHeap::FixAllocContext+0x14   0xffff44246540   0xffff84fd94f8   
[0x2]   libcoreclr!GCToEEInterface::GcEnumAllocContexts+0x54   0xffff44246570   0xffff850a6210   
[0x3]   libcoreclr!SVR::gc_heap::fix_allocation_contexts+0x28   (Inline Function)   (Inline Function)   
[0x4]   libcoreclr!SVR::gc_heap::garbage_collect+0x60   0xffff442465a0   0xffff850a3f44   
[0x5]   libcoreclr!SVR::gc_heap::gc_thread_function+0xca8   0xffff44246670   0xffff850a329c   
[0x6]   libcoreclr!SVR::gc_heap::gc_thread_stub+0x38   0xffff442466f0   0xffff84fdc420   
[0x7]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::operator()+0x3c   (Inline Function)   (Inline Function)   
[0x8]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::__invoke+0x50   0xffff44246710   0xffff852cafb0   
[0x9]   libcoreclr!CorUnix::CPalThread::ThreadEntry+0x1b8   0xffff44246740   0xffff8556d5c8   
[0xa]   libc_so+0x7d5c8   0xffff44246800   0x0   

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 5, 2024
@vcsjones vcsjones added area-GC-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Dec 5, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 5, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Dec 5, 2024
@mangod9 mangod9 added this to the 10.0.0 milestone Dec 5, 2024
@mangod9
Copy link
Member

mangod9 commented Dec 5, 2024

Thanks for reporting the issue. Does it repro frequently? Please share a dump (multiple if available) via email if possible.

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 5, 2024

Yes, it fails periodically (but have no clue, what is the exact trigger). We have systemd managed restarts also as load balancer checking the service health over http requests (so it's kind of active immediately after startup). Firstly I was thinking the first "non-health" (bigger) request crashes it, but this might not be the case.

I will (try to) send you the dumps over mail.
(Mail attachment failed, sent via alternative method).

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 6, 2024

I have tried the trivial dotnet new webapi, dotnet new webapiaot directly on the target system (curling over http/https), additionally adding explicit request to invoke GC.Collect(), but no issue here at all. All was compiled/published on the target system, which leads me to a question:

Might the behavior somehow depend (even non-intentionally) on the platform the application is built on? (we build/publish on windows machines and then deploy the result to linux servers):
dotnet publish $project --configuration $configuration --framework $framework --runtime $runtime --no-self-contained --output $output
For the reported case, $framework = 'net9.0', $runtime = 'linux-arm64'.

@mangod9
Copy link
Member

mangod9 commented Dec 6, 2024

Thanks for sharing the dump. It appears that Thread::GetAllocContext is returning an invalid context. In 9 this part of the code was touched in #103055 and #103607.

@jkoritzinsky, do you think any of those changes would cause a race on arm64?

@pavel-faltynek, would be helpful if you can share a few more dumps -- assume all of them AV with the same stack? And any other specific details and/or a repro would be helpful.

@jkoritzinsky
Copy link
Member

I know we had to do some fixes for during process shutdown (on Windows) in #103877.

My first guess would be that this is happening because the alloc context for a thread was destroyed, but the Thread object for the thread is still around, but the thread object cleans itself up in co-op mode, so it shouldn't be possible to race with the GC there.

Maybe there's a corresponding shutdown issue for Linux?

I can't think of anything else without looking at the dump myself.

@mangod9
Copy link
Member

mangod9 commented Dec 7, 2024

In this particular case doesnt look like it's occurring during shutdown. I have shared the dump with you offline.

@jkotas
Copy link
Member

jkotas commented Dec 9, 2024

The crash dump shows that the runtime was not notified about thread shutting down. It is most likely an managed/native interop problem (e.g. bug in interop corrupting unmanaged heap). I can see from the crash dump that your service uses number of nuget packages - the interop bug might be in one of them.

It may be useful to run it on checked build CoreCLR to see whether it will give us any extra insights. Could you please give it a try?

@pavel-faltynek
Copy link
Author

@jkotas, it doesn't seem I can access this. I have tried to randomly accept "account and project creation" in dev.azure.com (after login to my "live/microsoft" account), but these portals are over complicated to me (meaning I'm not patient enough to get oriented). Any chance you could just share it? Thank you.

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 9, 2024

@mangod9, generating more dumps might be the easier part. As far as I can remember all of the ones I loaded in WinDbg had the same call stack.

I have added:

  • dump number 3 (just redeployed and wait for crash, so no explicit request, only balancer health checks), looks a bit different:
[0x0]   libcoreclr!Object::RawSetMethodTable   (Inline Function)   (Inline Function)   
[0x1]   libcoreclr!SVR::CObjectHeader::SetFree+0x10   (Inline Function)   (Inline Function)   
[0x2]   libcoreclr!SVR::gc_heap::make_unused_array+0x5c   0xffff619964d0   0xffffa406cb30   
[0x3]   libcoreclr!SVR::gc_heap::fix_allocation_context+0x64   (Inline Function)   (Inline Function)   
[0x4]   libcoreclr!SVR::GCHeap::FixAllocContext+0xf0   0xffff61996540   0xffffa3f594f8   
[0x5]   libcoreclr!GCToEEInterface::GcEnumAllocContexts+0x54   0xffff61996570   0xffffa4026210   
[0x6]   libcoreclr!SVR::gc_heap::fix_allocation_contexts+0x28   (Inline Function)   (Inline Function)   
[0x7]   libcoreclr!SVR::gc_heap::garbage_collect+0x60   0xffff619965a0   0xffffa4023f44   
[0x8]   libcoreclr!SVR::gc_heap::gc_thread_function+0xca8   0xffff61996670   0xffffa402329c   
[0x9]   libcoreclr!SVR::gc_heap::gc_thread_stub+0x38   0xffff619966f0   0xffffa3f5c420   
[0xa]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::operator()+0x3c   (Inline Function)   (Inline Function)   
[0xb]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::__invoke+0x50   0xffff61996710   0xffffa424afb0   
[0xc]   libcoreclr!CorUnix::CPalThread::ThreadEntry+0x1b8   0xffff61996740   0xffffa44ed5c8   
[0xd]   libc_so+0x7d5c8   0xffff61996800   0x0   
  • dump number 4, 5 (looks "standard"):
[0x0]   libcoreclr!alloc_context::init_alloc_count   (Inline Function)   (Inline Function)   
[0x1]   libcoreclr!SVR::GCHeap::FixAllocContext+0x14   0xffff559a6540   0xffff97f694f8   
[0x2]   libcoreclr!GCToEEInterface::GcEnumAllocContexts+0x54   0xffff559a6570   0xffff98036210   
[0x3]   libcoreclr!SVR::gc_heap::fix_allocation_contexts+0x28   (Inline Function)   (Inline Function)   
[0x4]   libcoreclr!SVR::gc_heap::garbage_collect+0x60   0xffff559a65a0   0xffff98033f44   
[0x5]   libcoreclr!SVR::gc_heap::gc_thread_function+0xca8   0xffff559a6670   0xffff9803329c   
[0x6]   libcoreclr!SVR::gc_heap::gc_thread_stub+0x38   0xffff559a66f0   0xffff97f6c420   
[0x7]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::operator()+0x3c   (Inline Function)   (Inline Function)   
[0x8]   libcoreclr!<unnamed-namespace>::CreateNonSuspendableThread::$_0::__invoke+0x50   0xffff559a6710   0xffff9825afb0   
[0x9]   libcoreclr!CorUnix::CPalThread::ThreadEntry+0x1b8   0xffff559a6740   0xffff984fd5c8   
[0xa]   libc_so+0x7d5c8   0xffff559a6800   0x0   

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 9, 2024

Regarding the nuget note: there is one which differs between net8.0 and net9.0 targets: Microsoft.AspNetCore.Mvc.Testing - which explicitly states that is not compatible with net8.0 (in its latest version), but I believe this does not contribute to standard service runtime (it's test related only).

Additionally one code update was performed for net9.0: instead of creating X509Certificate2 directly from pfx storage, we now use X509CertificateLoader (there is newly SYSLIB0057 obsoletion warning). I have checked that reverting this update does not "remedy" the problem (we had some issues with pfx in the past on mobile platforms, so wanted to be sure it's not the issue).

EDIT: Forgot the Microsoft.AspNetCore.Authentication.JwtBearer which also went from 8.0.10 to 9.0.0 (with no impact on behavior when reverting back).

@jkotas
Copy link
Member

jkotas commented Dec 9, 2024

Any chance you could just share it? Thank you.

I have shared the checked build at https://github.com/jkotas/scratch/

@pavel-faltynek
Copy link
Author

pavel-faltynek commented Dec 10, 2024

Thank you, @jkotas. I have added dump 8 executed against the checked CLR.
Additionally, there is an assert in system log:

Assert failure(PID 1546607 [0x0017996f], Thread: 1546612 [0x179974]): (size >= Align (min_obj_size))
     File: /__w/1/s/src/coreclr/gc/gc.cpp:7896
     Image: /usr/lib/dotnet/dotnet

Doesn't seem like repeated thing. Found only single instance (even when the service crashed many times).

@mangod9
Copy link
Member

mangod9 commented Dec 10, 2024

Doesn't seem like repeated thing. Found only single instance (even when the service crashed many times).

So just to clarify you hit the assert only once, but the service was still crashing with FixAllocContext on the stack? Did you happen to capture a dump with the assert?

@askovpen
Copy link

same bug on docker runtime image x86_64 on tag :9.0. On 9.0-alpine work perfect.

@pavel-faltynek
Copy link
Author

So just to clarify you hit the assert only once, but the service was still crashing with FixAllocContext on the stack? Did you happen to capture a dump with the assert?

Right. Single shot assert, unfortunately no documented relationship to the dump(s). So I have no clue, whether the dump 8 is anyhow connected to the assert or not.

@mangod9
Copy link
Member

mangod9 commented Dec 11, 2024

same bug on docker runtime image x86_64 on tag :9.0. On 9.0-alpine work perfect.

do you happen to have a standalone repro?

@askovpen
Copy link

same bug on docker runtime image x86_64 on tag :9.0. On 9.0-alpine work perfect.

do you happen to have a standalone repro?

heisenbug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants