-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random segmentation fault in managed code on 32-bit ARM linux with dotnet 8.0 #102396
Comments
Tagging subscribers to this area: @dotnet/ncl |
Is it possible to check if your Linux OS flavor uses 64bit I have seen our tests suite occasionally crash on ARM32 Debian 12 (on main/9.0), possibly the same issue (but we disabled runs on that specific platform due to #101444 (comment) so I did not investigate further). |
Looking at the result of a |
Unfortunately I am not sure that this test necessarily works. It still depends on the |
Is there something I can try to run to more conclusively answer the |
I think this may be the same issue that I have been tracking down on our ARM32 (nxp imx6q) embedded platform running Linux. The fault can be reproduced using an unmodified dotnet 8 webapi sample program:
For me the segmentation fault only seems to occur on the first web request, and is much more likely to happen with multiple simultaneous requests. I've used this script running on the device to automate the process: #!/bin/bash
PROGRAM=crash
fail=0
RUNS=100
for run in $(seq 1 $RUNS); do
echo "Run $run ($fail failures)"
./$PROGRAM > /dev/null &
sleep 5
curl -k --parallel --parallel-immediate --parallel-max 50 http://localhost:5000/weatherforecast?[1-16] > /dev/null
sleep 5
PID=$(pidof $PROGRAM)
if [ "$PID" == "" ]; then
echo "CRASHED!"
((fail++))
else
kill $PID
sleep 2
fi
done
echo "Program crashes $fail / $RUNS" With a single request, the failure rate is only about 3%, but goes up to about 25% with 4 or more requests. .net7 shows the same fault, but .net6 works without any problems. The kernel (5.4.147) is compiled with CONFIG_64BIT_TIME, but glibc is only version 2.31 I can collect a coredump if that will help. Sometimes I get the AccessViolationException instead of a segmentation fault:
|
This looks like a GC hole. |
Tagging subscribers to this area: @mangod9 |
Hey @maf1024 @cw-ametek would you be able to provide a dump of the failure? If this consistently repros on certain platform might be worth trying with |
@mangod9 Buried in my original wall of text was a link to one sample of a crash dump (hosted on my free MSDN credits 😊) I'll try to attempt with HeapVerify soon and see if it makes the failures more consistent. Would it also possibly yield a more useful crash dump? (I'm unfamiliar with that option but it sounds like it does more exhaustive policing of memory) |
Ok, we will investigate. However, just realized that I had a typo in the env. var |
The dumps are not easily diagnosable since this is on a custom distro. Assume you are not able to repro on standard Debian/Ubuntu/Alpine ? |
I don't have any other 32-bit ARM hardware to even test it out on. I'm open to suggestions for virtualized approaches to repro in a more standardized way. |
Also for what it's worth, |
We noticed the same issue while porting our .NET applications from Mono to .NET 7/8. Since we could not find a quick solution, we stayed on .NET 6 first, as .NET 6 does not show the same crashes. Note, our applications run on NXP iMX6 dual/quad cores (Cortex-A9 ARM32) with a custom Yocto-based Linux (kernel 5.15). We initially observed random seg faults while running "dotnet test" (we run unit tests on our embedded target hardware), but were able to reproduce those seg faults also by running PowerShell Core, for example, a simple command like
seg faults within a few minutes. Even simple applications like But the repo sample posted by @cw-ametek (#102396 (comment)) might be easier to debug. We verified it on our hw/os combination; it crashes almost immediately. We tried different hardware and custom operating system versions in the past. For example, a Raspberry PI 3 with a Ubuntu 32.04 Server Edition, 32bit does not show these crashes. Also other hardware based on Cortex-A15 does not seg fault. Is this related to Cortex-A9 only? We are not sure yet. Disabling Ready2Run, Tier Compilation, and using different GC settings (server, without concurrent, background GC, etc.) does not change anything. But, reducing the number of CPU cores helps. For example, on a dual-core system the crashes don't happen that often, also, when binding the process via taskset to a single CPU core reduces the probability of these seg faults to almost zero. That said, would it help if we provide crash dumps? We are also on a custom Linux distro, but we can provide the necessary symbols for debugging those dumps, if needed. Can we do anything else? As a company we have high interest in getting that fixed. |
I'm using dotnet on ARM32 IoT devices in thousands, dotnet process uptime in months, without problems. My CPU is Allwinner H3, 4x ARM A7 on custom Yocto Linux. I have also monitoring of the nodes, so I see if there is process restart or any problem. I migrated from mono to dotnet 5.0 and now I'm using latest 8.0 and kernel 5.15.35. My app is not small, about 1M of lines of code + external libraries. |
Could you please share stacktrace of the crash to start with? |
Sure! Here are a few managed-only stack traces that we recorded in the past (various applications).
We see very erratic exception patterns. Looks like a kind of memory corruption to me. Right now, a colleague is trying to get managed+native stack traces via lldb+sos from one of the crash dumps, but we are facing some problems. The thread causing the crash (seg fault) does not show any backtraces. Maybe the dump is corrupted, too. We are working on it and come back to you. Sidenote: I have to hurry into vacation. My colleagues will take over. |
We see the same random segmentation faults with our cross built armel dotnet 8.0 runtime. |
@lvorpahl-nokia Are you able to collect a crash dump and open it under lldb w/ SOS extension? The next step to diagnose these crashes is to run VerifyHeap SOS command on the crash dump to see whether the GC heap is corrupted. |
Marvell Armada XP SoC is based on Sheeva PJ4B-MP cores, which are probably modified Cortex A8. The other cpu cores with this problem were Cortex A9. I don't know if it is relevant. A8 & A9 are older designs based on ARMv7 ISA. I have no problems with ARM A7 cores. Could there be a connection with different FPU (A8 & A9 have VFPv3)? More recent deisgns use VFPv4 (A7/15/17...) |
@michaldobrodenka, hmm, it is an interesting point. Thank you! Looks like by default |
@jkotas Unfortunately no, our toolchain has no lldb support. Therefore I apologize that I can not contribute more than reporting that we see the crashes on our CPU. We compiled |
Sorry for off topic, but what is thre reason you are compiling coreclr by yourself? Still somewhere could be VPFv4 assembly? But small chance probably :( |
We are running an armel Linux and there is no official armel build as far as I know. |
I'm getting lost in these arm arch and abbreviations, but I'm compiling dotnet for ARMv6, use mono-vm and it's usable on R Pi1. It's not I'm using it as a docker container to build self-contained executable which then runs on Raspberry Pi compute modules. Only downside is, that every timer will stop working after UInt32.Max millis (49.7days). If you want to try it: https://hub.docker.com/r/taphome/dotnet-armv6 |
And you can still use dotnet with mono runtime for ARMv7, it might help, use |
I have run the I can confirm that an Allwinner H3 with its four Cortex-A7 does not crash (at least not within 3 days of running that command in a loop). On boards with i.MX6Q (Cortex-A9 r2p10) and Zynq 7020 (Cortex-A9 r3p0) dotnet does crash after less than a minute. I even tried i.MX6Q boards from different vendors to rule out that the board is at fault. Since the type of ARMv7 core clearly makes a difference, maybe it is a known Cortex-A9 erratum? strace tells me that dotnet performs about 12500 cacheflush operations for this simple test. Maybe one of them goes wrong because of erratum 764369? The workaround in the kernel is enabled, but the description of the erratum says that the error might occur anyway in "extremely rare and complex timing conditions". Does someone have connections to ARM to verify that this is a CPU core bug? |
Any combination of "setarch linux32/linux64 -R program" and DOTNET_ReadyToRun=0 still crashes on dotnet on the IMX6DL. |
Description
I'm encountering random segmentation faults (and sometimes AccessViolationException and NullReferenceException) when running a dotnet 8.0 console app on 32-bit ARM linux. It seems to mainly occur for me when it attempts to connect to a SignalR hub as a client.
Reproduction Steps
I have pushed a small repro app pair here.
Note that in addition to the crashing console app, it also contains a trivially simple SignalR webapp that I've deployed to azure when reproducing. (Note the "[CHANGE THIS TO YOUR WEBAPP URL]" line in the console app)
The console app code just tries to make a SignalR connection over websockets and send/receive a few MessagePack messages before exiting.
When running the console app repeatedly, it has a random chance of encountering the issue, after something like 10 to 100 attempts. (a helper bash loop .sh is included)
Expected behavior
Can run repeatedly without crashing.
Actual behavior
Approximately 5% of the time it fails with simply
Segmentation fault
.I have uploaded a core crash dump file of one of the segfault occurrences here
On some occasions it randomly has a NullReferenceException, with additional stack trace info:
Regression?
It seems to work fine (even with thousands of attempts) when changing the csproj back to dotnet 6.0 instead of 8.0.
(I can't test dotnet 7.0 on this ARM device due to the higher glibc requirement 7.0 has)
Known Workarounds
Downgrading the csproj to dotnet 6.0
Configuration
Custom linux OS running on 32-bit ARM linux IoT device.
Problem seems specific to 32-bit ARM. Ran repeatedly on a different custom ARM64 device with no issues.
Other information
On some rare occasions with very similar code (from my team's proprietary app), it encountered this AccessViolationException, which I suspect is stemming from the same cause:
The text was updated successfully, but these errors were encountered: