Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM segmentation fault: JIT/Regression/CLR-x86-JIT/V1-M11-Beta1/b31878/b31878/b31878.sh #6334

Closed
swgillespie opened this issue Jul 15, 2016 · 24 comments

Comments

@swgillespie
Copy link
Contributor

See http://dotnet-ci.cloudapp.net/job/dotnet_coreclr/job/master/job/arm_emulator_cross_debug_ubuntu_prtest/511/console:

12:14:41 FAILED   - JIT/Regression/CLR-x86-JIT/V1-M11-Beta1/b31878/b31878/b31878.sh
12:14:41                BEGIN EXECUTION
12:14:41                /home/coreclr/Windows_NT.x64.Debug/Tests/coreoverlay/corerun b31878.exe
12:14:41 
               ./b31878.sh: line 117: 10078 Segmentation fault      (core dumped) $_DebuggerFullPath "$CORE_ROOT/corerun" b31878.exe $CLRTestExecutionArguments
12:14:41                Expected: 100
12:14:41                Actual: 139
12:14:41                END EXECUTION - FAILED
@jashook
Copy link
Contributor

jashook commented Jul 15, 2016

/cc @RussKeldorph these are the tests reported by Samsung as failing?

@RussKeldorph
Copy link
Contributor

@jashook Looks different than #6332.

@parjong
Copy link
Contributor

parjong commented Jul 18, 2016

@jashook @RussKeldorph After PR dotnet/coreclr#6021 is merged, there are many regressions in CoreCLR ARM (softp) Release build:

Here is the result from bdfce9e:

=======================
     Test Results
=======================
# CoreCLR Bin Dir  : 
# Tests Discovered : 9870
# Passed           : 8953
# Failed           : 579
# Skipped          : 338
=======================

Here is the result from bdfce9e without PR dotnet/coreclr#6021 (revert PR dotnet/coreclr#6021 manually):

=======================
     Test Results
=======================
# CoreCLR Bin Dir  : 
# Tests Discovered : 9870
# Passed           : 9504
# Failed           : 28
# Skipped          : 338
=======================

[bdfce9ed7fb][TC 387d9fc0a Release] [include-aab8856ce03].txt
[bdfce9ed7fb][TC 387d9fc0a Release] [revert-aab8856ce03].txt

Interestingly, b31878 is passed on my side.

@swgillespie
Copy link
Contributor Author

On the PR that triggered this failure, this test looked like it passed on the Release build but segfaulted on the Debug build here - could be why it passed on a Release build for @parjong .

@danmoseley
Copy link
Member

Seen again http://dotnet-ci.cloudapp.net/job/dotnet_coreclr/job/master/job/arm_emulator_cross_debug_ubuntu_prtest/245/consoleFull#2096352037c7af6a29-b465-404a-b249-90a3911e4354

10:13:31 FAILED   - JIT/Regression/CLR-x86-JIT/V1-M11-Beta1/b31878/b31878/b31878.sh
10:13:31                BEGIN EXECUTION
10:13:31                /bindings/tmp/arm32_ci_temp/coreclr/Windows_NT.x64.Debug/Tests/coreoverlay/corerun b31878.exe
10:13:31 
               qemu: uncaught target signal 11 (Segmentation fault) - core dumped
10:13:31                ./b31878.sh: line 117: 56638 Segmentation fault      (core dumped) $_DebuggerFullPath "$CORE_ROOT/corerun" b31878.exe $CLRTestExecutionArguments

@danmoseley
Copy link
Member

@RussKeldorph @swgillespie is there a way I can get the dump off that box for you ?

@swgillespie
Copy link
Contributor Author

swgillespie commented Sep 19, 2016

In theory, the dump should have been uploaded to the dumpling service, although looking at the logs I don't see anything indicating that the upload occurred.

@danmoseley
Copy link
Member

Can I go ahead and reset the CI?

@swgillespie
Copy link
Contributor Author

Yeah, I'd say it's fine.

@RussKeldorph
Copy link
Contributor

@swgillespie Where can we find out more about this "dumpling" service? I wouldn't be surprised it doesn't "just work" for the ARM emulator jobs.

/cc @jashook

@swgillespie
Copy link
Contributor Author

@RussKeldorph I don't know the specifics, but as I understand it dumpling is an HTTP endpoint that can receive crash dumps (dotnet/coreclr#6083) and view them (http://aka.ms/dumpling). @adityamandaleeka or @bryanAR might know more about how it works vis-a-vis ARM and other platforms.

@RussKeldorph
Copy link
Contributor

@hqueue @leemgs @myungjoo @parjong @wateret @sjsinju This issue is causing spurious failures in our PR and rolling tests. Can you investigate? Note that we believe @CarolEidt fixed the ARM regression introduced by dotnet/coreclr#6021 before resubmitting her change.

You may want to try adding the new --limitedDumpGeneration option to the ARM CI runtest.sh so you can use dumpling as @swgillespie suggests above.

@parjong
Copy link
Contributor

parjong commented Oct 17, 2016

@RussKeldorph The issue seems to be related with emulator-based testing environment. I tried to reproduce this issue with Raspberry Pi 3 and another ARM soft-fp devices, but failed.

@sjsinju @wateret Could you let me know about the current ARM CI in detail?

@sjsinju
Copy link
Contributor

sjsinju commented Oct 17, 2016

I checked current the debug builds of ARM CI status. It was hard to find the failure in b31878 tc. But I found failures of below links.

http://dotnet-ci.cloudapp.net/job/dotnet_coreclr/job/master/job/arm_emulator_cross_debug_ubuntu_prtest/988/
and
http://dotnet-ci.cloudapp.net/job/dotnet_coreclr/job/master/job/arm_emulator_cross_debug_ubuntu_prtest/990/

are the PRs that just changed documentation of 'linux-instructions.md'. But they made each other different test failures. The release build of the same PR on the CI cloud was successful.

http://dotnet-ci.cloudapp.net/job/dotnet_coreclr/job/master/job/arm_emulator_cross_release_ubuntu_prtest/960/

It seems that is the same problem with this issue. I think the segmentation fault is caused by not only b31878 but any test cases randomly with only debug CI likes @swgillespie said. So I think we have to investigate debug build.

@RussKeldorph
Copy link
Contributor

@sjsinju Sorry for not being clear. Yes, we believe this failure is not specific to b31878. It seems to happen nondeterministically in many, if not all, tests currently running in ARM CI. I believe this is another example (in the release build): http://dotnet-ci.cloudapp.net/job/dotnet_coreclr/job/master/job/arm_emulator_cross_release_ubuntu/232/console.

Due to the difficulty of reproducing, it may be better to enable capturing the dump in the CI rather than attempting to reproduce locally. If the --limitedDumpGeneration switch I mentioned above works in the emulator, that might be the easiest thing to try.

sjsinju referenced this issue in sjsinju/coreclr Oct 19, 2016
To make sure the reason of test failure ramdomly( #6298 ),
runtest option of '--limitedDumpGeneration' is added.
sjsinju referenced this issue in sjsinju/coreclr Nov 2, 2016
To make sure the reason of test failure ramdomly( #6298 ),
We checked segmentation faults occurred from mounted rootfs and the multi thread processing.

So I changed root-fs to archived root-fs and run tests with --sequential option.
jkotas referenced this issue in dotnet/coreclr Nov 4, 2016
…7946)

* ARM-CI : Use archived root-fs and run tests with --sequential option

To make sure the reason of test failure ramdomly( #6298 ),
We checked segmentation faults occurred from mounted rootfs and the multi thread processing.

So I changed root-fs to archived root-fs and run tests with --sequential option.

* change to original clang version
sjsinju referenced this issue in sjsinju/coreclr Nov 7, 2016
To make sure the reason of test failure ramdomly( #6298 ),
We checked segmentation faults occurred from mounted rootfs and the multi thread processing.

So I changed root-fs to the archived root-fs and run tests with --sequential option.

PS. The location of root-fs folder was changed from '/opt' wrote on reverted commit(dotnet#7991) to '/mnt' for resolving no space issue.
jkotas referenced this issue in dotnet/coreclr Nov 10, 2016
* ARM-CI : Fix segmentation faults on running tests

To make sure the reason of test failure ramdomly( #6298 ),
We checked segmentation faults occurred from mounted rootfs and the multi thread processing.

So I changed root-fs to the archived root-fs and run tests with --sequential option.

PS. The location of root-fs folder was changed from '/opt' wrote on reverted commit(#7991) to '/mnt' for resolving no space issue.
@sjsinju
Copy link
Contributor

sjsinju commented Nov 14, 2016

The PR dotnet/coreclr#8019 that running tests using a archived root-fs with sequential option was merged. But it seems to be not enough to resolve this issue. Although the frequency has decreased, the segmentation fault is occurred still(Local tests were all successful).

Additional investigations are needed.

@RussKeldorph
Copy link
Contributor

Per dotnet/coreclr#11069, I unfortunately recommend adding retry logic to runtest.sh. It should be enabled for ARM32 only, and I would prefer that a test only be retried if its output matches a very specific pattern, e.g. qemu: uncaught target signal 11 (Segmentation fault) - core dumped that is unique to this failure. In no case should a single test be retried more than three times. Please also add clear logging to indicate a retry is taking place so we can determine how often this continues to hit in the future.

@hqueue
Copy link
Member

hqueue commented Apr 27, 2017

I think retry logic can be implemented in two ways.

  • (1) Update runtest.sh with a new option to retry failed tests when qemu related error observed and ARM32 CI make use of the option.
  • (2) Use existing runtest.sh, and arm32_ci_test.sh implements all retry logic.

(1) may be preferred over (2) in general (right?)

What do you think of it ?

@RussKeldorph
Copy link
Contributor

I'm less concerned about how it's implemented and more concerned about the requirements. I strongly prefer retrying at the lowest level possible (e.g. individual test cases) rather than retrying all the tests if only one fails spuriously. I assumed that meant going with your option (1), but I suppose you could achieve the same goal in other ways. I think if a single invocation of runtest.sh is running multiple tests, however, you probably need to modify runtest.sh.

I'm not sure we need a new script option to enable retry on failure. Retries are a huge hack that we should remove eventually, and I'm not a fan of adding unnecessary complexity hacks. I would just enable retry only when the --testDirFile option is used, since that currently only happens in the scenario we care about.

@hqueue
Copy link
Member

hqueue commented May 4, 2017

I strongly prefer retrying at the lowest level possible (e.g. individual test cases) rather than retrying all the tests if only one fails spuriously

I definitely agree with this idea.

I would just enable retry only when the --testDirFile option is used, since that currently only happens in the scenario we care about.

I had similar idea in mind. Minor problems related to this approach is that (1) it may take more time to setup test environment if we invoke runtest.sh again with only failed test and (2) there will be multiple results of of runtest.sh and a result of the last execution of runtest.sh will show at the end of CI.

If this doesn't matter, then I also think this 2nd approach (by exploiting --testDirFile or --testDirFile option) is a reasonable choice, because this is a temporary hack for arm CI only as you said.

I will prepare a retrial logic with 2nd approach.

@hqueue
Copy link
Member

hqueue commented Jul 14, 2017

related issue dotnet/coreclr#6573

@BruceForstall
Copy link
Member

@RussKeldorph Is there a reason this issue needs to be kept open? You added a reference to the issue recently. But it's not clear that the "real" issue here is what the failure it was opened for.

@RussKeldorph
Copy link
Contributor

@BruceForstall I'm pretty confident the problem is a bug in the version of QEMU we rely on. I don't know if a newer QEMU would fix it or if that's even an option. The issue is less pressing since we don't have default-triggered CI jobs using QEMU anymore, but I believe the bug is still relevant until we either fix QEMU or stop using it altogether. @jashook mentioned a possible way to test armel using Docker on an ARM or ARM64 Linux box, but I think it's just an idea at this point.

jake-ruyi referenced this issue in jeikabu/nng.NETCore Feb 20, 2019
- dotnet crashes with:
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

- Seems to be known issue:
https://github.com/dotnet/coreclr/issues/6298
jake-ruyi referenced this issue in jeikabu/nng.NETCore Feb 21, 2019
* Dockerfiles for arm32v7

* Add support for arm32.  Dockerfile using qemu crashes.

- dotnet crashes with:
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

- Seems to be known issue:
https://github.com/dotnet/coreclr/issues/6298
@RussKeldorph
Copy link
Contributor

Closing stale. Will revisit if/when this environment is running regularly again.

jeikabu referenced this issue in jeikabu/nng.NETCore Jan 10, 2020
* Dockerfiles for arm32v7

* Add support for arm32.  Dockerfile using qemu crashes.

- dotnet crashes with:
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

- Seems to be known issue:
https://github.com/dotnet/coreclr/issues/6298
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants