Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failures crypto_macos.cc: `expected: CHECK_IS_BLOCKING(SIGPROF)` #36908

Closed
mkustermann opened this issue May 9, 2019 · 11 comments

Comments

Projects
None yet
4 participants
@mkustermann
Copy link
Member

commented May 9, 2019

Since around 5th of May our flakiness dashboard shows a lot of flaky crashes due to hitting assertions of this form (from build):

/===================================================\
| language_2/try_catch5_test succeeded again (Pass) |
\===================================================/

--- Command "vm_compile_to_kernel []" (took 863ms):
DART_CONFIGURATION=DebugSIMDBC64 /b/s/w/ir/pkg/vm/tool/gen_kernel --no-aot --platform=xcodebuild/DebugSIMDBC64/vm_platform_strong.dill -o /b/s/w/ir/xcodebuild/DebugSIMDBC64/generated_compilations/dartk/tests_language_2_try_catch5_test/out.dill /b/s/w/ir/tests/language_2/try_catch5_test.dart --packages=/b/s/w/ir/.packages -Ddart.developer.causal_async_stacks=true

exit code:
0

--- Command "vm" (took 06.000368s):
DART_CONFIGURATION=DebugSIMDBC64 xcodebuild/DebugSIMDBC64/dart --hot-reload-test-mode --optimization-counter-threshold=10 --no-background-compilation --ignore-unrecognized-flags --packages=/b/s/w/ir/.packages /b/s/w/ir/xcodebuild/DebugSIMDBC64/generated_compilations/dartk/tests_language_2_try_catch5_test/out.dill

exit code:
-6

stderr:
../../runtime/bin/crypto_macos.cc: 28: error: expected: CHECK_IS_BLOCKING(SIGPROF)
version=2.3.0-edge.df7db795dcc9eff14bbf99a9f0f9dec0cb3e3563 (Tue May 7 20:15:18 2019 +0000) on "macos_dbc"
thread=4355, isolate=(null)(0x0)
  pc 0x0000000106ffef74 fp 0x0000700001f61b10 dart::Profiler::DumpStackTrace(void*)
-- End of DumpStackTrace

--- Re-run this test:
python tools/test.py -n dartk-reload-mac-debug-simdbc64 language_2/try_catch5_test
@mkustermann

This comment has been minimized.

Copy link
Member Author

commented May 9, 2019

There are cores available for the failure above, see the "isolated out" links in the build.

Looking at one particular core it has the following stack trace:

(lldb) bt
* thread #5, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff6461db66 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff647e8080 libsystem_pthread.dylib`pthread_kill + 333
    frame #2: 0x00007fff645791ae libsystem_c.dylib`abort + 127
    frame #3: 0x0000000106c6826c dart`dart::Assert::Fail(char const*, ...) + 140
    frame #4: 0x0000000106c6fc07 dart`dart::bin::Crypto::GetRandomBytes(long, unsigned char*) + 519
    frame #5: 0x00000001070185d6 dart`dart::Random::Random() + 54
    frame #6: 0x00000001070b8d23 dart`dart::Thread::Thread(dart::Isolate*) + 483
    frame #7: 0x00000001070c1085 dart`dart::ThreadRegistry::GetFreeThreadLocked(dart::Isolate*, bool) + 245
    frame #8: 0x0000000106ef5c82 dart`dart::Isolate::ScheduleThread(bool, bool) + 258
    frame #9: 0x00000001070bb7bb dart`dart::Thread::EnterIsolateAsHelper(dart::Isolate*, dart::Thread::TaskKind, bool) + 43
    frame #10: 0x000000010724ea13 dart`dart::ConcurrentMarkTask::Run() + 35
    frame #11: 0x00000001070c0b91 dart`dart::ThreadPool::Worker::Loop() + 161
    frame #12: 0x00000001070c0753 dart`dart::ThreadPool::Worker::Main(unsigned long) + 435
    frame #13: 0x0000000106ff8fba dart`dart::ThreadStart(void*) + 138
    frame #14: 0x00007fff647e5661 libsystem_pthread.dylib`_pthread_body + 340
    frame #15: 0x00007fff647e550d libsystem_pthread.dylib`_pthread_start + 377
    frame #16: 0x00007fff647e4bf9 libsystem_pthread.dylib`thread_start + 13

Looking at the code:

// thread.h
class Thread : public ThreadState {
  ...
 private:
  Random thread_random_;
  ...
// random.cc
Random::Random() {
  uint64_t seed = FLAG_random_seed;
  if (seed == 0) {
    Dart_EntropySource callback = Dart::entropy_source_callback();
    if (callback != NULL) {
      if (!callback(reinterpret_cast<uint8_t*>(&seed), sizeof(seed))) {
      ...
    ...
  ...
}
bool Crypto::GetRandomBytes(intptr_t count, uint8_t* buffer) {                           
  ThreadSignalBlocker signal_blocker(SIGPROF);                                           
  intptr_t fd =                                                                          
      TEMP_FAILURE_RETRY_NO_SIGNAL_BLOCKER(open("/dev/urandom", O_RDONLY));              
  if (fd < 0) {                                                                          
    return false;                                                                        
  }                                                                                      
  intptr_t bytes_read = 0;                                                               
  do {                                                                                   
    int res = TEMP_FAILURE_RETRY_NO_SIGNAL_BLOCKER(                 // <-- this is where SIGPROF is not blocked                     
        read(fd, buffer + bytes_read, count - bytes_read));                              
  ...

It's a bit unclear how this could ever happen, since we block SIGPROF inside inside Crypto::GetRandomBytes() so the two TEMP_FAILURE_RETRY_NO_SIGNAL_BLOCKER should never have SIGPROF enabled.

Yet we hit the assertion above.

The addition of Thread::thread_random_ in a270999 caused us to hit this particular in the concurrent marking task. Though the the underlying issue is not quite clear to me yet.

/cc @sjindel-google @rmacnak-google

@mkustermann

This comment has been minimized.

Copy link
Member Author

commented May 9, 2019

If we can trust the data on the flakiness dashboard, then it is a little suspicious that a270999 was landed in April but the heavy flakiness started on 2nd of May.

@dcharkes

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

If we can trust the data on the flakiness dashboard

All results from the last two months on language_2/vm/math_vm_test:

n bot result count earliest last
31 dartk-reload-mac-debug-simdbc64 Crash 4 2019-05-03 12:38:22 UTC 2019-05-07 20:15:18 UTC
32 dartk-reload-mac-debug-simdbc64 Pass 746 2019-03-11 22:32:31 UTC 2019-05-09 23:32:11 UTC

And there are results in the database in between April 5th and May 2nd.

(I'm unsure why language_2/try_catch5_test doesn't show up with crash results in the results database.)

@dcharkes

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

Okay, only 1 of the 6 (1 original + 5 deflake) results make it into the results database (because they have the same primary key. This is why things with a single flake don't show up in the results database.

Moreover, after 100 non-flaky subsequent runs, a flaky test is forgiven - disappearing from the flakiness dashboard. So actually, 2nd of May is just exactly 100 builds ago. So this flakiness might have actually been happening since a270999.

@sortie

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

I ran this script to track down the issue:

for i in $(seq 0 1000); do rev=$(expr 2549 - $i); if gsutil.py cat gs://dart-test-results/builders/vm-kernel-mac-debug-simdbc64/$rev/logs.json | grep -q SIGPROF; then echo $rev has SIGPROF; else echo $rev does not have SIGPROF; fi; done)

Which located this build as the first containing the issue: https://ci.chromium.org/p/dart/builders/ci.sandbox/vm-kernel-mac-debug-simdbc64/2470

Which has blamelist 97122d1..0377617, out of which this commit seems the most likely:

https://dart-review.googlesource.com/c/sdk/+/100988

Which does indeed mention SIGPROF.

@dcharkes

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

Thanks for helping me dig through our cloud deflaking data @sortie!

@dcharkes

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

// Spawned threads inherit their spawner's signal mask. We sometimes spawn
// threads for running Dart code from a thread that is blocking SIGPROF.
// This function explicitly unblocks SIGPROF so the profiler continues to
// sample this thread.
static void UnblockSIGPROF() {
  sigset_t set;
  sigemptyset(&set);
  sigaddset(&set, SIGPROF);
  int r = pthread_sigmask(SIG_UNBLOCK, &set, NULL);
  USE(r);
  ASSERT(r == 0);
  ASSERT(!CHECK_IS_BLOCKING(SIGPROF));
}

The commit adds this method on MacOS and calls it on pthread spawning. This causes us to hit this crash.

However, the Linux version already had this code, which was added back in 2016 #26416 https://codereview.chromium.org/1953143002.

The question is why does MacOS behave different from Linux.

@dcharkes

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

Some quick Googling revealed this: https://lesteryu.com/signal-delivery-behavior-on-os-x/.

An online article discussed a subtle behavior on OS X, NetBSD, and OpenBSD delivering the SIGPROF signal to multithreaded programs: http://research.swtch.com/macpprof. In a nutshell, the current implementations in those systems cannot do it right to deliver the signal to a running thread, which most other systems would do.

@dcharkes

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

The flakiness dashboard also contains a group of tests that fails more frequently, with SIGPROF.

2019-05-02 18:02:02 co19_2/LibTest/convert/AsciiCodec/AsciiCodec_A01_t01 dartk-mac-debug-simdbc64 Crash,Pass Pass 3 46       36    
2019-05-02 18:02:02 co19_2/LibTest/convert/AsciiCodec/decode_A02_t01 dartk-mac-debug-simdbc64 Crash,Pass Crash 1 51       31    

etc.

python tools/test.py -n dartk-mac-debug-simdbc64 co19_2/LibTest/convert/AsciiDecoder/convert_A01_t01

Somewhat reliably reproduces the problem on my macbook (50-50ish).

Reverting 5393ce7 reliably prevents the crashes by not enabling SIGPROF on MacOS in the first place.

@dcharkes

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

Fix the profiler for threads spawned from the Mac eventhandler.

@rmacnak-google This might not be properly supported on MacOS. (Unfortunately, the CL 3 years ago does not mention why it was only added to Linux.)

You can semi-reliably reproduce this with python tools/test.py -n dartk-mac-debug-simdbc64 co19_2/LibTest/convert/AsciiDecoder/convert_A01_t01

@dcharkes

This comment has been minimized.

Copy link
Contributor

commented May 22, 2019

I looked at this again with @mraleph yesterday. The sigprof signal seems to become unblocked at random moments.

I double checked, on master reverting the part of @rmacnak-google's change that enables the sigprof signal on mac threads makes the AsciiDecoder convert test succeed (sigprof.patch).

dart-bot pushed a commit that referenced this issue May 23, 2019

[vm] Partially disable 5393ce7.
Avoid flakily failures setting the thread signal mask.

Bug: #36908
Change-Id: I3d66214189f3276365b4c1cc333847cb4f65c94f
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/103400
Commit-Queue: Ryan Macnak <rmacnak@google.com>
Reviewed-by: Alexander Markov <alexmarkov@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.