Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent segfault / abort / stuck processes when setting schedulers_online system flag starting in OTP 20 #4809

Closed
zerth opened this issue May 7, 2021 · 6 comments · Fixed by #4980
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@zerth
Copy link

zerth commented May 7, 2021

Describe the bug
When adjusting the number of schedulers online via erlang:system_flag/2 on a Linux x86_64 (Broadwell/Skylake) build, a segfault in erl_process.c:add2runq sometimes occurs:

(gdb) frame 0
#0  add2runq (enqueue=<optimized out>, prio=2, proc=0x7f0aa9c9b988, proxy=0x0, state=<optimized out>)
    at beam/erl_process.c:6664
6664	        if (ERTS_RUNQ_IX_IS_DIRTY(runq->ix))
(gdb) p runq
$6 = (ErtsRunQueue *) 0xffffffffffffffff
(gdb) etp runq
-1.

In a debug emulator, an abort in erl_process.c:erts_set_schedulers_online instead occurs:

(gdb) frame 2
#2  0x000000000079e370 in erl_assert_error (
    expr=0x884b58 "schdlr_sspnd.changer == ((Eterm)(((0) << 6) + ((0x0 << 4) | ((0x2 << 2) | 0x3))))",
    func=0x889a00 <__func__.27610> "erts_set_schedulers_online", file=0x881be7 "beam/erl_process.c", line=8003)
    at sys/unix/sys.c:956
956	    abort();
(gdb) up
#3  0x00000000004b46fb in erts_set_schedulers_online (p=0x7fa44aa040a8, plocks=1, new_no=5, old_no=0x7fa4481baaf0,
    dirty_only=0) at beam/erl_process.c:8003
8003		    ASSERT(schdlr_sspnd.changer == am_false);
(gdb) etp schdlr_sspnd.changer
true.
(gdb) etp schdlr_sspnd.chngq
#Cp<(nil)>.

To Reproduce
This will intermittently abort (debug emulator) or segfault (standard emulator) on a Linux x86_64 Broadwell or Skylake host (can't reproduce on an OS X Intel host) under various OTP versions since 20:

$ erl -emu_type debug -eval "
Q = fun K () ->
      timer:sleep(rand:uniform(1000)),
      erlang:system_flag(schedulers_online,
                         rand:uniform(erlang:system_info(schedulers))),
      K()
    end,
[spawn_link(Q) || _ <- lists:seq(1,100)],
receive
after 10000 ->
    init:stop()
end."
Erlang/OTP 24 [RELEASE CANDIDATE 3] [erts-12.0] [source] [64-bit] [smp:24:24] [ds:24:24:10] [async-threads:1] [jit] [type-assertions] [debug-compiled] [lock-checking]

Eshell V12.0  (abort with ^G)
1> beam/erl_process.c:8003:erts_set_schedulers_online() Assertion failed: schdlr_sspnd.changer == ((Eterm)(((0) << 6) + ((0x0 << 4) | ((0x2 << 2) | 0x3))))
                                       Aborted (core dumped)

For cases not resulting in segfault on the standard emulator, it seems the Q processes stop getting scheduled (haven't checked whether they become deadlocked).

Expected behavior
The number of online schedulers is adjusted randomly for a while followed by a graceful emulator shutdown.

Affected versions
I couldn't get git bisect run working properly and so attempted a manual bisect based on tags. I was able to reproduce starting with OTP-20.0-rc1, but not with OTP-19.3.6.9. Under OTP-24.0-rc3, the above only aborts in the debug emulator and doesn't segfault in the standard emulator, but in the standard emulator the Q processes appear to become stuck (number of schedulers online stops changing).

@zerth zerth added the bug Issue is reported as a bug label May 7, 2021
@rickard-green rickard-green self-assigned this May 7, 2021
@rickard-green rickard-green added the team:VM Assigned to OTP team VM label May 7, 2021
@zerth
Copy link
Author

zerth commented Jun 7, 2021

@rickard-green: any suggestions as to next steps?

@rickard-green
Copy link
Contributor

@zerth Sorry I haven't had the time to dig deeper into this yet. I'm hoping to be able to have a look at this next week.

@rickard-green
Copy link
Contributor

@zerth I've made a fix in pull request #4980. It seems to fix your issue, but it has not been very well tested yet and I want to look more at the issue. The branch in the pull request is based on OTP-22.3.4, but should merge cleanly to any later OTP versions. Please test it.

@zerth
Copy link
Author

zerth commented Jun 18, 2021

@rickard-green: thanks, it looks like #4980 resolves the issue!

Applying it to OTP-23.3.4.4 and OTP-24.0.2, the above repro example runs cleanly in the standard and debug emulators, and the behavior above no longer occurs. Without the patch, one of the following still occurs:

  1. schedulers_online value stops changing / processes get stuck (standard+debug)
  2. emulator aborts with assertion (debug)

Will look into enabling more extensive local testing.

rickard-green added a commit that referenced this issue Jun 19, 2021
…aint

* rickard/schedulers-online-fix/GH-4809/OTP-17500:
  Fix erlang:system_flag(schedulers_online, _)
@rickard-green
Copy link
Contributor

@zerth I force pushed some changes to the PR #4980. It is essentially the same fix with some cosmetic changes, more asserts and a new testcase. This change will be released in the next OTP 24, 23, and 22 patches released.

u3s pushed a commit that referenced this issue Jun 28, 2021
…aint-24

* rickard/schedulers-online-fix/GH-4809/OTP-17500:
  Fix erlang:system_flag(schedulers_online, _)
@rickard-green
Copy link
Contributor

#4980 has been released in the OTP 24.0.3 patch now.

IngelaAndin pushed a commit that referenced this issue Jul 22, 2021
…aint-23

* rickard/schedulers-online-fix/GH-4809/OTP-17500:
  Fix erlang:system_flag(schedulers_online, _)
bjorng pushed a commit that referenced this issue Sep 3, 2021
…aint-22

* rickard/schedulers-online-fix/GH-4809/OTP-17500:
  Fix erlang:system_flag(schedulers_online, _)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants