-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
erts_internal:await_result/1 can hang indefinitely #5359
Comments
The hanging call to
If you are able to inspect the process that is hanging, please check messages in the message queue of the hanging process and post the result here. What does the suffix |
I'll add that note to our bug, though noting that we've seen this issue in that call once ever, I wouldn't hold out much hope we'll ever have anything to report!
Apologies, looks like that's a versioning thing from the Erlang Solutions builds. We're using |
This indeed looks like #5078 - trying 24.1 might resolve this. |
Would #5078 also affect Edit: Yes, I just can't read. |
We'll give the upgrade a go. Will re-open if we find an example on the newer version. |
Describe the bug
We've been seeing sporadic hangs on our production nodes for some time which we've traced down to
erts_internal:await_result/1
never returning -- twice inerlang:cancel_timer/1
and once inerlang:process_info(_, total_heap_size)
. We've upgraded through 24.0.3 to pull in OTP-17472 (#4932), and the issue has continued to manifest (ie. it was neither caused, nor fixed, by this patch).In all recorded cases, this has occurred inside our internal
poller
behaviour (which is just a light wrapper aroundgen_server
to disallow resetting ofTimeout
, hence its manipulation of timers). The hang occurs inerlang:cancel_timer/1
with a timer created usingerlang:start_timer(Time, self(), tick)
. Anecdotally, it may have also occurred inside thatstart_timer
call (I can't find any solid proof of that; it seems like the example we have was a stack interpreted against updated code, so this is likely to be a red herring).Processes in this state can be freely killed, but don't appear to be able to make progress otherwise.
No reproduction details, as this occurs noticably about once every few months on a multi-node and fairly busy system. I suspect it's a very tight race.
Affected versions
Just listing those where the issue was observed:
erlang:cancel_timer/1
; we upgraded to 23.0.2-1 as a result of seeing an emulator error in the hope that that would help).erlang:process_info(Pid, total_heap_size)
on 2021-09-25 and one inerlang:cancel_timer/1
on 2021-11-09).Not all events may have been noticed or reported.
The text was updated successfully, but these errors were encountered: