Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erts_internal:await_result/1 can hang indefinitely #5359

Closed
bucko909 opened this issue Nov 9, 2021 · 5 comments
Closed

erts_internal:await_result/1 can hang indefinitely #5359

bucko909 opened this issue Nov 9, 2021 · 5 comments
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@bucko909
Copy link
Contributor

bucko909 commented Nov 9, 2021

Describe the bug

We've been seeing sporadic hangs on our production nodes for some time which we've traced down to erts_internal:await_result/1 never returning -- twice in erlang:cancel_timer/1 and once in erlang:process_info(_, total_heap_size). We've upgraded through 24.0.3 to pull in OTP-17472 (#4932), and the issue has continued to manifest (ie. it was neither caused, nor fixed, by this patch).

In all recorded cases, this has occurred inside our internal poller behaviour (which is just a light wrapper around gen_server to disallow resetting of Timeout, hence its manipulation of timers). The hang occurs in erlang:cancel_timer/1 with a timer created using erlang:start_timer(Time, self(), tick). Anecdotally, it may have also occurred inside that start_timer call (I can't find any solid proof of that; it seems like the example we have was a stack interpreted against updated code, so this is likely to be a red herring).

Processes in this state can be freely killed, but don't appear to be able to make progress otherwise.

No reproduction details, as this occurs noticably about once every few months on a multi-node and fairly busy system. I suspect it's a very tight race.

Affected versions

Just listing those where the issue was observed:

  • 22.2.6-1 (running between 2020-02-10 and 2020-06-03 -- one report on 2020-06-03 in erlang:cancel_timer/1; we upgraded to 23.0.2-1 as a result of seeing an emulator error in the hope that that would help).
  • 24.0.5-1 (running since 2021-08-02 -- one report in erlang:process_info(Pid, total_heap_size) on 2021-09-25 and one in erlang:cancel_timer/1 on 2021-11-09).

Not all events may have been noticed or reported.

@bucko909 bucko909 added the bug Issue is reported as a bug label Nov 9, 2021
@rickard-green rickard-green self-assigned this Nov 9, 2021
@rickard-green rickard-green added the team:VM Assigned to OTP team VM label Nov 9, 2021
@rickard-green
Copy link
Contributor

The hanging call to process_info() could be caused by this bug fixed in OTP-24.0.6:


  OTP-17548    Application(s): erts
               Related Id(s): OTP-10391, PR-5078

               A call to the process_info() BIF could end up hanging
               for ever due to a bug introduced when the new selective
               receive optimization was introduced in OTP 24.0. Note
               that this bug only effects process_info().

If you are able to inspect the process that is hanging, please check messages in the message queue of the hanging process and post the result here.

What does the suffix -1 in the version numbers mean? We never assign version numbers like that.

@bucko909
Copy link
Contributor Author

bucko909 commented Nov 9, 2021

I'll add that note to our bug, though noting that we've seen this issue in that call once ever, I wouldn't hold out much hope we'll ever have anything to report!

What does the suffix -1 in the version numbers mean? We never assign version numbers like that.

Apologies, looks like that's a versioning thing from the Erlang Solutions builds. We're using erlang-base=1:24.0.5-1 for example. It's probably internal packaging differences -- or just to allow for them -- so I imagine the base code is the same as the -1-stripped version.

@max-au
Copy link
Contributor

max-au commented Nov 11, 2021

This indeed looks like #5078 - trying 24.1 might resolve this.

@bucko909
Copy link
Contributor Author

bucko909 commented Nov 11, 2021

Would #5078 also affect cancel_timer (which is the majority of examples we have)?

Edit: Yes, I just can't read.

@bucko909
Copy link
Contributor Author

We'll give the upgrade a go. Will re-open if we find an example on the newer version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

3 participants