-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timer:apply_after(...) timers may never be executed if there is a problem spawning a process #7606
Comments
May I ask what you need this for? It sounds like you are considering it normal to hit the process limit, and that sounds wrong to me. If you do hit the process limit, a non-firing timer is likely to be one of your lesser problems. Processes are spawned all the time all over the place, and it is universally assumed that "
The timer was first refactored in #4811, which made it into OTP 25. But yes, the previous implementation worked the same way in this regard. |
Hi @juhlig, in my use case we do not expect to hit the process limit. A separate issue caused us to hit the system process limit that was resolved while the system was running. My expectation was that the system would fully recover, however, code relying on timer:apply_after(...) calls did not work as expected as the functions were never ran. As far as I can tell, the rest of the system recovered, but I did not have access to the running environment, only a set of logs. Given Erlang's claim to have built-in support for fault-tolerance I would have expected Erlang itself to tolerate hitting the process limit and for the system to recover. I appreciate that hitting the system limit is certainly not normal, and again, it is not expected in our application, but it does not feel correct to me that a running Erlang node can hit such a condition, that, when removed, cannot be recovered from, particularly as the default behaviour for Erlang is to set a process a limit.
I appreciate that this assumption exists, and I don't necessarily expect that it will change with this issue, but we hit this in the application that I work on and thought it worth sharing. This issue I hit existed solely within my application code due to the use of timer:apply_after(...). It would be be quite interesting to see if the Erlang runtime system itself may fail to recover if it hits the system process limit. If it does fail to recover, I suspect that the only recourse at the moment would be to restart the (OS) process. |
The important words here are "support for" and "-tolerance".
And for good reason, as otherwise the risk would be to render a system completely clogged and unresponsive or inaccessible, or to hit an OS or even machine limit which is likely much harsher as to its consequences. To be sure, I'm not trying to lecture you here, I'm sure you know and understand ;) |
@BenHuddleston @IngelaAndin @rickard-green @bjorng: As for timers that are guaranteed to do the apply eventually, even in the condition when a spawn at the appointed times fails, my thinking is this. Can we do it? I suppose so, but as we have no way (that I'm aware of) to be notified when the process limit falls below the limit, the IMO best way would be to simply try and retry spawning until it eventually succeeds. This should not block other timers from firing, though, so either we reschedule the timer with a small offset, or we do it in a separate process. Given that starting such a process ad hoc is at that moment not possible, we would need a side process pre-spawned when the timer server starts for that, for example something like this: init(_) ->
...
LateSpawnLoop = spawn_link(fun() -> late_spawn_loop(queue:new()) end),
....
late_spawn_loop(Pending0) ->
{Pending1, Timeout} = case queue:out(Pending0) of
{{value, {_, {M, F, A}}}, Pending1} ->
try spawn(M, F, A) of
_ -> {Pending1, 0}
catch
exit:system_limit -> {Pending0, 0};
_:_ -> {Pending1, 0}
end;
empty -> {Pending0, infinity}
end,
Pending2 = receive
{cancel, CancelTRef} -> queue:delete_with(fun({TRef, _}) -> TRef =:= CancelTRef end, Pending1);
{enqueue, Timer} -> queue:in(Timer, Pending1)
after Timeout -> Pending1
end,
late_spawn_loop(Pending2). (Only a rough draft of course. Especially the timeout after a failed spawn should probably be not If the normal Should we do it? Probably not, at least not unconditionally, feasibility of the approach outlined above aside. If the condition of the system running at the process limit lasts for a while, the pending spawns could build up, and be executed in a sudden surge once the process count abates, which in turn may as well drive the system up to the process limit again. In essence, the system could become constantly clogged. Also, it may not even be wanted by users, ie in their setup a non-firing timer may be better than a (very) late-firing one. |
I agree that not being able to recover is a problem. I would expect a failing |
By "the calling process" you mean the process that called |
The process that called I'm surprised this isn't the case, I thought it was already like this. |
The `timer` module would silently ignore the failure to a spawn a process when a timer had expired and it was time to spawn process. Since there is not really anything that can be done in this situation, be sure to terminate the system so that the problem is noticed. Fixes erlang#7606
The `timer` module would silently ignore the failure to a spawn a process when a timer had expired and it was time to spawn process. Since there is not really anything that can be done in this situation, be sure to terminate the system so that the problem is noticed. Note that that other parts of OTP don't attempt to cope with failure to spawn processes. Fixes erlang#7606
Hm, those are two separate things I guess ^^; For one, But since you mention supervisors, I think you are alluding to this passage in the OP?
This part has been removed in 44d5789. What it did was the supervisor setting a zero-timeout timer to call a function which did nothing but send a message back to the supervisor, which was pretty pointless, the supervisor could just send the message to itself directly. I dug up the surrounding PR, #1001, even the original author didn't remember what he was thinking at the time. |
I misunderstood the problem, thought it was about starting the timer process, but the PR with the fix makes me understand now. |
👍
Though I somehow doubt that crashing the entire node is what @BenHuddleston expected as a solution XD |
After discussing potential solutions with @rickard-green, we reached the conclusion that the only sensible thing to do is to terminate the runtime system if it the limit for the number of processes has been reached. The linked pull request does exactly that. |
Thanks all for looking at this.
It is certainly not what I expected when I raised this issue, but it did cross my mind when you mentioned the assumption that "spawn (on the same node) just works" and this seems reasonable to me :) |
…8759 timer: Don't silently ignore errors when spawning processes
Describe the bug
timer:apply_after(...) timers may never be executed if there is a problem spawning a process.
To Reproduce
Expected behavior
The timer eventually fires when the issue causing us to hit the spawn error (system limit) is removed, i.e:
Affected versions
All, as far as I am aware. The code was refactored in Erlang/OTP 26 - f9460ed - but the previous code looks to hit the same issue.
Prior to Erlang/OTP 19 only:
Quite interestingly, timer:apply_after(...) was used in supervisor.erl prior to Erlang/OTP 19 if a restart of a process failed. Were a user to hit the process system_limit, and a process were to crash, then the supervisor would attempt to restart it and that would fail. It would setup the timer which would also fail to spawn a process and the child would never be restarted. The timer:apply_after(...) call was removed in this change - 44d5789.
Additional context
I've only been able to hit this after hitting the max process limit, I'm unsure if there are other scenarios in which spawn may fail. Perhaps inability to allocate more memory?
Run test (run erl with +P 1024 to lower the max process count or this may take a while):
Test Code:
The text was updated successfully, but these errors were encountered: