Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timer:apply_after(...) timers may never be executed if there is a problem spawning a process #7606

Closed
BenHuddleston opened this issue Aug 30, 2023 · 12 comments · Fixed by #7645
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@BenHuddleston
Copy link

BenHuddleston commented Aug 30, 2023

Describe the bug
timer:apply_after(...) timers may never be executed if there is a problem spawning a process.

To Reproduce

  1. Hit the max process system limit.
  2. Setup a timer via timer:apply_after(...)
  3. Wait for timer to fire (it never does as it cannot spawn a process to run the function)

Expected behavior
The timer eventually fires when the issue causing us to hit the spawn error (system limit) is removed, i.e:

  1. Hit the max process system limit
  2. Setup a timer via timer:apply_after(...)
  3. Wait for some time for timer to fire (it will not)
  4. Fix max process system limit issue (kill some processes)
  5. Timer fires

Affected versions
All, as far as I am aware. The code was refactored in Erlang/OTP 26 - f9460ed - but the previous code looks to hit the same issue.

Prior to Erlang/OTP 19 only:
Quite interestingly, timer:apply_after(...) was used in supervisor.erl prior to Erlang/OTP 19 if a restart of a process failed. Were a user to hit the process system_limit, and a process were to crash, then the supervisor would attempt to restart it and that would fail. It would setup the timer which would also fail to spawn a process and the child would never be restarted. The timer:apply_after(...) call was removed in this change - 44d5789.

Additional context

I've only been able to hit this after hitting the max process limit, I'm unsure if there are other scenarios in which spawn may fail. Perhaps inability to allocate more memory?

Run test (run erl with +P 1024 to lower the max process count or this may take a while):

$ erl +P 1024
Erlang/OTP 26 [erts-14.0.2] [source] [64-bit] [smp:10:10] [ds:10:10:10] [async-threads:1] [jit] [dtrace]

Eshell V14.0.2 (press Ctrl+G to abort, type help(). for help)
1> compile:file(timer_test), l(timer_test), timer_test:simple_test().

Test Code:

-module(timer_test).

-include_lib("eunit/include/eunit.hrl").

-export([init/1, proc_wait/0, send_message/2]).

send_message(Pid, Message) ->
  Pid ! Message,
  ok.

proc_wait() ->
  Pid = spawn_link(fun () -> receive stop -> ok end end),
  {ok, Pid}.

init([]) ->
  {ok, {{one_for_one, 3, 10}, []}}.

simple_test() ->
  %% Sanity test, we can setup a timer, it will fire, a message will be sent to us and we can process it
  timer:apply_after(1, timer_test, send_message, [self(), sanity]),
  receive
    sanity -> ok
  end,

  %% Spin up a bunch of waiting processes to hit the process limit
  {ok, MaxProcSup} = supervisor:start_link(timer_test, []),
  lists:foreach(
    fun (X) ->
      supervisor:start_child(MaxProcSup,
                             #{id => X,
                               start => {timer_test, proc_wait, []},
                               shutdown => brutal_kill, restart => temporary})
    end,
    lists:seq(1, erlang:system_info(process_limit))),

  %% Setup our timer and wait for the message
  timer:apply_after(2, timer_test, send_message, [self(), test_message]),

  %% We expect to get hit the after clause after 10s as we the timer doesn't properly fire
  receive
    test_message ->
      %% Never expect to hit this
      ?assert(false)
  after 5000 ->
    %% Remove all the processes causing us to hit the system limit
    ChildProcesses = supervisor:which_children(MaxProcSup),
    lists:foreach(
      fun({_, ChildPid, _,_}) ->
        ChildPid ! stop
      end, ChildProcesses),

    %% New timer spawn sanity test
    timer:apply_after(1, timer_test, send_message, [self(), sanity2]),
    receive
      sanity2 -> ok
    end,

    receive
      test_message -> ok
    after 5000 ->
      %% Still not seen the original timer fire, fail the test
      ?assert(false)
    end
  end.
@BenHuddleston BenHuddleston added the bug Issue is reported as a bug label Aug 30, 2023
@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Sep 1, 2023
@juhlig
Copy link
Contributor

juhlig commented Sep 7, 2023

May I ask what you need this for? It sounds like you are considering it normal to hit the process limit, and that sounds wrong to me. If you do hit the process limit, a non-firing timer is likely to be one of your lesser problems. Processes are spawned all the time all over the place, and it is universally assumed that "spawn (on the same node) just works", there are usually no defenses against it being otherwise. In other words, if it doesn't work, something is Seriously Wrong and all hell will break loose.

The code was refactored in Erlang/OTP 26 - f9460ed - but the previous code looks to hit the same issue.

The timer was first refactored in #4811, which made it into OTP 25. But yes, the previous implementation worked the same way in this regard.

@BenHuddleston
Copy link
Author

May I ask what you need this for? It sounds like you are considering it normal to hit the process limit, and that sounds wrong to me. If you do hit the process limit, a non-firing timer is likely to be one of your lesser problems

Hi @juhlig, in my use case we do not expect to hit the process limit. A separate issue caused us to hit the system process limit that was resolved while the system was running. My expectation was that the system would fully recover, however, code relying on timer:apply_after(...) calls did not work as expected as the functions were never ran. As far as I can tell, the rest of the system recovered, but I did not have access to the running environment, only a set of logs.

Erlang is a general-purpose programming language with built-in support for concurrency, distribution and fault tolerance.

Given Erlang's claim to have built-in support for fault-tolerance I would have expected Erlang itself to tolerate hitting the process limit and for the system to recover. I appreciate that hitting the system limit is certainly not normal, and again, it is not expected in our application, but it does not feel correct to me that a running Erlang node can hit such a condition, that, when removed, cannot be recovered from, particularly as the default behaviour for Erlang is to set a process a limit.

Processes are spawned all the time all over the place, and it is universally assumed that "spawn (on the same node) just works", there are usually no defenses against it being otherwise.

I appreciate that this assumption exists, and I don't necessarily expect that it will change with this issue, but we hit this in the application that I work on and thought it worth sharing.

This issue I hit existed solely within my application code due to the use of timer:apply_after(...). It would be be quite interesting to see if the Erlang runtime system itself may fail to recover if it hits the system process limit. If it does fail to recover, I suspect that the only recourse at the moment would be to restart the (OS) process.

@juhlig
Copy link
Contributor

juhlig commented Sep 11, 2023

Given Erlang's claim to have built-in support for fault-tolerance

The important words here are "support for" and "-tolerance".
Just using Erlang doesn't make just anything you write fault tolerant, it just gives you many tools to build such systems. In the same sense, the fact that Erlang has built-in support for concurrency and distribution doesn't automatically make whatever you write in Erlang run concurrent and distributed, it just gives you the tools to write systems that do.
Also, fault-tolerant doesn't mean fault-proof, just as a house may be called tolerant to certain sizes of certain natural disasters that would be otherwise devastating to humans without a house to shelter in, but it is not proof against any size of any disaster.

particularly as the default behaviour for Erlang is to set a process a limit

And for good reason, as otherwise the risk would be to render a system completely clogged and unresponsive or inaccessible, or to hit an OS or even machine limit which is likely much harsher as to its consequences.

To be sure, I'm not trying to lecture you here, I'm sure you know and understand ;)

@juhlig
Copy link
Contributor

juhlig commented Sep 11, 2023

@BenHuddleston @IngelaAndin @rickard-green @bjorng:

As for timers that are guaranteed to do the apply eventually, even in the condition when a spawn at the appointed times fails, my thinking is this.

Can we do it? I suppose so, but as we have no way (that I'm aware of) to be notified when the process limit falls below the limit, the IMO best way would be to simply try and retry spawning until it eventually succeeds. This should not block other timers from firing, though, so either we reschedule the timer with a small offset, or we do it in a separate process. Given that starting such a process ad hoc is at that moment not possible, we would need a side process pre-spawned when the timer server starts for that, for example something like this:

init(_) ->
    ...
    LateSpawnLoop = spawn_link(fun() -> late_spawn_loop(queue:new()) end),
    ....

late_spawn_loop(Pending0) ->
    {Pending1, Timeout} = case queue:out(Pending0) of
                              {{value, {_, {M, F, A}}}, Pending1} ->
                                  try spawn(M, F, A) of
                                      _ -> {Pending1, 0}
                                      catch
                                          exit:system_limit -> {Pending0, 0};
                                          _:_ -> {Pending1, 0}
                                  end;
                              empty -> {Pending0, infinity}
                          end,
    Pending2 = receive
                   {cancel, CancelTRef} -> queue:delete_with(fun({TRef, _}) -> TRef =:= CancelTRef end, Pending1);
                   {enqueue, Timer} -> queue:in(Timer, Pending1)
               after Timeout -> Pending1
               end,
    late_spawn_loop(Pending2).

(Only a rough draft of course. Especially the timeout after a failed spawn should probably be not 0.)

If the normal spawn the timer server does itself fails, it could send the timer off to this side process, to put on constant retrial until it succeeds.

Should we do it? Probably not, at least not unconditionally, feasibility of the approach outlined above aside. If the condition of the system running at the process limit lasts for a while, the pending spawns could build up, and be executed in a sudden surge once the process count abates, which in turn may as well drive the system up to the process limit again. In essence, the system could become constantly clogged. Also, it may not even be wanted by users, ie in their setup a non-firing timer may be better than a (very) late-firing one.
So IMO, if we are to implement this, it should be done as an additional, optional, non-default feature in an own, separate set of apply_after-like functions.

@essen
Copy link
Contributor

essen commented Sep 11, 2023

I agree that not being able to recover is a problem. I would expect a failing spawn to trigger an exception in the calling process so that it can be handled. Silent failure can only lead to further issues down the line.

@juhlig
Copy link
Contributor

juhlig commented Sep 11, 2023

I agree that not being able to recover is a problem. I would expect a failing spawn to trigger an exception in the calling process so that it can be handled. Silent failure can only lead to further issues down the line.

By "the calling process" you mean the process that called timer:apply_after?

@essen
Copy link
Contributor

essen commented Sep 11, 2023

The process that called spawn or spawn_link. So in the case of timer:apply_after the supervisor will fail to start the child and propagate an error which should be propagated back to the caller of timer:apply_after so they know the timer was not set.

I'm surprised this isn't the case, I thought it was already like this.

bjorng added a commit to bjorng/otp that referenced this issue Sep 11, 2023
The `timer` module would silently ignore the failure to a spawn a
process when a timer had expired and it was time to spawn process.

Since there is not really anything that can be done in this situation,
be sure to terminate the system so that the problem is noticed.

Fixes erlang#7606
bjorng added a commit to bjorng/otp that referenced this issue Sep 11, 2023
The `timer` module would silently ignore the failure to a spawn a
process when a timer had expired and it was time to spawn process.

Since there is not really anything that can be done in this situation,
be sure to terminate the system so that the problem is noticed. Note
that that other parts of OTP don't attempt to cope with failure to
spawn processes.

Fixes erlang#7606
@juhlig
Copy link
Contributor

juhlig commented Sep 11, 2023

Hm, those are two separate things I guess ^^;

For one, timer:apply_after sends a message to the timer_server process, which will tl;dr send a timed message to itself via erlang:start_timer. When it receives that message later, it calls spawn(M, F, Args) (in a catch or try ... catch, I guess this is to defend against invalid MFArgs, not against hitting the process limit). Ie the process doing the spawn is the timer server, not the process calling timer:apply_after. Setting the timer works, executing it is what fails.

But since you mention supervisors, I think you are alluding to this passage in the OP?

Quite interestingly, timer:apply_after(...) was used in supervisor.erl prior to Erlang/OTP 19 if a restart of a process failed...

This part has been removed in 44d5789. What it did was the supervisor setting a zero-timeout timer to call a function which did nothing but send a message back to the supervisor, which was pretty pointless, the supervisor could just send the message to itself directly. I dug up the surrounding PR, #1001, even the original author didn't remember what he was thinking at the time.

@bjorng bjorng linked a pull request Sep 11, 2023 that will close this issue
@essen
Copy link
Contributor

essen commented Sep 11, 2023

I misunderstood the problem, thought it was about starting the timer process, but the PR with the fix makes me understand now.

@juhlig
Copy link
Contributor

juhlig commented Sep 11, 2023

I misunderstood the problem, thought it was about starting the timer process

👍

but the PR with the fix makes me understand now.

Though I somehow doubt that crashing the entire node is what @BenHuddleston expected as a solution XD

@bjorng
Copy link
Contributor

bjorng commented Sep 11, 2023

After discussing potential solutions with @rickard-green, we reached the conclusion that the only sensible thing to do is to terminate the runtime system if it the limit for the number of processes has been reached. The linked pull request does exactly that.

@BenHuddleston
Copy link
Author

Thanks all for looking at this.

@juhlig,

Though I somehow doubt that crashing the entire node is what @BenHuddleston expected as a solution XD

It is certainly not what I expected when I raised this issue, but it did cross my mind when you mentioned the assumption that "spawn (on the same node) just works" and this seems reasonable to me :)

bjorng added a commit that referenced this issue Sep 12, 2023
…8759

timer: Don't silently ignore errors when spawning processes
@bjorng bjorng closed this as completed in 10c49bc Sep 12, 2023
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants