Implement synchronous init failure for gen processes #6843

RaimoNiskanen · 2023-02-10T10:31:47Z

Implement a function proc_lib:init_fail/2,3 to call from proc_lib started processes when they fail to initialize instead of proc_lib:init_ack/1,2. The new function sends a {nack,,} to the spawner that then uses a process monitor on the spawned process to await the monitor 'DOWN' message before returning a failure to the spawner.
This avoids a race when using proc_lib:init_ack/1,2 which
allows the spawner to get and return the failure return value, and restart a new server instance while the old instance not yet has been scheduled for exit and cleanup. The old server instance may hold a VM resource such as a named ETS table that the new server instance fails to create.
Use the new init fail function in Kernel and Stdlib.

josevalim

Beautiful. 💯

RaimoNiskanen · 2023-02-10T11:12:13Z

I propose a function proc_lib:init_fail/2,3 that besides sending a {nack,,} also exits the process, so the proc_lib code can rely on the spawned process exiting and use infinity when waiting for the monitor {'DOWN',,,,}.

Documentation is missing.
A test case that verifies that cyclic restarts of a server frees resources as expected is missing.
Updating all OTP code outside Kernel and Stdlib remains to do.

Interesting question

We maybe should take the opportunity to change the behaviour of purging {'EXIT',,} messages from the parent's message queue. Now it is quite unpredictable (I did not change that to prevent test cases in Stdlib; application_SUITE from failing).

Example: proc_lib:start_link/*

The init() function ends in proc_lib:init_fail/2,3 and returns an error back to the parent calling proc_lib:start_link/*, but the {'EXIT', Server, Reason} remains in the parent's message queue.
The init() function crashes, proc_lib gets the {'EXIT', Server, Reason} message and returns {error,Reason}. No {'EXIT',,} message remains in the parent's message queue.
The init() function times out. proc_lib kills it and flushes the {'EXIT',,} message, so it does not remain in the parent's message queue.

There is no way for the parent calling proc_lib:start_link/* to see from the return value if there is an {'EXIT',,} message lingering in the message queue.

proc_lib:start_monitor/* with a link option behaves roughly the same, except that a crashing init() function leaves an {'EXIT',,} message in the parent's message queue. The same goes for proc_lib:start/* with a link` option.

If I understood the code, that is...

I think that a logical behaviour would be to never leave an {'EXIT',,} message in the parent's message queue if the return value from proc_lib:spawn*/* that creates a link indicates that the server failed to start, regardless of if init() crashes, returns a failure, or times out.

Changing this would be subtly backwards incompatible, though.

Would it be worth it to do that in OTP-26.0?

github-actions · 2023-02-10T11:20:07Z

CT Test Results

      3 files   142 suites 1h 38m 36s ⏱️
3 193 tests 2 921 ✔️ 269 💤 3 ❌
3 695 runs 3 376 ✔️ 316 💤 3 ❌

For more details on these failures, see this check.

Results for commit 381648e.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

josevalim · 2023-02-10T11:27:15Z

The init() function ends in proc_lib:init_fail/2,3 and returns an error back to the parent calling proc_lib:start_link/*, but the {'EXIT', Server, Reason} remains in the parent's message queue.

You could argue for both cases.

A crash during init does not leak messages, per your account, but the current error replies do, which means any code that is calling proc_lib already has to take into account EXIT messages being there or not
Even with this PR, it is still totally fine for a process to reply to init with {error, _} and then continue to go do some work, effectively exiting only later (which means EXIT messages will still come later)

In other words, I think it will always be unpredictable?

I would say that never removing EXIT from the message queue would be the predictable thing to do but it is also annoying and it causes the caller to do more work if you leave them there.

RaimoNiskanen · 2023-02-10T12:58:18Z

A crash during init() does not leak a message for start_link, but for start and start_monitor when the link option is used, and we can change that. I think the current difference is annoying.

Current error replies (proc_lib:init_ack(SomeError)) leaks a message, as does my new proc_lib:init_fail(Return, Exception), but the new one can be changed to not do that.

A server that returns an error via proc_lib:init_ack/1,2 and then goes on running can be classified as broken.

We can achieve a point where servers that are not broken (use init_ack/1,2 and init_fail/2,3) in a correct way never leaks an exit message. All gen_* servers would be not broken since we can fix them.

josevalim · 2023-02-10T13:09:15Z

A server that returns an error via proc_lib:init_ack/1,2 and then goes on running can be classified as broken.

It does not necessarily need to run but it may want to do some clean up that is expensive and therefore does not want to block the caller. Would that be a possibility?

RaimoNiskanen · 2023-02-10T13:22:32Z

Then the caller cannot know when the server can be restarted, and the link may fire any time after returning the error. Wouldn't this be a special case of server start so you should return something like {error, Pid, hold_on}, which from proc_lib's point of view would be a successful server start followed by a controlled delayed exit.

josevalim · 2023-02-10T13:26:35Z

I see, thanks! So in that case I agree that we could always guarantee EXIT to be removed and also make it predictable (with less work on the caller). 👍

KennethL · 2023-02-13T08:01:29Z

lib/stdlib/src/gen_server.erl

+	    proc_lib:init_ack(Starter, {ok, self()}),
+	    loop(
+              Parent, Name, State, CbCache, infinity,
+              HibernateAfterTimeout, Debug);
+        {ok, {ok, State, TimeoutOrHibernate}}
          when ?is_timeout(TimeoutOrHibernate);
               TimeoutOrHibernate =:= hibernate ->
-	    proc_lib:init_ack(Starter, {ok, self()}), 	    
-	    loop(Parent, Name, State, CbCache, TimeoutOrHibernate, HibernateAfterTimeout, Debug);
+	    proc_lib:init_ack(Starter, {ok, self()}),
+	    loop(
+              Parent, Name, State, CbCache, TimeoutOrHibernate,
+              HibernateAfterTimeout, Debug);
 	{ok, {ok, State, {continue, _}=Continue}} ->
-	    proc_lib:init_ack(Starter, {ok, self()}), 	    
-	    loop(Parent, Name, State, CbCache, Continue, HibernateAfterTimeout, Debug);
+	    proc_lib:init_ack(Starter, {ok, self()}),
+	    loop(
+              Parent, Name, State, CbCache, Continue,
+              HibernateAfterTimeout, Debug);


What is the difference? I can only see changes in indentation and line breaks.

Yes. Those are only indentation and line break changes.
There were trailing whitespaces and lines longer than 80, which annoyed me...

RaimoNiskanen · 2023-02-13T15:36:34Z

To summarize how the proc_lib:start*/* functions behave:

If the init() function calls proc_lib:ack(Value):
-> Value is returned by all start functions.

Note: When this is used to indicate server start failure:
1. there is no synchronization with the server process exit so if the exiting server holds an unique VM resource a restart of the server may fail to allocate the resource.
2. if the caller traps exits and a start function that creates a link is used, the failing server will cause an {'EXIT',_,_} link message some time after Value is returned.
If the init() function times out as in does not call the ack function above and does not exit; when the start function Timeout expires (which may be never), the started server is killed with exit(Pid, kill):
-> If the caller doesn't trap exits and and a start function that sets a process link was use then the caller is killed by the link with reason killed.
-> Otherwise {error, timeout} is returned.
Note: Only start_link waits for the started server to exit before returning, so for other start functions that set a process link some code may be executed before the link kills the caller, or when no link is set a server restart may fail if the exiting server holds an unique VM resource needed by the new server instance.

Currently

If the init() function exits with reason Reason (without calling the ack function, before the start function Timeout expires):
-> If the caller doesn't trap exits, a start function that sets a process link was used, and Reason =/= normal, then the caller is killed by the process link with reason Reason.
-> If the caller doesn't trap exits, start_link was used and Reason =:= normal the caller hangs until start_link Timeout expires (which may be never) and then returns {error, timeout}.
-> Otherwise {error, Reason} is returned.
Note: If a start function that sets a link other than start_link was used, then there is an {'EXIT',_,_} message lingering in the caller's message queue.

Proposed change

If the init() function exits with reason Reason (without calling the ack function, before the start function Timeout expires):
-> If the caller doesn't trap exits, a start function that sets a process link was used, and Reason =/= normal then the caller is killed by the process link with reason Reason.
-> Otherwise {error, Reason} is returned.
There is never a lingering {'EXIT',_,_} message.
The init() function should only call proc_lib:init_ack(Value) to indicate a successful server start. There is no change in how it behaves, only in how it should be used.
If the init() function times out, all start functions wait for the server to exit before returning.
If the init() function calls the new function proc_lib:init_fail(Value, Reason):
-> If the caller doesn't trap exits, a start function that sets a process link was used, and Reason =/= normal then the caller is killed by the link with reason Reason.
-> Otherwise the start function returns Value, when the failed server process has exited.
Note: The changed gen_* modules (including supervisor) that use this new function will no longer leave an {'EXIT',_,_} message in the caller's message queue when a start function that sets a process link is used.

RaimoNiskanen · 2023-02-14T14:46:04Z

Sorry for my brainfarting, but I am trying to clarify this to me and for an OTP Technical Board soon to be, where I want to discuss if we dare to clean up the exit message behaviour for the major release OTP-26.0.

Summary of the summary

A proc_lib:start*/* function that sets a link will, of course, when the server exits, be it during init/1 or any time later, trigger the link, which will kill a caller that does not trap exits (unless the exit reason is normal which does not kill the caller). This is fundamental link behaviour and not something we can or will change for these start functions.

The second commit (first after the one that introduces test cases) changes how synchronous the start functions are:

If the init/1 function does not produce a reply within the start function Timeout the started process is killed with exit(Pid, kill). The code is changed to wait for the started process to exit before returning from the start function.
A new function proc_lib:init_fail/2,3 is introduced and used from all gen_* modules which facilitates the proc_lib start functions to in the same way wait for the started process to exit before returning from the start function.
A bug is fixed for when the caller of proc_lib:start_link/* doesn't trap exits and the init/1 function misbehaves in that it simply exits withnormal without calling proc_lib:init_ack/* nor proc_lib:init_fail/*, then the caller will not detect the exit and wait for the start Timeout which may be infinity... The bug is fixed by using a process monitor also forproc_lib:start_link/*.

These changes, I think, can be regarded as bug fixes, so we should simply do them.

The third commit changes when {'EXIT',_,_} messages delivered to the caller due to a process link are consumed (only applicable when the caller traps exits, and a start function that sets a process link is used). Without this change:

Exit messages are consumed when the init/1 function times out,
and when the init/1 function exits if proc_lib:start_link/* is used.
Exit messages are thereby not consumed when a server calls proc_lib:init_fail/2,3 or the no longer recommended to use when failing in init/1 - proc_lib:init_ack/1,2,
or when the init/1 function exits and proc_lib:start/* or proc_lib:start_monitor/* with the 'link' option is used.

The change is to always consume the {'EXIT',_,_} message however the init/1 function fails to start the server (provided that the failure indication is produced with proc_lib:init_fail/* and not with proc_lib:init_ack/*).

RaimoNiskanen · 2023-02-15T09:55:49Z

Added documentation

* Implement a function `proc_lib:init_fail/2,3` to call from `proc_lib` started processes when they fail to initialize instead of `proc_lib:init_ack/1,2`. The new function sends a `{nack,,}` to the spawner that then uses a process monitor on the spawned process to await the monitor `'DOWN'` message before returning a failure to the spawner. This avoids a race when using `proc_lib:init_ack/1,2` which allows the spawner to get and return the failure return value, and restart a new server instance while the old instance not yet has been scheduled for exit and cleanup. The old server instance may hold a VM resource such as a named ETS table that the new server instance fails to create. * Use the new init fail function in Kernel and Stdlib. * Fix a bug when `proc_lib:start_link/*` is called from a process that does _not_ trap exits and the `init/1` function simply exits with reason normal without calling `proc_lib:init_fail/2,3 or `proc_lib:init_ack/1,2`. Then the caller would wait for the start `Timeout` which could be `infinity`. This is solved by using a process monitor also for `proc_lib:start_link/*`.

RaimoNiskanen · 2023-02-15T11:26:13Z

Rebased to a later 'master'

RaimoNiskanen · 2023-02-15T13:59:14Z

This is the documentation that I have updated:

proc_lib:init_ack/1,2
proc_lib:init_fail/2,3 (added)
proc_lib:start/3,4,5
proc_lib:start_link/3,4,5
proc_lib:start_monitor/3,4,5
Design Principles; sys and proc_lib; Special Processes; Starting the Process, just at the end, before Debugging

These applications seem they should be updated to use proc_lib:init_fail/2,3:
diameter, inets, mnesia, reltool, snmp, ssh, tftp, wx.

After fixing how proc_lib:start_link/* handles exit(normal) it is, from an init/1 function, equivalent to call exit(Reason) instead of proc_lib:init_fail({error, Reason}, {exit, Reason}).

RaimoNiskanen · 2023-02-16T21:12:44Z

"Simplify error propagation" and "Clarify documentation" are new, the others are rebased.

See GitHub PR#6843. When `init()` function exits, return `{error, EXIT_reason}`. A link, of course, still may kill the caller. Never leave an {`EXIT',_,_} message lingering after an error return.

RaimoNiskanen · 2023-02-17T09:29:02Z

Documented the new behaviour of consuming 'EXIT' messages

lib/stdlib/doc/src/proc_lib.xml

rickard-green · 2023-02-21T12:49:54Z

lib/stdlib/doc/src/proc_lib.xml

+
+    <func>
+      <name since="OTP-26.0">init_fail(Ret, Exception) -> no_return()</name>
+      <name since="OTP-26.0">init_fail(Parent, Ret, Exception) -> no_return()</name>


Do we really need to be able to raise every possible exception there is here? Wouldn't raising an exit exception with an exit Reason as last argument instead of Exception be enough?

Using the class throw here seems really fishy. I would actually say it is plain wrong.

We probably do not need to raise every possible exception here, but; this function should be called from a process started with the very general proc_lib:start{_link,monitor}/3,4,5, which expects the started process to run to an end in any manner; any value or exception.

So if there is user code that today, for some reason throws some exception, say it wants the logging in the top level catch in proc_lib:init_p/3,5; to make it possible for that code to use the new proc_lib:init_fail/2,3 and raise any exception, I think it should handle also the the really fishy in this context class throw.

Should a process unconditionally exit by, for example process_flag(trap_exit, false), exit(self(), suicide), it will circumvent the top level catch in init_p(). Should it catch the exception from init_fail() it would mess up the exception it itself requested. Both cases, I think, are misuses that we cannot guard against. The proc_lib:start process simply has to run to an end.

RaimoNiskanen added this to the OTP-26.0 milestone Feb 10, 2023

RaimoNiskanen requested a review from rickard-green February 10, 2023 10:31

RaimoNiskanen self-assigned this Feb 10, 2023

RaimoNiskanen linked an issue Feb 10, 2023 that may be closed by this pull request

ETS table name conflict racing if it is created in a supervised gen_server:init #6339

Closed

RaimoNiskanen mentioned this pull request Feb 10, 2023

ETS table name conflict racing if it is created in a supervised gen_server:init #6339

Closed

RaimoNiskanen marked this pull request as draft February 10, 2023 10:48

josevalim approved these changes Feb 10, 2023

View reviewed changes

RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from 220ac3b to fe36d8e Compare February 10, 2023 11:12

KennethL reviewed Feb 13, 2023

View reviewed changes

rickard-green added team:VM Assigned to OTP team VM team:PS Assigned to OTP team PS and removed team:VM Assigned to OTP team VM labels Feb 13, 2023

RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from fe36d8e to a6e8664 Compare February 14, 2023 11:00

RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch 2 times, most recently from 88611ab to e1124c9 Compare February 15, 2023 09:55

RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from e1124c9 to a372035 Compare February 15, 2023 11:12

RaimoNiskanen added 2 commits February 15, 2023 12:25

Test looping over gen_server init failure

193cda6

RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from a372035 to 28241ad Compare February 15, 2023 11:26

RaimoNiskanen added 3 commits February 16, 2023 22:10

Simplify error propagation

b32780f

After fixing how proc_lib:start_link/* handles exit(normal) it is, from an init/1 function, equivalent to call exit(Reason) instead of proc_lib:init_fail({error, Reason}, {exit, Reason}).

Update doc with proc_lib:init_fail/2,3

99cee51

Clarify documentation

d3f04e3

RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from 972f88e to 6eea2eb Compare February 16, 2023 21:11

Change to a more coherent behaviour for init_fail and exit in init

2f5efa0

See GitHub PR#6843. When `init()` function exits, return `{error, EXIT_reason}`. A link, of course, still may kill the caller. Never leave an {`EXIT',_,_} message lingering after an error return.

RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from 6eea2eb to 2f5efa0 Compare February 17, 2023 09:28

RaimoNiskanen marked this pull request as ready for review February 17, 2023 09:29

Fix bug - start_link should not return an unwrapped error Reason

897bd2e

RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from 7d81867 to 897bd2e Compare February 17, 2023 13:29

rickard-green approved these changes Feb 21, 2023

View reviewed changes

Update documentation after review

381648e

RaimoNiskanen merged commit b0afc29 into erlang:master Feb 23, 2023

RaimoNiskanen mentioned this pull request Feb 24, 2023

Use synchronous init fail in OTP applications #6919

Merged

justincy mentioned this pull request Jun 16, 2023

Tests failing with OTP 26 whatyouhide/redix#251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement synchronous init failure for gen processes #6843

Implement synchronous init failure for gen processes #6843

RaimoNiskanen commented Feb 10, 2023

josevalim left a comment

RaimoNiskanen commented Feb 10, 2023

github-actions bot commented Feb 10, 2023 •

edited

Loading

josevalim commented Feb 10, 2023 •

edited

Loading

RaimoNiskanen commented Feb 10, 2023

josevalim commented Feb 10, 2023

RaimoNiskanen commented Feb 10, 2023

josevalim commented Feb 10, 2023

KennethL Feb 13, 2023

RaimoNiskanen Feb 13, 2023

RaimoNiskanen commented Feb 13, 2023 •

edited

Loading

RaimoNiskanen commented Feb 14, 2023 •

edited

Loading

RaimoNiskanen commented Feb 15, 2023

RaimoNiskanen commented Feb 15, 2023

RaimoNiskanen commented Feb 15, 2023 •

edited

Loading

RaimoNiskanen commented Feb 16, 2023

RaimoNiskanen commented Feb 17, 2023

rickard-green Feb 21, 2023

RaimoNiskanen Feb 21, 2023

Implement synchronous init failure for gen processes #6843

Implement synchronous init failure for gen processes #6843

Conversation

RaimoNiskanen commented Feb 10, 2023

josevalim left a comment

Choose a reason for hiding this comment

RaimoNiskanen commented Feb 10, 2023

Interesting question

github-actions bot commented Feb 10, 2023 • edited Loading

CT Test Results

Artifacts

josevalim commented Feb 10, 2023 • edited Loading

RaimoNiskanen commented Feb 10, 2023

josevalim commented Feb 10, 2023

RaimoNiskanen commented Feb 10, 2023

josevalim commented Feb 10, 2023

KennethL Feb 13, 2023

Choose a reason for hiding this comment

RaimoNiskanen Feb 13, 2023

Choose a reason for hiding this comment

RaimoNiskanen commented Feb 13, 2023 • edited Loading

Currently

Proposed change

RaimoNiskanen commented Feb 14, 2023 • edited Loading

Summary of the summary

RaimoNiskanen commented Feb 15, 2023

RaimoNiskanen commented Feb 15, 2023

RaimoNiskanen commented Feb 15, 2023 • edited Loading

RaimoNiskanen commented Feb 16, 2023

RaimoNiskanen commented Feb 17, 2023

rickard-green Feb 21, 2023

Choose a reason for hiding this comment

RaimoNiskanen Feb 21, 2023

Choose a reason for hiding this comment

github-actions bot commented Feb 10, 2023 •

edited

Loading

josevalim commented Feb 10, 2023 •

edited

Loading

RaimoNiskanen commented Feb 13, 2023 •

edited

Loading

RaimoNiskanen commented Feb 14, 2023 •

edited

Loading

RaimoNiskanen commented Feb 15, 2023 •

edited

Loading