Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement synchronous init failure for gen processes #6843

Merged

Conversation

RaimoNiskanen
Copy link
Contributor

  • Implement a function proc_lib:init_fail/2,3 to call from proc_lib started processes when they fail to initialize instead of proc_lib:init_ack/1,2. The new function sends a {nack,,} to the spawner that then uses a process monitor on the spawned process to await the monitor 'DOWN' message before returning a failure to the spawner.
    This avoids a race when using proc_lib:init_ack/1,2 which
    allows the spawner to get and return the failure return value, and restart a new server instance while the old instance not yet has been scheduled for exit and cleanup. The old server instance may hold a VM resource such as a named ETS table that the new server instance fails to create.

  • Use the new init fail function in Kernel and Stdlib.

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful. 💯

@RaimoNiskanen
Copy link
Contributor Author

I propose a function proc_lib:init_fail/2,3 that besides sending a {nack,,} also exits the process, so the proc_lib code can rely on the spawned process exiting and use infinity when waiting for the monitor {'DOWN',,,,}.

  • Documentation is missing.
  • A test case that verifies that cyclic restarts of a server frees resources as expected is missing.
  • Updating all OTP code outside Kernel and Stdlib remains to do.

Interesting question

We maybe should take the opportunity to change the behaviour of purging {'EXIT',,} messages from the parent's message queue. Now it is quite unpredictable (I did not change that to prevent test cases in Stdlib; application_SUITE from failing).

Example: proc_lib:start_link/*

  • The init() function ends in proc_lib:init_fail/2,3 and returns an error back to the parent calling proc_lib:start_link/*, but the {'EXIT', Server, Reason} remains in the parent's message queue.
  • The init() function crashes, proc_lib gets the {'EXIT', Server, Reason} message and returns {error,Reason}. No {'EXIT',,} message remains in the parent's message queue.
  • The init() function times out. proc_lib kills it and flushes the {'EXIT',,} message, so it does not remain in the parent's message queue.

There is no way for the parent calling proc_lib:start_link/* to see from the return value if there is an {'EXIT',,} message lingering in the message queue.

proc_lib:start_monitor/* with a link option behaves roughly the same, except that a crashing init() function leaves an {'EXIT',,} message in the parent's message queue. The same goes for proc_lib:start/* with a link` option.

If I understood the code, that is...

I think that a logical behaviour would be to never leave an {'EXIT',,} message in the parent's message queue if the return value from proc_lib:spawn*/* that creates a link indicates that the server failed to start, regardless of if init() crashes, returns a failure, or times out.

Changing this would be subtly backwards incompatible, though.

Would it be worth it to do that in OTP-26.0?

@RaimoNiskanen RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from 220ac3b to fe36d8e Compare February 10, 2023 11:12
@github-actions
Copy link
Contributor

github-actions bot commented Feb 10, 2023

CT Test Results

       3 files     142 suites   1h 38m 36s ⏱️
3 193 tests 2 921 ✔️ 269 💤 3
3 695 runs  3 376 ✔️ 316 💤 3

For more details on these failures, see this check.

Results for commit 381648e.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@josevalim
Copy link
Contributor

josevalim commented Feb 10, 2023

The init() function ends in proc_lib:init_fail/2,3 and returns an error back to the parent calling proc_lib:start_link/*, but the {'EXIT', Server, Reason} remains in the parent's message queue.

You could argue for both cases.

  1. A crash during init does not leak messages, per your account, but the current error replies do, which means any code that is calling proc_lib already has to take into account EXIT messages being there or not

  2. Even with this PR, it is still totally fine for a process to reply to init with {error, _} and then continue to go do some work, effectively exiting only later (which means EXIT messages will still come later)

In other words, I think it will always be unpredictable?

I would say that never removing EXIT from the message queue would be the predictable thing to do but it is also annoying and it causes the caller to do more work if you leave them there.

@RaimoNiskanen
Copy link
Contributor Author

A crash during init() does not leak a message for start_link, but for start and start_monitor when the link option is used, and we can change that. I think the current difference is annoying.

Current error replies (proc_lib:init_ack(SomeError)) leaks a message, as does my new proc_lib:init_fail(Return, Exception), but the new one can be changed to not do that.

A server that returns an error via proc_lib:init_ack/1,2 and then goes on running can be classified as broken.

We can achieve a point where servers that are not broken (use init_ack/1,2 and init_fail/2,3) in a correct way never leaks an exit message. All gen_* servers would be not broken since we can fix them.

@josevalim
Copy link
Contributor

A server that returns an error via proc_lib:init_ack/1,2 and then goes on running can be classified as broken.

It does not necessarily need to run but it may want to do some clean up that is expensive and therefore does not want to block the caller. Would that be a possibility?

@RaimoNiskanen
Copy link
Contributor Author

Then the caller cannot know when the server can be restarted, and the link may fire any time after returning the error. Wouldn't this be a special case of server start so you should return something like {error, Pid, hold_on}, which from proc_lib's point of view would be a successful server start followed by a controlled delayed exit.

@josevalim
Copy link
Contributor

I see, thanks! So in that case I agree that we could always guarantee EXIT to be removed and also make it predictable (with less work on the caller). 👍

Comment on lines +919 to +934
proc_lib:init_ack(Starter, {ok, self()}),
loop(
Parent, Name, State, CbCache, infinity,
HibernateAfterTimeout, Debug);
{ok, {ok, State, TimeoutOrHibernate}}
when ?is_timeout(TimeoutOrHibernate);
TimeoutOrHibernate =:= hibernate ->
proc_lib:init_ack(Starter, {ok, self()}),
loop(Parent, Name, State, CbCache, TimeoutOrHibernate, HibernateAfterTimeout, Debug);
proc_lib:init_ack(Starter, {ok, self()}),
loop(
Parent, Name, State, CbCache, TimeoutOrHibernate,
HibernateAfterTimeout, Debug);
{ok, {ok, State, {continue, _}=Continue}} ->
proc_lib:init_ack(Starter, {ok, self()}),
loop(Parent, Name, State, CbCache, Continue, HibernateAfterTimeout, Debug);
proc_lib:init_ack(Starter, {ok, self()}),
loop(
Parent, Name, State, CbCache, Continue,
HibernateAfterTimeout, Debug);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference? I can only see changes in indentation and line breaks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Those are only indentation and line break changes.
There were trailing whitespaces and lines longer than 80, which annoyed me...

@rickard-green rickard-green added team:VM Assigned to OTP team VM team:PS Assigned to OTP team PS and removed team:VM Assigned to OTP team VM labels Feb 13, 2023
@RaimoNiskanen
Copy link
Contributor Author

RaimoNiskanen commented Feb 13, 2023

To summarize how the proc_lib:start*/* functions behave:

  • If the init() function calls proc_lib:ack(Value):
    -> Value is returned by all start functions.

    Note: When this is used to indicate server start failure:

    1. there is no synchronization with the server process exit so if the exiting server holds an unique VM resource a restart of the server may fail to allocate the resource.
    2. if the caller traps exits and a start function that creates a link is used, the failing server will cause an {'EXIT',_,_} link message some time after Value is returned.
  • If the init() function times out as in does not call the ack function above and does not exit; when the start function Timeout expires (which may be never), the started server is killed with exit(Pid, kill):
    -> If the caller doesn't trap exits and and a start function that sets a process link was use then the caller is killed by the link with reason killed.
    -> Otherwise {error, timeout} is returned.
    Note: Only start_link waits for the started server to exit before returning, so for other start functions that set a process link some code may be executed before the link kills the caller, or when no link is set a server restart may fail if the exiting server holds an unique VM resource needed by the new server instance.

Currently

  • If the init() function exits with reason Reason (without calling the ack function, before the start function Timeout expires):
    -> If the caller doesn't trap exits, a start function that sets a process link was used, and Reason =/= normal, then the caller is killed by the process link with reason Reason.
    -> If the caller doesn't trap exits, start_link was used and Reason =:= normal the caller hangs until start_link Timeout expires (which may be never) and then returns {error, timeout}.
    -> Otherwise {error, Reason} is returned.
    Note: If a start function that sets a link other than start_link was used, then there is an {'EXIT',_,_} message lingering in the caller's message queue.

Proposed change

  • If the init() function exits with reason Reason (without calling the ack function, before the start function Timeout expires):
    -> If the caller doesn't trap exits, a start function that sets a process link was used, and Reason =/= normal then the caller is killed by the process link with reason Reason.
    -> Otherwise {error, Reason} is returned.
    There is never a lingering {'EXIT',_,_} message.

  • The init() function should only call proc_lib:init_ack(Value) to indicate a successful server start. There is no change in how it behaves, only in how it should be used.

  • If the init() function times out, all start functions wait for the server to exit before returning.

  • If the init() function calls the new function proc_lib:init_fail(Value, Reason):
    -> If the caller doesn't trap exits, a start function that sets a process link was used, and Reason =/= normal then the caller is killed by the link with reason Reason.
    -> Otherwise the start function returns Value, when the failed server process has exited.
    Note: The changed gen_* modules (including supervisor) that use this new function will no longer leave an {'EXIT',_,_} message in the caller's message queue when a start function that sets a process link is used.

@RaimoNiskanen RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from fe36d8e to a6e8664 Compare February 14, 2023 11:00
@RaimoNiskanen
Copy link
Contributor Author

RaimoNiskanen commented Feb 14, 2023

Sorry for my brainfarting, but I am trying to clarify this to me and for an OTP Technical Board soon to be, where I want to discuss if we dare to clean up the exit message behaviour for the major release OTP-26.0.

Summary of the summary

A proc_lib:start*/* function that sets a link will, of course, when the server exits, be it during init/1 or any time later, trigger the link, which will kill a caller that does not trap exits (unless the exit reason is normal which does not kill the caller). This is fundamental link behaviour and not something we can or will change for these start functions.

The second commit (first after the one that introduces test cases) changes how synchronous the start functions are:

  • If the init/1 function does not produce a reply within the start function Timeout the started process is killed with exit(Pid, kill). The code is changed to wait for the started process to exit before returning from the start function.
  • A new function proc_lib:init_fail/2,3 is introduced and used from all gen_* modules which facilitates the proc_lib start functions to in the same way wait for the started process to exit before returning from the start function.
  • A bug is fixed for when the caller of proc_lib:start_link/* doesn't trap exits and the init/1 function misbehaves in that it simply exits withnormal without calling proc_lib:init_ack/* nor proc_lib:init_fail/*, then the caller will not detect the exit and wait for the start Timeout which may be infinity... The bug is fixed by using a process monitor also forproc_lib:start_link/*.

These changes, I think, can be regarded as bug fixes, so we should simply do them.

The third commit changes when {'EXIT',_,_} messages delivered to the caller due to a process link are consumed (only applicable when the caller traps exits, and a start function that sets a process link is used). Without this change:

  • Exit messages are consumed when the init/1 function times out,
    and when the init/1 function exits if proc_lib:start_link/* is used.
  • Exit messages are thereby not consumed when a server calls proc_lib:init_fail/2,3 or the no longer recommended to use when failing in init/1 - proc_lib:init_ack/1,2,
    or when the init/1 function exits and proc_lib:start/* or proc_lib:start_monitor/* with the 'link' option is used.

The change is to always consume the {'EXIT',_,_} message however the init/1 function fails to start the server (provided that the failure indication is produced with proc_lib:init_fail/* and not with proc_lib:init_ack/*).

@RaimoNiskanen RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch 2 times, most recently from 88611ab to e1124c9 Compare February 15, 2023 09:55
@RaimoNiskanen
Copy link
Contributor Author

Added documentation

@RaimoNiskanen RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from e1124c9 to a372035 Compare February 15, 2023 11:12
* Implement a function `proc_lib:init_fail/2,3` to call from
`proc_lib` started processes when they fail to initialize instead
of `proc_lib:init_ack/1,2`.  The new function sends a `{nack,,}`
to the spawner that then uses a process monitor on the spawned
process to await the monitor `'DOWN'` message before returning
a failure to the spawner.
  This avoids a race when using `proc_lib:init_ack/1,2` which
allows the spawner to get and return the failure return value,
and restart a new server instance while the old instance not
yet has been scheduled for exit and cleanup.  The old server
instance may hold a VM resource such as a named ETS table
that the new server instance fails to create.

* Use the new init fail function in Kernel and Stdlib.

* Fix a bug when `proc_lib:start_link/*` is called from
a process that does _not_ trap exits and the `init/1` function
simply exits with reason normal without calling
`proc_lib:init_fail/2,3 or `proc_lib:init_ack/1,2`.  Then
the caller would wait for the start `Timeout` which could
be `infinity`.  This is solved by using a process monitor
also for `proc_lib:start_link/*`.
@RaimoNiskanen RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from a372035 to 28241ad Compare February 15, 2023 11:26
@RaimoNiskanen
Copy link
Contributor Author

Rebased to a later 'master'

@RaimoNiskanen
Copy link
Contributor Author

RaimoNiskanen commented Feb 15, 2023

This is the documentation that I have updated:

These applications seem they should be updated to use proc_lib:init_fail/2,3:
diameter, inets, mnesia, reltool, snmp, ssh, tftp, wx.

After fixing how proc_lib:start_link/* handles exit(normal)
it is, from an init/1 function, equivalent to call exit(Reason)
instead of proc_lib:init_fail({error, Reason}, {exit, Reason}).
@RaimoNiskanen RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from 972f88e to 6eea2eb Compare February 16, 2023 21:11
@RaimoNiskanen
Copy link
Contributor Author

"Simplify error propagation" and "Clarify documentation" are new, the others are rebased.

See GitHub PR#6843.

When `init()` function exits, return `{error, EXIT_reason}`.
A link, of course, still may kill the caller.

Never leave an {`EXIT',_,_} message lingering after an error return.
@RaimoNiskanen RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from 6eea2eb to 2f5efa0 Compare February 17, 2023 09:28
@RaimoNiskanen
Copy link
Contributor Author

Documented the new behaviour of consuming 'EXIT' messages

@RaimoNiskanen RaimoNiskanen marked this pull request as ready for review February 17, 2023 09:29
@RaimoNiskanen RaimoNiskanen force-pushed the raimo/gen-sync-start-fail/GH-6339 branch from 7d81867 to 897bd2e Compare February 17, 2023 13:29
lib/stdlib/doc/src/proc_lib.xml Outdated Show resolved Hide resolved
lib/stdlib/doc/src/proc_lib.xml Outdated Show resolved Hide resolved

<func>
<name since="OTP-26.0">init_fail(Ret, Exception) -> no_return()</name>
<name since="OTP-26.0">init_fail(Parent, Ret, Exception) -> no_return()</name>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to be able to raise every possible exception there is here? Wouldn't raising an exit exception with an exit Reason as last argument instead of Exception be enough?

Using the class throw here seems really fishy. I would actually say it is plain wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably do not need to raise every possible exception here, but; this function should be called from a process started with the very general proc_lib:start{_link,monitor}/3,4,5, which expects the started process to run to an end in any manner; any value or exception.

So if there is user code that today, for some reason throws some exception, say it wants the logging in the top level catch in proc_lib:init_p/3,5; to make it possible for that code to use the new proc_lib:init_fail/2,3 and raise any exception, I think it should handle also the the really fishy in this context class throw.

Should a process unconditionally exit by, for example process_flag(trap_exit, false), exit(self(), suicide), it will circumvent the top level catch in init_p(). Should it catch the exception from init_fail() it would mess up the exception it itself requested. Both cases, I think, are misuses that we cannot guard against. The proc_lib:start process simply has to run to an end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:PS Assigned to OTP team PS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ETS table name conflict racing if it is created in a supervised gen_server:init
4 participants