Potential process leakage of tls_dyn_connection_sup processes #6244

josevalim · 2022-08-25T17:07:49Z

Describe the bug

I believe there is a race condition in the ssl application that can lead to a leak of tls_dyn_connection_sup processes.

We are assuming this because, in a high-performance system, we are retrieving all instances of tls_dyn_connection_sup and some of them have either no children or 1 child. One hour later, we still see the same number of leaked processes with no or 1 child.

If someone calls ssl:connect, the following call trace might happen:

ssl:connect -> tls_socket:connect
tls_socket:connect -> ssl_gen_statem:connect
ssl_gen_statem:connect -> tls_gen_connection:start_fsm

tls_gen_connection:start_fsm has the following code:

start_fsm(Role, Host, Port, Socket,
          {#{erl_dist := false, sender_spawn_opts := SenderOpts}, _, Trackers} = Opts,
    User, {CbModule, _, _, _, _} = CbInfo, 
    Timeout) -> 
    try
        {ok, DynSup} = tls_connection_sup:start_child([]),
        {ok, Sender} = tls_dyn_connection_sup:start_child(DynSup, sender, [[{spawn_opt, SenderOpts}]]),
        {ok, Pid} = tls_dyn_connection_sup:start_child(DynSup, receiver, [Role, Sender, Host, Port, Socket,
                                                             Opts, User, CbInfo]),
        {ok, SslSocket} = ssl_gen_statem:socket_control(?MODULE, Socket, [Pid, Sender], CbModule, Trackers),
        ssl_gen_statem:handshake(SslSocket, Timeout)
    catch
  error:{badmatch, {error, _} = Error} ->
      Error
    end;

I assume the issue is with the code above: if the current process crashes for any reason (such as a broken process link), it can fail before tls_dyn_connection_sup:start_child(DynSup, receiver, ...) is invoked. My undestanding is that the receiver child (the second one) is the one monitoring the user process and therefore is the one responsible for triggering the significant cleanup in the supervisor. However, if the child process is never started due to a broken link, both tls_dyn_connection_sup and tls_sender will leak (and we see both leaking).

To Reproduce

Unfortunately I do not have an easy way to reproduce this issue, due to the race, but I believe the call trace above is reasonable?

Potential solutions

Feel free to ignore this section. :)

One of the possible solutions is to change tls_dyn_connection_sup to start both children as part of its initialization and then call supervisor:which_children to retrieve their PIDs. A potential downside of this approach is that tls_dyn_connection_sup takes longer to start, increasing the odds of tls_connection_sup itself becoming a bottleneck? It may be necessary to make tls_connection_sup a pool of processes that are randomly routed to.

Affected versions

OTP 25.

The text was updated successfully, but these errors were encountered:

IngelaAndin · 2022-08-26T13:07:09Z

Would https://github.com/IngelaAndin/otp/pull/new/ingela/ssl/dy-sup-leak/GH-6244 make the leak go away?

josevalim · 2022-08-26T14:27:54Z

Hi @IngelaAndin, I don't think so. :( The issue is caused by process links, so in your branch, the link can still break at any moment before the exit command is executed. I believe the try/catch doesn't play a factor at all, because it doesn't catch exits from links.

The solution I had in mind was something like:

    SenderArgs = [[{spawn_opt, SenderOpts}]],
    ReceiverArgs = [Role, Sender, Host, Port, Socket, Opts, User, CbInfo],
    {ok, DynSup} = tls_connection_sup:start_child(SenderArgs, ReceiverArgs),
    [{_, Sender, _, _}, {_, Receiver, _, _}] = supervisor:which_children(DynSup),

But, as per above, I am worried tls_connection_sup may become a bottleneck. This could be addressed by starting a handful of tls_connection_sup and picking one based on erlang:phash/2.

IngelaAndin · 2022-08-29T11:42:35Z

@josevalim

Having one child could be a valid case as the receiver process can linger if it has received data that has not yet been read by the controlling process. Having no children except for a moment exactly after the dynamic supervisor is spawned seems strange.

None of the processes will be linked to the process executing the start_fsm code. The sender process will be very idle until the handshake has been completed so the risk that it could crash before the receiver process started is also highly unlikely.

I can think of one scenario that could cause the line
{ok, SslSocket} = ssl_gen_statem:socket_control(?MODULE, Socket, [Pid, Sender], CbModule, Trackers),
to fail, it is if you do a so-called upgrade of a from a TCP socket in your server and the listen socket is not set to {active, false}, or the mode of the accept socket is changed to some active mode before passed to the ssl API.

None of the ssl processes will wait for the handshake to complete to return from their init functions and which_children is
executed on a connection-specific process so if I understand your solution correctly it would probably not be too much of a bottleneck. But at the moment I do not understand what problem it solves.

When you see only one child alive, do you know if it is a sender child or a receiver child?

sneako · 2022-08-29T12:09:32Z

Hi @IngelaAndin ! Jose has been helping us track down this bug within our system. I just checked the children of our tls_dyn_connection_sup which only have one child and they are all senders.

josevalim · 2022-08-29T12:11:31Z

But at the moment I do not understand what problem it solves.

My understanding is that the receiver process monitors the process that calls start_fsm. So if we guarantee the receiver starts alongside the sender, then we guarantee the monitoring happens, and that the receiver will eventually shutdown and bring the sender and its parent down. Or have I misunderstood the monitor call? :)

IngelaAndin · 2022-08-29T13:35:46Z

@josevalim I think you got the monitoring part correctly :) What I do not understand is how the receiver process fails to be linked to the supervisor. Exactly which version of OTP are you running? This sound like something that could have been fixed by 0598f76

sneako · 2022-08-29T13:36:36Z

We are running 25.0.2

IngelaAndin · 2022-08-29T14:55:06Z

Well ok, then that should not be the problem! What log_level do you have? Would it be possible to turn on info log_level for ssl on a system where this happens?

josevalim · 2022-08-29T15:04:27Z

@IngelaAndin what I believe is happening is that there is another process, linked to the current one, and this other process terminates. When it terminates, it causes the current process to immediately exit. If it immediately exits after this line:

 {ok, DynSup} = tls_connection_sup:start_child([]),

Then neither sender and receiver process are started and it leaves the supervisor hanging. The same if it exits right after the sender starts: there is receiver and therefore both supervisor and sender leak forever.

The issue is not that the receiver doesn’t link, but rather that it never starts! :)

So my proposal is to start all three processes atomically, under the same start_child call.

sneako · 2022-08-29T15:59:22Z

We are actually already running with log_level info. This is an Elixir application, configured with

config :logger,
  level: :info

and on a remote shell:

iex(app@10.4.235.41)1> Logger.level()
:info

I do not see any log messages related to the ssl application.

IngelaAndin · 2022-08-30T05:42:26Z

@josevalim Oh, I see. Let me think some more about that! @sneako thanks for the info.

If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close erlang#6244

IngelaAndin · 2022-08-30T08:46:33Z

@josevalim I force pushed a new solution suggestion to ingela/ssl/dy-sup-leak/GH-6244 What do you think?

josevalim · 2022-08-30T09:45:37Z

@IngelaAndin perfect :)

Btw, can any of the child process return {error, _} when starting? When I checked I believe all of them start immediately and return {ok, pid}. If that's the case, then you could skip the try/catch?

IngelaAndin · 2022-08-30T15:07:05Z

Humm ... well sender process init should never crash but receiver init is more complex. I will polish the solution a little and make a PR (hopefully) tomorrow.

If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close erlang#6244

IngelaAndin · 2022-09-01T14:23:24Z

Ok so making a test case was a bit tricky, I now have one that does not fail with the fix and fails fairly often on maint, but not every time. I think it will have to do.

If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close erlang#6244

/OTP-18233 ssl: Avoid partial connection trees

If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close #6244

… maint

… maint-24 * ingela/maint-24/ssl/dy-sup-leak/GH-6244/OTP-18233: ssl: Avoid partial connection trees

josevalim added the bug Issue is reported as a bug label Aug 25, 2022

IngelaAndin added the team:PS Assigned to OTP team PS label Aug 26, 2022

IngelaAndin added the in progress label Aug 26, 2022

IngelaAndin self-assigned this Aug 26, 2022

IngelaAndin mentioned this issue Sep 1, 2022

ssl: Avoid partial connection trees #6270

Merged

IngelaAndin added a commit that referenced this issue Sep 8, 2022

Merge pull request #6270 from IngelaAndin/ingela/ssl/dy-sup-leak/GH-6244

9bd57f7

/OTP-18233 ssl: Avoid partial connection trees

IngelaAndin closed this as completed in 03cfa8c Sep 8, 2022

IngelaAndin added a commit that referenced this issue Sep 8, 2022

Merge branch 'ingela/maint-24/ssl/dy-sup-leak/GH-6244/OTP-18233' into…

6f4c7b0

… maint

IngelaAndin pushed a commit that referenced this issue Sep 13, 2022

Merge branch 'ingela/maint-24/ssl/dy-sup-leak/GH-6244/OTP-18233' into…

b1b2138

… maint-24 * ingela/maint-24/ssl/dy-sup-leak/GH-6244/OTP-18233: ssl: Avoid partial connection trees

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential process leakage of tls_dyn_connection_sup processes #6244

Potential process leakage of tls_dyn_connection_sup processes #6244

josevalim commented Aug 25, 2022 •

edited

Loading

IngelaAndin commented Aug 26, 2022

josevalim commented Aug 26, 2022

IngelaAndin commented Aug 29, 2022 •

edited

Loading

sneako commented Aug 29, 2022 •

edited

Loading

josevalim commented Aug 29, 2022 •

edited

Loading

IngelaAndin commented Aug 29, 2022

sneako commented Aug 29, 2022

IngelaAndin commented Aug 29, 2022

josevalim commented Aug 29, 2022 •

edited

Loading

sneako commented Aug 29, 2022

IngelaAndin commented Aug 30, 2022

IngelaAndin commented Aug 30, 2022

josevalim commented Aug 30, 2022

IngelaAndin commented Aug 30, 2022

IngelaAndin commented Sep 1, 2022 •

edited

Loading

Potential process leakage of tls_dyn_connection_sup processes #6244

Potential process leakage of tls_dyn_connection_sup processes #6244

Comments

josevalim commented Aug 25, 2022 • edited Loading

IngelaAndin commented Aug 26, 2022

josevalim commented Aug 26, 2022

IngelaAndin commented Aug 29, 2022 • edited Loading

sneako commented Aug 29, 2022 • edited Loading

josevalim commented Aug 29, 2022 • edited Loading

IngelaAndin commented Aug 29, 2022

sneako commented Aug 29, 2022

IngelaAndin commented Aug 29, 2022

josevalim commented Aug 29, 2022 • edited Loading

sneako commented Aug 29, 2022

IngelaAndin commented Aug 30, 2022

IngelaAndin commented Aug 30, 2022

josevalim commented Aug 30, 2022

IngelaAndin commented Aug 30, 2022

IngelaAndin commented Sep 1, 2022 • edited Loading

josevalim commented Aug 25, 2022 •

edited

Loading

IngelaAndin commented Aug 29, 2022 •

edited

Loading

sneako commented Aug 29, 2022 •

edited

Loading

josevalim commented Aug 29, 2022 •

edited

Loading

josevalim commented Aug 29, 2022 •

edited

Loading

IngelaAndin commented Sep 1, 2022 •

edited

Loading