Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential process leakage of tls_dyn_connection_sup processes #6244

Closed
josevalim opened this issue Aug 25, 2022 · 15 comments
Closed

Potential process leakage of tls_dyn_connection_sup processes #6244

josevalim opened this issue Aug 25, 2022 · 15 comments
Assignees
Labels
bug Issue is reported as a bug in progress team:PS Assigned to OTP team PS

Comments

@josevalim
Copy link
Contributor

josevalim commented Aug 25, 2022

Describe the bug

I believe there is a race condition in the ssl application that can lead to a leak of tls_dyn_connection_sup processes.

We are assuming this because, in a high-performance system, we are retrieving all instances of tls_dyn_connection_sup and some of them have either no children or 1 child. One hour later, we still see the same number of leaked processes with no or 1 child.

If someone calls ssl:connect, the following call trace might happen:

ssl:connect -> tls_socket:connect
tls_socket:connect -> ssl_gen_statem:connect
ssl_gen_statem:connect -> tls_gen_connection:start_fsm

tls_gen_connection:start_fsm has the following code:

start_fsm(Role, Host, Port, Socket,
          {#{erl_dist := false, sender_spawn_opts := SenderOpts}, _, Trackers} = Opts,
    User, {CbModule, _, _, _, _} = CbInfo, 
    Timeout) -> 
    try
        {ok, DynSup} = tls_connection_sup:start_child([]),
        {ok, Sender} = tls_dyn_connection_sup:start_child(DynSup, sender, [[{spawn_opt, SenderOpts}]]),
        {ok, Pid} = tls_dyn_connection_sup:start_child(DynSup, receiver, [Role, Sender, Host, Port, Socket,
                                                             Opts, User, CbInfo]),
        {ok, SslSocket} = ssl_gen_statem:socket_control(?MODULE, Socket, [Pid, Sender], CbModule, Trackers),
        ssl_gen_statem:handshake(SslSocket, Timeout)
    catch
  error:{badmatch, {error, _} = Error} ->
      Error
    end;

I assume the issue is with the code above: if the current process crashes for any reason (such as a broken process link), it can fail before tls_dyn_connection_sup:start_child(DynSup, receiver, ...) is invoked. My undestanding is that the receiver child (the second one) is the one monitoring the user process and therefore is the one responsible for triggering the significant cleanup in the supervisor. However, if the child process is never started due to a broken link, both tls_dyn_connection_sup and tls_sender will leak (and we see both leaking).

To Reproduce

Unfortunately I do not have an easy way to reproduce this issue, due to the race, but I believe the call trace above is reasonable?

Potential solutions

Feel free to ignore this section. :)

One of the possible solutions is to change tls_dyn_connection_sup to start both children as part of its initialization and then call supervisor:which_children to retrieve their PIDs. A potential downside of this approach is that tls_dyn_connection_sup takes longer to start, increasing the odds of tls_connection_sup itself becoming a bottleneck? It may be necessary to make tls_connection_sup a pool of processes that are randomly routed to.

Affected versions

OTP 25.

@josevalim josevalim added the bug Issue is reported as a bug label Aug 25, 2022
@IngelaAndin IngelaAndin added the team:PS Assigned to OTP team PS label Aug 26, 2022
@IngelaAndin
Copy link
Contributor

@josevalim
Copy link
Contributor Author

Hi @IngelaAndin, I don't think so. :( The issue is caused by process links, so in your branch, the link can still break at any moment before the exit command is executed. I believe the try/catch doesn't play a factor at all, because it doesn't catch exits from links.

The solution I had in mind was something like:

    SenderArgs = [[{spawn_opt, SenderOpts}]],
    ReceiverArgs = [Role, Sender, Host, Port, Socket, Opts, User, CbInfo],
    {ok, DynSup} = tls_connection_sup:start_child(SenderArgs, ReceiverArgs),
    [{_, Sender, _, _}, {_, Receiver, _, _}] = supervisor:which_children(DynSup),

But, as per above, I am worried tls_connection_sup may become a bottleneck. This could be addressed by starting a handful of tls_connection_sup and picking one based on erlang:phash/2.

@IngelaAndin
Copy link
Contributor

IngelaAndin commented Aug 29, 2022

@josevalim

Having one child could be a valid case as the receiver process can linger if it has received data that has not yet been read by the controlling process. Having no children except for a moment exactly after the dynamic supervisor is spawned seems strange.

None of the processes will be linked to the process executing the start_fsm code. The sender process will be very idle until the handshake has been completed so the risk that it could crash before the receiver process started is also highly unlikely.

I can think of one scenario that could cause the line
{ok, SslSocket} = ssl_gen_statem:socket_control(?MODULE, Socket, [Pid, Sender], CbModule, Trackers),
to fail, it is if you do a so-called upgrade of a from a TCP socket in your server and the listen socket is not set to {active, false}, or the mode of the accept socket is changed to some active mode before passed to the ssl API.

None of the ssl processes will wait for the handshake to complete to return from their init functions and which_children is
executed on a connection-specific process so if I understand your solution correctly it would probably not be too much of a bottleneck. But at the moment I do not understand what problem it solves.

When you see only one child alive, do you know if it is a sender child or a receiver child?

@sneako
Copy link
Contributor

sneako commented Aug 29, 2022

Hi @IngelaAndin ! Jose has been helping us track down this bug within our system. I just checked the children of our tls_dyn_connection_sup which only have one child and they are all senders.

@josevalim
Copy link
Contributor Author

josevalim commented Aug 29, 2022

But at the moment I do not understand what problem it solves.

My understanding is that the receiver process monitors the process that calls start_fsm. So if we guarantee the receiver starts alongside the sender, then we guarantee the monitoring happens, and that the receiver will eventually shutdown and bring the sender and its parent down. Or have I misunderstood the monitor call? :)

@IngelaAndin
Copy link
Contributor

@josevalim I think you got the monitoring part correctly :) What I do not understand is how the receiver process fails to be linked to the supervisor. Exactly which version of OTP are you running? This sound like something that could have been fixed by 0598f76

@sneako
Copy link
Contributor

sneako commented Aug 29, 2022

We are running 25.0.2

@IngelaAndin
Copy link
Contributor

Well ok, then that should not be the problem! What log_level do you have? Would it be possible to turn on info log_level for ssl on a system where this happens?

@josevalim
Copy link
Contributor Author

josevalim commented Aug 29, 2022

@IngelaAndin what I believe is happening is that there is another process, linked to the current one, and this other process terminates. When it terminates, it causes the current process to immediately exit. If it immediately exits after this line:

 {ok, DynSup} = tls_connection_sup:start_child([]),

Then neither sender and receiver process are started and it leaves the supervisor hanging. The same if it exits right after the sender starts: there is receiver and therefore both supervisor and sender leak forever.

The issue is not that the receiver doesn’t link, but rather that it never starts! :)

So my proposal is to start all three processes atomically, under the same start_child call.

@sneako
Copy link
Contributor

sneako commented Aug 29, 2022

We are actually already running with log_level info. This is an Elixir application, configured with

config :logger,
  level: :info

and on a remote shell:

iex(app@10.4.235.41)1> Logger.level()
:info

I do not see any log messages related to the ssl application.

@IngelaAndin
Copy link
Contributor

@josevalim Oh, I see. Let me think some more about that! @sneako thanks for the info.

IngelaAndin added a commit to IngelaAndin/otp that referenced this issue Aug 30, 2022
If the "User" process, the process starting the TLS connection, gets
killed in the middle of spawning the dynamic connection tree make sure
we do not leave any processes behind.

Close erlang#6244
@IngelaAndin
Copy link
Contributor

@josevalim I force pushed a new solution suggestion to ingela/ssl/dy-sup-leak/GH-6244 What do you think?

@josevalim
Copy link
Contributor Author

@IngelaAndin perfect :)

Btw, can any of the child process return {error, _} when starting? When I checked I believe all of them start immediately and return {ok, pid}. If that's the case, then you could skip the try/catch?

@IngelaAndin
Copy link
Contributor

Humm ... well sender process init should never crash but receiver init is more complex. I will polish the solution a little and make a PR (hopefully) tomorrow.

IngelaAndin added a commit to IngelaAndin/otp that referenced this issue Aug 31, 2022
If the "User" process, the process starting the TLS connection, gets
killed in the middle of spawning the dynamic connection tree make sure
we do not leave any processes behind.

Close erlang#6244
IngelaAndin added a commit to IngelaAndin/otp that referenced this issue Aug 31, 2022
If the "User" process, the process starting the TLS connection, gets
killed in the middle of spawning the dynamic connection tree make sure
we do not leave any processes behind.

Close erlang#6244
IngelaAndin added a commit to IngelaAndin/otp that referenced this issue Sep 1, 2022
If the "User" process, the process starting the TLS connection, gets
killed in the middle of spawning the dynamic connection tree make sure
we do not leave any processes behind.

Close erlang#6244
@IngelaAndin
Copy link
Contributor

IngelaAndin commented Sep 1, 2022

Ok so making a test case was a bit tricky, I now have one that does not fail with the fix and fails fairly often on maint, but not every time. I think it will have to do.

IngelaAndin added a commit to IngelaAndin/otp that referenced this issue Sep 2, 2022
If the "User" process, the process starting the TLS connection, gets
killed in the middle of spawning the dynamic connection tree make sure
we do not leave any processes behind.

Close erlang#6244
IngelaAndin added a commit to IngelaAndin/otp that referenced this issue Sep 2, 2022
If the "User" process, the process starting the TLS connection, gets
killed in the middle of spawning the dynamic connection tree make sure
we do not leave any processes behind.

Close erlang#6244
IngelaAndin added a commit to IngelaAndin/otp that referenced this issue Sep 6, 2022
If the "User" process, the process starting the TLS connection, gets
killed in the middle of spawning the dynamic connection tree make sure
we do not leave any processes behind.

Close erlang#6244
IngelaAndin added a commit that referenced this issue Sep 8, 2022
/OTP-18233

ssl: Avoid partial connection trees
IngelaAndin added a commit that referenced this issue Sep 8, 2022
If the "User" process, the process starting the TLS connection, gets
killed in the middle of spawning the dynamic connection tree make sure
we do not leave any processes behind.

Close #6244
IngelaAndin pushed a commit that referenced this issue Sep 13, 2022
… maint-24

* ingela/maint-24/ssl/dy-sup-leak/GH-6244/OTP-18233:
  ssl: Avoid partial connection trees
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug in progress team:PS Assigned to OTP team PS
Projects
None yet
Development

No branches or pull requests

3 participants