-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential process leakage of tls_dyn_connection_sup processes #6244
Comments
Would https://github.com/IngelaAndin/otp/pull/new/ingela/ssl/dy-sup-leak/GH-6244 make the leak go away? |
Hi @IngelaAndin, I don't think so. :( The issue is caused by process links, so in your branch, the link can still break at any moment before the The solution I had in mind was something like:
But, as per above, I am worried |
Having one child could be a valid case as the receiver process can linger if it has received data that has not yet been read by the controlling process. Having no children except for a moment exactly after the dynamic supervisor is spawned seems strange. None of the processes will be linked to the process executing the start_fsm code. The sender process will be very idle until the handshake has been completed so the risk that it could crash before the receiver process started is also highly unlikely. I can think of one scenario that could cause the line None of the ssl processes will wait for the handshake to complete to return from their init functions and which_children is When you see only one child alive, do you know if it is a sender child or a receiver child? |
Hi @IngelaAndin ! Jose has been helping us track down this bug within our system. I just checked the children of our |
My understanding is that the receiver process monitors the process that calls start_fsm. So if we guarantee the receiver starts alongside the sender, then we guarantee the monitoring happens, and that the receiver will eventually shutdown and bring the sender and its parent down. Or have I misunderstood the monitor call? :) |
@josevalim I think you got the monitoring part correctly :) What I do not understand is how the receiver process fails to be linked to the supervisor. Exactly which version of OTP are you running? This sound like something that could have been fixed by 0598f76 |
We are running |
Well ok, then that should not be the problem! What log_level do you have? Would it be possible to turn on info log_level for ssl on a system where this happens? |
@IngelaAndin what I believe is happening is that there is another process, linked to the current one, and this other process terminates. When it terminates, it causes the current process to immediately exit. If it immediately exits after this line:
Then neither sender and receiver process are started and it leaves the supervisor hanging. The same if it exits right after the sender starts: there is receiver and therefore both supervisor and sender leak forever. The issue is not that the receiver doesn’t link, but rather that it never starts! :) So my proposal is to start all three processes atomically, under the same start_child call. |
We are actually already running with log_level info. This is an Elixir application, configured with config :logger,
level: :info and on a remote shell: iex(app@10.4.235.41)1> Logger.level()
:info I do not see any log messages related to the |
@josevalim Oh, I see. Let me think some more about that! @sneako thanks for the info. |
If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close erlang#6244
@josevalim I force pushed a new solution suggestion to ingela/ssl/dy-sup-leak/GH-6244 What do you think? |
@IngelaAndin perfect :) Btw, can any of the child process return |
Humm ... well sender process init should never crash but receiver init is more complex. I will polish the solution a little and make a PR (hopefully) tomorrow. |
If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close erlang#6244
If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close erlang#6244
If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close erlang#6244
Ok so making a test case was a bit tricky, I now have one that does not fail with the fix and fails fairly often on maint, but not every time. I think it will have to do. |
If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close erlang#6244
If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close erlang#6244
If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close erlang#6244
If the "User" process, the process starting the TLS connection, gets killed in the middle of spawning the dynamic connection tree make sure we do not leave any processes behind. Close #6244
… maint-24 * ingela/maint-24/ssl/dy-sup-leak/GH-6244/OTP-18233: ssl: Avoid partial connection trees
Describe the bug
I believe there is a race condition in the
ssl
application that can lead to a leak oftls_dyn_connection_sup
processes.We are assuming this because, in a high-performance system, we are retrieving all instances of
tls_dyn_connection_sup
and some of them have either no children or 1 child. One hour later, we still see the same number of leaked processes with no or 1 child.If someone calls
ssl:connect
, the following call trace might happen:tls_gen_connection:start_fsm
has the following code:I assume the issue is with the code above: if the current process crashes for any reason (such as a broken process link), it can fail before
tls_dyn_connection_sup:start_child(DynSup, receiver, ...)
is invoked. My undestanding is that the receiver child (the second one) is the one monitoring the user process and therefore is the one responsible for triggering the significant cleanup in the supervisor. However, if the child process is never started due to a broken link, bothtls_dyn_connection_sup
andtls_sender
will leak (and we see both leaking).To Reproduce
Unfortunately I do not have an easy way to reproduce this issue, due to the race, but I believe the call trace above is reasonable?
Potential solutions
Feel free to ignore this section. :)
One of the possible solutions is to change
tls_dyn_connection_sup
to start both children as part of its initialization and then callsupervisor:which_children
to retrieve their PIDs. A potential downside of this approach is thattls_dyn_connection_sup
takes longer to start, increasing the odds oftls_connection_sup
itself becoming a bottleneck? It may be necessary to maketls_connection_sup
a pool of processes that are randomly routed to.Affected versions
OTP 25.
The text was updated successfully, but these errors were encountered: