New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excon's socket cache doesn't play well with chunked response callbacks #640
Comments
Thanks for the detailed report. Would it make sense to wrap the usage you describe in threads? Then we could take advantage of the thread-safe behavior, instead of needing to try and avoid it. If the other docker connection is in a separate thread, it should get it's own socket. Does that make sense? I realize that strictly speaking the usage you describe is not actually parallel or otherwise requiring of threads (you in effect pause stream processing, do something and then return to it), but I think this could be an easy fix potentially. The current behavior does not accommodate incomplete streaming, as you suggest. But it does tend to reuse sockets, even if you don't reuse connections (which was the intention). Not saying we could never change it, as it's more nice to have (people who really want reuse can maintain/reuse connections). What do you think? |
I don't think threads are necessary or desirable here. If you're dropping into a separate thread just to use the per-thread socket caching, that's quite error prone (as someone who spends far too much time elbow-deep in mutexes and condition variables, I find it all quite headache-inducing). On the other hand, if you're thinking of using threads to multi-process chunks in parallel, that would be surprising and potentially undesirable behaviour to me, if I were using Excon directly in an analogous situation to the way that I don't think the "reuse sockets, even if you don't reuse connections" behaviour is a good thing. If I have two separate instances of On the documentation front, I note that the README says that you must pass |
Thanks for the detailed feedback, I appreciate you taking the time to discuss further. I certainly understand your reticence around using threads, they can be a REAL pain. Sorry that the weirdness made things so difficult. I think it was more that it was a seemingly-not-so-bad coincidence that people would usually have socket reuse even if they were "doing it wrong", rather than a feature that was important or that we need to preserve. I think, as you point out, the fact that different connections might point to the same socket is surprising (it just isn't something that in practice people are likely to run into). The docker case is just unusual in that regard, as one request-in-progress is occurring and another one wants to get added in the same thread. It is a perfectly legitimate use case, just not one I had previously considered or encountered. At the moment As for improving the caching, it seems we have some options:
Which is all to say, I'm coming around to your original suggestion (or at least something very like it), but I'd like to have some confidence it won't come back to haunt us. And I'm not sure how best to garner that confidence. What do you think? |
I guess you'll "leak" either way, insofar as both threads and connection objects can die before the associated cache entries disappear, leaving hash entries that don't refer to anything useful and which will never (in a naive implementation) get cleaned up:
Of the two, I think the one that'll happen less often is the existence of a single long-lived connection object which is passed around a lot of different threads. So, changing this line to
would do the job well enough to be going on with. It's an improvement on what's there now, and probably leaks no more in real-world usage than the current implementation, insofar as there's no way I can see for socket key entries to be cleaned up when the connection that created them goes out of scope, if the connection isn't closed or errors out before the object gets cleaned up -- but I'm not a Ruby GC expert, so perhaps there's some magic behind the scenes that triggers a code path that calls |
Yeah, that sounds pretty accurate. It may be that we could do some weird additional stuff to try and force explicit cleanup, but I'd rather not introduce that headache speculatively (ie sure, I'd investigate if it was explicitly a problem). Would you be up for helping with that small patch? I think as long as tests pass I feel relatively comfortable anyway. As you said, it seems unlikely to be much worse than the current state for GC, and if it really blows up somewhere we can always revert. Thanks for taking the time to work through this with me, I appreciate it. |
Yep, I can whip up a PR. |
Make thread-safe socket cache per-connection [Fixes #640]
Thanks! |
This is the cause of the memory leak that we experienced recently. Every Excon connection has a cache of sockets that ensures that a socket is only used by a specific thread during its lifetime. This means that 1 Excon connection might hold any number of open sockets to the same server. In our case for example if Puma is running with 4 threads, and a connection pool of 4 Excon connections to the PVB API it means that there will a maximum of 16 HTTP persistent connections. The issue is that the implementation of the cache doesn't clear old connections for threads that are no longer running (an issue when Puma is configured to autoscale threads). We inadevertedly solved the memory leak by configuring Puma to use a fixed number of threads. The implementation of this cache was changed just before we started having memory leaks to fix a use case of Excon that we don't use. The old implementation didn't have a memory leak with our use case. Although this makes the `PrisonVisits::Client` not thread safe it's fine in our use case because: - we use a connection pool of connections and a connection is only used by 1 thread at a time. - It means that we don't hold sockets open unnecessarily for each Excon connection. I've tested the value of this setting locally by load testing the server with concurrent requests: - with the cache enabled and Puma autoscaling threads it showed the memory leak when spacing the load testing by 1 minute between benchmarks - with the cache disabled and Puma autoscaling threads memory usage remained stable and there were no threading issues. Excon [discussion](excon/excon#640) about the recent change to the cache where it actually mentions that it has an memory leak issue.
This is the cause of the memory leak that we experienced recently. Every Excon connection has a cache of sockets that ensures that a socket is only used by a specific thread during its lifetime. This means that 1 Excon connection might hold any number of open sockets to the same server. In our case for example if Puma is running with 4 threads, and a connection pool of 4 Excon connections to the PVB API it means that there will a maximum of 16 HTTP persistent connections. The issue is that the implementation of the cache doesn't clear old connections for threads that are no longer running (an issue when Puma is configured to autoscale threads). We inadevertedly solved the memory leak by configuring Puma to use a fixed number of threads. The implementation of this cache was changed just before we started having memory leaks to fix a use case of Excon that we don't use. The old implementation didn't have a memory leak with our use case. Although this makes the `PrisonVisits::Client` not thread safe it's fine in our use case because: - we use a connection pool of connections and a connection is only used by 1 thread at a time. - It means that we don't hold sockets open unnecessarily for each Excon connection. I've tested the value of this setting locally by load testing the server with concurrent requests: - with the cache enabled and Puma autoscaling threads it showed the memory leak when spacing the load testing by 1 minute between benchmarks - with the cache disabled and Puma autoscaling threads memory usage remained stable and there were no threading issues. Excon (discussion)[excon/excon#640] about the recent change to the cache where it actually mentions that it has an memory leak issue.
This is the cause of the memory leak that we experienced recently. Every Excon connection has a cache of sockets that ensures that a socket is only used by a specific thread during its lifetime. This means that 1 Excon connection might hold any number of open sockets to the same server. In our case for example if Puma is running with 4 threads, and a connection pool of 4 Excon connections to the PVB API it means that there will a maximum of 16 HTTP persistent connections. The issue is that the implementation of the cache doesn't clear old connections for threads that are no longer running (an issue when Puma is configured to autoscale threads). We inadevertedly solved the memory leak by configuring Puma to use a fixed number of threads. The implementation of this cache was changed just before we started having memory leaks to fix a use case of Excon that we don't use. The old implementation didn't have a memory leak with our use case. Although this makes the `PrisonVisits::Client` not thread safe it's fine in our use case because: - we use a connection pool of connections and a connection is only used by 1 thread at a time. - It means that we don't hold sockets open unnecessarily for each Excon connection. I've tested the value of this setting locally by load testing the server with concurrent requests: - with the cache enabled and Puma autoscaling threads it showed the memory leak when spacing the load testing by 1 minute between benchmarks - with the cache disabled and Puma autoscaling threads memory usage remained stable and there were no threading issues. Excon [discussion](excon/excon#640) about the recent change to the cache where it actually mentions that it has an memory leak issue.
This is the cause of the memory leak that we experienced recently. Every Excon connection has a cache of sockets that ensures that a socket is only used by a specific thread during its lifetime. This means that 1 Excon connection might hold any number of open sockets to the same server. In our case for example if Puma is running with 4 threads, and a connection pool of 4 Excon connections to the PVB API it means that there will a maximum of 16 HTTP persistent connections. The issue is that the implementation of the cache doesn't clear old connections for threads that are no longer running (an issue when Puma is configured to autoscale threads). We inadevertedly solved the memory leak by configuring Puma to use a fixed number of threads. The implementation of this cache was changed just before we started having memory leaks to fix a use case of Excon that we don't use. The old implementation didn't have a memory leak with our use case. Although this makes the `PrisonVisits::Client` not thread safe it's fine in our use case because: - we use a connection pool of connections and a connection is only used by 1 thread at a time. - It means that we don't hold sockets open unnecessarily for each Excon connection. I've tested the value of this setting locally by load testing the server with concurrent requests: - with the cache enabled and Puma autoscaling threads it showed the memory leak when spacing the load testing by 1 minute between benchmarks - with the cache disabled and Puma autoscaling threads memory usage remained stable and there were no threading issues. Excon [discussion](excon/excon#640) about the recent change to the cache where it actually mentions that it has an memory leak issue.
I'm using Excon via
docker-api
, and if you ask for a stream of events using that library, it provides the event stream in a chunked response, and for each event, it calls a user-provided block. In that block, you may want to querydockerd
for details about the object on which the event was triggered.Unfortunately, that additional request will completely stomp on the event stream response (that's still "in progress"), because Excon only ever uses a single fd for every connection to a specific destination (host/port, or -- in my case -- Unix socket). Even with separate
Excon::Connection
objects (whichDocker::Connection
creates for every request it makes, even on the sameDocker::Connection
), the problem persists because the default Excon behaviour is to use a per-thread socket cache, which shares sockets between objects in the same thread.There is a workaround I've found that works for my specific case: turn off
thread_safe_sockets
. In that case, the socket cache is per-connection-object, rather than per-thread, and it works OK for my specific case. It isn't a general solution, however, because if you pass around a single Excon connection to multiple threads that you also use for chunked response processing, you need that per-thread cache separation. (I'm mildly amused that fixing one concurrency problem (multi-threaded access) breaks another sort of concurrency (callbacks), but then, I'm easily amused.)I suspect that the solution is to make the thread-safe cache also per-connection; unfortunately, just tacking
[self.object_id]
ontoThread.current[:_excon_sockets]
will end up bloating up processes with lots of dead, cached sockets. I think indexing@_excon_sockets
by the thread ID (ie@_excon_sockets[Thread.current.object_id][@socket_key]
) will work, but I'm not confident enough about that to submit a PR blind. If you think it's the right approach, I can put up the trivial PR for it.The text was updated successfully, but these errors were encountered: