Connection Pool Not Releasing Connections #29
Comments
Closing of connections don't get called in commands, because there are valid reasons to leave them open to the database for multiple instructions to be sent back and forth. Which is what Pooled Connections do. Can you show me another command that is closing or releasing the connection? If you want connections to be closed with each call to the server, you don't want to use a connection pool. Connection pools are designed to be intelligent about opening and closing of connections. You want to use the Normal connection provider which opens and closes connections based on the session. See the differences in how the PooledConnectionProvider differs from the NormalConnectionProvider. https://github.com/managedfusion/fluentcassandra/blob/master/src/Connections/PooledConnectionProvider.cs |
Thank you for responding. I didn't express clearly that the problem is not that the connection is not being closed, but rather that it it is not being freed to be reused. Close is merely the name of the method in PooledConnectionProvider. Close() in PooledConnectionProvider does not actually close the connection, but rather moves it from _usedConnections to _freeConnections queue. As I tried to show in the code above for every call to family.Get(mykey).FirstOrDefault() two connections are added to _usedConnections and only one of them is moved back to _usedConnections. After 100 iterations, usedConnections.Count = 100 and I get the message "No connection could be made,timed out trying to aquire a connection from the connection pool." Excuse me if I am doing something incorrectly. If you still think there is no problem, I will dig into the codebase deeper. |
@kguner Either way you should dig in. Open source software is a collaborative process, it not only relies on people like you contributing issues, but also digging in. FluentCassandra is like most open source projects, managed by one person. You seem to have an excellent grasp of the problem. So your guess would be as good as mine was to why Close() isn't being called. You can set the break point here to see if the code is ever hit: But my guess is that it is, and the List.Remove() method isn't working correctly. Here is a little background to get you started. When the IQueryable is enumerated. This method is called. Then https://github.com/managedfusion/fluentcassandra/blob/master/src/BaseCassandraColumnFamily.cs#L145 Then the command is executed https://github.com/managedfusion/fluentcassandra/blob/master/src/BaseCassandraColumnFamily.cs#L112 Then the session is disposed of. https://github.com/managedfusion/fluentcassandra/blob/master/src/BaseCassandraColumnFamily.cs#L112 Then the connection is closed as long as it isn't null https://github.com/managedfusion/fluentcassandra/blob/master/src/CassandraSession.cs#L192 It is not that I don't want to track this down my self, I just don't have the time. And open source software such as FluentCassandra is driven my need and time, and with out both, it doesn't get developed. It sounds like you have the need, so if you have the time to put in, it will be greatly appreciated. |
Thank you for the breakdown. I will look into this when I get the chance. |
I ran into the exact same problem. the problem is that in CassandraSession class, the GetClient() function is creating the _connection, but returning _connection.Client. and hence a connection is leaked. Instead, the _connection should be created in ExecuteOperation. I've tested that this fixes the problem. |
@edweip3 there are cases where GetClient is used and it doesn't go through ExecuteOperation. The connection should stick around as long as the session regardless if only one ExecuteOperation is called or 30 ExecuteOperation's are called. Also as long as the CassandraSession is disposed of, the connection is Closed against the provider, and that should return it back to the pool. I am not quite sure how creating the connection in ExecuteOperation would solve this. Can you please share your code. I don't know if I buy the connection being leaked through GetClient is causing the issue. |
@nberardi I think I figured out what is wrong with the code. the problem is in CassandraContext ExecuteOperation function. in the sample code given above by @kguner, CassandraContext is constructed from the ConnectionBuilder, not CassandraSession. so in the CassandraContext.ExecuteOperation function, we create a local session, and then disposes it at the end of the function call. so by the end of this call from the sample code: dynamic c = family.Get(sequence).FirstOrDefault(); a local session was created and disposed. but on the next line: this causes a connection to be created, because on iterators, the actual call to get the data (i.e. Execute() function on an operation) is delayed until the data is being accessed through the iterator. In most of the Execute() functions of an operation, Session.GetClient() is called. and hence the connection leak. moving the instantiation of the _connection in CassandraSession from GetClient to ExecuteOperation helped because now the instantiation of _connection is not being delayed for execution. Hence, .Net knows how to free this properly when the time comes. Perhaps this isn't the best solution, but it is working for me. |
That makes total sense, I knew those iterators would bite me in the butt eventually. I guess the easy way is to just get rid of the iterators. Luckily they are only used in a few places. |
It seems the reason the pool handling gets tricky is because Of course, now there seems to be a memory leak and i die with out of memory after about 26k operations, so maybe i'm just shifting the problem around. I assume it's because i am keeping the context around for each worker thread and doing lots of operations, so there must be something holding on to references in the session that would normally get released on a session dispose. UPDATE: Yep, it's mutation trackers. Not knowing their purpose, I don't know when or if they should be flushed. But since it's part of the Context not the session, I would say it's unrelated and I'll investigate and start a separate issue. So, unless there is specific reasoning why a context should not hold on to a session during its lifetime, my recommendation would be tying the session to context, which should solve the iterator issue, since those would be invalid if called outside of a context anyhow. |
Sorry to keep beating on this, but using a In order to flush the mutators, i was running 100 threads that would create new context after every 1000 records. This had a sustained throughput over about 1700 records/second (supercolumn appends), i.e. ~1-2 context dispose/creates per second and i hit pool exhaustion again after about 200k records. I backed the context re-creation off to only after every 5000 records and that time it got to 630k records, but hit pool exhaustion once again. Looking at the pool, early on i saw that all 200 connections where in the used part of the pool, but it ran for quite a while after that, so it isn't that every session creation used and lost a connection. UPDATE: It's getting late. I should do more testing before posting. When you provide a session to a context, the context considers it an outside session and does not dispose it, which means that the only way the connection gets released is by the finalizer getting hit. Since i was putting enough pressure on the GC, the finalizer did hit frequently, which is why i didn't exhaust the pool quickly, but not enough pressure to keep the pool from getting exhausted eventually. Looks like per context session is a viable solution (and a quite a bit more efficient due to less lock contention and session churn). Here's how i've changed it locally for testing and so far with 200 threads, each recreating the context after 1000 creates for over 1m records, i have not encountered any more pool issues: /// <summary>
///
/// </summary>
/// <param name="session"></param>
public CassandraContext(CassandraSession session) : this(session, true) { }
/// <summary>
///
/// </summary>
/// <param name="connectionBuilder"></param>
public CassandraContext(IConnectionBuilder connectionBuilder) : this(new CassandraSession(connectionBuilder), false) { }
private CassandraContext(CassandraSession session, bool isOutsideSession) {
ThrowErrors = true;
_trackers = new List<IFluentMutationTracker>();
_session = session;
_isOutsideSession = isOutsideSession;
ConnectionBuilder = _session.ConnectionBuilder;
Keyspace = new CassandraKeyspace(ConnectionBuilder.Keyspace, this);
} And in ExecuteOperation, i've removed the concept of a local session. |
The pool obviously needs to be redesigned. Just haven't had too much time to figure out a good solution. |
If you don't have objections to using a session per context, i'd gladly clean-up my changes and send you a pull. |
Can you send your pull request because we'd gladly use it. |
We already support a session per context. You just create a session and pass it into the context. |
Correct, but then you are responsible for creating the session and passing it to the context and disposing of it afterwards. My local change abolishes the concept of a local session per execution and instead creates a session from the provided ConnectionBuilder at context creation, disposing it at context disposal (unless it was manually provided, which leaves the disposal responsibility with the caller). I'll clean up my change and let you decide whether you like it. It's served me well for my stress testing over the weekend. |
I looked at your source. The problem with what you have done is that it leaves the connection open for possibly a very long time if somebody say puts it in a static. Or more realistically a global variable, like in a desktop app. What I am doing here is no different than how other frameworks like EF function. I think the problem is with the Connection Pool. I am not really interested in modifying the behavior of Context, because if the pool released connections correctly under pressure. The current Context wouldn't have any problem. |
I guess I view the CassandraContext more analogous to getting a Database connection from a pool, where once you have a connection you keep it until you release it. Considering that cassandra is generally lots of small, short calls, creating a session for each call seems like an undue burden, both performance wise and GC pressure wise. |
The CassandraSession is more analogous to getting a Database connection from a pool, at least that was the way it was designed. Because that is where the connection is stored. Secondly CassandraContext was designed as a secondary add-on to CassandraSession, which provides all the niceties which make FluentCassandra easy to use. If I had to equate this to the SQL Server interface I would say. CassandraSession = SqlConnection In that you can do anything you want with SqlConnection, but if you want all the nice add-ons such as LINQ and everything else that comes with the Entity Framework. You have to wrap SqlConnection with DbContext. The same is true with CassandraSession and CassandraContext, you can use CassandraSession and use all the raw RPC calls against Cassandra with out ever spinning up CassandraContext, however if you want all the nice add-ons such as LINQ and everything else you need to wrap it with CassandraContext. That was how I designed the two. |
Ok, but even DbContext doesn't create a new SqlConnection per DB action. It either gets a connection or creates a LazyInternalContext with a LazyInternalConnection, i.e. once a connection is initialized for a DbContext, it stays with that one connection. I can certainly change my code to not initialize the session at context creation time, but on first execute if you prefer. Just seems that creating sessions on a per execute call is more trouble than its worth, given the performance aspects and the problems with iterators accessing disposed sessions since they leak outside the execute scope. |
Ok, pull submitted: #73 Once i went for lazy initializing, the ctor changes went away and only difference is removal of local session from Execute |
You should check out EnsureConnection. Because there are 4 cases where the connection is opened and closed. Before heading down this route that changes fuctionality for the many users that have been using this library for 2+ year. We should try to determine why you feel this change is needed. Because the connection pool manages the state of the connection with the server. And sessions are basically free to setup. So if we create one or 100 it shouldn't matter as long as the connection is being returned to the pool effeciently. For non connection pools. The user always has the option to pass in a session. But the safe route is to create a new one rather than relying that the session and connection are still connected. |
Ok, I take it all back. I was not aware that DBContext could transparently re-connect and even re-acquire a transaction, so i've been projecting my assumptions on CassandraContext. It's not a philosophy I agree with but that's just something I will have to come to terms with. So what precipitated this proposed change was issues I ran into. My initial goal with my Cassandra R&D was ensuring that it would meet our performance requirements and instead of testing Cassandra, i spent most of a day troubleshooting why the library would run out of connections as soon as I applied any load to it and the connection per execution stood out as the culprit. However, even if you fix the pool and address the issue of iterators, under load the lock contention on the pool does make the session creation a noticeable overhead, so I would still suggest that CassandraContext at least move to a EnsureConnection approach of only creation a new session when an old one is bad. I can tackle that if you'd like. |
I have just updated in version 1.1.9 the thrift library to version 0.9.0. And they have made some nice strides to actually dispose of used connections after they are used. So your churn may increase. Here are the changes: One thing that you should note is that I have not yet incorporated the Thrift |
@ilushka85 https://github.com/eplowe/fluentcassandra/tree/connreleaseissue -- So, the change itself was rather simple, from my initial testing all seems well with the world. I removed the concept of a local session in CassandraContext. If ExecuteOperation runs and _session is null, it creates one (I am thinking this might be better to put in the constructor, anyways) --- I added a CloseClient() method to CassandraSession. In the finally block of CassandraContext::ExecuteOperation I call that method. Depending on the pool type being used that will result in the connection being physically closed, or the connection being released back into the pool. This will allow a context that lives in say a static variable, long running loop, etc etc to not hold on to the connection for the lifetime of the context. |
OK. I see your point of view. I'll make some changes to keep in line with On Thu, Nov 15, 2012 at 6:23 PM, Eric Plowe eric.plowe@gmail.com wrote:
|
I would need actual unit tests showing this vs the current system. I tried Nick Berardi On Nov 15, 2012, at 6:23 PM, Eric Plowe notifications@github.com wrote: Well, the session is still technically local to the context, its just that On Thu, Nov 15, 2012 at 6:20 PM, Nick Berardi notifications@github.comwrote:
— |
Understand completely. I'll see if I can get the unit tests written tonight On Thu, Nov 15, 2012 at 6:26 PM, Nick Berardi notifications@github.comwrote:
|
If we force a ToList() in the Execute method of the classes that inherit from QueryableColumnFamilyOperation that will force eager loading instead of the expected lazy loading which causes us the end up in the situation where GetEnumerator() is deferred and CassandraContext::ExecuteOperation() can not dispose the session properly because it's _connection field was still null on the first pass and CassandraSession::GetClient() is called later on with no way to call CassandraSession::Dispose() again I've gone through it for a couple of hours now and all my tests are showing expected results. From what I see we would need to change 6 classes. e.g: public override IEnumerable<FluentColumnFamily> Execute()
{
return GetFamilies(ColumnFamily).ToList();
} IMHO I think this would be the best route as functionality doesn't change and we have business as normal. If we don't go that route we need to explicitly call a CloseClient/Dispose method in those methods and that would mean the session has now effectively been disposed twice which feels bad to me. |
https://github.com/eplowe/fluentcassandra/tree/connreleaseissue -- I've pushed a new set of changes up. In my tests connections were properly released to the pool, unit tests still ran, local session is still local session. I've also added a little tweak to memoize the Enumerator on the first run and then use a cached version so if you were doing something like call Any() to see if anything existed and then calling something else inside, it won't make a another call out to cassandra. Let me know what you guys think. |
@eplowe i list an exampled fix in issue #82 that could be fixed by adding the following to all execute operations: var temp = output.GetEnumerator(); How is this different? I think ideally we need a way to keep lazy loading and still keep proper return of connections to the connection pool |
@ilushka85 --- A lot. The way you're doing it is simply trying to fix the It also allows you to do some interesting things e.g. caching the iterator Now, we could implement a CloseClient() in CassandraSession, similar to On Fri, Nov 16, 2012 at 8:12 AM, ilushka85 notifications@github.com wrote:
|
@eplowe I agree that your method has its bennefits as I previously called mine just a hack to get things to work. But i still feel that both yours and mine have one key flaw and thats the requirement of calling getenumerator when its unintended. I think to preserve original and intended functionality a call needs to not execute on the server unless an operation is done to cause it. Calling ToList() means we are forcing it. |
I agree, but I also tend to believe that if you're requesting it, you want On Fri, Nov 16, 2012 at 10:25 AM, ilushka85 notifications@github.comwrote:
|
Does anyone else want to chime in on this? Nick?
|
I have wrestled with this myself, and since these are RPC commands where the data is either returned or it isn't. The commands shouldn't really shouldn't be lazy. If anything should be lazy, it should be the Fluent objects such as Column, Record, or Family. I personally think all "yeild return" should be removed from the RPC based commands. And for the case of the upcoming binary CQL access, they should implement an async model. Lets leave the lazy loading to the Fluent objects where nessisary. |
So based on what you said that's what I am doing: converted all the
|
@eplowe sounds good |
@nberardi when do you plan to merge this into the main fork? |
It was merged. Nick Berardi On Nov 23, 2012, at 8:35 AM, ilushka85 notifications@github.com wrote: @nberardi https://github.com/nberardi when do you plan to merge this into — |
I didn't did a pull request for that branch. I can send one if you want. |
I need one. It is a pain to do it with out. Plus a pull request starts a On Fri, Nov 23, 2012 at 9:48 AM, Eric Plowe notifications@github.comwrote:
|
@eplowe any luck in sending the pull request. I am sending the current changes to NuGet on Dec 1. |
We will need to see if pull request #90 fixes this issue. I would love to close this ticket. |
Based on my testing everything seemed kosher. Then again, I just started digging my nails into the code base recently so you might see find a use case I didn't take into account. Btw, I am on the plane to Vegas for the AWS re:invent conf. You going? |
I wish, but no. :) On Tue, Nov 27, 2012 at 3:08 PM, Eric Plowe notifications@github.comwrote:
|
Closing because everything seems stable. Thanks for the work everybody. |
We are still encountering issues where cassandra is dropping / closing / timeing out connections after this happening many times the connection pool will never return a connection despite cluster being functional. |
Some of the errors you listed will blacklist a server. The default recovery time is 30 seconds. If you get enough of those in that window you could exhaust your pool. I would look into why you're getting dropped / closed / timing out connections cassandra in such a constant rate as to cause that to happen. You may need to tune your cluster/servers, or scale out, etc etc. |
@eplowe what happens is that a timeout occurs on one secondary index(we are trying to figure out why) and after this it corrects itself but by the time it does fluent will never grant a connection.... its like the servers never come off the blacklist... basically after we hav etheh initial set of timeouts the only errors we get is timing out waiting for a connection from pool |
I would step through it. Like I said the default recovery timer is 30
|
@eplowe i will try this but it will be very hard to reproduce in our test environment. It appears the recovery method never runs since we can leave the application running for 24 hours and the only errors that will happen is that it cant get a connection from the pool. Restarting the app at any point then makes it work fine. |
Are you running the most recent release of fluentcassandra?
|
Description: in PooledConnectionProvider.cs _usedConnections is being filled up (100) and not being released.
Usage (my code):
in MultiGetColumnFamilySlice.Execute() (comments added)
The text was updated successfully, but these errors were encountered: