FluentCassandra Thrift transport throws exception after idling due to Cassandra connection closed #125

Closed
gs-jackal opened this Issue Mar 22, 2013 · 16 comments

Comments

Projects
None yet
5 participants
@gs-jackal
Contributor

gs-jackal commented Mar 22, 2013

Got the following exception when my application idles for a while without activity. Is there a way for FluentCassandra or exposed method to check the connection status? So perhaps I can do a refresh or reconnect? Thanks

[ERROR] FATAL UNHANDLED EXCEPTION: FluentCassandra.Operations.CassandraOperationException: Cannot read, Remote side has closed ---> FluentCassandra.Thrift.Transport.TTransportException: Cannot read, Remote side has closed
  at FluentCassandra.Thrift.Transport.TTransport.ReadAll (System.Byte[] buf, Int32 off, Int32 len) [0x00000] in <filename unknown>:0 
  at FluentCassandra.Thrift.Transport.TFramedTransport.ReadFrame () [0x00000] in <filename unknown>:0 
  at FluentCassandra.Thrift.Transport.TFramedTransport.Read (System.Byte[] buf, Int32 off, Int32 len) [0x00000] in <filename unknown>:0 
  at FluentCassandra.Thrift.Transport.TTransport.ReadAll (System.Byte[] buf, Int32 off, Int32 len) [0x00000] in <filename unknown>:0 
  at FluentCassandra.Thrift.Protocol.TBinaryProtocol.ReadAll (System.Byte[] buf, Int32 off, Int32 len) [0x00000] in <filename unknown>:0 
  at FluentCassandra.Thrift.Protocol.TBinaryProtocol.ReadI32 () [0x00000] in <filename unknown>:0 
  at FluentCassandra.Thrift.Protocol.TBinaryProtocol.ReadMessageBegin () [0x00000] in <filename unknown>:0 
  at FluentCassandra.Apache.Cassandra.Cassandra+Client.recv_execute_cql3_query () [0x00000] in <filename unknown>:0 
  at FluentCassandra.Apache.Cassandra.Cassandra+Client.execute_cql3_query (System.Byte[] query, Compression compression, ConsistencyLevel consistency) [0x00000] in <filename unknown>:0 
  at FluentCassandra.Operations.CassandraClientWrapper.execute_cql3_query (System.Byte[] query, Compression compression, ConsistencyLevel consistency) [0x00000] in <filename unknown>:0 
  at FluentCassandra.Operations.ExecuteCqlQuery.Execute () [0x00000] in <filename unknown>:0 
  at FluentCassandra.Operations.Operation`1[System.Collections.Generic.IEnumerable`1[FluentCassandra.Linq.ICqlRow]].TryExecute (IEnumerable`1& result) [0x00000] in <filename unknown>:0 
  --- End of inner exception stack trace ---
  at FluentCassandra.CassandraSession.ExecuteOperation[IEnumerable`1] (FluentCassandra.Operations.Operation`1 action, Nullable`1 throwOnError) [0x00000] in <filename unknown>:0 
  at FluentCassandra.CassandraContext.ExecuteOperation[IEnumerable`1] (FluentCassandra.Operations.Operation`1 action, Nullable`1 throwOnError) [0x00000] in <filename unknown>:0 
  at FluentCassandra.CassandraContext.ExecuteQuery (FluentCassandra.Types.UTF8Type cqlQuery, System.String cqlVersion) [0x00000] in <filename unknown>:0 
@nberardi

This comment has been minimized.

Show comment Hide comment
@nberardi

nberardi Mar 22, 2013

Contributor

Are you using a Pooled connection? Because it is designed to handle this,
if you just have one server, you are probably going through a Normal
connection which has a simple connection design.

Contributor

nberardi commented Mar 22, 2013

Are you using a Pooled connection? Because it is designed to handle this,
if you just have one server, you are probably going through a Normal
connection which has a simple connection design.

@gs-jackal

This comment has been minimized.

Show comment Hide comment
@gs-jackal

gs-jackal Mar 22, 2013

Contributor

Yes, I'm using a pooled connection already but still getting that error. I'm running two Cassandra nodes. If I purposely keep the connection active by running unnecessary CQL every few seconds then it won't happen. Only happens when I allow it to be idle for sometime

Contributor

gs-jackal commented Mar 22, 2013

Yes, I'm using a pooled connection already but still getting that error. I'm running two Cassandra nodes. If I purposely keep the connection active by running unnecessary CQL every few seconds then it won't happen. Only happens when I allow it to be idle for sometime

@ghost ghost assigned nberardi Apr 15, 2013

@sdether

This comment has been minimized.

Show comment Hide comment
@sdether

sdether Dec 13, 2013

Contributor

I think the problem is likely because GetClient uses _connection.IsOpen to determine whether the existing connnection can be used for building the CassandraClientWrapper. This in turn relies on TcpClient.Connected way down in TSocket. In my own RPC library MindTouch.Clacks I discovered the hard way that Connected is only useful for determining the connection has failed after a network operation has failed, i.e. it does not actively check and I had to resort to a rather more expensive way to test:

if(!_socket.Connected) {
  return false;
}
return !(_socket.Poll(10, SelectMode.SelectRead) && _socket.Available == 0);

Although TSocket is part of Thrift and we can't change that, we already use TSocketWithConnectTimeout instead anyhow to get around another TcpClient limitation, so i can put that code in there. I will try on a branch and see if it introduces any problems and if not I'll let you review the change to see if this is something we want to accept as the normal behavior.

Contributor

sdether commented Dec 13, 2013

I think the problem is likely because GetClient uses _connection.IsOpen to determine whether the existing connnection can be used for building the CassandraClientWrapper. This in turn relies on TcpClient.Connected way down in TSocket. In my own RPC library MindTouch.Clacks I discovered the hard way that Connected is only useful for determining the connection has failed after a network operation has failed, i.e. it does not actively check and I had to resort to a rather more expensive way to test:

if(!_socket.Connected) {
  return false;
}
return !(_socket.Poll(10, SelectMode.SelectRead) && _socket.Available == 0);

Although TSocket is part of Thrift and we can't change that, we already use TSocketWithConnectTimeout instead anyhow to get around another TcpClient limitation, so i can put that code in there. I will try on a branch and see if it introduces any problems and if not I'll let you review the change to see if this is something we want to accept as the normal behavior.

@nberardi

This comment has been minimized.

Show comment Hide comment
@nberardi

nberardi Dec 13, 2013

Contributor

Has anybody checked for an update on the Thrift library. It has probably been almost a year since it was last updated.

Contributor

nberardi commented Dec 13, 2013

Has anybody checked for an update on the Thrift library. It has probably been almost a year since it was last updated.

@sdether

This comment has been minimized.

Show comment Hide comment
@sdether

sdether Dec 13, 2013

Contributor

I just looked at the current source and it seems like there's been fixes. Their implementation of TSocket does include connect time out now, but still relies on Connected for IsOpen. If my change actually fixes this bug, I'll try to get it into the upstream Thrift and we can pick it up here via Thrift

Contributor

sdether commented Dec 13, 2013

I just looked at the current source and it seems like there's been fixes. Their implementation of TSocket does include connect time out now, but still relies on Connected for IsOpen. If my change actually fixes this bug, I'll try to get it into the upstream Thrift and we can pick it up here via Thrift

@nberardi

This comment has been minimized.

Show comment Hide comment
@nberardi

nberardi Dec 13, 2013

Contributor

Should probably also regenerate the Cassandra API for Thrift while doing it. It was probably last done about a year ago.

Contributor

nberardi commented Dec 13, 2013

Should probably also regenerate the Cassandra API for Thrift while doing it. It was probably last done about a year ago.

@igorberger

This comment has been minimized.

Show comment Hide comment
@igorberger

igorberger Dec 13, 2013

I don't know if this is the same problem, but we have a local patch to clean up the socket in case of errors. Unfortunately, it's inside Thrift. We've added "this.Close();" to TTransport.ReadAll():

public int ReadAll(byte[] buf, int off, int len)
{
int got = 0;
int ret = 0;

while (got < len)
{
    ret = Read(buf, off + got, len - got);
    if (ret <= 0)
    {
        this.Close();
        throw new TTransportException("Cannot read, Remote side has closed");
    }
    got += ret;
}

return got;

}

I don't know if this is the same problem, but we have a local patch to clean up the socket in case of errors. Unfortunately, it's inside Thrift. We've added "this.Close();" to TTransport.ReadAll():

public int ReadAll(byte[] buf, int off, int len)
{
int got = 0;
int ret = 0;

while (got < len)
{
    ret = Read(buf, off + got, len - got);
    if (ret <= 0)
    {
        this.Close();
        throw new TTransportException("Cannot read, Remote side has closed");
    }
    got += ret;
}

return got;

}

@sdether

This comment has been minimized.

Show comment Hide comment
@sdether

sdether Dec 14, 2013

Contributor

I made the connect change, which led to some other issues, because apparently the code path that would reconnect if IsOpen is false has never been exercised. I.e. if you try to re-open the socket on a socket that is marked Connected false, it'll just throw, so the client should really be thrown away and a new instance used.

However we wrap the TTransport in an IConnection, which keeps some state, so re-opening the socket transparently underneath will just lead to failure, since the IConnection won't set the Keyspace, since it's state says it's already done that.

The way i've solved this is that the PooledConnectionProvider now gets a connection from the pool and checks IsOpen. If that is false, it disposes the connection throws it out of the pool can recursively tries again. That solves the problem and the Session now recovers from an intermittent connection failure.

The only problem with this code is that IsOpen used to rely on Socket.Connected which was basically a free call (just reading a status field), while the proper implementation polls the socket, i.e. not free. And it does this now not only for every single command that is executed, but in the preparation to make that call we call IsOpen 5 times. So overall for people hitting the API hard, this is likely to be a noticeable change in performance.

Not sure what to do about it. The change to TSocket.IsOpen is clearly correct, since Socket.Connected just does not do what people assume it does, but the way IsOpen is used so freely would make this change in Thrift a problem. Any thoughts?

Contributor

sdether commented Dec 14, 2013

I made the connect change, which led to some other issues, because apparently the code path that would reconnect if IsOpen is false has never been exercised. I.e. if you try to re-open the socket on a socket that is marked Connected false, it'll just throw, so the client should really be thrown away and a new instance used.

However we wrap the TTransport in an IConnection, which keeps some state, so re-opening the socket transparently underneath will just lead to failure, since the IConnection won't set the Keyspace, since it's state says it's already done that.

The way i've solved this is that the PooledConnectionProvider now gets a connection from the pool and checks IsOpen. If that is false, it disposes the connection throws it out of the pool can recursively tries again. That solves the problem and the Session now recovers from an intermittent connection failure.

The only problem with this code is that IsOpen used to rely on Socket.Connected which was basically a free call (just reading a status field), while the proper implementation polls the socket, i.e. not free. And it does this now not only for every single command that is executed, but in the preparation to make that call we call IsOpen 5 times. So overall for people hitting the API hard, this is likely to be a noticeable change in performance.

Not sure what to do about it. The change to TSocket.IsOpen is clearly correct, since Socket.Connected just does not do what people assume it does, but the way IsOpen is used so freely would make this change in Thrift a problem. Any thoughts?

@sdether

This comment has been minimized.

Show comment Hide comment
@sdether

sdether Dec 14, 2013

Contributor

To put some numbers behind a noticeable change in performance is, i ran some benchmarks and against my dev instance: An insert currently takes an average of 0.61ms. After adding the socket poll to check the connection that rises to 3.2ms (5x slower), so clearly not acceptable.

Ideally I'd just check IsOpen when getting a connection from the pool, but that would be a major change, so instead I've opted to only pass the call down to the transport every so often -- using 1000ms, not sure how to make this configurable, since Connection is fairly deep down. With that change there is no performance decrease and I still have a fairly high chance of detecting a connection failure and recovering appropriately.

I'm going to put this code into our production use and see whether it really takes care of our intermittent connection problems before submitting a pull. This change is in the custom TSocketConnectionTimeout and I'm not sure if Thrift will accept my change since it does change the performance of IsOpen so drastically, but i'll try.

Contributor

sdether commented Dec 14, 2013

To put some numbers behind a noticeable change in performance is, i ran some benchmarks and against my dev instance: An insert currently takes an average of 0.61ms. After adding the socket poll to check the connection that rises to 3.2ms (5x slower), so clearly not acceptable.

Ideally I'd just check IsOpen when getting a connection from the pool, but that would be a major change, so instead I've opted to only pass the call down to the transport every so often -- using 1000ms, not sure how to make this configurable, since Connection is fairly deep down. With that change there is no performance decrease and I still have a fairly high chance of detecting a connection failure and recovering appropriately.

I'm going to put this code into our production use and see whether it really takes care of our intermittent connection problems before submitting a pull. This change is in the custom TSocketConnectionTimeout and I'm not sure if Thrift will accept my change since it does change the performance of IsOpen so drastically, but i'll try.

@Aaronontheweb

This comment has been minimized.

Show comment Hide comment
@Aaronontheweb

Aaronontheweb Jan 8, 2014

Contributor

@sdether how well has this fix been working for you in production thus far? Still considering a pull request?

Contributor

Aaronontheweb commented Jan 8, 2014

@sdether how well has this fix been working for you in production thus far? Still considering a pull request?

@sdether

This comment has been minimized.

Show comment Hide comment
@sdether

sdether Jan 8, 2014

Contributor

It's spent the holidays on our staging deployment and has been in production for the last week and we've not seen the connection errors occur in that time. I'm hoping to make some time to try create a pull request with the IsOpen change against Thrift, although i'm not optimistic about its acceptance, since as a drop-in replacement it changes performance behavior of that call rather drastically. So if that is rejected, i'l create a pull request against pull request with the change in our custom TSocketConnectionTimeout, which is how i'm running it at MindTouch right now.

Contributor

sdether commented Jan 8, 2014

It's spent the holidays on our staging deployment and has been in production for the last week and we've not seen the connection errors occur in that time. I'm hoping to make some time to try create a pull request with the IsOpen change against Thrift, although i'm not optimistic about its acceptance, since as a drop-in replacement it changes performance behavior of that call rather drastically. So if that is rejected, i'l create a pull request against pull request with the change in our custom TSocketConnectionTimeout, which is how i'm running it at MindTouch right now.

@Aaronontheweb

This comment has been minimized.

Show comment Hide comment
@Aaronontheweb

Aaronontheweb Jan 8, 2014

Contributor

@sdether I'll defer to @nberardi but I'd go for the customer TSocketConnectionTimeout changes instead given your concerns about performance.

Contributor

Aaronontheweb commented Jan 8, 2014

@sdether I'll defer to @nberardi but I'd go for the customer TSocketConnectionTimeout changes instead given your concerns about performance.

@nberardi

This comment has been minimized.

Show comment Hide comment
@nberardi

nberardi Jan 8, 2014

Contributor

Same here, I am alright with the custom version.

The only problem with modifying this is that it can be easily copied over
with a new version of Thrift. Any solution, other than remembering not to
do so?

Nick Berardi
(484) 302-0125
Sent on the go from my phone.

On Jan 8, 2014, at 5:43 PM, Aaron Stannard notifications@github.com wrote:

@sdether https://github.com/sdether I'll defer to
@nberardihttps://github.com/nberardibut I'd go for the customer
TSocketConnectionTimeout changes instead given
your concerns about performance.


Reply to this email directly or view it on
GitHubhttps://github.com/fluentcassandra/fluentcassandra/issues/125#issuecomment-31884280
.

Contributor

nberardi commented Jan 8, 2014

Same here, I am alright with the custom version.

The only problem with modifying this is that it can be easily copied over
with a new version of Thrift. Any solution, other than remembering not to
do so?

Nick Berardi
(484) 302-0125
Sent on the go from my phone.

On Jan 8, 2014, at 5:43 PM, Aaron Stannard notifications@github.com wrote:

@sdether https://github.com/sdether I'll defer to
@nberardihttps://github.com/nberardibut I'd go for the customer
TSocketConnectionTimeout changes instead given
your concerns about performance.


Reply to this email directly or view it on
GitHubhttps://github.com/fluentcassandra/fluentcassandra/issues/125#issuecomment-31884280
.

@Aaronontheweb

This comment has been minimized.

Show comment Hide comment
@Aaronontheweb

Aaronontheweb Jan 8, 2014

Contributor

@nberardi @sdether IMHO, we should regenerate the Thrift bindings first THEN add this. Given the movement towards the native protocol within the Cassandra project I doubt we're going to be seeing many major updates to the Thrift implementation soon.

Contributor

Aaronontheweb commented Jan 8, 2014

@nberardi @sdether IMHO, we should regenerate the Thrift bindings first THEN add this. Given the movement towards the native protocol within the Cassandra project I doubt we're going to be seeing many major updates to the Thrift implementation soon.

@nberardi

This comment has been minimized.

Show comment Hide comment
@nberardi

nberardi Jan 8, 2014

Contributor

Agreed.

Nick Berardi
(484) 302-0125
Sent on the go from my phone.

On Jan 8, 2014, at 5:58 PM, Aaron Stannard notifications@github.com wrote:

@nberardi https://github.com/nberardi
@sdetherhttps://github.com/sdetherIMHO, we should regenerate the
Thrift bindings first THEN add this. Given
the movement towards the native protocol within the Cassandra project I
doubt we're going to be seeing many major updates to the Thrift
implementation soon.


Reply to this email directly or view it on
GitHubhttps://github.com/fluentcassandra/fluentcassandra/issues/125#issuecomment-31885620
.

Contributor

nberardi commented Jan 8, 2014

Agreed.

Nick Berardi
(484) 302-0125
Sent on the go from my phone.

On Jan 8, 2014, at 5:58 PM, Aaron Stannard notifications@github.com wrote:

@nberardi https://github.com/nberardi
@sdetherhttps://github.com/sdetherIMHO, we should regenerate the
Thrift bindings first THEN add this. Given
the movement towards the native protocol within the Cassandra project I
doubt we're going to be seeing many major updates to the Thrift
implementation soon.


Reply to this email directly or view it on
GitHubhttps://github.com/fluentcassandra/fluentcassandra/issues/125#issuecomment-31885620
.

@sdether

This comment has been minimized.

Show comment Hide comment
@sdether

sdether Jan 8, 2014

Contributor

Currently the custom version is already living outside the Thrift source tree, so if we re-generate Thrift, the custom code will stick around. Main issue there is that the custom TSocket has to manually merge the Thrift changes to TSocket in. I will do this and provide a pull request after you guys pull the latest thrift

Contributor

sdether commented Jan 8, 2014

Currently the custom version is already living outside the Thrift source tree, so if we re-generate Thrift, the custom code will stick around. Main issue there is that the custom TSocket has to manually merge the Thrift changes to TSocket in. I will do this and provide a pull request after you guys pull the latest thrift

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment