Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Black Listing To Aggressive #51

Closed
edweip3 opened this Issue · 57 comments

7 participants

@edweip3

In the RoundRobinServerManager, if a connection cannot be made to a server, it is added to the blacklist. Once it's in the blacklist, it is there for the life time of RoundRobinServerManager, which is a problem if a connection pool is used, and the CassandraContext is held open for a long time.

@nberardi nberardi was assigned
@nberardi
Owner

Please mock up a change and submit a pull request this is probably a valuable addition to have.

@lhowlader

I am currently working on implementing the Circuit Breaker pattern along with unit tests. Once it is implemented and tested, we will not need to blacklist a server. I believe the Circuit Breaker pattern is relevant to this issue. Please feel free to comment if you have considered the pattern and have any insight on why it should not be used. Current plan is to be able to configure the number of attempts that trip the breaker. Also, confugrable min (millisec) and max (millsec) params that allows the retries (to see if Breaker in Open state can be Closed) progrssively between min and max. Once max is reached, the Circuit Breaker for the node will publish an event that can be subsribed to indicating that the node require special attention/intervention.

@nberardi
Owner

I think this will be a great addition. I never really liked the blacklist, so I am happy to hear somebody is trying to get rid of it.

@dlbromen

Update:
@Smoles and myself have taken over this functionality from lhowlander. There will probably be a lot of discussion about what we've done so please let me know if you'd like to move it to another issue or a different communication channel.

Some of the design principles:
1. Do not blacklist a server forever.
2. Blacklist on a configurable number of failures (not just one).
3. Automatically retry a server after a configurable amount of time. If it fails again, reblacklist. If it succeeds, re-enable it.

@dlbromen

Sorry, accidently hit submit...

  1. Don't open the floodgates when retrying a server. Just try a single connection
  2. When re-enabling a server, make it available for use (especially true when using Pooling=True)
  3. Don't consider everything a failure. Set of specific I/O related exceptions.

.. and there are probably more.

We've got something staged up, here: https://github.com/W3i/fluentcassandra/tree/circuit_breaker

Let us know how you would like to proceed. Thank you.

@ilushka85

Is this ready for prime time?

@eplowe

I've implemented something similar that I am currently using in production but also takes it a step further and classify's various exception types as retry/clienthealthy -- so while you might get a connection that's good from CassandraSession.GetClient(), it might go down between the time it was opened successfully and when a command is executed against it. If we have more than one server in the list try to move to another one up to a max retry count.

@ilushka85

eplowe: do you have the code you can submit for it? I;d like to take a look and possibly use it in our scenario as well.

@eplowe
@nberardi
Owner

If any of you guys want to send a pull request, I would be happy to take it. I know this is an issue that we need to get resolved soon. Eventually I would like to make the a ServiceManager that is ring aware so that the RoundRobin can be retired.

@tjake

I was also thinking of the circuit breaker pattern. I'll try out @dlbromen's patch

@ilushka85

eplowe did you get a chance to upload your changes?

@eplowe
@eplowe

https://github.com/eplowe/fluentcassandra/tree/blacklisting-enhancements

Let me know what you think. So far from my testing, it appears to be done what I expect. With that said, there may be some edge cases not accounted for.

@nberardi
Owner

Should we really be using RoundRobin anymore? It was created at a time when there were no mechanisms to describe the nodes in a cluster. But now, but we now have that ability with the describe_ring function.

https://github.com/managedfusion/fluentcassandra/blob/master/src/Operations/CassandraClientWrapper.cs#L154

So I guess my question is should we keep putting lipstick on this pig, or move to something that is better suited for Cassandra moving forward?

@eplowe

Agreed 100% --- but for now either dlbromens patch, or my patch can act as a stop-gap until this new mechanism is in place.

@eplowe

With that said, I'd be more than happy to work on an auto discovery mechanism.

@tjake

What does autodiscover have todo with round robin? Wouldn't you use autodiscover to get the list of servers to round robin? Also auto discover means you need to be sure you don't connect across DCs.

Some clients like Astynax try to connect to the one of the replicas so you cut down on coordinators hops. But even then you would still need some kind of round robin across replicas.

We've been running with @dlbromen's patch and so far so good.

@eplowe
@ilushka85

@eplowe @dlbromen @tjake Are you guys using connection pool in your environment? If yes are you seeing it properly release connections?

@eplowe
@ilushka85

@eplowe well the connections should be left open... the question is if they get returned properly. This is an issue we have had that I have commented on and had a slight but improper fix for this issue.

@eplowe
@ilushka85

heres a simple test if you debug this and say after the tenth time thru the loop look at the connection pool you will see ten connections are in use and not returned to freed.

var builder = new ConnectionBuilder(ConfigurationManager.AppSettings["ResultlyDataConnectionString"]);

    var db = new CassandraContext(builder);



    CassandraColumnFamily usersColumnFamily = new CassandraColumnFamily<UTF8Type>(db, "Users");


    for (int x = 0; x < 100; x++)
    {
        var rows = usersColumnFamily.Get().TakeKeys(25);
        string lastKey = rows.Last().Key;


        Console.WriteLine(lastKey);
    }
@eplowe
@ilushka85

@eplowe i posted a hack that fixes this in my other message:

My crude attempt at fixing this was inserting the following snippets in each of the operation function classes

var temp = output.GetEnumerator();
if (Session.WasDisposed)
{
Session.WasDisposed = false;
Session.Dispose();
}

Any better ideas on how to do the above? it has to do with how .net doesnt really execute code till later.

IN addition i noticed connections having an exception do not get returned to pool or closed in original code.

@dlbromen

I haven't had a chance to catch up with the recent activity here, but can you please take the Connection releasing discussion to issue #29. Thank you.

@ilushka85

also current round robin method and connection pool means that second and subsequent queries you execute are not hit on next server in round robin but on the same one open connection ... so if your code only requires the use of one connection and you are using a pool all executions will happen on that one server regardless of how many you have defined.

@eplowe
@ilushka85

@eplowe Ok. Waiting for your input there.

@ilushka85

@eplowe any thoughts on my comment above about your round robin replacement changes not taking into account using other servers when using a pool?

@eplowe
@ilushka85

@eplowe using your blacklisting enhancements with enough timeout exceptions using the pool the connections still do not get reset and all subsequent connections fail as pool is exhausted.

@eplowe
@ilushka85
@eplowe
@eplowe
@ilushka85

@eplowe I have not been able to reproduce since we corrected what was causing the timeout issue in the first place but will continue to test it.

We have found a bug in cassandra with secondary indexes just so you are aware that we reported to datastax who is working on a solution. Basically timeouts occur if you rely on secondary indexes and perform deletes on the column with the secondary index. If you delete enough of them a timeout will occur during querys against the secondary index while it skips over tombstones as they still exist in the secondary index. this is what lead to our timeouts.

In any case.... Were you able to make any more progress on your changes with auto discovery etc?

@nberardi
Owner

Any chance we can get a patch? Was hoping to wrap up the next release before Thanksgiving.

@eplowe
@nberardi
Owner

@eplowe thanks for the TimedOutException

RE: auto discovery, this functionality should be moved to another connection manager, since for the time being we still need to support the current methodology of server lists that are provided. Right now I am favoring if in the connection string the server is just a server:* then we will use auto discovery, anything else will be the old method. How does that sound?

@eplowe
@nberardi
Owner

You are right, I like your method better. I think auto discover should be turned on by default, what is your thoughts?

@eplowe
@eplowe
@nberardi
Owner

Did you send a pull request?

@eplowe
@nberardi
Owner

Thanks for the commit from issue #88, I think that closes the issue for now. At least until we open the code back up for #29, regarding auto searching of servers.

@nberardi nberardi closed this
@nberardi
Owner
@eplowe

Just looked at the refactoring -- You should stop the timer when you try to recover, and the recovery time IMO is too long. You could have a situation where you're having some network hiccups and given the amount of traffic to your application you blacklist all your servers and now you have 30 seconds until we attempt to check if they are back up or not. 30 seconds when you're getting getting hammered with traffic feels like years.

@nberardi
Owner
@eplowe

http://msdn.microsoft.com/en-us/library/yz1c7148%28v=vs.80%29.aspx

Timer.Change Method (Int32, Int32)

dueTime

The amount of time to delay before the invoking the callback method specified when the Timer was constructed, in milliseconds. Specify Timeout.Infinite to prevent the timer from restarting. Specify zero (0) to restart the timer immediately. 

period

The time interval between invocations of the callback method specified when the Timer was constructed, in milliseconds. Specify Timeout.Infinite to disable periodic signaling. 

Also, there is a slight bug -- the reason I was passing the two states on the exception was to know what to do based on the exception. But you are doing things like blacklisting a connection and retrying for TimedOutExceptions -- which is wrong, a timeout doesn't necessarily indicate an unhealthy client. Just that the particular command timed out. You're also not taking into account IOException and NotFoundException -- which are important. That is a pretty easy fix.

@eplowe

Also, I just noticed this as well, and when we do remove a server from the blacklist we need to add it back to _serverQueue.

@eplowe
@nberardi
Owner

I know what they say, but how do you think it actually work under the covers? It is essentially this.

while(true) {
    if (TickCount == CallbackTickCount)
        Callback();
}

I do realize this is an over simplification, but essentially this is how all timers work. Also on MSDN they do make this note on http://msdn.microsoft.com/en-us/library/system.threading.timer.aspx

System.Threading.Timer is a simple, lightweight timer that uses callback methods and is served by thread pool threads. It is not recommended for use with Windows Forms, because its callbacks do not occur on the user interface thread. System.Windows.Forms.Timer is a better choice for use with Windows Forms. For server-based timer functionality, you might consider using System.Timers.Timer, which raises events and has additional features.

Also here is the difference between System.Timers.Timer and System.Threading.Timer

http://stackoverflow.com/questions/1416803/system-timers-timer-vs-system-threading-timer

RE: IOException and NotFoundException

I didn't actually see you implement them anywhere, so I assumed it was a mistake.

https://github.com/eplowe/fluentcassandra/blob/de1be9790a4d27c77a98a692e10916d2be16625e/src/Operations/Operation.cs

Also it has been good to work with you, I just like to maintain some sort of order so the codebase doesn't get overrun with personal one-off changes that only benefit one entity.

@eplowe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.