Skip to content

RFE: New Balancer Type - Failover #754

Borkason opened this Issue Mar 24, 2013 · 8 comments

1 participant

Cherokee Project member

Original author: (November 26, 2010 09:25:26)

This is a request for enhancement. The enhancement is requested for the balancers (currently round robin or IP Hash). This might fit elsewhere in Cherokee adjacent to the existing balancer pool management.

This new type is intended to provided failover / backup server to fulfill requests only during failure of the machine(s) in backend pool.

The idea here is rather simple.

We typically have this:

Cherokee (balancer) --> ColdFusion
or Cherokee (balancer) --> PHP

We tend to run single backends due to tracking code and existing code base tracking sessions, log-ins, etc.

We want to provide just-in-case balancing only - when the main application server fails ONLY. Thus this new balancer type.

When/where the failure is detected of the app server Cherokee would kick requests to this failover machine for processing. Upon the app server returning, Cherokee would resume sending traffic to the original main backend app server for processing.

This is particularly important to us and being used in the manner as described above for SEARCH SPIDERS, who pummel us with many requests per second typically and can burst into the 20-50 request per second range and do so sustained for hours to in excess of 24 hours at times. Eventually, our app server just for these spiders starts to get behind and fails which creates a cascading failure --- sometime it recovers gracefully a minute or two later (creating a mini outage) - other times it doesn't and that creates an large outage/downtime for these requests.

Our work around outside of Cherokee is to run pen or similar balancing product on other side of Cherokee and detect the failure there and have that pipe the request elsewhere. It's a pain to manage and monitor and to document for future. Less parts that better.

If someone is willing to build this into Cherokee, I'll offer a Bounty to pay for such - at least in part. Unsure of the complexity/time investment.

Feel free to contact me via email to discuss/clarify.

Original issue:

Cherokee Project member

From on November 27, 2010 20:43:26
You can limit the rate at which most search engines hit your site by putting a crawl-delay in your robots.txt file:

User-agent: *
Crawl-delay: 10

would limit it to 1 hit every 10 seconds

Cherokee Project member

From on November 29, 2010 05:30:53
The Robots specification is sooooo loosely adhered to that it's useless today.

We have lots of URL prefixes like, etc. which spiders view as different sites per se.

As traffic grows, there is always a need to deal with exponential growth/burst situations.

The other nice thing with isolating the spiders like I've described is the heavy overhead of them not retaining any session data from request to request is isolated to machine(s) where we can trim the sessions at a much more rapid rate to recoup memory. Mixing the spiders with real users in this scenario is a very bad idea as spiders make page generation times balloon and can create a working backlog and an application pro-active failure stoppage.

Cherokee Project member

From alobbs on December 04, 2010 16:27:36 implemented it.
Could you please give it a try?

Cherokee Project member

From alobbs on December 22, 2010 09:04:01
There is been no feedback for this Request for Enhancement after almost 20 days. I assume it works fine.

Cherokee Project member

From on December 22, 2010 09:45:02
I am sorry, I didn't notice that feature.

Thanks for the note on this.

I'll get the latest source compiled today and experiment with it. Exciting!

Cherokee Project member

From on December 22, 2010 12:53:55
Am up and running with latest trunk version now and changed the sources to use failover type.

If I have 2 sources does failover work so it sticks folks on that first one and the 2nd one is just used when first one is determined failed? What happens if we have say three or more sources in failover?

Also, what is deemed a failure at current? Is there underlying error detection of webserver standard error codes, failure to respond in certain time, etc?

Cherokee Project member

From on December 22, 2010 14:55:37
AFAICT, failure == failure to connect

Regarding the behavior:

Cherokee Project member

From on December 22, 2010 17:42:57
Yes, a failure is a failure to connect. The response is not analyzed as long as there is an actual response.

Cherokee will always use the first available option of the sources list. In case the first one were disabled, it would use the second until the first backend server were back on-line. If the second failed as well, it would use the third... and so on, and so forth.

@Borkason Borkason closed this Mar 24, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.