-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul health check is ok, but fabio cannot connect to upstream #393
Comments
I'm not sure I understand your problem. fabio needs to be able to connect to a consul agent which is usually |
When fabio generate three routes through consul, but the network is broken between a route and fabio's host. Fabio still send some requests to the route until meeting http proxy timeout error. How can I deal with this situation? |
You're right, there is no control to this behavior right now. Fabio assumes that the consul cluster is vital and won't drop routes on consul outages. It cannot reasonably infer what the right answer to this situation is therefore it does nothing. However, I get your point and may make sense to add an option to control this behavior leaving the default at the current behavior. Then you could choose that fabio would drop the routes after a certain amount of time when a connection loss occurs. |
The current work around would be to detect that the connection to consul is gone and restart fabio. However, if fabio cannot connect to consul at all it won't start the listeners so you won't even get a |
You can implement it like nginx, add max_failed params. Then when fabio failed about max_failed times , delete or the route. |
OK, I'll have a look. Out of curiosity and since you are the first one to request this feature: How come the network between fabio and your consul cluster is broken for extended periods of time? I realize that everything can break but usually when your consul network connectivity is broken you have bigger things to worry about. |
I can imagine a similar but maybe slightly different situation where fabio and the consul cluster it is attached to are healthy. The consul health checks are passing because the agents responsible for those checks can reach the services but fabio cannot reach those services. Something like: fabio (10.1.0.12) -> consul localhost:8500 The consul agent nodes are communicating properly with the server nodes and updating the catalog health data but fabio on 10.1.0.12 cannot route directly to the service running on 172.16.0.22 for whatever reason. I believe that situation would result in fabio timing out when trying to route traffic to that service. I have no idea how that situation should be handled or if that is even fabio's responsibility. |
@leprechau I have the same problem with you. Now I write a python script to check that the network is ok. If not, I will delete this route through my script(route del command). |
Hmm, that sounds like a case for a circuit breaker. I think fabio should do that to protect itself. |
Just wanted to chime in here that this would be a very important feature for our environment; we really like the ability to frequently replace our ec2 infrastructure on a rolling basis. Right now, we have a management type VPC in which the consul servers live, but the machines were fabio runs are on our "staging" and "production" app servers. We see regular issues when we're doing rolling replacement of our "management" VPC ec2 instances, thus knocking out consul, and, concurrently, we're doing rolling replacements of either "staging" or "production" app servers. Since fabio's sole knowledge of node up/down status comes from consul, if the consul connection is interrupted AND one of the "neighbor" machines is also down, fabio cheerfully continues to route traffic to a downed instance. Maybe this would be such a big deal if I knew how to make traffic coming into fabio only get passed to the upstream endpoint running on the same host as fabio...but I don't know how to do that 😄 Instead, our traffic flows are ALB -> fabio_listener -> one-of-several app servers that are members of the relevant consul service. So, yeah, consul goes away, and app server "B" is being switched out...traffic coming into fabio on app server "C" or "A" will get routed to "B" 33% of the time. |
So the issue is that according to consul fabio can route traffic to a service but then discovers that it can't for whatever reason.
|
A narrow view of the problem would only look at "connection refused" or "i/o timeout" errors because of faulty network configuration. This would be a protection mechanism for fabio since Consul should reflect the true state of the cluster. Ideally, a circuit breaker for this failure should be per The challenge is where to store and maintain the circuit breakers. A simple approach would be to have a map in the http proxy and store a CB per Another approach would be to attach the CBs to the I think the right approach is to automate what @beyondkmp is doing which is generating In that case the http/tcp/udp proxy could just notify the CB monitor of failures which would maintain the state and generate routing table updates which are then fed back into fabio. Since network issues could be local to a fabio instance I would not try to sync them across the fabio cluster. |
This patch adds a circuit breaker monitor which issues route table fragments to delete routes for which the circuit breaker is open. The current implementation supports only the HTTP path. TCP and WS will follow. The integration test isn't testing the behavior automatically yet but can be used to eyeball the behavior. This needs to be improved. Also missing is the code to GC the circuit breakers that aren't tripped over time but that should be easy to implement in this code. Fixes #393
I've started working on this having a single threaded circuit breaker monitor which gets OK/FAIL events from the proxies. The monitor will then issue a routing table fragment which will delete all routes that match the breaker. Once a breaker becomes ready again the The config language was updated to support deleting all routes for a given destination, e.g.:
Example
With this approach the proxy, the routing table and the circuit breakers are decoupled and can be updated independently. The monitor needs to be integrated into the TCP path and the Websocket path and there needs to be GC of OK breakers which are no longer used. The trickier part is to write a decent integration test. Please let me know if you have feedback. |
This looks great Frank! This certainly addresses our situation of a Consul that may "go away" with concurrent proxy outages, but also handles spurious network disconnections between the monitor and proxies. Would it make sense to have the ok/fail thresholds configurable for the monitor? Also, I'm assuming "once a breaker becomes ready again" means that we've gotten updated information from Consul? |
I will make the breakers configurable. As usual there is the challenge of making this configurable per route or globally but since we're talking basic networking errors I will start with a global config value. The breakers are independent from Consul since the idea is that Consul knows the state of the cluster. Once Consul becomes available again fabio will already pick up the new routing table but it may take some time for the breakers to recover. |
Hello, I've the same issue here --> Fabio (container), API (containers) and Consul Agent (process) are on the same host. Health Check are done by docker-exec health check from the agent to the containers. Fabio and API use the same overlay network. Sometimes we have some network issue on the overlay and fabio can not access to some containers. For example, we have 4 instances of our containers, and 2 can not be reached by fabio. We can see errors in log (no backend error etc ...). We try a ping or curl from the fabio container to API and we have the same error. BUT, consul health-check is OK ... because it's a docker exec that run a script in the container itself (like a curl localhost/health). So if the consul health-check is OK but the network failed between fabio and the containers, the route still the same and fabio continue to reach the backend that can not be reached ... So, if fabio can blacklist backend that can not be reached, event if the health-check is OK will be a very important feature for application availability ... |
Consul and fabio are not in same host. It's ok when consul check upstream, but fabio cannot connect to upstream. Because the network between fabio and upstream is broken. Can fabio check upstream itself?
The text was updated successfully, but these errors were encountered: