Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul health check is ok, but fabio cannot connect to upstream #393

Open
beyondkmp opened this issue Nov 23, 2017 · 16 comments
Open

Consul health check is ok, but fabio cannot connect to upstream #393

beyondkmp opened this issue Nov 23, 2017 · 16 comments

Comments

@beyondkmp
Copy link

beyondkmp commented Nov 23, 2017

Consul and fabio are not in same host. It's ok when consul check upstream, but fabio cannot connect to upstream. Because the network between fabio and upstream is broken. Can fabio check upstream itself?

@magiconair
Copy link
Contributor

I'm not sure I understand your problem. fabio needs to be able to connect to a consul agent which is usually localhost:8500. Maybe you can elaborate?

@beyondkmp
Copy link
Author

When fabio generate three routes through consul, but the network is broken between a route and fabio's host. Fabio still send some requests to the route until meeting http proxy timeout error. How can I deal with this situation?

@magiconair
Copy link
Contributor

You're right, there is no control to this behavior right now. Fabio assumes that the consul cluster is vital and won't drop routes on consul outages. It cannot reasonably infer what the right answer to this situation is therefore it does nothing. However, I get your point and may make sense to add an option to control this behavior leaving the default at the current behavior. Then you could choose that fabio would drop the routes after a certain amount of time when a connection loss occurs.

@magiconair
Copy link
Contributor

The current work around would be to detect that the connection to consul is gone and restart fabio. However, if fabio cannot connect to consul at all it won't start the listeners so you won't even get a 404 Not Found.

@beyondkmp
Copy link
Author

You can implement it like nginx, add max_failed params. Then when fabio failed about max_failed times , delete or the route.

@magiconair
Copy link
Contributor

OK, I'll have a look. Out of curiosity and since you are the first one to request this feature: How come the network between fabio and your consul cluster is broken for extended periods of time? I realize that everything can break but usually when your consul network connectivity is broken you have bigger things to worry about.

@aaronhurt
Copy link
Member

aaronhurt commented Nov 27, 2017

I can imagine a similar but maybe slightly different situation where fabio and the consul cluster it is attached to are healthy. The consul health checks are passing because the agents responsible for those checks can reach the services but fabio cannot reach those services.

Something like:

fabio (10.1.0.12) -> consul localhost:8500
service (172.16.0.22) -> consul localhost:8500

The consul agent nodes are communicating properly with the server nodes and updating the catalog health data but fabio on 10.1.0.12 cannot route directly to the service running on 172.16.0.22 for whatever reason.

I believe that situation would result in fabio timing out when trying to route traffic to that service. I have no idea how that situation should be handled or if that is even fabio's responsibility.

@beyondkmp
Copy link
Author

@leprechau I have the same problem with you. Now I write a python script to check that the network is ok. If not, I will delete this route through my script(route del command).

@magiconair
Copy link
Contributor

Hmm, that sounds like a case for a circuit breaker. I think fabio should do that to protect itself.

@jcomeaux
Copy link

Just wanted to chime in here that this would be a very important feature for our environment; we really like the ability to frequently replace our ec2 infrastructure on a rolling basis. Right now, we have a management type VPC in which the consul servers live, but the machines were fabio runs are on our "staging" and "production" app servers. We see regular issues when we're doing rolling replacement of our "management" VPC ec2 instances, thus knocking out consul, and, concurrently, we're doing rolling replacements of either "staging" or "production" app servers. Since fabio's sole knowledge of node up/down status comes from consul, if the consul connection is interrupted AND one of the "neighbor" machines is also down, fabio cheerfully continues to route traffic to a downed instance.

Maybe this would be such a big deal if I knew how to make traffic coming into fabio only get passed to the upstream endpoint running on the same host as fabio...but I don't know how to do that 😄

Instead, our traffic flows are ALB -> fabio_listener -> one-of-several app servers that are members of the relevant consul service. So, yeah, consul goes away, and app server "B" is being switched out...traffic coming into fabio on app server "C" or "A" will get routed to "B" 33% of the time.

@magiconair
Copy link
Contributor

magiconair commented Dec 24, 2017

So the issue is that according to consul fabio can route traffic to a service but then discovers that it can't for whatever reason.


~~~Then I'd need some code for the backoff algorithm.~~~

The http code returns 502 on i/o timeout and https://github.com/rubyist/circuitbreaker provides all the CB functions. The question is how to integrate them. (see below)

@magiconair
Copy link
Contributor

A narrow view of the problem would only look at "connection refused" or "i/o timeout" errors because of faulty network configuration. This would be a protection mechanism for fabio since Consul should reflect the true state of the cluster.

Ideally, a circuit breaker for this failure should be per host:port and not per route since a service can have many routes. However, the challenges described below apply to any CB integration.

The challenge is where to store and maintain the circuit breakers.

A simple approach would be to have a map in the http proxy and store a CB per host:port. However, that would add a global lock that all requests would have to go through which may limit throughput. The other issue is that fabio would still pick routes to unreachable hosts since the route.Picker does not have access to that map. fabio could retry the lookup until it finds a suitable target but that may not terminate. This approach would then have to be replicated for the TCP proxy as well.

Another approach would be to attach the CBs to the route.Target so that the route.Picker can filter routes to dead hosts. This would eliminate the lock contention and fabio would serve only valid routes. However, the route.Target objects are disposed every time the routing table is recreated (i.e. on every Consul state change) and with that the CB state would be lost. Also, the CBs would be per route (i.e. host:port/path) and would have to trigger for each route individually unless there is some code that scans the routing table and updates all matching CBs.

I think the right approach is to automate what @beyondkmp is doing which is generating route del commands and append them to the routing table.

In that case the http/tcp/udp proxy could just notify the CB monitor of failures which would maintain the state and generate routing table updates which are then fed back into fabio. Since network issues could be local to a fabio instance I would not try to sync them across the fabio cluster.

magiconair added a commit that referenced this issue Dec 29, 2017
This patch adds a circuit breaker monitor which issues route table
fragments to delete routes for which the circuit breaker is open.

The current implementation supports only the HTTP path. TCP and WS will
follow.

The integration test isn't testing the behavior automatically yet
but can be used to eyeball the behavior. This needs to be improved.

Also missing is the code to GC the circuit breakers that aren't
tripped over time but that should be easy to implement in this
code.

Fixes #393
@magiconair
Copy link
Contributor

I've started working on this having a single threaded circuit breaker monitor which gets OK/FAIL events from the proxies. The monitor will then issue a routing table fragment which will delete all routes that match the breaker. Once a breaker becomes ready again the route del command will be removed.

The config language was updated to support deleting all routes for a given destination, e.g.:

# delete all routes for 'host:port'
route del * * http://host:port

Example

# given the routes
route add svc /foo http://1.2.3.4:8000
route add svc /foo http://5.6.7.8:8000
route del * * http://1.2.3.4:8000

# results in 
route add svc /foo http://5.6.7.8:8000

With this approach the proxy, the routing table and the circuit breakers are decoupled and can be updated independently.

The monitor needs to be integrated into the TCP path and the Websocket path and there needs to be GC of OK breakers which are no longer used. The trickier part is to write a decent integration test.

Please let me know if you have feedback.

@jcomeaux
Copy link

jcomeaux commented Dec 29, 2017

This looks great Frank! This certainly addresses our situation of a Consul that may "go away" with concurrent proxy outages, but also handles spurious network disconnections between the monitor and proxies.

Would it make sense to have the ok/fail thresholds configurable for the monitor? Also, I'm assuming "once a breaker becomes ready again" means that we've gotten updated information from Consul?

@magiconair
Copy link
Contributor

I will make the breakers configurable. As usual there is the challenge of making this configurable per route or globally but since we're talking basic networking errors I will start with a global config value.

The breakers are independent from Consul since the idea is that Consul knows the state of the cluster. Once Consul becomes available again fabio will already pick up the new routing table but it may take some time for the breakers to recover.

@deuch
Copy link

deuch commented May 17, 2018

Hello,

I've the same issue here -->

Fabio (container), API (containers) and Consul Agent (process) are on the same host.

Health Check are done by docker-exec health check from the agent to the containers.

Fabio and API use the same overlay network. Sometimes we have some network issue on the overlay and fabio can not access to some containers. For example, we have 4 instances of our containers, and 2 can not be reached by fabio. We can see errors in log (no backend error etc ...). We try a ping or curl from the fabio container to API and we have the same error.

BUT, consul health-check is OK ... because it's a docker exec that run a script in the container itself (like a curl localhost/health).

So if the consul health-check is OK but the network failed between fabio and the containers, the route still the same and fabio continue to reach the backend that can not be reached ...

So, if fabio can blacklist backend that can not be reached, event if the health-check is OK will be a very important feature for application availability ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants