race condition between `consul-template` and `restarting haproxy` #69

Kosta-Github · 2015-07-03T09:35:10Z

It looks like there is a race condition between consul-template and restarting haproxy. I noticed on several cluster nodes that haproxy is running several times, e.g.:

administrator@ECS-7b314f01:~$ docker exec -it panteras_panteras_1 /bin/bash
root@ECS-7b314f01:/opt# ps -aux | grep haproxy

root        15  0.1  0.0  19616  1684 ?        S    08:51   0:00 /bin/bash /opt/consul-template/haproxy_watcher.sh
root        16  0.3  0.3  12724  7748 ?        Sl   08:51   0:01 consul-template -consul=10.50.80.41:8500 -template template.conf:/etc/haproxy/haproxy.cfg:/opt/consul-template/haproxy_reload.sh -template=./keepalived.conf:/etc/keepalived/keepalived.conf:./keepalived_reload.sh
root      1142  0.0  0.0  21076  1608 ?        Ss   08:53   0:00 /usr/sbin/haproxy -p /tmp/haproxy.pid -f /etc/haproxy/haproxy.cfg -sf 1133 1114 1097
root      1479  0.0  0.0  21076  1600 ?        Ss   08:54   0:00 /usr/sbin/haproxy -p /tmp/haproxy.pid -f /etc/haproxy/haproxy.cfg -sf 1474 1142 1114
root      1505  0.0  0.0  21076  1400 ?        Ss   08:54   0:00 /usr/sbin/haproxy -p /tmp/haproxy.pid -f /etc/haproxy/haproxy.cfg -sf 1498 1479 1142
root      2406  0.0  0.0  10468   940 ?        S+   08:58   0:00 grep --color=auto haproxy

Note, that there are more than one pids listed for the -sf option of haproxy!

My guess is, that this happens, when there are several service changes propagating through consul-template with very short time intervals in between, e.g., when scaling a marathon app. This will trigger the haproxy reload command in parallel each one getting one or more pids via the pidof call resulting in the situation given above.

The result of this is that haproxy on that mode will/might run with an outdated haproxy config.

Right now, I don't have a good idea how to fix that...

The text was updated successfully, but these errors were encountered:

Kosta-Github · 2015-07-03T12:00:08Z

OK, browsing the internet a little bit more and reading stuff, I came to the conclusion, that multiple running haproxy instances might not be a problem, since the older ones should finish themselves once the last open connection gets closed.

But then there might be another issue for this observed behavior:

As soon as there is a change with respect to services running and a new haproxy config file is generated and reloaded by haproxy I get HTTP connection errors for a very short time period (~1sec). And this is across services/apps, e.g., when scaling/deploying a service_a I also get connection errors for a running service_b.

Any ideas?

sielaq · 2015-07-04T15:30:18Z

ad. to HAproxy multiple processes - this is a normal behavior.
ad. 2'nd question - we are not having this problem at all, till now, HAproxy reload never caused any issue yet.

Do you have problem with every running service, when HAproxy reloads ?
Do you have problem with services when scaling down only ? or also when scaling up ?

Kosta-Github · 2015-07-06T15:13:56Z

I just setup a very simple node.js based service (just echo'ing the HTTP headers that were passed to it) for testing the fail-over behavior of HAProxy.

This service gets stressed by another node.js based script, which runs 10 requests in parallel in a loop hitting the echo service from above.

As soon as the HAProxy config file gets changed (due to the deployment of a new service, scaling an existing service up or down, ...) I get some failing requests reported, e.g.:

...
 stress-test 4: remote address: ::ffff:10.50.80.218, service host: 0e0a26b73ff8, duration: 32
 stress-test 5: remote address: ::ffff:10.50.80.218, service host: 0e0a26b73ff8, duration: 32
 stress-test 3: remote address: ::ffff:10.50.80.218, service host: 0e0a26b73ff8, duration: 32
 stress-test 6: remote address: ::ffff:10.50.80.218, service host: 0e0a26b73ff8, duration: 31
 stress-test 8: request error: Error: read ECONNRESET, duration: 27
 stress-test 7: request error: Error: read ECONNRESET, duration: 27
 stress-test 9: request error: Error: read ECONNRESET, duration: 27
 stress-test 1: request error: Error: read ECONNRESET, duration: 30
 stress-test 0: remote address: ::ffff:10.50.80.218, service host: 0e0a26b73ff8, duration: 29
 stress-test 2: remote address: ::ffff:10.50.80.218, service host: 0e0a26b73ff8, duration: 32
 stress-test 3: remote address: ::ffff:10.50.80.218, service host: 0e0a26b73ff8, duration: 30
...

or sometimes connection timeout like this:

...
  stress-test 0: remote address: ::ffff:10.50.80.218, service host: 1088cc6d638d, duration: 27
  stress-test 0: remote address: ::ffff:10.50.80.218, service host: 4382a069d20c, duration: 30
  stress-test 0: remote address: ::ffff:10.50.80.218, service host: 6e97e38e6d33, duration: 34
  stress-test 2: request error: Error: connect ETIMEDOUT, duration: 20099
  stress-test 3: request error: Error: connect ETIMEDOUT, duration: 20099
  stress-test 9: request error: Error: connect ETIMEDOUT, duration: 20099
  stress-test 5: request error: Error: connect ETIMEDOUT, duration: 20099
  stress-test 8: request error: Error: connect ETIMEDOUT, duration: 20099
  stress-test 6: request error: Error: connect ETIMEDOUT, duration: 21000
  stress-test 4: request error: Error: connect ETIMEDOUT, duration: 20099
  stress-test 1: request error: Error: connect ETIMEDOUT, duration: 21000
  stress-test 7: request error: Error: connect ETIMEDOUT, duration: 20099
  stress-test 0: remote address: ::ffff:10.50.80.218, service host: 980e24577611, duration: 28
  stress-test 2: remote address: ::ffff:10.50.80.218, service host: 0e0a26b73ff8, duration: 28
  stress-test 3: remote address: ::ffff:10.50.80.218, service host: 1088cc6d638d, duration: 30
...

sielaq · 2015-07-06T18:39:18Z

I think this is kind of HAproxy design problem that for few millisecond port is unavailable - specially exposed if you test it without keep alive - you should rather simulate browser behavior and check
apache ab (from apache2-utils package) that supports keep alive
and then:
ab -n 10000 -c 10 -k http://echo.service.my_company.com
this will give you more reliable results.

We are considering to migrate to different Load Balancer. One of our internal developer wrote a golang consul based router, we are waiting for OS it (It need some time to Open Source any internal tool)

sielaq · 2015-07-06T20:17:21Z

if this is still bothering you, change in generate_yml.sh for that

HAPROXY_RELOAD_COMMAND="iptables -I INPUT -p tcp --dport 80 --syn -j DROP; sleep 1; /usr/sbin/haproxy -p /tmp/haproxy.pid -f /etc/haproxy/haproxy.cfg -sf $(pidof /usr/sbin/hapro
xy); iptables -D INPUT -p tcp --dport 80 --syn -j DROP || true"

I just saw that this option cannot be overwritten from external ENV variable , man man man

sielaq · 2015-07-06T20:19:39Z

more info :
http://inside.unbounce.com/product-dev/haproxy-reloads/
http://engineeringblog.yelp.com/2015/04/true-zero-downtime-haproxy-reloads.html

Kosta-Github · 2015-07-06T21:25:06Z

Thx for the pointers; will have a closer look tomorrow in the office.

I hoped/thought that HAProxy could handle those cases out-of-the-box... :-(

Do you have experience with nginx? Can it handle that reload case more gracefully?

sielaq · 2015-07-07T05:54:40Z

Yes we have experience with nginx, apache and varnish.
Nginx could be a better choice, varnish also (but varnish has implemented it since version 4)
The solution with iptables should also do the trick with HAproxy, especially the 1st method (from the first link) is better, but much more complicated. The 2'nd one with tc and nl-qdisc-add could be easy to implement.
And worth mention: there is no faster LB than HAproxy.

IMHO all those methods doesn't fully feet to consul cluster - because you always need to use consul-template "glue code" thats why some of my colleagues decided to write its own router.

sielaq · 2015-07-07T07:56:11Z

I have just tested solution with nl-qdisc-add that buffers all the stuff for milliseconds and it works really good,
I think I can adapt this into PanteraS - this will definitely satisfy you.

Kosta-Github · 2015-07-07T08:45:13Z

Yes, definitely! :-)

Thank you for all your suggestions and hard work!

#69 - HAproxy reload gets more smooth

sielaq · 2015-07-07T12:41:14Z

fixed

Kosta-Github · 2015-07-07T12:44:00Z

thx

Kosta-Github · 2015-07-07T13:16:44Z

This change didn't fix the issue for me; I am observing the exact same errors as noted above... :-(

sielaq · 2015-07-07T18:45:15Z

Yea, I forgot to include one part, I was wondering where to put it, and forgot about it.

sielaq · 2015-07-08T11:11:32Z

oh man this is not working as suspected, depends on buffer size it can even slow down a lot.
I start to think to implement the best option with two HAproxy deamons and iptables switch.

sielaq · 2015-07-08T20:46:02Z

I have a PoC using iptables switching between two HAproxy,
and this one is really working fine. I have well tested it,
it works much better than queuing/buffering.
I will push changes tomorrow - I have to make some cosmetic changes.

#69 final solution

sielaq · 2015-07-09T09:12:28Z

Kindly please verify if this is working fine

eBayClassifiedsGroup#73 & eBayClassifiedsGroup#69

Kosta-Github · 2015-08-10T11:42:39Z

This seems to work now and also with my keepalived PR (#68)...

Thanks for that.

sielaq added a commit to sielaq/PanteraS that referenced this issue Jul 7, 2015

eBayClassifiedsGroup#69 - HAproxy reload gets more smooth

65c6a0c

sielaq added a commit that referenced this issue Jul 7, 2015

Merge pull request #72 from sielaq/master

764b630

#69 - HAproxy reload gets more smooth

sielaq closed this as completed Jul 7, 2015

Kosta-Github pushed a commit to Kosta-Github/PanteraS that referenced this issue Jul 7, 2015

eBayClassifiedsGroup#69 - HAproxy reload gets more smooth

1922c85

sielaq reopened this Jul 7, 2015

sielaq added a commit to sielaq/PanteraS that referenced this issue Jul 8, 2015

eBayClassifiedsGroup#69 - PoC

68e47f7

sielaq added a commit to sielaq/PanteraS that referenced this issue Jul 8, 2015

eBayClassifiedsGroup#69 - PoC

1cf2356

sielaq added a commit to sielaq/PanteraS that referenced this issue Jul 8, 2015

eBayClassifiedsGroup#69 - PoC

e9ba7cf

sielaq added a commit to sielaq/PanteraS that referenced this issue Jul 9, 2015

eBayClassifiedsGroup#69 - HAproxy reload gets more smooth - 2nd approach

66cd729

sielaq added a commit to sielaq/PanteraS that referenced this issue Jul 9, 2015

eBayClassifiedsGroup#69 - HAproxy reload gets more smooth - 2nd approach

552eea3

sielaq added a commit that referenced this issue Jul 9, 2015

Merge pull request #73 from sielaq/master

b31ad23

#69 final solution

sielaq mentioned this issue Jul 9, 2015

#69 final solution #73

Merged

Kosta-Github added a commit to Kosta-Github/PanteraS that referenced this issue Jul 10, 2015

Pull in zero-downtime haproxy reload mechanism from master

31f664b

eBayClassifiedsGroup#73 & eBayClassifiedsGroup#69

sielaq mentioned this issue Aug 8, 2015

New release with updates #76

Merged

Kosta-Github closed this as completed Aug 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race condition between `consul-template` and `restarting haproxy` #69

race condition between `consul-template` and `restarting haproxy` #69

Kosta-Github commented Jul 3, 2015

Kosta-Github commented Jul 3, 2015

sielaq commented Jul 4, 2015

Kosta-Github commented Jul 6, 2015

sielaq commented Jul 6, 2015

sielaq commented Jul 6, 2015

sielaq commented Jul 6, 2015

Kosta-Github commented Jul 6, 2015

sielaq commented Jul 7, 2015

sielaq commented Jul 7, 2015

Kosta-Github commented Jul 7, 2015

sielaq commented Jul 7, 2015

Kosta-Github commented Jul 7, 2015

Kosta-Github commented Jul 7, 2015

sielaq commented Jul 7, 2015

sielaq commented Jul 8, 2015

sielaq commented Jul 8, 2015

sielaq commented Jul 9, 2015

Kosta-Github commented Aug 10, 2015

race condition between consul-template and restarting haproxy #69

race condition between consul-template and restarting haproxy #69

Comments

Kosta-Github commented Jul 3, 2015

Kosta-Github commented Jul 3, 2015

sielaq commented Jul 4, 2015

Kosta-Github commented Jul 6, 2015

sielaq commented Jul 6, 2015

sielaq commented Jul 6, 2015

sielaq commented Jul 6, 2015

Kosta-Github commented Jul 6, 2015

sielaq commented Jul 7, 2015

sielaq commented Jul 7, 2015

Kosta-Github commented Jul 7, 2015

sielaq commented Jul 7, 2015

Kosta-Github commented Jul 7, 2015

Kosta-Github commented Jul 7, 2015

sielaq commented Jul 7, 2015

sielaq commented Jul 8, 2015

sielaq commented Jul 8, 2015

sielaq commented Jul 9, 2015

Kosta-Github commented Aug 10, 2015

race condition between `consul-template` and `restarting haproxy` #69

race condition between `consul-template` and `restarting haproxy` #69