Haproxy module #1545

Merged
merged 14 commits into from Jan 11, 2017

Projects

None yet

4 participants

@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited

Works via url and via unix socket (actually there is a bug in base.py that prevents to obtain all data from socket, i will make a pr with fix later)

Should be tested.
Works ok in my lab env.
If url is specified plugin uses UrlService, if no url and socket is specified - SocketService

@l2isbad l2isbad referenced this pull request Jan 11, 2017
Closed

new haproxy module #784

l2isbad added some commits Jan 11, 2017
@l2isbad l2isbad add workaround for bug in base.py
7ffe05b
@l2isbad l2isbad add ;norefresh option to url check
420a018
@jurgenhaas
Contributor

Nice work! It is doing what it should on my 4 proxy hosts.

l2isbad added some commits Jan 11, 2017
@l2isbad l2isbad configuration for haproxy plagin added
bf4a53c
@l2isbad l2isbad add haproxy plugin to Makefiles
b48f542
@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited

@jurgenhaas Any suggestion?

  1. Option to add via conf file 'Servers' family (bin, bout, scur, qcur for all backend servers) ?
  2. More health charts? (more health charts = more alarms) (i see there is server uptime, check time in ms, downtime in ms) ?
  3. What about default alarm?

Also i think preferred method to poll is via socket. Can you check if it is working properly? (you can see poll method in error.log -python.d INFO: haproxy_via_url Plugin was started succesfully. We are using UrlService)

@jurgenhaas
Contributor
  1. Option to add via conf file 'Servers' family (bin, bout, scur, qcur for all backend servers) ?

Not sure if that should also be added. You could argue that monitoring the health of the "Servers" should be done by NetData on those hosts and not through HaProxy.

  1. More health charts? (more health charts = more alarms) (i see there is server uptime, check time in ms, downtime in ms) ?

A pretty good reference is from Datadogs and you may want to follow their guide with the various aspects.

  1. What about default alarm?

Looks like the datadogs article has some advise for this too.

@l2isbad
Contributor
l2isbad commented Jan 11, 2017

Well the article is very good, but i dont use haproxy. so i'm lazy to read it)
I think i will add an alarm for up->down server and im done here.
All other stuff can be added later if someone will need to.

l2isbad added some commits Jan 11, 2017
@l2isbad l2isbad change context name 53efa35
@l2isbad l2isbad haproxy backend server alarm added
275f97b
@l2isbad l2isbad add haproxy alarm to makefile
43bd866
@ktsaou
Member
ktsaou commented Jan 11, 2017

nice work @l2isbad . Let me know when you want this merged.

@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited

If there are no suggestions i think it is ready.
The only tester is @jurgenhaas and he said the plugin is ok

@ktsaou ktsaou merged commit 642dc0e into firehol:master Jan 11, 2017

2 checks passed

codeclimate no new or fixed issues
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ktsaou
Member
ktsaou commented Jan 11, 2017

merged!

@ktsaou
Member
ktsaou commented Jan 11, 2017

hm... I forgot the wiki...
@l2isbad can you update it?

@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited

sure! but later

@jurgenhaas
Contributor

Sorry, I haven't tested the alarms before and just got to that now. I realized that all my backends are reported as being down and I wonder why. They all work well and none of them is down, but the chart reports a value=1 for ALL of them. Not sure what the plugin expects and why it thinks they are down. Any idea?

@l2isbad
Contributor
l2isbad commented Jan 11, 2017

@jurgenhaas
Code is
https://github.com/firehol/netdata/blob/master/python.d/haproxy.chart.py#L163
What is your backend server status?
Is it not UP?

@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited

I got it! if there is no check

  server server1_1 127.0.0.1:81
  server server1_2 127.0.0.2:81 check
  server server1_3 127.0.0.3:81 check

status is not UP

@jurgenhaas want to make a PR to fix this? (change UP to DOWN)

@jurgenhaas
Contributor

Looking at the attached CSV I would say they are UP but still the plugin shows a 1 for every backend in the health chart. No idea what that is, you?

stat.csv.txt

@l2isbad
Contributor
l2isbad commented Jan 11, 2017

no check guy
screenshot_20170112_034321

@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited

all those servers are not UP in your CSV

@jurgenhaas
Contributor

The servers are not checked, but the backends are on UP. And I thought that status was on backends at not servers. I'm confused.

@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited

Well i am haproxy noob i thought backend (!= real server) = group of servers. If server from group is DOWN - alert. Seems logical to me

@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited

Anyway
Can you change
!= 'UP' to == 'DOWN'
here
https://github.com/firehol/netdata/blob/master/python.d/haproxy.chart.py#L163
and check after this

@jurgenhaas
Contributor

I have changed that line 163 to this instead:

return server['# pxname'] == back_ends[_]['# pxname'] and server['status'] != 'UP' and server['status'] != 'no check'

That's working too.

@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited

We need to catch all DOWN servers so the right decision is
server['status'] == 'DOWN'

it would be great if you did a PR to fix bug you found it!

@l2isbad
Contributor
l2isbad commented Jan 11, 2017

i am wrong too tired don't see all your code (
still one check > two checks
so i vote for == DOWN

@jurgenhaas
Contributor

That's true, on the other hand, the kown states are "UP" and "no check" and those 2 are OK for us. If the status report would provide anything other than those 2, we would not recognize that as a potential problem. That's why I thought this explicit method would be more safe.

@l2isbad
Contributor
l2isbad commented Jan 11, 2017

It makes sense. I am ok with that!

@l2isbad l2isbad deleted the l2isbad:haproxy_module branch Jan 11, 2017
@ktsaou
Member
ktsaou commented Jan 11, 2017

Please also put a calc line on the alarm definition, to be sure what $this is evaluated to.

@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited
lookup: average -10s
calc: $his
crit: $this > 0

like that?
I thought that this is not necessary since there is no calculation

This expression is evaluated just after the lookup. Its purpose is to apply some calculation before using the value looked up from the db.

@ktsaou
Member
ktsaou commented Jan 11, 2017

hm... I can't recall what is the default for $this. I think that if neither calc nor lookup is given, the alarm is considered invalid and is not processed.

Can you see the alarm badge? What value does it report? Is it green? (if it is black or grey, it is not evaluated).

I don't know what this chart shows, so I cannot give you directions. Can you post an image of it?

@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited

lookup is given
https://github.com/firehol/netdata/blob/master/conf.d/health.d/haproxy.conf#L5
works ok in my lab env

Logic is very simple

  1. If all servers are up - sum of all dimensions = 0 (number of failed servers)
  2. if one of servers is down - sum > 0
    alert!
@l2isbad
Contributor
l2isbad commented Jan 11, 2017

screenshot_20170112_062501

@ktsaou
Member
ktsaou commented Jan 11, 2017

hm... interesting...
Could you please also the chart?

@l2isbad
Contributor
l2isbad commented Jan 11, 2017 edited

The chart is one the same screen (on the right side - backend1 value 1, backend2 value 2, sum = 3)

@l2isbad
Contributor
l2isbad commented Jan 11, 2017
@ktsaou
Member
ktsaou commented Jan 11, 2017

@l2isbad then it works ok as-is (I didn't recall that the default $this is the sum of all dimensions). Sorry...

@robbat2
robbat2 commented Jan 15, 2017

@l2isbad does this correctly handle the nbprocs > 1 case? Any comments on using it with that?

@l2isbad
Contributor
l2isbad commented Jan 15, 2017 edited

@robbat2 What do you mean? Connect to several stats page (or sockets?) and graph summary stats? The short answer is no.
The plugin was made on @jurgenhaas and @hameno requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment