SPECS: Central netdata Health Monitoring Server #1527

Open
ktsaou opened this Issue Jan 8, 2017 · 5 comments

Projects

None yet

3 participants

@ktsaou
Member
ktsaou commented Jan 8, 2017 edited

I plan to add a central netdata health server.

These are the high level requirements, as I think of them. Feel free to comment.

each netdata

  1. Each netdata will be configured to communicate with one netdata health server.

  2. Each netdata will update the health server periodically, at fixed intervals (e.g. every 30 seconds).

  3. Each netdata will propagate all alarm status changes to the health server, as they happen (i.e. immediately, without any hysteresis).


the users

  1. Users are known at the health server.

  2. Each user should be able to set the alarm notification methods he/she likes for each alarm role, individual alarms, and alarm level (CRITICAL, WARNING, CLEAR).

    So, user A could say that he wants to receive emails for dba alarms, twilio messages for sysadmin alarms, no notification for proxyadmin alarms and especially for alarm X from host H a slack notification for warnings and a pagerduty notification if it is critical.


the health server

  1. The health server will maintain a small database for each of the servers it monitors.

  2. The health server will send alarm notifications to users according to their settings. This means each alarm may notify many users with different settings each.

  3. The health server will provide a dashboard featuring all netdata servers the user knows. So, a high level overview of all the netdata a user knows, will be available at the health server (each netdata could also provide a link on their dashboards).

  4. The health server will provide the following additional alarms for all servers monitored:

    • server down/unreachable
    • server rebooted/restarted
    • netdata restarted

Notes

  1. As I see it, this central health monitoring system is highly related to the netdata registry. Probably the registry is best suited for this additional functionality.

  2. I would really like to have this functionality available for everyone that has installed netdata, even for IoT devices, or single server installations. Like a global public netdata health monitoring server !

  3. To understand my view: at some point in future I would love to have alarms configured from the users themselves, directly on the netdata dashboards. This will allow each user to have his/her own alarms, without bothering other users. So, netdata alarms could become personalized and completely dynamic.

@l2isbad
Contributor
l2isbad commented Jan 8, 2017

Would be cool if we can chart 'health summary' on health monitoring server (number of monitored instances (in up, down state), number of alarms in WARN/CRIT state)).
it will be easy to do if there will be additional api calls (kinda returns all the currently raised alarms for all monitored instances).

This was referenced Jan 9, 2017
@emagaliff

This could be very helpful for #505

@ktsaou
Member
ktsaou commented Jan 12, 2017

This could be very helpful for #505

yes, it could! The only problem with ephemeral nodes is how the health server / registry will cleanup, to remove servers that are now offline.

@emagaliff

I don't think they should automatically clean up - that should be up to a server administrator to do when needed. Servers can be down for a variety of reasons and triggering an automatic cleanup can remove servers that are only down temporarily.

@ktsaou
Member
ktsaou commented Jan 12, 2017

You are right, but check #1311 for another view of the same problem... :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment