New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
instrument dotmesh with prometheus #305
Comments
for now,
works |
Things to monitor, raised in the engineering retro 2018-04-27:
Add alerts to slack when error rates, zpool fullness etc exceed allowable limits. |
Alaric needs to talk to Priya about recording operator state into Prometheus! The following log statements should probably become prometheus metrics: glog.V(1).Infof("%d healthy-looking dotmeshes exist to run on %d nodes; %d of them seem to be actually running; %d dotmeshes need deleting, and %d out of %d undotted nodes are temporarily suspended",
dottedNodeCount, len(validNodes),
runningPodCount,
len(dotmeshesToKill),
len(suspendedNodes), len(undottedNodes)) glog.V(1).Infof("%d/%d nodes might just be running or getting there, minimum target is %d",
clusterPopulation, len(validNodes),
clusterMinimumPopulation) |
Status:
|
Alerts are now active on https://cloud.weave.works/twilight-meteor-77/org/alerting/rules and these are configured to ping #operations on the Dotmesh slack. The current thresholds are set to x 3 and will need to be tweaked based on observing the alerts over a period of time. Alerts:
|
when one is operating a dotmesh cluster, as we are at dothub.com, it's useful to know how full the zpools are.
expose prometheus metrics from dotmesh servers to share:
and maybe some other things, then we can hook them up to a prometheus service of our choice (in our case, weave cloud), which can then set up alerts into slack.
The text was updated successfully, but these errors were encountered: