Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instrument dotmesh with prometheus #305

Closed
lukemarsden opened this issue Feb 23, 2018 · 5 comments
Closed

instrument dotmesh with prometheus #305

lukemarsden opened this issue Feb 23, 2018 · 5 comments
Assignees
Labels

Comments

@lukemarsden
Copy link
Collaborator

lukemarsden commented Feb 23, 2018

when one is operating a dotmesh cluster, as we are at dothub.com, it's useful to know how full the zpools are.

expose prometheus metrics from dotmesh servers to share:

  • zpool fullness
  • number of registered users

and maybe some other things, then we can hook them up to a prometheus service of our choice (in our case, weave cloud), which can then set up alerts into slack.

@lukemarsden
Copy link
Collaborator Author

lukemarsden commented Feb 23, 2018

for now,

$ kubectl exec -n dotmesh -ti dotmesh-7kqsx zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
pool  9.94G   969M  8.99G         -     6%     9%  1.00x  ONLINE  -

works

@alaric-dotmesh
Copy link
Contributor

alaric-dotmesh commented Apr 27, 2018

Things to monitor, raised in the engineering retro 2018-04-27:

  • Etcd requests/responses per second
  • RPC requests per second and error rates
  • HTTP requests per second and error rates
  • Number of transitions into backoff per second
  • zpool fullness
  • number of registered users

Add alerts to slack when error rates, zpool fullness etc exceed allowable limits.

@alaric-dotmesh
Copy link
Contributor

Alaric needs to talk to Priya about recording operator state into Prometheus!

The following log statements should probably become prometheus metrics:

	glog.V(1).Infof("%d healthy-looking dotmeshes exist to run on %d nodes; %d of them seem to be actually running; %d dotmeshes need deleting, and %d out of %d undotted nodes are temporarily suspended",
		dottedNodeCount, len(validNodes),
		runningPodCount,
		len(dotmeshesToKill),
		len(suspendedNodes), len(undottedNodes))
	glog.V(1).Infof("%d/%d nodes might just be running or getting there, minimum target is %d",
		clusterPopulation, len(validNodes),
		clusterMinimumPopulation)

@lukemarsden
Copy link
Collaborator Author

lukemarsden commented May 29, 2018

Status:

  • need to do etcd request/responses metrics
  • need to have alerts (and test them!) to slack 👍

@prisamuel
Copy link
Contributor

Alerts are now active on https://cloud.weave.works/twilight-meteor-77/org/alerting/rules and these are configured to ping #operations on the Dotmesh slack. The current thresholds are set to x 3 and will need to be tweaked based on observing the alerts over a period of time.

Alerts:

# Alerts based on application level prometheus instrumentation
    
        
ALERT SlowRPCRequests
  IF          sum(rate(dm_rpc_req_duration_seconds{}[1m])) by (rpc_method, status_code, path) > 3
  FOR         5m
  LABELS      { severity="warning" }
  ANNOTATIONS {
    summary = "High RPC request latency on dothub",
    impact = "RPC Requests to the dothub cluster are very slow",
  }
  
ALERT NoRPCRequests
  IF          sum(rate(dm_rpc_req_duration_seconds_count[5m])) by (rpc_method, status_code, path) == 0
  FOR         5m
  LABELS      { severity="error" }
  ANNOTATIONS {
    summary = "No RPC requests detected over a 5min period",
    impact = "Dothub cluster appears to be down, no Ping or any other RPC requests detected",
  }


ALERT ErrorRateTooHigh
  IF          sum(rate(dm_req_total{status_code!="200"}[5m])) by (status_code) > 10
  FOR         5m
  LABELS      { severity="error" }
  ANNOTATIONS {
    summary = "More than 10 failed http requests to dotmesh server in 5mins",
    impact = "Dothub cluster appears to have issues, experiencing high error rate",
  }


ALERT SlowHttpRequests
  IF          sum(rate(dm_req_duration_seconds[5m])) by (url, status_code) > 3
  FOR         5m
  LABELS      { severity="error" }
  ANNOTATIONS {
    summary = "High request latency on dothub",
    impact = "Requests to the dothub cluster are very slow",
  }
      
ALERT ZPoolsFull
  IF          max(dm_zpool_usage_percentage) > 89
  FOR         5m
  LABELS      { severity="error" }
  ANNOTATIONS {
    summary = "Zpools are approaching full capacity on the production cluster",
    impact = "Fire! The cluster will go down if the zpools are not expanded immediately!",
  }
  
  
  
ALERT TooManyBackoffStateTransitions
  IF          max(rate(dm_state_transition_total{to="backoff"}[1m])) > 10
  FOR         5m
  LABELS      { severity="error" }
  ANNOTATIONS {
    summary = "> 10 transitions/minute into backoff state for over 5mins",
    impact = "Fire! FSMachines are going into backoff state.",
  }
  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants