instrument dotmesh with prometheus #305

lukemarsden · 2018-02-23T16:31:32Z

when one is operating a dotmesh cluster, as we are at dothub.com, it's useful to know how full the zpools are.

expose prometheus metrics from dotmesh servers to share:

zpool fullness
number of registered users

and maybe some other things, then we can hook them up to a prometheus service of our choice (in our case, weave cloud), which can then set up alerts into slack.

lukemarsden · 2018-02-23T16:32:57Z

for now,

$ kubectl exec -n dotmesh -ti dotmesh-7kqsx zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
pool  9.94G   969M  8.99G         -     6%     9%  1.00x  ONLINE  -

works

alaric-dotmesh · 2018-04-27T13:49:56Z

Things to monitor, raised in the engineering retro 2018-04-27:

Etcd requests/responses per second
RPC requests per second and error rates
HTTP requests per second and error rates
Number of transitions into backoff per second
zpool fullness
number of registered users

Add alerts to slack when error rates, zpool fullness etc exceed allowable limits.

alaric-dotmesh · 2018-05-21T13:17:26Z

Alaric needs to talk to Priya about recording operator state into Prometheus!

The following log statements should probably become prometheus metrics:

	glog.V(1).Infof("%d healthy-looking dotmeshes exist to run on %d nodes; %d of them seem to be actually running; %d dotmeshes need deleting, and %d out of %d undotted nodes are temporarily suspended",
		dottedNodeCount, len(validNodes),
		runningPodCount,
		len(dotmeshesToKill),
		len(suspendedNodes), len(undottedNodes))

	glog.V(1).Infof("%d/%d nodes might just be running or getting there, minimum target is %d",
		clusterPopulation, len(validNodes),
		clusterMinimumPopulation)

…leware

…erent pieces of data

lukemarsden · 2018-05-29T13:23:42Z

Status:

need to do etcd request/responses metrics
need to have alerts (and test them!) to slack 👍

prisamuel · 2018-05-30T12:55:41Z

Alerts are now active on https://cloud.weave.works/twilight-meteor-77/org/alerting/rules and these are configured to ping #operations on the Dotmesh slack. The current thresholds are set to x 3 and will need to be tweaked based on observing the alerts over a period of time.

Alerts:

# Alerts based on application level prometheus instrumentation
    
        
ALERT SlowRPCRequests
  IF          sum(rate(dm_rpc_req_duration_seconds{}[1m])) by (rpc_method, status_code, path) > 3
  FOR         5m
  LABELS      { severity="warning" }
  ANNOTATIONS {
    summary = "High RPC request latency on dothub",
    impact = "RPC Requests to the dothub cluster are very slow",
  }
  
ALERT NoRPCRequests
  IF          sum(rate(dm_rpc_req_duration_seconds_count[5m])) by (rpc_method, status_code, path) == 0
  FOR         5m
  LABELS      { severity="error" }
  ANNOTATIONS {
    summary = "No RPC requests detected over a 5min period",
    impact = "Dothub cluster appears to be down, no Ping or any other RPC requests detected",
  }


ALERT ErrorRateTooHigh
  IF          sum(rate(dm_req_total{status_code!="200"}[5m])) by (status_code) > 10
  FOR         5m
  LABELS      { severity="error" }
  ANNOTATIONS {
    summary = "More than 10 failed http requests to dotmesh server in 5mins",
    impact = "Dothub cluster appears to have issues, experiencing high error rate",
  }


ALERT SlowHttpRequests
  IF          sum(rate(dm_req_duration_seconds[5m])) by (url, status_code) > 3
  FOR         5m
  LABELS      { severity="error" }
  ANNOTATIONS {
    summary = "High request latency on dothub",
    impact = "Requests to the dothub cluster are very slow",
  }
      
ALERT ZPoolsFull
  IF          max(dm_zpool_usage_percentage) > 89
  FOR         5m
  LABELS      { severity="error" }
  ANNOTATIONS {
    summary = "Zpools are approaching full capacity on the production cluster",
    impact = "Fire! The cluster will go down if the zpools are not expanded immediately!",
  }
  
  
  
ALERT TooManyBackoffStateTransitions
  IF          max(rate(dm_state_transition_total{to="backoff"}[1m])) > 10
  FOR         5m
  LABELS      { severity="error" }
  ANNOTATIONS {
    summary = "> 10 transitions/minute into backoff state for over 5mins",
    impact = "Fire! FSMachines are going into backoff state.",
  }

lukemarsden added the task label Feb 25, 2018

lukemarsden closed this as completed Apr 23, 2018

lukemarsden reopened this Apr 23, 2018

prisamuel assigned prisamuel and lukemarsden May 9, 2018

lukemarsden added a commit that referenced this issue May 10, 2018

#305 start instrumenting with prometheus

f1e09d0

lukemarsden added a commit that referenced this issue May 11, 2018

#305: we'll want status_code on rpcs

9199294

prisamuel pushed a commit that referenced this issue May 16, 2018

#305: Added basic instrumentation for request duration and request count

97f2082

prisamuel pushed a commit that referenced this issue May 17, 2018

#305: Added more metrics for state transitions

9235e49

prisamuel pushed a commit that referenced this issue May 22, 2018

#305 Added metrics for zpool usage for each node

4ad99f6

prisamuel pushed a commit that referenced this issue May 22, 2018

#305: Report only server's own zpool usage, simplified logic

5c579af

prisamuel pushed a commit that referenced this issue May 23, 2018

#305: Update deps to use dotmesh-io/rpc

8406ced

prisamuel pushed a commit that referenced this issue May 23, 2018

#305 More updates to godeps, update tests

3f37cec

prisamuel pushed a commit that referenced this issue May 23, 2018

#305: Added rpc method, path and responseCode to metrics via rpc midd…

f942e28

…leware

prisamuel pushed a commit that referenced this issue May 23, 2018

#305: Bump version of CI tools

7950684

prisamuel pushed a commit that referenced this issue May 23, 2018

#305: Update deps to latest dotmesh-io on all packages

d04cb54

prisamuel pushed a commit that referenced this issue May 24, 2018

#305: Add X-Request-Id headers on prmetheus metrics to correlate diff…

cd71e75

…erent pieces of data

prisamuel pushed a commit that referenced this issue May 25, 2018

#305: Remove request-id from metrics

fb5abea

alaric-dotmesh added a commit that referenced this issue May 29, 2018

#305: Instrument the operator

b8a7865

lukemarsden removed their assignment May 29, 2018

prisamuel pushed a commit that referenced this issue May 29, 2018

#305: Add ability to track rpc durations

549b08b

prisamuel closed this as completed May 30, 2018

prisamuel mentioned this issue May 30, 2018

Instrument etcd fs/requests and fs/responses into prometheus metrics #425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instrument dotmesh with prometheus #305

instrument dotmesh with prometheus #305

lukemarsden commented Feb 23, 2018 •

edited

lukemarsden commented Feb 23, 2018 •

edited

alaric-dotmesh commented Apr 27, 2018 •

edited by prisamuel

alaric-dotmesh commented May 21, 2018

lukemarsden commented May 29, 2018 •

edited by prisamuel

prisamuel commented May 30, 2018

instrument dotmesh with prometheus #305

instrument dotmesh with prometheus #305

Comments

lukemarsden commented Feb 23, 2018 • edited

lukemarsden commented Feb 23, 2018 • edited

alaric-dotmesh commented Apr 27, 2018 • edited by prisamuel

alaric-dotmesh commented May 21, 2018

lukemarsden commented May 29, 2018 • edited by prisamuel

prisamuel commented May 30, 2018

lukemarsden commented Feb 23, 2018 •

edited

lukemarsden commented Feb 23, 2018 •

edited

alaric-dotmesh commented Apr 27, 2018 •

edited by prisamuel

lukemarsden commented May 29, 2018 •

edited by prisamuel