Production monitoring and alerting #2564

jseldess · 2018-02-27T18:28:05Z

Document production monitoring and alerting:
- Add monitoring and alerting page.
- Add monitoring and alerting to production checklist.
- Mention other monitoring systems that can work with prometheus endpoint
Clean up deployment docs:
- Recategorize deployment as manual (including cloud) and orchestrated.
- Rename some files and links.

Fixes #21.
Fixes #2305.
Fixes #1792.
Furthers #2120.

cockroach-teamcity · 2018-02-27T18:28:16Z

This change is

a-robinson · 2018-02-27T20:22:45Z

Reviewed 12 of 12 files at r1.
Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

v1.1/production-monitoring.md, line 3 at r1 (raw file):

---
title: Production Monitoring
summary:

So we aren't documenting any of this for v1.1? Why create the empty page at all?

v2.0/deploy-on-premises.md, line 48 at r1 (raw file):

{% include prod_deployment/secure-test-cluster.md %}

## Step 6. Set up HAProxy load balancers

I'm kind of surprised there isn't already reusable content about this. Don't worry about it if there isn't, though.

v2.0/deploy-on-premises.md, line 94 at r1 (raw file):

	Field | Description
	------|------------
	`timout connect`<br>`timeout client`<br>`timeout server` | Timeout values that should be suitable for most deployments.

s/timout/timeout/

v2.0/deploy-on-premises.md, line 95 at r1 (raw file):

	------|------------
	`timout connect`<br>`timeout client`<br>`timeout server` | Timeout values that should be suitable for most deployments.
	`bind` | The port that HAProxy listens on. This is the port clients will connect to and thus needs to be allowed by your network configuration.<br><br>This tutorial assumes HAProxy is running on a separate machine from CockroachDB nodes. If you run HAProxy on the same machine as a node (not recommended), you'll need to change this port, as `26257` is also used for inter-node communication.

s/also used for inter-node communication/likely already being used by the CockroachDB node.

v2.0/monitoring-and-alerting.md, line 17 at r1 (raw file):

The [built-in Admin UI](admin-ui-overview.html) gives you essential metrics about a cluster's health, such as the number of live, dead, and suspect nodes, the number of unavailable ranges, and the queries per second and service latency across the cluster. It is accessible from every node at `http://<host>:<http-port>`, or `http://<host>:8080` by default.

{{site.data.alerts.callout_danger}}Because the Admin UI is built into CockroachDB, if a cluster becomes unavailable, the Admin UI becomes unavailable as well. Therefore, it's essential to plan additional methods of monitoring cluster health as described below.{{site.data.alerts.end}}

If we're being really precise, I'd say that "much of the Admin UI becomes unavailable" or "most of the Admin UI becomes unavailable", since a few of the summary fields and a lot of the debug pages still work. That's pretty nit-picky, though, so feel free to leave as is.

v2.0/monitoring-and-alerting.md, line 24 at r1 (raw file):

~~~ shell
$ curl -i http://localhost:8080/_status/vars

Any particular reason to use the -i? I think the output would be clearer without the -i stuff at the top.

v2.0/monitoring-and-alerting.md, line 95 at r1 (raw file):

#### /&#95;admin/v1/health

If a node is unable to communicate with a majority of the other nodes in the cluster, the  `http://<node-host>:<http-port>/_admin/v1/health` endpoint returns an HTTP `503 Service Unavailable` status response code with an error:

For v2.0, we're going to prefer pointing people to the /health?ready=1 endpoint, as @bdarnell and I have been discussing today on cockroachdb/cockroach#22911. It's still a work in progress, but the plan is for it to be mostly the same as today's /_admin/v1/health except that it will also return an error if the node is draining.

v2.0/monitoring-and-alerting.md, line 140 at r1 (raw file):

- With the `--ranges` flag, you get granular range and replica details, including unavailability and under-replication.
- With the `--stats` flag, you get granular disk usage details.
- With the `--all` flag, you get all of the above.

There's also a --decommission option.

v2.0/monitoring-and-alerting.md, line 142 at r1 (raw file):

- With the `--all` flag, you get all of the above.

## Events to Alert On

I'd recommend adding in the InstanceFlapping alert from https://github.com/cockroachdb/cockroach/blob/master/monitoring/rules/alerts.rules

v2.0/monitoring-and-alerting.md, line 167 at r1 (raw file):

- **Rule:** Send an alert when a node is not executing SQL despite having connections.

- **How to detect:** The `sql_conns` metric in the node's `_status/vars` output will be great than `0` while the the `sql_query_count` metric will be `0`. You can also break this down by statement type using `sql_select_count`, `sql_insert_count`, `sql_update_count`, and `sql_delete_count`.

s/great/greater

Comments from Reviewable

jseldess · 2018-02-27T21:52:49Z

Review status: 4 of 28 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending.

v2.0/monitoring-and-alerting.md, line 17 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

If we're being really precise, I'd say that "much of the Admin UI becomes unavailable" or "most of the Admin UI becomes unavailable", since a few of the summary fields and a lot of the debug pages still work. That's pretty nit-picky, though, so feel free to leave as is.

Done.

v2.0/monitoring-and-alerting.md, line 24 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Any particular reason to use the -i? I think the output would be clearer without the -i stuff at the top.

Done. Also for the other examples.

v2.0/monitoring-and-alerting.md, line 140 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

There's also a --decommission option.

Done.

v2.0/monitoring-and-alerting.md, line 142 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I'd recommend adding in the InstanceFlapping alert from https://github.com/cockroachdb/cockroach/blob/master/monitoring/rules/alerts.rules

Done.

v2.0/monitoring-and-alerting.md, line 167 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/great/greater

Done.

v1.1/production-monitoring.md, line 3 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

So we aren't documenting any of this for v1.1? Why create the empty page at all?

I was starting with 2.0 only. Left this in by accident.

Will apply to 1.1 once you happy with the content.

v2.0/deploy-on-premises.md, line 48 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I'm kind of surprised there isn't already reusable content about this. Don't worry about it if there isn't, though.

The on-premise (previously manual) tutorials are the only 2 that feature these haproxy steps, and the secure and insecure steps are a little different. I agree, though, that there's opportunity to reuse content. Going to punt for now.

v2.0/deploy-on-premises.md, line 94 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/timout/timeout/

Done.

v2.0/deploy-on-premises.md, line 95 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/also used for inter-node communication/likely already being used by the CockroachDB node.

Done.

Comments from Reviewable

jseldess · 2018-02-27T21:54:36Z

v2.0/monitoring-and-alerting.md, line 95 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

For v2.0, we're going to prefer pointing people to the /health?ready=1 endpoint, as @bdarnell and I have been discussing today on cockroachdb/cockroach#22911. It's still a work in progress, but the plan is for it to be mostly the same as today's /_admin/v1/health except that it will also return an error if the node is draining.

OK, thanks, @a-robinson. Still need to look into this.

Comments from Reviewable

jseldess · 2018-02-27T21:55:30Z

@robert-s-lee, please take a look as well when you have time.

cockroach-teamcity · 2018-02-28T03:48:37Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/be40b1c3e1eb0bbf1b1be773d4981e448d3503ca/

cockroach-teamcity · 2018-02-28T05:58:12Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/bdf99f6aca836f61d79352c0dbfaf6c50ceb81cf/

a-robinson · 2018-02-28T13:59:58Z

other than still needing to update the health-checking section.

Reviewed 1 of 14 files at r1, 35 of 35 files at r2.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

v2.0/monitoring-and-alerting.md, line 140 at r2 (raw file):

- **Rule:** Send an alert if a node has restarted more than 5 times in 10 minutes.

- **How to detect:** Calculate this using the `sys_uptime` metric in the node's `_status/vars` output, which gives you the seconds that the `cockroach` process has been running.

s/using the sys_uptime metric/using the number of times the sys_uptime metric was reset back to zero/

Comments from Reviewable

- Recategorize deployment as manual (including cloud) and orchestrated. - Rename some files and links.

jseldess · 2018-02-28T18:06:47Z

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

v2.0/monitoring-and-alerting.md, line 95 at r1 (raw file):

Previously, jseldess wrote…

OK, thanks, @a-robinson. Still need to look into this.

Done. PTAL.

v2.0/monitoring-and-alerting.md, line 140 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/using the sys_uptime metric/using the number of times the sys_uptime metric was reset back to zero/

Done.

Comments from Reviewable

cockroach-teamcity · 2018-02-28T18:08:27Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/f4ee783687fcc4352cdcf857302e3b7cfe144e92/

- Add monitoring and alerting page. - Add monitoring and alerting to production checklist. - Mention other monitoring systems that can work with prometheus endpoint.

cockroach-teamcity · 2018-02-28T18:15:56Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/8b5d6ea39472cfb9310e6f2c7b478b052065b210/

cockroach-teamcity · 2018-02-28T19:53:22Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/e20e0340e6260dbc5cab8a65b9af5f6cffe9caa4/

cockroach-teamcity · 2018-02-28T21:46:37Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/270e95cbf951de10c0e8c868cc63cdc48c7f1f12/

In #2564, I renamed Manual Deployment to On-Premises Deployment but neglected to update the clock synch include, which was targeting content based on page title.

jseldess added the in progress label Feb 27, 2018

jseldess requested review from bdarnell and a-robinson and removed request for bdarnell February 27, 2018 18:28

jseldess force-pushed the production-monitoring branch from 0aebab0 to c532161 Compare February 27, 2018 23:28

cockroachdb deleted a comment from cockroach-teamcity Feb 28, 2018

jseldess force-pushed the production-monitoring branch from c532161 to be40b1c Compare February 28, 2018 03:47

cockroachdb deleted a comment from cockroach-teamcity Feb 28, 2018

jseldess force-pushed the production-monitoring branch from be40b1c to bdf99f6 Compare February 28, 2018 05:57

a-robinson mentioned this pull request Feb 28, 2018

*: Clarify units of metrics in their help messages cockroachdb/cockroach#23218

Merged

Clean up deployment docs

66026f1

- Recategorize deployment as manual (including cloud) and orchestrated. - Rename some files and links.

jseldess force-pushed the production-monitoring branch from bdf99f6 to f4ee783 Compare February 28, 2018 18:06

a-robinson approved these changes Feb 28, 2018

View reviewed changes

Document production monitoring and alerting

0d5374d

- Add monitoring and alerting page. - Add monitoring and alerting to production checklist. - Mention other monitoring systems that can work with prometheus endpoint.

jseldess force-pushed the production-monitoring branch from f4ee783 to 8b5d6ea Compare February 28, 2018 18:14

jseldess force-pushed the production-monitoring branch 2 times, most recently from 8e1188e to e20e034 Compare February 28, 2018 19:47

Apply changes to 1.1 docs and more cleanup

270e95c

jseldess force-pushed the production-monitoring branch from e20e034 to 270e95c Compare February 28, 2018 21:44

jseldess merged commit 02cec5e into master Mar 1, 2018

jseldess deleted the production-monitoring branch March 1, 2018 10:20

jseldess removed the in progress label Mar 1, 2018

jseldess pushed a commit that referenced this pull request Mar 3, 2018

Fix clock sync instructions in on-premises tutorials

58d5544

In #2564, I renamed Manual Deployment to On-Premises Deployment but neglected to update the clock synch include, which was targeting content based on page title.

jseldess mentioned this pull request Mar 3, 2018

Fix clock synch instructions in on-premises tutorials #2592

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production monitoring and alerting #2564

Production monitoring and alerting #2564

jseldess commented Feb 27, 2018 •

edited

cockroach-teamcity commented Feb 27, 2018

a-robinson commented Feb 27, 2018

jseldess commented Feb 27, 2018

jseldess commented Feb 27, 2018

jseldess commented Feb 27, 2018

cockroach-teamcity commented Feb 28, 2018

cockroach-teamcity commented Feb 28, 2018

a-robinson commented Feb 28, 2018

jseldess commented Feb 28, 2018

cockroach-teamcity commented Feb 28, 2018

cockroach-teamcity commented Feb 28, 2018

cockroach-teamcity commented Feb 28, 2018

cockroach-teamcity commented Feb 28, 2018

Production monitoring and alerting #2564

Production monitoring and alerting #2564

Conversation

jseldess commented Feb 27, 2018 • edited

cockroach-teamcity commented Feb 27, 2018

a-robinson commented Feb 27, 2018

jseldess commented Feb 27, 2018

jseldess commented Feb 27, 2018

jseldess commented Feb 27, 2018

cockroach-teamcity commented Feb 28, 2018

cockroach-teamcity commented Feb 28, 2018

a-robinson commented Feb 28, 2018

jseldess commented Feb 28, 2018

cockroach-teamcity commented Feb 28, 2018

cockroach-teamcity commented Feb 28, 2018

cockroach-teamcity commented Feb 28, 2018

cockroach-teamcity commented Feb 28, 2018

jseldess commented Feb 27, 2018 •

edited