Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production monitoring and alerting #2564

Merged
merged 3 commits into from Mar 1, 2018
Merged

Production monitoring and alerting #2564

merged 3 commits into from Mar 1, 2018

Conversation

jseldess
Copy link
Contributor

@jseldess jseldess commented Feb 27, 2018

  • Document production monitoring and alerting:
    • Add monitoring and alerting page.
    • Add monitoring and alerting to production checklist.
    • Mention other monitoring systems that can work with prometheus endpoint
  • Clean up deployment docs:
    • Recategorize deployment as manual (including cloud) and orchestrated.
    • Rename some files and links.

Fixes #21.
Fixes #2305.
Fixes #1792.
Furthers #2120.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jseldess jseldess requested review from bdarnell and a-robinson and removed request for bdarnell February 27, 2018 18:28
@a-robinson
Copy link
Contributor

Reviewed 12 of 12 files at r1.
Review status: all files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


v1.1/production-monitoring.md, line 3 at r1 (raw file):

---
title: Production Monitoring
summary:

So we aren't documenting any of this for v1.1? Why create the empty page at all?


v2.0/deploy-on-premises.md, line 48 at r1 (raw file):

{% include prod_deployment/secure-test-cluster.md %}

## Step 6. Set up HAProxy load balancers

I'm kind of surprised there isn't already reusable content about this. Don't worry about it if there isn't, though.


v2.0/deploy-on-premises.md, line 94 at r1 (raw file):

	Field | Description
	------|------------
	`timout connect`<br>`timeout client`<br>`timeout server` | Timeout values that should be suitable for most deployments.

s/timout/timeout/


v2.0/deploy-on-premises.md, line 95 at r1 (raw file):

	------|------------
	`timout connect`<br>`timeout client`<br>`timeout server` | Timeout values that should be suitable for most deployments.
	`bind` | The port that HAProxy listens on. This is the port clients will connect to and thus needs to be allowed by your network configuration.<br><br>This tutorial assumes HAProxy is running on a separate machine from CockroachDB nodes. If you run HAProxy on the same machine as a node (not recommended), you'll need to change this port, as `26257` is also used for inter-node communication.

s/also used for inter-node communication/likely already being used by the CockroachDB node.


v2.0/monitoring-and-alerting.md, line 17 at r1 (raw file):

The [built-in Admin UI](admin-ui-overview.html) gives you essential metrics about a cluster's health, such as the number of live, dead, and suspect nodes, the number of unavailable ranges, and the queries per second and service latency across the cluster. It is accessible from every node at `http://<host>:<http-port>`, or `http://<host>:8080` by default.

{{site.data.alerts.callout_danger}}Because the Admin UI is built into CockroachDB, if a cluster becomes unavailable, the Admin UI becomes unavailable as well. Therefore, it's essential to plan additional methods of monitoring cluster health as described below.{{site.data.alerts.end}}

If we're being really precise, I'd say that "much of the Admin UI becomes unavailable" or "most of the Admin UI becomes unavailable", since a few of the summary fields and a lot of the debug pages still work. That's pretty nit-picky, though, so feel free to leave as is.


v2.0/monitoring-and-alerting.md, line 24 at r1 (raw file):

~~~ shell
$ curl -i http://localhost:8080/_status/vars

Any particular reason to use the -i? I think the output would be clearer without the -i stuff at the top.


v2.0/monitoring-and-alerting.md, line 95 at r1 (raw file):

#### /&#95;admin/v1/health

If a node is unable to communicate with a majority of the other nodes in the cluster, the  `http://<node-host>:<http-port>/_admin/v1/health` endpoint returns an HTTP `503 Service Unavailable` status response code with an error:

For v2.0, we're going to prefer pointing people to the /health?ready=1 endpoint, as @bdarnell and I have been discussing today on cockroachdb/cockroach#22911. It's still a work in progress, but the plan is for it to be mostly the same as today's /_admin/v1/health except that it will also return an error if the node is draining.


v2.0/monitoring-and-alerting.md, line 140 at r1 (raw file):

- With the `--ranges` flag, you get granular range and replica details, including unavailability and under-replication.
- With the `--stats` flag, you get granular disk usage details.
- With the `--all` flag, you get all of the above.

There's also a --decommission option.


v2.0/monitoring-and-alerting.md, line 142 at r1 (raw file):

- With the `--all` flag, you get all of the above.

## Events to Alert On

I'd recommend adding in the InstanceFlapping alert from https://github.com/cockroachdb/cockroach/blob/master/monitoring/rules/alerts.rules


v2.0/monitoring-and-alerting.md, line 167 at r1 (raw file):

- **Rule:** Send an alert when a node is not executing SQL despite having connections.

- **How to detect:** The `sql_conns` metric in the node's `_status/vars` output will be great than `0` while the the `sql_query_count` metric will be `0`. You can also break this down by statement type using `sql_select_count`, `sql_insert_count`, `sql_update_count`, and `sql_delete_count`.

s/great/greater


Comments from Reviewable

@jseldess
Copy link
Contributor Author

Review status: 4 of 28 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending.


v2.0/monitoring-and-alerting.md, line 17 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

If we're being really precise, I'd say that "much of the Admin UI becomes unavailable" or "most of the Admin UI becomes unavailable", since a few of the summary fields and a lot of the debug pages still work. That's pretty nit-picky, though, so feel free to leave as is.

Done.


v2.0/monitoring-and-alerting.md, line 24 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Any particular reason to use the -i? I think the output would be clearer without the -i stuff at the top.

Done. Also for the other examples.


v2.0/monitoring-and-alerting.md, line 140 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

There's also a --decommission option.

Done.


v2.0/monitoring-and-alerting.md, line 142 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I'd recommend adding in the InstanceFlapping alert from https://github.com/cockroachdb/cockroach/blob/master/monitoring/rules/alerts.rules

Done.


v2.0/monitoring-and-alerting.md, line 167 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/great/greater

Done.


v1.1/production-monitoring.md, line 3 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

So we aren't documenting any of this for v1.1? Why create the empty page at all?

I was starting with 2.0 only. Left this in by accident.

Will apply to 1.1 once you happy with the content.


v2.0/deploy-on-premises.md, line 48 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I'm kind of surprised there isn't already reusable content about this. Don't worry about it if there isn't, though.

The on-premise (previously manual) tutorials are the only 2 that feature these haproxy steps, and the secure and insecure steps are a little different. I agree, though, that there's opportunity to reuse content. Going to punt for now.


v2.0/deploy-on-premises.md, line 94 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/timout/timeout/

Done.


v2.0/deploy-on-premises.md, line 95 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/also used for inter-node communication/likely already being used by the CockroachDB node.

Done.


Comments from Reviewable

@jseldess
Copy link
Contributor Author

v2.0/monitoring-and-alerting.md, line 95 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

For v2.0, we're going to prefer pointing people to the /health?ready=1 endpoint, as @bdarnell and I have been discussing today on cockroachdb/cockroach#22911. It's still a work in progress, but the plan is for it to be mostly the same as today's /_admin/v1/health except that it will also return an error if the node is draining.

OK, thanks, @a-robinson. Still need to look into this.


Comments from Reviewable

@jseldess
Copy link
Contributor Author

@robert-s-lee, please take a look as well when you have time.

@a-robinson
Copy link
Contributor

:lgtm: other than still needing to update the health-checking section.


Reviewed 1 of 14 files at r1, 35 of 35 files at r2.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


v2.0/monitoring-and-alerting.md, line 140 at r2 (raw file):

- **Rule:** Send an alert if a node has restarted more than 5 times in 10 minutes.

- **How to detect:** Calculate this using the `sys_uptime` metric in the node's `_status/vars` output, which gives you the seconds that the `cockroach` process has been running.

s/using the sys_uptime metric/using the number of times the sys_uptime metric was reset back to zero/


Comments from Reviewable

- Recategorize deployment as manual (including cloud) and orchestrated.
- Rename some files and links.
@jseldess
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


v2.0/monitoring-and-alerting.md, line 95 at r1 (raw file):

Previously, jseldess wrote…

OK, thanks, @a-robinson. Still need to look into this.

Done. PTAL.


v2.0/monitoring-and-alerting.md, line 140 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/using the sys_uptime metric/using the number of times the sys_uptime metric was reset back to zero/

Done.


Comments from Reviewable

- Add monitoring and alerting page.
- Add monitoring and alerting to production checklist.
- Mention other monitoring systems that can work with prometheus endpoint.
@jseldess jseldess force-pushed the production-monitoring branch 2 times, most recently from 8e1188e to e20e034 Compare February 28, 2018 19:47
@jseldess jseldess merged commit 02cec5e into master Mar 1, 2018
@jseldess jseldess deleted the production-monitoring branch March 1, 2018 10:20
jseldess pushed a commit that referenced this pull request Mar 3, 2018
In #2564, I renamed
Manual Deployment to On-Premises Deployment but neglected
to update the clock synch include, which was targeting content
based on page title.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants