New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Production monitoring and alerting #2564
Conversation
Reviewed 12 of 12 files at r1. v1.1/production-monitoring.md, line 3 at r1 (raw file):
So we aren't documenting any of this for v1.1? Why create the empty page at all? v2.0/deploy-on-premises.md, line 48 at r1 (raw file):
I'm kind of surprised there isn't already reusable content about this. Don't worry about it if there isn't, though. v2.0/deploy-on-premises.md, line 94 at r1 (raw file):
s/timout/timeout/ v2.0/deploy-on-premises.md, line 95 at r1 (raw file):
s/also used for inter-node communication/likely already being used by the CockroachDB node. v2.0/monitoring-and-alerting.md, line 17 at r1 (raw file):
If we're being really precise, I'd say that "much of the Admin UI becomes unavailable" or "most of the Admin UI becomes unavailable", since a few of the summary fields and a lot of the debug pages still work. That's pretty nit-picky, though, so feel free to leave as is. v2.0/monitoring-and-alerting.md, line 24 at r1 (raw file):
Any particular reason to use the v2.0/monitoring-and-alerting.md, line 95 at r1 (raw file):
For v2.0, we're going to prefer pointing people to the v2.0/monitoring-and-alerting.md, line 140 at r1 (raw file):
There's also a v2.0/monitoring-and-alerting.md, line 142 at r1 (raw file):
I'd recommend adding in the InstanceFlapping alert from https://github.com/cockroachdb/cockroach/blob/master/monitoring/rules/alerts.rules v2.0/monitoring-and-alerting.md, line 167 at r1 (raw file):
s/great/greater Comments from Reviewable |
Review status: 4 of 28 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending. v2.0/monitoring-and-alerting.md, line 17 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. v2.0/monitoring-and-alerting.md, line 24 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. Also for the other examples. v2.0/monitoring-and-alerting.md, line 140 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. v2.0/monitoring-and-alerting.md, line 142 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. v2.0/monitoring-and-alerting.md, line 167 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. v1.1/production-monitoring.md, line 3 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I was starting with 2.0 only. Left this in by accident. Will apply to 1.1 once you happy with the content. v2.0/deploy-on-premises.md, line 48 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
The on-premise (previously manual) tutorials are the only 2 that feature these haproxy steps, and the secure and insecure steps are a little different. I agree, though, that there's opportunity to reuse content. Going to punt for now. v2.0/deploy-on-premises.md, line 94 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. v2.0/deploy-on-premises.md, line 95 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. Comments from Reviewable |
v2.0/monitoring-and-alerting.md, line 95 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
OK, thanks, @a-robinson. Still need to look into this. Comments from Reviewable |
@robert-s-lee, please take a look as well when you have time. |
0aebab0
to
c532161
Compare
c532161
to
be40b1c
Compare
be40b1c
to
bdf99f6
Compare
other than still needing to update the health-checking section. Reviewed 1 of 14 files at r1, 35 of 35 files at r2. v2.0/monitoring-and-alerting.md, line 140 at r2 (raw file):
s/using the Comments from Reviewable |
- Recategorize deployment as manual (including cloud) and orchestrated. - Rename some files and links.
bdf99f6
to
f4ee783
Compare
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. v2.0/monitoring-and-alerting.md, line 95 at r1 (raw file): Previously, jseldess wrote…
Done. PTAL. v2.0/monitoring-and-alerting.md, line 140 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. Comments from Reviewable |
- Add monitoring and alerting page. - Add monitoring and alerting to production checklist. - Mention other monitoring systems that can work with prometheus endpoint.
f4ee783
to
8b5d6ea
Compare
8e1188e
to
e20e034
Compare
e20e034
to
270e95c
Compare
In #2564, I renamed Manual Deployment to On-Premises Deployment but neglected to update the clock synch include, which was targeting content based on page title.
Fixes #21.
Fixes #2305.
Fixes #1792.
Furthers #2120.