-
Couldn't load subscription status.
- Fork 126
feature/add-client-etcd-monitoring #613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/add-client-etcd-monitoring #613
Conversation
WalkthroughThis pull request removes the obsolete Grafana dashboard configuration for etcd3 by deleting Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
hack/download-dashboards.sh (1)
73-73: Reminder: Resolve TODOs and Check File Path FormatA TODO comment remains on line 73 alongside a file path containing a double slash (
dashboards//kubernetes-cluster/nodes/ntp.json). Consider addressing the TODO and verifying that all file paths are correctly formed to avoid potential issues during dashboard downloads.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
dashboards/control-plane/kube-etcd3.json(0 hunks)hack/download-dashboards.sh(2 hunks)packages/extra/etcd/Chart.yaml(1 hunks)packages/extra/etcd/templates/etcd-cluster.yaml(1 hunks)packages/extra/etcd/templates/podscrape.yaml(1 hunks)packages/extra/etcd/templates/prometheus-rules.yaml(1 hunks)packages/extra/monitoring/dashboards.list(1 hunks)packages/extra/versions_map(1 hunks)
💤 Files with no reviewable changes (1)
- dashboards/control-plane/kube-etcd3.json
✅ Files skipped from review due to trivial changes (3)
- packages/extra/etcd/Chart.yaml
- packages/extra/monitoring/dashboards.list
- packages/extra/versions_map
🔇 Additional comments (3)
packages/extra/etcd/templates/podscrape.yaml (1)
1-11: Review: New VMPodScrape Resource ConfigurationThe YAML defines a new VMPodScrape resource for scraping etcd pod metrics. The API version (
operator.victoriametrics.com/v1beta1), kind, metadata, and selector are correctly specified. Verify that your cluster’s operator supports this API version and consider whether "example-pod-scrape" should be renamed for production use.packages/extra/etcd/templates/etcd-cluster.yaml (1)
43-48: Review: Metrics Container Addition in EtcdCluster Pod TemplateThe addition of the dedicated metrics container under the podTemplate is clear. Configuring the container with name
etcdand exposing themetricsport on containerPort2381(TCP) aligns with the new monitoring strategy. Ensure this container definition integrates well with the rest of your workload specifications.hack/download-dashboards.sh (1)
71-71: Review: Updated Dashboard JSON File ReferenceThe reference in line 71 has been updated from
kube-etcd3.jsontokube-etcd.json, which is consistent with the removal of the outdated Grafana dashboard configuration. Confirm that all downstream processes are updated to use the new filename.
| - alert: etcdHighNumberOfFailedGRPCRequests | ||
| annotations: | ||
| summary: "etcd cluster '{{`{{ $labels.job }}`}}': '{{`{{ $value }}`}}' of requests for '{{`{{ $labels.grpc_method }}`}}' failed on etcd instance '{{`{{ $labels.instance }}`}}'." | ||
| expr: | | ||
| 100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) BY (job, instance, grpc_service, grpc_method) | ||
| / | ||
| sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method) | ||
| > 1 | ||
| for: 10m | ||
| labels: | ||
| severity: warning | ||
|
|
||
| - alert: etcdHighNumberOfFailedGRPCRequests | ||
| annotations: | ||
| summary: "etcd cluster '{{`{{ $labels.job }}`}}': '{{`{{ $value }}`}}' of requests for '{{`{{ $labels.grpc_method }}`}}' failed on etcd instance '{{`{{ $labels.instance }}`}}'." | ||
| expr: | | ||
| 100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) BY (job, instance, grpc_service, grpc_method) | ||
| / | ||
| sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method) | ||
| > 5 | ||
| for: 5m | ||
| labels: | ||
| severity: critical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: Duplicate GRPC Alert Rules Detected
Two alert rules named etcdHighNumberOfFailedGRPCRequests are defined with different thresholds and severity levels (one warning and one critical). Duplicate alert names can cause configuration conflicts or unexpected behavior in Prometheus. Consider renaming one of these alerts or consolidating the logic into a single rule with tiered thresholds.
| - alert: etcdHighNumberOfFailedHTTPRequests | ||
| annotations: | ||
| summary: "'{{`{{ $value }}`}}' of requests for '{{`{{ $labels.method }}`}}' failed on etcd instance '{{`{{ $labels.instance }}`}}'." | ||
| expr: | | ||
| sum(rate(etcd_http_failed_total{job=~".*etcd.*", code!="404"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job=~".*etcd.*"}[5m])) BY (method) > 0.01 | ||
| for: 10m | ||
| labels: | ||
| severity: warning | ||
|
|
||
| - alert: etcdHighNumberOfFailedHTTPRequests | ||
| annotations: | ||
| summary: "'{{`{{ $value }}`}}' of requests for '{{`{{ $labels.method }}`}}' failed on etcd instance '{{`{{ $labels.instance }}`}}'." | ||
| expr: | | ||
| sum(rate(etcd_http_failed_total{job=~".*etcd.*", code!="404"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job=~".*etcd.*"}[5m])) BY (method) > 0.05 | ||
| for: 10m | ||
| labels: | ||
| severity: critical | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: Duplicate HTTP Alert Rules Detected
Similarly, there are duplicate etcdHighNumberOfFailedHTTPRequests alerts with differing thresholds and severities. To maintain clarity and ensure correct alert evaluation, use unique alert names or combine the thresholds logically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 3m для etcdNotEnoughMembers многовато, а так аппрув
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary by CodeRabbit
VMPodScraperesource for improved pod metrics collection.