-
Couldn't load subscription status.
- Fork 126
Import Piraeus dashboard and alerts. #658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@kvaps has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 11 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
WalkthroughThis pull request introduces a set of monitoring configuration changes. A new entry for the Linstor dashboard has been added to the Changes
Possibly related PRs
Suggested reviewers
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
packages/system/piraeus-operator/alerts/piraeus-datastore.yaml (2)
7-54: Comprehensive LINSTOR alert rules with appropriate severities.The LINSTOR alerts cover critical failure modes (controller offline, satellite not online, storage pool errors) and warning conditions (error rates, capacity). The descriptions include helpful troubleshooting commands.
Note that line 54 uses
severity: warnwhile other alerts usewarning. Consider standardizing to eitherwarnorwarningacross all rules for consistency.- severity: warn + severity: warning
55-116: Well-defined DRBD alert rules covering key failure scenarios.The DRBD alerts cover important scenarios such as connectivity issues, device state problems, quorum loss, and resynchronization issues. The rules use appropriate PromQL expressions to detect these conditions.
There are some minor typos in the alert descriptions:
- DRBD device "{{ $labels.name }}" on "{{ $labels.node }}" is unintenionally diskless. + DRBD device "{{ $labels.name }}" on "{{ $labels.node }}" is unintentionally diskless.Also, as mentioned earlier, there's inconsistency in severity labels (
warnvswarning). Consider standardizing them.- severity: warn + severity: warning
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
dashboards/storage/piraeus-dashboard.json(1 hunks)packages/extra/monitoring/dashboards.list(1 hunks)packages/system/piraeus-operator/alerts/piraeus-datastore.yaml(1 hunks)packages/system/piraeus-operator/templates/alerts.yaml(1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.35.1)
packages/system/piraeus-operator/templates/alerts.yaml
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
🔇 Additional comments (15)
packages/extra/monitoring/dashboards.list (1)
39-39: LGTM: Successfully added the Piraeus dashboard to monitoring list.The new dashboard entry is properly added to the dashboards list with the correct format, consistent with the existing pattern (under the storage category).
packages/system/piraeus-operator/templates/alerts.yaml (1)
1-7: Helm template for dynamically including alert files.This template efficiently aggregates all YAML files from the alerts directory to create a unified alerts configuration. The approach is elegant and maintainable, making it easy to add new alert rules in separate files.
Regarding the YAMLlint error, this is a false positive - the syntax
{{- $files := .Files.Glob ... -}}is valid Helm template syntax where the hyphen trims whitespace.🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
dashboards/storage/piraeus-dashboard.json (12)
1-25: Well-structured dashboard configuration with proper organization.The dashboard is properly set up with appropriate metadata, templating, and organization. The LINSTOR row is clearly defined with the right positioning.
26-85: Space Usage gauge with appropriate thresholds.The space usage gauge is well-configured with a sensible threshold (80%) and proper unit display (percent). The panel captures important storage pool metrics.
86-254: Storage stats panels with good visualization choices.The Total, Free, and Used storage stats are appropriately visualized with suitable color coding (text, dark-green, dark-red) that clearly distinguishes the metrics visually.
255-474: Resource and node count statistics are well organized.The panel layout for nodes, storage pools, resource definitions, and resources provides a concise overview of the system status with appropriate visualization options.
475-621: Effective visualization of scrape duration and error reporting.The time series visualization for scrape duration and the colored background for error reports provide clear visibility into system performance and potential issues.
622-798: Clear status indicators for resource health.The status indicators for offline nodes, failed storage pools, and failed resources use appropriate color coding and thresholds to highlight potential issues.
799-1106: Detailed I/O metrics with appropriate visualizations.The write rate statistics use stepped line graphs which are appropriate for monitoring these types of metrics, and the filtering to show the most active volumes is a good optimization.
1107-1204: Read rate metrics with consistent styling.The read rate visualizations maintain consistency with the write rate panels, using appropriate units and stacked displays for the by-node view.
1205-1378: DRBD section with appropriate gauges and metrics.The DRBD section uses gauges effectively to show resource counts and includes scrape duration metrics for monitoring performance.
1379-1704: Comprehensive DRBD status indicators and metrics.The out-of-sync data visualizations and status indicators for disconnected nodes, quorum issues, and storage failures provide a complete view of DRBD health.
1705-2132: Detailed tables for DRBD resource issues.The tables showing disconnected resources, out-of-sync resources, resources without quorum, and storage failures provide actionable information with appropriate sorting and column configuration.
2133-2193: Dashboard configuration with appropriate refresh rate and templating.The dashboard is correctly configured with a 30-second refresh rate, appropriate time range defaults, and templating variables for datasource and node selection.
packages/system/piraeus-operator/alerts/piraeus-datastore.yaml (1)
1-6: Properly structured PrometheusRule resource.The YAML file correctly defines a PrometheusRule with appropriate API version and metadata, following Kubernetes best practices.
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🧹 Nitpick comments (2)
dashboards/storage/linstor.json (2)
1282-1378: Graph Panel “Scrape Duration” (Second Instance): Interval Consistency
This instance uses the queryavg_over_time(scrape_duration_seconds{job="linstor-node", node=~"$node"}[$__rate_interval])which is similar to the earlier panel. Confirm that both versions are needed or if they should be consolidated.
1643-1704: Stat Panel “out-of-sync”: Duplicative or Complementary?
This panel usescount(count by (name) (drbd_peerdevice_outofsync_bytes{node=~"$node"} > 0)) OR on() vector(0)which appears very similar to the earlier “out-of-sync data” panel. Please confirm if both panels are necessary or if they should be merged.
🛑 Comments failed to post (1)
dashboards/storage/linstor.json (1)
740-798:
⚠️ Potential issueStat Panel “Failed Resources”: Query Expression Complexity – Potential Issue
The expressioncount(((linstor_volume_state{node=~"$node"} != 1) != 4) != -1) OR on() vector(0)contains multiple nested
!=comparisons that are likely not yielding the intended numeric result.Suggested Revision: If the intent is to count volumes in a “failed” state, consider using a direct label filter (for example, if a state label exists, e.g.,
state="failed"):- "expr": "count(((linstor_volume_state{node=~\"$node\"} != 1) != 4) != -1) OR on() vector(0)" + "expr": "count(linstor_volume_state{node=~\"$node\", state=\"failed\"}) OR on() vector(0)"Please review and adjust the query to accurately capture the failed resources metric.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.{ "datasource": "${datasource}", "description": "", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 1 } ] } }, "overrides": [] }, "gridPos": { "h": 3, "w": 4, "x": 8, "y": 14 }, "id": 49, "options": { "colorMode": "background", "graphMode": "none", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.2.6", "targets": [ { "exemplar": true, "expr": "count(linstor_volume_state{node=~\"$node\", state=\"failed\"}) OR on() vector(0)", "legendFormat": "", "refId": "A" } ], "title": "Failed Resources", "type": "stat" }

Summary by CodeRabbit