-
Notifications
You must be signed in to change notification settings - Fork 18
2026-04-21: 6 ACP KB article(s) #144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7cde20d
2bb33d8
c143a20
047ba6b
79eac74
9aa691f
d1b0ce6
c74fdac
1363d55
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| --- | ||
| kind: | ||
| - Troubleshooting | ||
| products: | ||
| - Alauda Container Platform | ||
| ProductsVersion: | ||
| - 4.1.0,4.2.x | ||
| --- | ||
| ## Issue | ||
|
|
||
| etcd performance degrades due to insufficient storage or network backend capabilities, producing log messages similar to the following: | ||
|
|
||
| ``` | ||
| etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms) | ||
| etcdserver: server is likely overloaded | ||
| etcdserver: read-only range request "key:\"xxxx\"" count_only:true with result "xxxx" took too long (xxx s) to execute | ||
| wal: sync duration of xxxx s, expected less than 1s | ||
| ``` | ||
|
|
||
| These warnings indicate the storage subsystem or network cannot keep up with etcd's latency requirements. | ||
|
|
||
| ## Root Cause | ||
|
|
||
| etcd is highly sensitive to storage and network performance. Any bottleneck in the backend infrastructure — slow disk I/O, high network latency, packet drops, or CPU saturation — directly impacts the ability of the etcd cluster to process writes and maintain leader-heartbeat deadlines. A request should normally complete in under 50 ms; durations exceeding 200 ms trigger warnings in the logs. | ||
|
|
||
| ## Resolution | ||
|
|
||
| ### Identify the Bottleneck | ||
|
|
||
| Three common causes of etcd slowness: | ||
|
|
||
| 1. **Slow storage** — Disk I/O latency exceeds acceptable thresholds | ||
| 2. **CPU overload** — Control-plane nodes are overcommitted | ||
| 3. **Database size growth** — The etcd data file has grown beyond optimal size | ||
|
|
||
| ### Check Storage Performance with fio | ||
|
|
||
| Run an I/O benchmark on each control-plane node to validate disk performance: | ||
|
|
||
| ```bash | ||
| fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \ | ||
| --rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based | ||
| ``` | ||
|
|
||
| The 99th percentile fdatasync latency must be under **10 ms**. | ||
|
|
||
| ### Monitor Key etcd Metrics | ||
|
|
||
| Use Prometheus to track the following metrics: | ||
|
|
||
| | Metric | Threshold | Meaning | | ||
| |---|---|---| | ||
| | `etcd_disk_wal_fsync_duration_seconds_bucket` (p99) | < 10 ms | WAL write latency | | ||
| | `etcd_disk_backend_commit_duration_seconds_bucket` (p99) | < 25 ms | Backend commit latency | | ||
| | `etcd_network_peer_round_trip_time_seconds_bucket` (p99) | < 50 ms | Peer-to-peer network RTT | | ||
| | `etcd_mvcc_db_total_size_in_bytes` | < 2 GB (default quota) | Database size | | ||
|
|
||
| ### Network Health | ||
|
|
||
| High network latency or packet drops between etcd members destabilize the cluster. Monitor network RTT and investigate any persistent packet loss on the control-plane network interface. | ||
|
|
||
| ### Database Defragmentation | ||
|
|
||
| If the database size approaches the quota, perform manual defragmentation: | ||
|
|
||
| ```bash | ||
| kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \ | ||
| --endpoints=https://127.0.0.1:2379 \ | ||
| --cacert=/etc/kubernetes/pki/etcd/ca.crt \ | ||
| --cert=/etc/kubernetes/pki/etcd/server.crt \ | ||
| --key=/etc/kubernetes/pki/etcd/server.key | ||
|
Comment on lines
+67
to
+71
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: For Kubernetes static-pod etcd in kubeadm style, etcdctl should use the /etc/kubernetes/pki/etcd/peer.crt and /etc/kubernetes/pki/etcd/peer.key certificate/key pair (along with /etc/kubernetes/pki/etcd/ca.crt) for client-auth operations like etcdctl endpoint health and etcdctl defrag. Using /etc/kubernetes/pki/etcd/server.crt and server.key is incorrect for etcdctl client operations, as these are server-side certificates for etcd listening on client port 2379. The peer certificates are used for client authentication when one etcd member communicates with others over the client port. The healthcheck-client.crt/key is specifically for Kubernetes liveness probes and not intended for general etcdctl use. Example command: ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/peer.crt \ --key=/etc/kubernetes/pki/etcd/peer.key \ endpoint health This applies to both single-node local etcd and stacked HA etcd static pods managed by kubeadm. Citations:
Use peer certificates for On kubeadm-based clusters, Corrected example:🤖 Prompt for AI Agents |
||
| ``` | ||
|
|
||
| ## Diagnostic Steps | ||
|
|
||
| Check etcd logs for latency warnings: | ||
|
|
||
| ```bash | ||
| kubectl logs -n kube-system etcd-<node-name> --tail=100 | grep -E "took too long|heartbeat|overloaded" | ||
| ``` | ||
|
|
||
| Query etcd metrics directly via the Prometheus endpoint: | ||
|
|
||
| ```bash | ||
| kubectl exec -n kube-system etcd-<node-name> -- wget -qO- http://127.0.0.1:2381/metrics 2>/dev/null \ | ||
| | grep -E "etcd_disk_wal_fsync|etcd_disk_backend_commit|etcd_mvcc_db_total_size" | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,111 @@ | ||||||||||||||||||||||
| --- | ||||||||||||||||||||||
| kind: | ||||||||||||||||||||||
| - Troubleshooting | ||||||||||||||||||||||
| products: | ||||||||||||||||||||||
| - Alauda Container Platform | ||||||||||||||||||||||
| ProductsVersion: | ||||||||||||||||||||||
| - 4.1.0,4.2.x | ||||||||||||||||||||||
| --- | ||||||||||||||||||||||
| ## Issue | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| When troubleshooting node-level problems, increasing the kubelet log verbosity helps identify the root cause. The default log level (`2`) may not provide enough detail for complex issues such as pod scheduling failures, volume mount errors, or container runtime communication problems. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Root Cause | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| The kubelet supports configurable log verbosity levels ranging from `0` (least verbose) to `10` (most verbose). The default level is `2`, which provides basic operational information. Higher levels expose progressively more diagnostic data, but consume additional CPU, disk I/O, and memory on the node. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Resolution | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Log Level Reference | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| | Level Range | Purpose | | ||||||||||||||||||||||
| |---|---| | ||||||||||||||||||||||
| | 0 | Critical errors only | | ||||||||||||||||||||||
| | 1–2 | Default operational output | | ||||||||||||||||||||||
| | 3–4 | Debug-level information, suitable for most troubleshooting | | ||||||||||||||||||||||
| | 5–8 | Trace-level output, verbose internal state dumps | | ||||||||||||||||||||||
| | 9–10 | Maximum verbosity, rarely needed | | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Persistent Configuration (Mutable Host OS) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| On mutable host OSes (standard Linux distributions with a writable `/etc`), set the kubelet log level persistently by adding or modifying the `--v` flag via a systemd drop-in file: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||
| sudo mkdir -p /etc/systemd/system/kubelet.service.d/ | ||||||||||||||||||||||
| sudo tee /etc/systemd/system/kubelet.service.d/10-log-level.conf <<EOF | ||||||||||||||||||||||
| [Service] | ||||||||||||||||||||||
| Environment="KUBELET_LOG_LEVEL=4" | ||||||||||||||||||||||
| ExecStart= | ||||||||||||||||||||||
| ExecStart=/usr/bin/kubelet \$KUBELET_KUBECONFIG_ARGS \$KUBELET_CONFIG_ARGS \$KUBELET_LOG_LEVEL | ||||||||||||||||||||||
|
Comment on lines
+35
to
+39
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Verify expected kubelet flag format and effective command on a node
systemctl show kubelet -p ExecStart
ps -ef | grep '[k]ubelet' | grep -o -- '--v=[0-9]\+'Repository: alauda/knowledge Length of output: 148 🏁 Script executed: # Locate and examine the target documentation file
find . -name "Configure_Kubelet_Log_Level_Verbosity.md" -type fRepository: alauda/knowledge Length of output: 120 🏁 Script executed: # Once located, read the file to see the full context (lines 1-50)
cat -n docs/en/solutions/Configure_Kubelet_Log_Level_Verbosity.md | head -50Repository: alauda/knowledge Length of output: 2155 🌐 Web query:
💡 Result: The kubelet supports the -v or --v flag (type: int) to control log verbosity level using Kubernetes' klog logging system. This sets the threshold for V(LEVEL) statements in the code, where higher values enable more detailed logging. Log format is primarily "text" (klog native format), configurable via --logging-format (default: text; json supported in some components). Example text log line at high verbosity: I1025 00:15:15.525108 1 httplog.go:79] GET /api/v1/namespaces/... (1.512ms) 200 ... Increasing --v=N logs more events (less severe); --v=0 logs only critical events. Recommended production default: --v=2. OpenShift docs provide detailed levels: - --v=0: Always visible to operator - --v=1: Reasonable default - --v=2: Steady state info, important changes (recommended default) - --v=3: Extended changes - --v=4: Debug - --v=6: Requested resources - --v=7: HTTP headers - --v=8: HTTP contents Use --vmodule=pattern=N for per-file/module overrides (text format only). To set: command-line flag kubelet --v=4, config file (logging.verbosity: 4), or environment KUBELET_LOG_LEVEL=4 (systemd drop-in). Restart kubelet after changes. Many flags deprecated; prefer --config file. Citations:
🏁 Script executed: # Search for kubelet systemd configuration patterns in the repo
rg -i "kubelet.*environment\|kubelet.*extra.*args" --type=md --type=yaml --type=json -A 3 -B 1Repository: alauda/knowledge Length of output: 42 🏁 Script executed: # Check for any references to KUBELET_LOG_LEVEL, KUBELET_EXTRA_ARGS in documentation
rg "KUBELET_LOG_LEVEL|KUBELET_EXTRA_ARGS|KUBELET_KUBECONFIG_ARGS"Repository: alauda/knowledge Length of output: 404 🌐 Web query:
💡 Result: The KUBELET_LOG_LEVEL environment variable sets the kubelet log verbosity level (equivalent to the --v flag) in kubeadm-managed systemd services. It is not part of standard upstream Kubernetes kubeadm configuration but is used in distributions like OpenShift/OKD. In kubeadm setups, kubelet logging is primarily configured via: - The --v flag added to KUBELET_KUBEADM_ARGS in /var/lib/kubelet/kubeadm-flags.env (generated by kubeadm init/join). - Or in the structured config.yaml (/var/lib/kubelet/config.yaml) under logging.verbosity. To set log level in standard kubeadm: 1. Edit /var/lib/kubelet/kubeadm-flags.env and append --v=N to KUBELET_KUBEADM_ARGS (e.g., KUBELET_KUBEADM_ARGS="--v=4 ..."). 2. Or edit /var/lib/kubelet/config.yaml: logging: verbosity: N. 3. Run: systemctl daemon-reload && systemctl restart kubelet. For OpenShift-style override (compatible with kubeadm): Create /etc/systemd/system/kubelet.service.d/10-logging.conf: [Service] Environment="KUBELET_LOG_LEVEL=4" Then: systemctl daemon-reload && systemctl restart kubelet. Verbosity levels: 0 (minimal), 1-4 (debug), 5-8 (trace). Default is typically 2 in OpenShift examples. Kubeadm's /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf supports Environment= and EnvironmentFile=, so KUBELET_LOG_LEVEL works if the kubelet binary recognizes it (likely maps to --v). Citations:
Persistent snippet passes an invalid kubelet argument format at line 37–39. The configuration sets Use Suggested fix-Environment="KUBELET_LOG_LEVEL=4"
+Environment="KUBELET_EXTRA_ARGS=--v=4"
ExecStart=
-ExecStart=/usr/bin/kubelet \$KUBELET_KUBECONFIG_ARGS \$KUBELET_CONFIG_ARGS \$KUBELET_LOG_LEVEL
+ExecStart=/usr/bin/kubelet \$KUBELET_KUBECONFIG_ARGS \$KUBELET_CONFIG_ARGS \$KUBELET_EXTRA_ARGS📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents
coderabbitai[bot] marked this conversation as resolved.
|
||||||||||||||||||||||
| EOF | ||||||||||||||||||||||
| sudo systemctl daemon-reload | ||||||||||||||||||||||
| sudo systemctl restart kubelet | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### Persistent Configuration (Immutable OS Nodes) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| On immutable-OS nodes — MicroOS, or any setup where `/etc` is backed by a read-mostly overlay that is reset on node upgrades or rollbacks — direct file edits under `/etc/systemd/system/kubelet.service.d/` **will not survive the next node update**. You may see the desired verbosity right after the change, then lose it silently when the node image is replaced. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Persist the change through ACP's Immutable Infrastructure mechanism instead: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - Define the drop-in file as part of the node configuration managed by ACP (under `configure/clusters/nodes`). The platform renders and re-applies it every time a node boots, so the override survives OS upgrades and rollbacks. | ||||||||||||||||||||||
| - Trigger a rolling apply on the target node pool. ACP will cordon/drain, restart the kubelet with the new verbosity, and resume scheduling. | ||||||||||||||||||||||
| - Revert the same way — update the node configuration to remove the override; do not `rm` the file directly on the node, because the mutation will be lost at the next reconcile. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| If the cluster spans both mutable and immutable nodes, scope the change to a node group / pool so that only the intended nodes carry the higher verbosity. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ### One-Time Change (Single Node) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| For temporary debugging on a single mutable-OS node without touching the persistent configuration, override the kubelet arguments directly on that node: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||
| sudo systemctl edit kubelet | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Add the following to raise verbosity to level 4: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```ini | ||||||||||||||||||||||
| [Service] | ||||||||||||||||||||||
| Environment="KUBELET_EXTRA_ARGS=--v=4" | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Then reload and restart: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||
| sudo systemctl daemon-reload | ||||||||||||||||||||||
| sudo systemctl restart kubelet | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| On immutable-OS nodes, prefer the Immutable Infrastructure flow above even for short investigations: running `systemctl edit` on a single node works until that node is re-imaged, at which point the change is gone without warning. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| > **Important:** Revert the log level back to the default (`2`) after collecting the necessary logs. Extended operation at high verbosity places significant load on node resources. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Diagnostic Steps | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Verify the current kubelet log level by inspecting the running process: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||
| ps aux | grep kubelet | grep -o '\-\-v=[0-9]*' | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Gather kubelet logs from a specific node: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||
| kubectl get nodes | ||||||||||||||||||||||
| kubectl debug node/<node-name> --image=busybox -- cat /host/var/log/kubelet.log | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
Comment on lines
+95
to
+96
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: On Kubernetes and OpenShift nodes managed by systemd, the recommended way to collect kubelet logs is using journalctl -u kubelet. For OpenShift, additional commands like oc adm node-logs --role= -u kubelet are also recommended. The file /var/log/kubelet.log is not guaranteed to exist. On Linux nodes using systemd (the default for modern distributions), kubelet writes logs to journald by default, not to a .log file in /var/log. A .log file only appears if systemd is not present. Citations:
Use On modern systemd-based nodes (the default for Kubernetes/OpenShift), kubelet writes logs to journald, not to a file in Suggested doc fix-kubectl debug node/<node-name> --image=busybox -- cat /host/var/log/kubelet.log
+kubectl debug node/<node-name> --image=busybox -- chroot /host journalctl -u kubelet.service --since "1 hour ago"📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Alternatively, SSH into the node and use journalctl: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||
| ssh <node-address> | ||||||||||||||||||||||
| sudo journalctl -b -f -u kubelet.service | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| To collect logs from all nodes at once: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||
| for n in $(kubectl get nodes --no-headers | awk '{print $1}'); do | ||||||||||||||||||||||
| ssh "$n" "sudo journalctl -u kubelet.service --since '1 hour ago'" > "${n}.kubelet.log" | ||||||||||||||||||||||
| done | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| --- | ||
| kind: | ||
| - Information | ||
| products: | ||
| - Alauda Container Platform | ||
| ProductsVersion: | ||
| - 4.1.0,4.2.x | ||
| --- | ||
| ## Issue | ||
|
|
||
| When etcd auto-defragmentation is disabled, the database accumulates unused space over time. Without proactive monitoring, the etcd data file can grow to the point where cluster performance degrades. A mechanism is needed to alert operators when manual defragmentation becomes necessary. | ||
|
|
||
| ## Resolution | ||
|
|
||
| Create a custom `PrometheusRule` resource that triggers alerts based on the ratio of unused space within the etcd database. | ||
|
|
||
| ### Prerequisites | ||
|
|
||
| Ensure the Prometheus Operator is deployed and the `PrometheusRule` CRD is available in the cluster: | ||
|
|
||
| ```bash | ||
| kubectl get crd prometheusrules.monitoring.coreos.com | ||
| ``` | ||
|
|
||
| ### Create the Alert Rules | ||
|
|
||
| Apply the following `PrometheusRule` manifest. Adjust the namespace to match the monitoring stack configuration (commonly `monitoring` or `kube-system`): | ||
|
|
||
| ```yaml | ||
| apiVersion: monitoring.coreos.com/v1 | ||
| kind: PrometheusRule | ||
| metadata: | ||
| name: etcd-defragmentation-alerts | ||
| namespace: monitoring | ||
| spec: | ||
| groups: | ||
| - name: etcd-defragmentation.rules | ||
| rules: | ||
| - alert: EtcdDefragIsAdvised | ||
| annotations: | ||
| summary: >- | ||
| Etcd database unused space exceeds 35%. | ||
| Consider running defragmentation. | ||
| description: >- | ||
| The etcd database has more than 35% unused space | ||
| and the total size exceeds 400 MB. Schedule a | ||
| defragmentation during a maintenance window. | ||
| expr: >- | ||
| avg(etcd_db_total_size_in_bytes) > 419430400 | ||
| and | ||
| ( | ||
| (avg(etcd_mvcc_db_total_size_in_bytes) | ||
| - avg(etcd_mvcc_db_total_size_in_use_in_bytes)) | ||
| * 100 | ||
| / avg(etcd_mvcc_db_total_size_in_bytes) | ||
| ) > 35 | ||
|
Comment on lines
+49
to
+56
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: Yes, it is recommended to evaluate etcd_mvcc_db_total_size_in_bytes and etcd_mvcc_db_total_size_in_use_in_bytes per instance without global aggregation like avg to avoid masking fragmentation on a single etcd member. Citations:
Remove global Per Prometheus and etcd best practices, evaluate fragmentation metrics per instance without global aggregation. Using Suggested PromQL adjustment- avg(etcd_db_total_size_in_bytes) > 419430400
+ etcd_db_total_size_in_bytes > 419430400
and
(
- (avg(etcd_mvcc_db_total_size_in_bytes)
- - avg(etcd_mvcc_db_total_size_in_use_in_bytes))
+ (etcd_mvcc_db_total_size_in_bytes
+ - etcd_mvcc_db_total_size_in_use_in_bytes)
* 100
- / avg(etcd_mvcc_db_total_size_in_bytes)
+ / etcd_mvcc_db_total_size_in_bytes
) > 35
@@
- avg(etcd_db_total_size_in_bytes) > 629145600
+ etcd_db_total_size_in_bytes > 629145600
and
(
- (avg(etcd_mvcc_db_total_size_in_bytes)
- - avg(etcd_mvcc_db_total_size_in_use_in_bytes))
+ (etcd_mvcc_db_total_size_in_bytes
+ - etcd_mvcc_db_total_size_in_use_in_bytes)
* 100
- / avg(etcd_mvcc_db_total_size_in_bytes)
+ / etcd_mvcc_db_total_size_in_bytes
) > 40Also applies to: lines 71–78 🤖 Prompt for AI Agents |
||
| labels: | ||
| severity: warning | ||
|
|
||
| - alert: EtcdDefragIsNeeded | ||
| annotations: | ||
| summary: >- | ||
| Etcd database unused space exceeds 40%. | ||
| Defragmentation is strongly recommended. | ||
| description: >- | ||
| The etcd database has more than 40% unused space | ||
| and the total size exceeds 600 MB. Perform | ||
| defragmentation as soon as possible to avoid | ||
| performance degradation. | ||
| expr: >- | ||
| avg(etcd_db_total_size_in_bytes) > 629145600 | ||
| and | ||
| ( | ||
| (avg(etcd_mvcc_db_total_size_in_bytes) | ||
| - avg(etcd_mvcc_db_total_size_in_use_in_bytes)) | ||
| * 100 | ||
| / avg(etcd_mvcc_db_total_size_in_bytes) | ||
| ) > 40 | ||
| labels: | ||
| severity: critical | ||
| ``` | ||
|
|
||
| ### Verify the Rules Are Loaded | ||
|
|
||
| ```bash | ||
| kubectl get prometheusrule -n monitoring | ||
| kubectl describe prometheusrule etcd-defragmentation-alerts -n monitoring | ||
| ``` | ||
|
|
||
| ### Perform Defragmentation When Alerted | ||
|
|
||
| When the alert fires, run defragmentation on each etcd member: | ||
|
|
||
| ```bash | ||
| kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \ | ||
| --endpoints=https://127.0.0.1:2379 \ | ||
| --cacert=/etc/kubernetes/pki/etcd/ca.crt \ | ||
| --cert=/etc/kubernetes/pki/etcd/server.crt \ | ||
| --key=/etc/kubernetes/pki/etcd/server.key | ||
| ``` | ||
|
|
||
| Process one member at a time to maintain quorum throughout the operation. | ||
|
|
||
| ## Diagnostic Steps | ||
|
|
||
| Check current etcd database size and usage: | ||
|
|
||
| ```bash | ||
| kubectl exec -n kube-system etcd-<node-name> -- wget -qO- http://127.0.0.1:2381/metrics 2>/dev/null \ | ||
| | grep -E "etcd_mvcc_db_total_size_in_bytes|etcd_mvcc_db_total_size_in_use_in_bytes|etcd_db_total_size_in_bytes" | ||
| ``` | ||
|
|
||
| Verify the PrometheusRule is being evaluated: | ||
|
|
||
| ```bash | ||
| kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 & | ||
| curl -s 'http://localhost:9090/api/v1/rules' | python3 -m json.tool | grep -A5 "defrag" | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a language identifier to the fenced log block.
The code fence is unlabeled; this triggers MD040 and reduces readability in rendered docs.
Suggested fix
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 13-13: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents