alauda · jing2uo · Apr 22, 2026 · Apr 22, 2026 · Apr 22, 2026 · Apr 22, 2026
diff --git a/docs/en/solutions/Backend_Performance_Requirements_for_etcd.md b/docs/en/solutions/Backend_Performance_Requirements_for_etcd.md
@@ -0,0 +1,87 @@
+---
+kind:
+   - Troubleshooting
+products:
+   - Alauda Container Platform
+ProductsVersion:
+   - 4.1.0,4.2.x
+---
+## Issue
+
+etcd performance degrades due to insufficient storage or network backend capabilities, producing log messages similar to the following:
+
+```
+etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms)
+etcdserver: server is likely overloaded
+etcdserver: read-only range request "key:\"xxxx\"" count_only:true with result "xxxx" took too long (xxx s) to execute
+wal: sync duration of xxxx s, expected less than 1s
+```
+
+These warnings indicate the storage subsystem or network cannot keep up with etcd's latency requirements.
+
+## Root Cause
+
+etcd is highly sensitive to storage and network performance. Any bottleneck in the backend infrastructure — slow disk I/O, high network latency, packet drops, or CPU saturation — directly impacts the ability of the etcd cluster to process writes and maintain leader-heartbeat deadlines. A request should normally complete in under 50 ms; durations exceeding 200 ms trigger warnings in the logs.
+
+## Resolution
+
+### Identify the Bottleneck
+
+Three common causes of etcd slowness:
+
+1. **Slow storage** — Disk I/O latency exceeds acceptable thresholds
+2. **CPU overload** — Control-plane nodes are overcommitted
+3. **Database size growth** — The etcd data file has grown beyond optimal size
+
+### Check Storage Performance with fio
+
+Run an I/O benchmark on each control-plane node to validate disk performance:
+
+```bash
+fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
+    --rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based
+```
+
+The 99th percentile fdatasync latency must be under **10 ms**.
+
+### Monitor Key etcd Metrics
+
+Use Prometheus to track the following metrics:
+
+| Metric | Threshold | Meaning |
+|---|---|---|
+| `etcd_disk_wal_fsync_duration_seconds_bucket` (p99) | < 10 ms | WAL write latency |
+| `etcd_disk_backend_commit_duration_seconds_bucket` (p99) | < 25 ms | Backend commit latency |
+| `etcd_network_peer_round_trip_time_seconds_bucket` (p99) | < 50 ms | Peer-to-peer network RTT |
+| `etcd_mvcc_db_total_size_in_bytes` | < 2 GB (default quota) | Database size |
+
+### Network Health
+
+High network latency or packet drops between etcd members destabilize the cluster. Monitor network RTT and investigate any persistent packet loss on the control-plane network interface.
+
+### Database Defragmentation
+
+If the database size approaches the quota, perform manual defragmentation:
+
+```bash
+kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \
+  --endpoints=https://127.0.0.1:2379 \
+  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
+  --cert=/etc/kubernetes/pki/etcd/server.crt \
+  --key=/etc/kubernetes/pki/etcd/server.key
+```
+
+## Diagnostic Steps
+
+Check etcd logs for latency warnings:
+
+```bash
+kubectl logs -n kube-system etcd-<node-name> --tail=100 | grep -E "took too long|heartbeat|overloaded"
+```
+
+Query etcd metrics directly via the Prometheus endpoint:
+
+```bash
+kubectl exec -n kube-system etcd-<node-name> -- wget -qO- http://127.0.0.1:2381/metrics 2>/dev/null \
+  | grep -E "etcd_disk_wal_fsync|etcd_disk_backend_commit|etcd_mvcc_db_total_size"
+```
diff --git a/docs/en/solutions/Configure_Kubelet_Log_Level_Verbosity.md b/docs/en/solutions/Configure_Kubelet_Log_Level_Verbosity.md
@@ -0,0 +1,111 @@
+---
+kind:
+   - Troubleshooting
+products:
+   - Alauda Container Platform
+ProductsVersion:
+   - 4.1.0,4.2.x
+---
+## Issue
+
+When troubleshooting node-level problems, increasing the kubelet log verbosity helps identify the root cause. The default log level (`2`) may not provide enough detail for complex issues such as pod scheduling failures, volume mount errors, or container runtime communication problems.
+
+## Root Cause
+
+The kubelet supports configurable log verbosity levels ranging from `0` (least verbose) to `10` (most verbose). The default level is `2`, which provides basic operational information. Higher levels expose progressively more diagnostic data, but consume additional CPU, disk I/O, and memory on the node.
+
+## Resolution
+
+### Log Level Reference
+
+| Level Range | Purpose |
+|---|---|
+| 0 | Critical errors only |
+| 1–2 | Default operational output |
+| 3–4 | Debug-level information, suitable for most troubleshooting |
+| 5–8 | Trace-level output, verbose internal state dumps |
+| 9–10 | Maximum verbosity, rarely needed |
+
+### Persistent Configuration (Mutable Host OS)
+
+On mutable host OSes (standard Linux distributions with a writable `/etc`), set the kubelet log level persistently by adding or modifying the `--v` flag via a systemd drop-in file:
+
+```bash
+sudo mkdir -p /etc/systemd/system/kubelet.service.d/
+sudo tee /etc/systemd/system/kubelet.service.d/10-log-level.conf <<EOF
+[Service]
+Environment="KUBELET_LOG_LEVEL=4"
+ExecStart=
+ExecStart=/usr/bin/kubelet \$KUBELET_KUBECONFIG_ARGS \$KUBELET_CONFIG_ARGS \$KUBELET_LOG_LEVEL
-sudo tee /etc/systemd/system/kubelet.service.d/10-log-level.conf <<EOF
-[Service]
-Environment="KUBELET_LOG_LEVEL=4"
-ExecStart=
-ExecStart=/usr/bin/kubelet \$KUBELET_KUBECONFIG_ARGS \$KUBELET_CONFIG_ARGS \$KUBELET_LOG_LEVEL
+sudo tee /etc/systemd/system/kubelet.service.d/10-log-level.conf <<EOF
+[Service]
+Environment="KUBELET_EXTRA_ARGS=--v=4"
+ExecStart=
+ExecStart=/usr/bin/kubelet \$KUBELET_KUBECONFIG_ARGS \$KUBELET_CONFIG_ARGS \$KUBELET_EXTRA_ARGS
-sudo tee /etc/systemd/system/kubelet.service.d/10-log-level.conf <<EOF
-[Service]
-Environment="KUBELET_LOG_LEVEL=4"
-ExecStart=
-ExecStart=/usr/bin/kubelet \$KUBELET_KUBECONFIG_ARGS \$KUBELET_CONFIG_ARGS \$KUBELET_LOG_LEVEL
+sudo tee /etc/systemd/system/kubelet.service.d/10-log-level.conf <<EOF
+[Service]
+Environment="KUBELET_EXTRA_ARGS=--v=4"
+ExecStart=
+ExecStart=/usr/bin/kubelet \$KUBELET_KUBECONFIG_ARGS \$KUBELET_CONFIG_ARGS \$KUBELET_EXTRA_ARGS
+EOF
+sudo systemctl daemon-reload
+sudo systemctl restart kubelet
+```
+
+### Persistent Configuration (Immutable OS Nodes)
+
+On immutable-OS nodes — MicroOS, or any setup where `/etc` is backed by a read-mostly overlay that is reset on node upgrades or rollbacks — direct file edits under `/etc/systemd/system/kubelet.service.d/` **will not survive the next node update**. You may see the desired verbosity right after the change, then lose it silently when the node image is replaced.
+
+Persist the change through ACP's Immutable Infrastructure mechanism instead:
+
+- Define the drop-in file as part of the node configuration managed by ACP (under `configure/clusters/nodes`). The platform renders and re-applies it every time a node boots, so the override survives OS upgrades and rollbacks.
+- Trigger a rolling apply on the target node pool. ACP will cordon/drain, restart the kubelet with the new verbosity, and resume scheduling.
+- Revert the same way — update the node configuration to remove the override; do not `rm` the file directly on the node, because the mutation will be lost at the next reconcile.
+
+If the cluster spans both mutable and immutable nodes, scope the change to a node group / pool so that only the intended nodes carry the higher verbosity.
+
+### One-Time Change (Single Node)
+
+For temporary debugging on a single mutable-OS node without touching the persistent configuration, override the kubelet arguments directly on that node:
+
+```bash
+sudo systemctl edit kubelet
+```
+
+Add the following to raise verbosity to level 4:
+
+```ini
+[Service]
+Environment="KUBELET_EXTRA_ARGS=--v=4"
+```
+
+Then reload and restart:
+
+```bash
+sudo systemctl daemon-reload
+sudo systemctl restart kubelet
+```
+
+On immutable-OS nodes, prefer the Immutable Infrastructure flow above even for short investigations: running `systemctl edit` on a single node works until that node is re-imaged, at which point the change is gone without warning.
+
+> **Important:** Revert the log level back to the default (`2`) after collecting the necessary logs. Extended operation at high verbosity places significant load on node resources.
+
+## Diagnostic Steps
+
+Verify the current kubelet log level by inspecting the running process:
+
+```bash
+ps aux | grep kubelet | grep -o '\-\-v=[0-9]*'
+```
+
+Gather kubelet logs from a specific node:
+
+```bash
+kubectl get nodes
+kubectl debug node/<node-name> --image=busybox -- cat /host/var/log/kubelet.log
+```
-kubectl debug node/<node-name> --image=busybox -- cat /host/var/log/kubelet.log
-```
+kubectl debug node/<node-name> --image=busybox -- chroot /host journalctl -u kubelet.service --since "1 hour ago"
-kubectl debug node/<node-name> --image=busybox -- cat /host/var/log/kubelet.log
-```
+kubectl debug node/<node-name> --image=busybox -- chroot /host journalctl -u kubelet.service --since "1 hour ago"
+
+Alternatively, SSH into the node and use journalctl:
+
+```bash
+ssh <node-address>
+sudo journalctl -b -f -u kubelet.service
+```
+
+To collect logs from all nodes at once:
+
+```bash
+for n in $(kubectl get nodes --no-headers | awk '{print $1}'); do
+  ssh "$n" "sudo journalctl -u kubelet.service --since '1 hour ago'" > "${n}.kubelet.log"
+done
+```
diff --git a/docs/en/solutions/Create_PrometheusRule_Alerts_for_etcd_Defragmentation.md b/docs/en/solutions/Create_PrometheusRule_Alerts_for_etcd_Defragmentation.md
@@ -0,0 +1,118 @@
+---
+kind:
+   - Information
+products:
+   - Alauda Container Platform
+ProductsVersion:
+   - 4.1.0,4.2.x
+---
+## Issue
+
+When etcd auto-defragmentation is disabled, the database accumulates unused space over time. Without proactive monitoring, the etcd data file can grow to the point where cluster performance degrades. A mechanism is needed to alert operators when manual defragmentation becomes necessary.
+
+## Resolution
+
+Create a custom `PrometheusRule` resource that triggers alerts based on the ratio of unused space within the etcd database.
+
+### Prerequisites
+
+Ensure the Prometheus Operator is deployed and the `PrometheusRule` CRD is available in the cluster:
+
+```bash
+kubectl get crd prometheusrules.monitoring.coreos.com
+```
+
+### Create the Alert Rules
+
+Apply the following `PrometheusRule` manifest. Adjust the namespace to match the monitoring stack configuration (commonly `monitoring` or `kube-system`):
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: etcd-defragmentation-alerts
+  namespace: monitoring
+spec:
+  groups:
+    - name: etcd-defragmentation.rules
+      rules:
+        - alert: EtcdDefragIsAdvised
+          annotations:
+            summary: >-
+              Etcd database unused space exceeds 35%.
+              Consider running defragmentation.
+            description: >-
+              The etcd database has more than 35% unused space
+              and the total size exceeds 400 MB. Schedule a
+              defragmentation during a maintenance window.
+          expr: >-
+            avg(etcd_db_total_size_in_bytes) > 419430400
+            and
+            (
+              (avg(etcd_mvcc_db_total_size_in_bytes)
+               - avg(etcd_mvcc_db_total_size_in_use_in_bytes))
+              * 100
+              / avg(etcd_mvcc_db_total_size_in_bytes)
+            ) > 35
+          labels:
+            severity: warning
+
+        - alert: EtcdDefragIsNeeded
+          annotations:
+            summary: >-
+              Etcd database unused space exceeds 40%.
+              Defragmentation is strongly recommended.
+            description: >-
+              The etcd database has more than 40% unused space
+              and the total size exceeds 600 MB. Perform
+              defragmentation as soon as possible to avoid
+              performance degradation.
+          expr: >-
+            avg(etcd_db_total_size_in_bytes) > 629145600
+            and
+            (
+              (avg(etcd_mvcc_db_total_size_in_bytes)
+               - avg(etcd_mvcc_db_total_size_in_use_in_bytes))
+              * 100
+              / avg(etcd_mvcc_db_total_size_in_bytes)
+            ) > 40
+          labels:
+            severity: critical
+```
+
+### Verify the Rules Are Loaded
+
+```bash
+kubectl get prometheusrule -n monitoring
+kubectl describe prometheusrule etcd-defragmentation-alerts -n monitoring
+```
+
+### Perform Defragmentation When Alerted
+
+When the alert fires, run defragmentation on each etcd member:
+
+```bash
+kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \
+  --endpoints=https://127.0.0.1:2379 \
+  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
+  --cert=/etc/kubernetes/pki/etcd/server.crt \
+  --key=/etc/kubernetes/pki/etcd/server.key
+```
+
+Process one member at a time to maintain quorum throughout the operation.
+
+## Diagnostic Steps
+
+Check current etcd database size and usage:
+
+```bash
+kubectl exec -n kube-system etcd-<node-name> -- wget -qO- http://127.0.0.1:2381/metrics 2>/dev/null \
+  | grep -E "etcd_mvcc_db_total_size_in_bytes|etcd_mvcc_db_total_size_in_use_in_bytes|etcd_db_total_size_in_bytes"
+```
+
+Verify the PrometheusRule is being evaluated:
+
+```bash
+kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 &
+curl -s 'http://localhost:9090/api/v1/rules' | python3 -m json.tool | grep -A5 "defrag"
+```