From 55864a7202e331700eb20efcb6ef38e4f5fd35e8 Mon Sep 17 00:00:00 2001 From: Pavel Tishkov Date: Thu, 22 May 2025 14:42:06 +0300 Subject: [PATCH 1/2] fix(module): fix vm, vmop alerts Signed-off-by: Pavel Tishkov --- monitoring/prometheus-rules/vm.state.yaml | 39 +++++++++++++++++++++ monitoring/prometheus-rules/vm.yaml | 20 ----------- monitoring/prometheus-rules/vmop.state.yaml | 38 ++++++++++++++++++++ monitoring/prometheus-rules/vmop.yaml | 17 --------- 4 files changed, 77 insertions(+), 37 deletions(-) create mode 100644 monitoring/prometheus-rules/vm.state.yaml delete mode 100644 monitoring/prometheus-rules/vm.yaml create mode 100644 monitoring/prometheus-rules/vmop.state.yaml delete mode 100644 monitoring/prometheus-rules/vmop.yaml diff --git a/monitoring/prometheus-rules/vm.state.yaml b/monitoring/prometheus-rules/vm.state.yaml new file mode 100644 index 0000000000..09cda00db4 --- /dev/null +++ b/monitoring/prometheus-rules/vm.state.yaml @@ -0,0 +1,39 @@ +- name: virtualization.vm.state + rules: + - alert: D8VirtualizationVirtualMachineFirmwareOutOfDate + expr: d8_virtualization_virtualmachine_firmware_up_to_date == 0 + labels: + severity_level: "8" + tier: application + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_firmware_out_of_date: "D8VirtualizationVirtualMachineFirmwareOutOfDate,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_firmware_out_of_date: "D8VirtualizationVirtualMachineFirmwareOutOfDate,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualMachine have out of date firmware. + description: | + The virtual machine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has outdated firmware. + Outdated firmware may expose the VM to security vulnerabilities or compatibility issues after virtualization updates. + ### Why This Happens + Firmware (QEMU/KVM) used by a VM is tied to the version provided by the node where the VM is running. After updating the virtualization module (via Deckhouse), new firmware becomes available, but already running VMs continue using the old version until restarted or migrated. + ### Diagnosis + Inspect the VM status to confirm the firmware issue: + ```bash + d8 k -n {{ $labels.namespace }} get vm {{ $labels.name }} -o jsonpath="{.status}" + ``` + ### Recommended Actions + To apply the latest firmware: + 1. **Schedule maintenance** and inform relevant teams/users. + 2. Choose one of the following options depending on your setup: + #### Option A: Evict the VM to another node (live migration): + ```bash + d8 v -n {{ $labels.namespace }} evict vm {{ $labels.name }} + ``` + > Requires live migration support. + #### Option B: Reboot the VM: + ```bash + d8 v -n {{ $labels.namespace }} restart vm {{ $labels.name }} + ``` + > Simpler, but causes downtime unless guest OS supports ACPI shutdown/restart. + 3. After migration or reboot, the VM will use the updated firmware automatically. diff --git a/monitoring/prometheus-rules/vm.yaml b/monitoring/prometheus-rules/vm.yaml deleted file mode 100644 index f33fcf2985..0000000000 --- a/monitoring/prometheus-rules/vm.yaml +++ /dev/null @@ -1,20 +0,0 @@ -- name: kubernetes.virtualization.vm - rules: - - alert: D8VirtualizationVMFirmwareOutOfDate - expr: count(d8_virtualization_virtualmachine_firmware_up_to_date == 0) > 0 - labels: - severity_level: "6" - tier: cluster - for: 30m - annotations: - plk_protocol_version: "1" - plk_markup_format: "markdown" - plk_create_group_if_not_exists__d8_virtualization_vm_firmware_out_of_date: "D8VirtualizationVMFirmwareOutOfDate,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" - plk_grouped_by__d8_virtualization_vm_firmware_out_of_date: "D8VirtualizationVMFirmwareOutOfDate,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" - summary: Have found virtual machines that require firmware upgrades. - description: | - To find VirtualMachines that have outdated firmware run the following command: - ``` - kubectl get vm -A -o json | jq -r '.items[] | select(any(.status.conditions[]?; .type == "FirmwareUpToDate" and .status == "False")) | "\(.metadata.namespace)/\(.metadata.name)"' - ``` - The VirtualMachine firmware is updated automatically after a new module is installed. To perform the procedure manually, evict the VM to a new node, or reboot it. diff --git a/monitoring/prometheus-rules/vmop.state.yaml b/monitoring/prometheus-rules/vmop.state.yaml new file mode 100644 index 0000000000..3d12152136 --- /dev/null +++ b/monitoring/prometheus-rules/vmop.state.yaml @@ -0,0 +1,38 @@ +- name: virtualization.vmop.state + rules: + - alert: D8VirtualizationVirtualMachineOperationStuckInProgressPhase + expr: d8_virtualization_virtualmachineoperation_status_phase{phase="InProgress"} == 1 + labels: + severity_level: "9" + tier: application + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_vmop_stuck_in_progress_phase: "D8VirtualizationVirtualMachineOperationStuckInProgressPhase,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_vmop_stuck_in_progress_phase: "D8VirtualizationVirtualMachineOperationStuckInProgressPhase,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: The VirtualMachineOperation stuck in InProgress phase for a long time. + description: | + The `VirtualMachineOperation` object `{{ $labels.name }}` in namespace `{{ $labels.namespace }}` has been stuck in the `InProgress` phase for more than 60 minutes. + This may indicate that the operation (e.g., restart, evict, stop, start) was not completed successfully and is now stalled. + ### Possible Causes + - The underlying virtual machine is unreachable or in an inconsistent state. + - Node issues (e.g., network problems, node downtime). + ### Diagnosis + 1. Get details of the affected VirtualMachineOperation: + ```bash + d8 k -n {{ $labels.namespace }} get vmop {{ $labels.name }} -o wide + ``` + 2. Check related VM status: + ```bash + d8 k -n {{ $labels.namespace }} get vm -o jsonpath="{.status}" + ``` + ### Recommended Actions + If the operation can be safely retried, delete the `VirtualMachineOperation` object: + ```bash + d8 k -n {{ $labels.namespace }} delete vmop {{ $labels.name }} + ``` + Then re-initiate the required action (e.g., restart, evict, etc). + ```bash + d8 v + ``` diff --git a/monitoring/prometheus-rules/vmop.yaml b/monitoring/prometheus-rules/vmop.yaml deleted file mode 100644 index 466dacb392..0000000000 --- a/monitoring/prometheus-rules/vmop.yaml +++ /dev/null @@ -1,17 +0,0 @@ -- name: kubernetes.virtualization.vmop - rules: - - alert: D8VirtualizationVMOPStuckInPorgressState - expr: d8_virtualization_virtualmachineoperation_status_phase{phase="InProgress"} == 1 - labels: - severity_level: "9" - tier: cluster - for: 30m - annotations: - plk_protocol_version: "1" - plk_markup_format: "markdown" - plk_create_group_if_not_exists__d8_virtualization_vmop_stuck_in_progress_state: "D8VirtualizationVmopStuckInProgressState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" - plk_grouped_by__d8_virtualization_vmop_stuck_in_progress_state: "D8VirtualizationVmopStuckInProgressState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" - summary: The VMOP with phase InProgress for a long time. - description: | - The recommended course of action: - Find VMOPs whose phase is "InProgress" and sort by creation time: `kubectl get vmop -A -o jsonpath="{range .items[?(@.status.phase=='InProgress')].metadata}{.namespace}{'\t'}{.name}{'\t'}{.creationTimestamp}{'\n'}{end}" --sort-by=.metadata.creationTimestamp` From f1c44221f1f6dfed2b1255e9bbc9fbed55e7c2dd Mon Sep 17 00:00:00 2001 From: Pavel Tishkov Date: Fri, 23 May 2025 19:17:52 +0300 Subject: [PATCH 2/2] fix(module): fix typo Signed-off-by: Pavel Tishkov --- monitoring/prometheus-rules/vm.state.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/monitoring/prometheus-rules/vm.state.yaml b/monitoring/prometheus-rules/vm.state.yaml index 09cda00db4..6646c5c937 100644 --- a/monitoring/prometheus-rules/vm.state.yaml +++ b/monitoring/prometheus-rules/vm.state.yaml @@ -28,12 +28,12 @@ 2. Choose one of the following options depending on your setup: #### Option A: Evict the VM to another node (live migration): ```bash - d8 v -n {{ $labels.namespace }} evict vm {{ $labels.name }} + d8 v -n {{ $labels.namespace }} evict {{ $labels.name }} ``` > Requires live migration support. #### Option B: Reboot the VM: ```bash - d8 v -n {{ $labels.namespace }} restart vm {{ $labels.name }} + d8 v -n {{ $labels.namespace }} restart {{ $labels.name }} ``` > Simpler, but causes downtime unless guest OS supports ACPI shutdown/restart. 3. After migration or reboot, the VM will use the updated firmware automatically.