Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions monitoring/prometheus-rules/vm.state.yaml
Comment thread
hayer969 marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
- name: virtualization.vm.state
rules:
- alert: D8VirtualizationVirtualMachineFirmwareOutOfDate
expr: d8_virtualization_virtualmachine_firmware_up_to_date == 0
labels:
severity_level: "8"
tier: application
for: 60m
annotations:
plk_protocol_version: "1"
plk_markup_format: "markdown"
plk_create_group_if_not_exists__d8_virtualization_virtualmachine_firmware_out_of_date: "D8VirtualizationVirtualMachineFirmwareOutOfDate,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes"
plk_grouped_by__d8_virtualization_virtualmachine_firmware_out_of_date: "D8VirtualizationVirtualMachineFirmwareOutOfDate,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes"
summary: VirtualMachine have out of date firmware.
description: |
The virtual machine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has outdated firmware.
Outdated firmware may expose the VM to security vulnerabilities or compatibility issues after virtualization updates.
### Why This Happens
Firmware (QEMU/KVM) used by a VM is tied to the version provided by the node where the VM is running. After updating the virtualization module (via Deckhouse), new firmware becomes available, but already running VMs continue using the old version until restarted or migrated.
### Diagnosis
Inspect the VM status to confirm the firmware issue:
```bash
d8 k -n {{ $labels.namespace }} get vm {{ $labels.name }} -o jsonpath="{.status}"
```
### Recommended Actions
To apply the latest firmware:
1. **Schedule maintenance** and inform relevant teams/users.
2. Choose one of the following options depending on your setup:
#### Option A: Evict the VM to another node (live migration):
```bash
d8 v -n {{ $labels.namespace }} evict {{ $labels.name }}
```
> Requires live migration support.
#### Option B: Reboot the VM:
```bash
d8 v -n {{ $labels.namespace }} restart {{ $labels.name }}
```
> Simpler, but causes downtime unless guest OS supports ACPI shutdown/restart.
3. After migration or reboot, the VM will use the updated firmware automatically.
20 changes: 0 additions & 20 deletions monitoring/prometheus-rules/vm.yaml

This file was deleted.

38 changes: 38 additions & 0 deletions monitoring/prometheus-rules/vmop.state.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
- name: virtualization.vmop.state
rules:
- alert: D8VirtualizationVirtualMachineOperationStuckInProgressPhase
expr: d8_virtualization_virtualmachineoperation_status_phase{phase="InProgress"} == 1
labels:
severity_level: "9"
tier: application
for: 60m
annotations:
plk_protocol_version: "1"
plk_markup_format: "markdown"
plk_create_group_if_not_exists__d8_virtualization_vmop_stuck_in_progress_phase: "D8VirtualizationVirtualMachineOperationStuckInProgressPhase,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes"
plk_grouped_by__d8_virtualization_vmop_stuck_in_progress_phase: "D8VirtualizationVirtualMachineOperationStuckInProgressPhase,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes"
summary: The VirtualMachineOperation stuck in InProgress phase for a long time.
description: |
The `VirtualMachineOperation` object `{{ $labels.name }}` in namespace `{{ $labels.namespace }}` has been stuck in the `InProgress` phase for more than 60 minutes.
This may indicate that the operation (e.g., restart, evict, stop, start) was not completed successfully and is now stalled.
### Possible Causes
- The underlying virtual machine is unreachable or in an inconsistent state.
- Node issues (e.g., network problems, node downtime).
### Diagnosis
1. Get details of the affected VirtualMachineOperation:
```bash
d8 k -n {{ $labels.namespace }} get vmop {{ $labels.name }} -o wide
```
2. Check related VM status:
```bash
d8 k -n {{ $labels.namespace }} get vm <vm-name> -o jsonpath="{.status}"
```
### Recommended Actions
If the operation can be safely retried, delete the `VirtualMachineOperation` object:
```bash
d8 k -n {{ $labels.namespace }} delete vmop {{ $labels.name }}
```
Then re-initiate the required action (e.g., restart, evict, etc).
```bash
d8 v <operation> <vm-name>
```
17 changes: 0 additions & 17 deletions monitoring/prometheus-rules/vmop.yaml

This file was deleted.

Loading