EtcdBackupCronJobStatusFailed fires when not needed #40

matthias4217 · 2022-11-21T12:43:51Z

We use this helm chart on our OKD cluster, and it's been working pretty nicely so far. However, the EtcdBackupCronJobStatusFailed has started firing when a job is running.

However, I don't understand why I've started receiving such errors only recently. We've recently updated the cluster, which is probably the cause. But the issue might come from something else.

Cluster version : 4.11.0-0.okd-2022-11-05-030711

Solutions or workarounds

I've tried to replace kube_job_status_succeeded{namespace="infra-backup-etcd"} == 0 by kube_job_status_failed{namespace="infra-backup-etcd"} > 0, but I'm not sure whether it would raise an alert in case of a failure to schedule a job.

Another solution would be kube_job_status_succeeded{namespace="infra-backup-etcd"} + kube_job_status_active{namespace="infra-backup-etcd"} != 1. It seems to work fine, though there could still be some cases I've missed.

The text was updated successfully, but these errors were encountered:

hairmare · 2023-01-05T10:04:46Z

I don't think we've been able to reproduce this at any point. It might be 4.11 specific, we don't currently have that in production because we stick with LTS versions so it's less tested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EtcdBackupCronJobStatusFailed fires when not needed #40

EtcdBackupCronJobStatusFailed fires when not needed #40

matthias4217 commented Nov 21, 2022 •

edited

Loading

hairmare commented Jan 5, 2023

EtcdBackupCronJobStatusFailed fires when not needed #40

EtcdBackupCronJobStatusFailed fires when not needed #40

Comments

matthias4217 commented Nov 21, 2022 • edited Loading

Solutions or workarounds

hairmare commented Jan 5, 2023

matthias4217 commented Nov 21, 2022 •

edited

Loading