Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EtcdBackupCronJobStatusFailed fires when not needed #40

Open
matthias4217 opened this issue Nov 21, 2022 · 1 comment
Open

EtcdBackupCronJobStatusFailed fires when not needed #40

matthias4217 opened this issue Nov 21, 2022 · 1 comment

Comments

@matthias4217
Copy link

matthias4217 commented Nov 21, 2022

We use this helm chart on our OKD cluster, and it's been working pretty nicely so far. However, the EtcdBackupCronJobStatusFailed has started firing when a job is running.

However, I don't understand why I've started receiving such errors only recently. We've recently updated the cluster, which is probably the cause. But the issue might come from something else.

Cluster version : 4.11.0-0.okd-2022-11-05-030711

Solutions or workarounds

I've tried to replace kube_job_status_succeeded{namespace="infra-backup-etcd"} == 0 by kube_job_status_failed{namespace="infra-backup-etcd"} > 0, but I'm not sure whether it would raise an alert in case of a failure to schedule a job.

Another solution would be kube_job_status_succeeded{namespace="infra-backup-etcd"} + kube_job_status_active{namespace="infra-backup-etcd"} != 1. It seems to work fine, though there could still be some cases I've missed.

@hairmare
Copy link
Contributor

hairmare commented Jan 5, 2023

I don't think we've been able to reproduce this at any point. It might be 4.11 specific, we don't currently have that in production because we stick with LTS versions so it's less tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants