Skip to content

Conversation

nemacysts
Copy link
Member

We've seen an increase in cases where the Watch (for a reason I don't quite understand yet - some of the things I've been reading suggest that there's maybe something re: k8s API caches and whatnot exipring) starts erroring out with exceptions like:

kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 1642912881 (1644329743

While in the past this seems to have generally always fixed itself with the pre-existing restart code, in the aforementioned cases we've seen this loop get stuck in a restart-loop that it we either never recover from until a human intervenes OR it takes long enough to recover that we start losing events.

This diff ensures that when we restart the watch loop, we restart as if we were starting a fresh Watch - this does mean that we'll re-process existing events as the current behavior of a Watch().stream() includes returning the current state, but that's fine since while GH issues mention that this isn't necessarily ordered, our state machine/transition logic should ensure that out-of-order events don't actually hurt us - at worst, we'll simply waste time.

We've seen an increase in cases where the Watch (for a reason I don't
quite understand yet - some of the things I've been reading suggest that
there's maybe something re: k8s API caches and whatnot exipring) starts
erroring out with exceptions like:
```
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 1642912881 (1644329743
```

While in the past this seems to have generally always fixed itself with
the pre-existing restart code, in the aforementioned cases we've seen
this loop get stuck in a restart-loop that it we either never recover
from until a human intervenes OR it takes long enough to recover that we
start losing events.

This diff ensures that when we restart the watch loop, we restart as if
we were starting a fresh Watch - this does mean that we'll re-process
existing events as the current behavior of a `Watch().stream()` includes
returning the current state, but that's fine since while GH issues
mention that this isn't necessarily ordered, our state
machine/transition logic should ensure that out-of-order events don't
actually hurt us - at worst, we'll simply waste time.
@nemacysts nemacysts marked this pull request as ready for review March 25, 2025 20:20
@nemacysts
Copy link
Member Author

I tested this on infrastage, but this HTTP410 issue doesn't seem to really happen in this cluster (and it's not particularly frequent enough in any dev clusters for me to be able to temporarily test this there)

@nemacysts nemacysts merged commit f0f659a into master Mar 25, 2025
2 checks passed
nemacysts added a commit to Yelp/Tron that referenced this pull request Mar 25, 2025
This brings in Yelp/task_processing#223, which
should ensure that we don't get stuck in a funky restart loop if k8s
tells us our watch's resourceVersion is too old
nemacysts added a commit to Yelp/Tron that referenced this pull request Mar 26, 2025
This brings in Yelp/task_processing#223, which
should ensure that we don't get stuck in a funky restart loop if k8s
tells us our watch's resourceVersion is too old
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants