Handle Watch expiration by resetting the resourceVersion #223

nemacysts · 2025-03-25T19:14:11Z

We've seen an increase in cases where the Watch (for a reason I don't quite understand yet - some of the things I've been reading suggest that there's maybe something re: k8s API caches and whatnot exipring) starts erroring out with exceptions like:

kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 1642912881 (1644329743

While in the past this seems to have generally always fixed itself with the pre-existing restart code, in the aforementioned cases we've seen this loop get stuck in a restart-loop that it we either never recover from until a human intervenes OR it takes long enough to recover that we start losing events.

This diff ensures that when we restart the watch loop, we restart as if we were starting a fresh Watch - this does mean that we'll re-process existing events as the current behavior of a Watch().stream() includes returning the current state, but that's fine since while GH issues mention that this isn't necessarily ordered, our state machine/transition logic should ensure that out-of-order events don't actually hurt us - at worst, we'll simply waste time.

We've seen an increase in cases where the Watch (for a reason I don't quite understand yet - some of the things I've been reading suggest that there's maybe something re: k8s API caches and whatnot exipring) starts erroring out with exceptions like: ``` kubernetes.client.exceptions.ApiException: (410) Reason: Expired: too old resource version: 1642912881 (1644329743 ``` While in the past this seems to have generally always fixed itself with the pre-existing restart code, in the aforementioned cases we've seen this loop get stuck in a restart-loop that it we either never recover from until a human intervenes OR it takes long enough to recover that we start losing events. This diff ensures that when we restart the watch loop, we restart as if we were starting a fresh Watch - this does mean that we'll re-process existing events as the current behavior of a `Watch().stream()` includes returning the current state, but that's fine since while GH issues mention that this isn't necessarily ordered, our state machine/transition logic should ensure that out-of-order events don't actually hurt us - at worst, we'll simply waste time.

nemacysts · 2025-03-25T20:21:38Z

I tested this on infrastage, but this HTTP410 issue doesn't seem to really happen in this cluster (and it's not particularly frequent enough in any dev clusters for me to be able to temporarily test this there)

This brings in Yelp/task_processing#223, which should ensure that we don't get stuck in a funky restart loop if k8s tells us our watch's resourceVersion is too old

nemacysts marked this pull request as ready for review March 25, 2025 20:20

nemacysts requested review from EvanKrall, wilmer05, EmanElsaban, KaspariK, jfongatyelp and ajayOO8 March 25, 2025 20:21

jfongatyelp approved these changes Mar 25, 2025

View reviewed changes

nemacysts merged commit f0f659a into master Mar 25, 2025
2 checks passed

nemacysts mentioned this pull request Mar 25, 2025

Bump task_processing to v1.3.3 for Watch restart fix Yelp/Tron#1035

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle Watch expiration by resetting the resourceVersion #223

Handle Watch expiration by resetting the resourceVersion #223

Uh oh!

nemacysts commented Mar 25, 2025

Uh oh!

nemacysts commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

Handle Watch expiration by resetting the resourceVersion #223

Handle Watch expiration by resetting the resourceVersion #223

Uh oh!

Conversation

nemacysts commented Mar 25, 2025

Uh oh!

nemacysts commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!