Dynamic Per Resource Timeouts #19991

tommyp1ckles · 2022-05-27T16:47:14Z

To prevent k8s resource types that take a long time to sync from prematurely crashing agent Pods upon init. Adding per resource timeouts that do not timeout sync unless the watcher has exceeded the timeout period after the last informer event of that type to be received.
That is, each watcher records the time of the last event, per API resource type. If the K8s-Synced-Timeout is reached, only timeout if event of that type was not received during the timeout period.

As well, to facilitate diagnosing such issues, adding more labels to EventsTS metrics and refactoring all uses of this metric into a single GaugeVec.

Fixes: #18776

tommyp1ckles · 2022-06-01T15:14:58Z

/test

tommyp1ckles · 2022-06-08T20:33:48Z

nm about the GKE tests - missed the memo about those not working. Everything else seems to be ok test wise.

tommyp1ckles · 2022-06-08T23:12:05Z

/test

tommyp1ckles · 2022-06-08T23:14:45Z

I recently discovered the below handy git tricks, which might help for addressing review comments

# commit changes linked to original commit, auto complete can be used for easier selection
git commit --fixup <the original commit> 

# rebase with autosquash linked commits for clean commit history.
git rebase master -i --autosquash

Thanks yeah, I've been doing a lot of interactive rebasing. I just ended up mixing changes into commits which made reintegrating them back into the original very trick as they would depend on changes made later in somewhat unrelated commits.
Definitely need to try to keep my commits smaller and more self contained in the future...

tommyp1ckles · 2022-06-08T23:21:46Z

@christarazi I'll squash down the last commit once you're done reviewing.

pkg/k8s/synced/resources.go

nathanjsweet

K8s API changes look good.

tommyp1ckles · 2022-06-14T16:34:15Z

@tklauser can I get another review when you have a moment.

tklauser · 2022-06-15T10:07:11Z

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sPolicyTest Basic Test checks ICMP policies

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.23-kernel-net-next' failed:

Click to show.

Test Name

K8sPolicyTest Basic Test checks ICMP policies

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-net-next so I can create one.

tommyp1ckles · 2022-06-15T15:26:59Z

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' hit: #20217 (91.88% similarity)

While waiting for init of k8s subsystem, timeouts will be calculated from either the start time, or the time of the last received event. Resources that may take longer to sync but do make process by receiving events will be less likely to crash the Pod. Fixes: cilium#18776 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

To allow tracking k8s events per resource/action type, add event scope and action labels to events_ts metrics. Refactored all events_ts metrics into a single gauge vector. API metrics will be labelled with the url resource path and request type. Containerd metrics removed due to not being used. Also added test to ensure that cache with no controller doesn't wait. Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>

tommyp1ckles · 2022-06-15T17:40:29Z

/test

tommyp1ckles · 2022-06-15T17:45:18Z

ICMP test keeps failing, rebasing to see if recent datapath changes fix that.

tommyp1ckles · 2022-06-16T01:25:59Z

1.23-kernel-net-next failing seems unrelated to branch

tommyp1ckles · 2022-06-16T01:26:11Z

/test

tklauser · 2022-06-16T08:45:08Z

@tommyp1ckles FYI, you can re-trigger individual tests using the trigger phrases mentioned in brackets next to their name. For example, to re-trigger just k8s-1.23-kernel-net-next you can use /test-1.23-net-next. This will avoid running all other tests and potentially hitting other flakes.

tommyp1ckles · 2022-06-16T18:47:02Z

/test-1.23-net-next

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 27, 2022

tommyp1ckles changed the title ~~Pr/tommyp1ckles/dynamic per resource timeouts~~ Dynamic Per Resource Timeouts May 31, 2022

tommyp1ckles force-pushed the pr/tommyp1ckles/dynamic-per-resource-timeouts branch 10 times, most recently from 1b04d6e to a33f4b8 Compare May 31, 2022 23:26

tommyp1ckles force-pushed the pr/tommyp1ckles/dynamic-per-resource-timeouts branch 5 times, most recently from b2e9b54 to 3ecc331 Compare June 1, 2022 23:33

tommyp1ckles marked this pull request as ready for review June 1, 2022 23:34

tommyp1ckles requested a review from a team June 1, 2022 23:34

tommyp1ckles requested review from a team as code owners June 1, 2022 23:34

tommyp1ckles requested review from a team June 1, 2022 23:34

tommyp1ckles requested review from a team as code owners June 1, 2022 23:34

tommyp1ckles requested review from ldelossa, sayboras and tklauser June 1, 2022 23:34

tommyp1ckles force-pushed the pr/tommyp1ckles/dynamic-per-resource-timeouts branch from 706e7fd to 4234e7e Compare June 8, 2022 23:11

tommyp1ckles requested a review from christarazi June 8, 2022 23:12

christarazi approved these changes Jun 9, 2022

View reviewed changes

pkg/k8s/synced/resources.go Show resolved Hide resolved

tommyp1ckles force-pushed the pr/tommyp1ckles/dynamic-per-resource-timeouts branch 3 times, most recently from 1e1aa6f to 5d25fcf Compare June 10, 2022 00:10

nathanjsweet approved these changes Jun 13, 2022

View reviewed changes

tklauser approved these changes Jun 15, 2022

View reviewed changes

tklauser removed the request for review from nebril June 15, 2022 10:07

maintainer-s-little-helper bot mentioned this pull request Jun 15, 2022

CI: K8sPolicyTest Basic Test checks ICMP policies #20217

Closed

tommyp1ckles added 2 commits June 15, 2022 17:39

tommyp1ckles force-pushed the pr/tommyp1ckles/dynamic-per-resource-timeouts branch from 5d25fcf to 559ee6f Compare June 15, 2022 17:40

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 16, 2022

kkourt merged commit 41d77e1 into cilium:master Jun 17, 2022

aanm mentioned this pull request Jun 22, 2022

Prepare for release v1.12.0-rc3 #20279

Merged

tommyp1ckles mentioned this pull request Sep 16, 2022

CFP: Improve Cilium Metrics Documentation #21343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Per Resource Timeouts #19991

Dynamic Per Resource Timeouts #19991

tommyp1ckles commented May 27, 2022 •

edited

Loading

tommyp1ckles commented Jun 1, 2022

tommyp1ckles commented Jun 8, 2022

tommyp1ckles commented Jun 8, 2022

tommyp1ckles commented Jun 8, 2022 •

edited

Loading

tommyp1ckles commented Jun 8, 2022 •

edited

Loading

nathanjsweet left a comment

tommyp1ckles commented Jun 14, 2022

tklauser commented Jun 15, 2022 •

edited by maintainer-s-little-helper bot

Loading

Test Name

Failure Output

Test Name

Failure Output

tommyp1ckles commented Jun 15, 2022 •

edited by maintainer-s-little-helper bot

Loading

tommyp1ckles commented Jun 15, 2022

tommyp1ckles commented Jun 15, 2022

tommyp1ckles commented Jun 16, 2022

tommyp1ckles commented Jun 16, 2022

tklauser commented Jun 16, 2022

tommyp1ckles commented Jun 16, 2022

Dynamic Per Resource Timeouts #19991

Dynamic Per Resource Timeouts #19991

Conversation

tommyp1ckles commented May 27, 2022 • edited Loading

tommyp1ckles commented Jun 1, 2022

tommyp1ckles commented Jun 8, 2022

tommyp1ckles commented Jun 8, 2022

tommyp1ckles commented Jun 8, 2022 • edited Loading

tommyp1ckles commented Jun 8, 2022 • edited Loading

nathanjsweet left a comment

Choose a reason for hiding this comment

tommyp1ckles commented Jun 14, 2022

tklauser commented Jun 15, 2022 • edited by maintainer-s-little-helper bot Loading

Test Name

Failure Output

Test Name

Failure Output

tommyp1ckles commented Jun 15, 2022 • edited by maintainer-s-little-helper bot Loading

tommyp1ckles commented Jun 15, 2022

tommyp1ckles commented Jun 15, 2022

tommyp1ckles commented Jun 16, 2022

tommyp1ckles commented Jun 16, 2022

tklauser commented Jun 16, 2022

tommyp1ckles commented Jun 16, 2022

tommyp1ckles commented May 27, 2022 •

edited

Loading

tommyp1ckles commented Jun 8, 2022 •

edited

Loading

tommyp1ckles commented Jun 8, 2022 •

edited

Loading

tklauser commented Jun 15, 2022 •

edited by maintainer-s-little-helper bot

Loading

tommyp1ckles commented Jun 15, 2022 •

edited by maintainer-s-little-helper bot

Loading