Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Per Resource Timeouts #19991

Conversation

tommyp1ckles
Copy link
Contributor

@tommyp1ckles tommyp1ckles commented May 27, 2022

To prevent k8s resource types that take a long time to sync from prematurely crashing agent Pods upon init. Adding per resource timeouts that do not timeout sync unless the watcher has exceeded the timeout period after the last informer event of that type to be received.
That is, each watcher records the time of the last event, per API resource type. If the K8s-Synced-Timeout is reached, only timeout if event of that type was not received during the timeout period.

As well, to facilitate diagnosing such issues, adding more labels to EventsTS metrics and refactoring all uses of this metric into a single GaugeVec.

Fixes: #18776

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 27, 2022
@tommyp1ckles tommyp1ckles changed the title Pr/tommyp1ckles/dynamic per resource timeouts Dynamic Per Resource Timeouts May 31, 2022
@tommyp1ckles tommyp1ckles force-pushed the pr/tommyp1ckles/dynamic-per-resource-timeouts branch 10 times, most recently from 1b04d6e to a33f4b8 Compare May 31, 2022 23:26
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles force-pushed the pr/tommyp1ckles/dynamic-per-resource-timeouts branch 5 times, most recently from b2e9b54 to 3ecc331 Compare June 1, 2022 23:33
@tommyp1ckles tommyp1ckles marked this pull request as ready for review June 1, 2022 23:34
@tommyp1ckles tommyp1ckles requested a review from a team June 1, 2022 23:34
@tommyp1ckles tommyp1ckles requested review from a team as code owners June 1, 2022 23:34
@tommyp1ckles tommyp1ckles requested review from a team June 1, 2022 23:34
@tommyp1ckles tommyp1ckles requested review from a team as code owners June 1, 2022 23:34
@tommyp1ckles
Copy link
Contributor Author

nm about the GKE tests - missed the memo about those not working. Everything else seems to be ok test wise.

@tommyp1ckles tommyp1ckles force-pushed the pr/tommyp1ckles/dynamic-per-resource-timeouts branch from 706e7fd to 4234e7e Compare June 8, 2022 23:11
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor Author

tommyp1ckles commented Jun 8, 2022

I recently discovered the below handy git tricks, which might help for addressing review comments

# commit changes linked to original commit, auto complete can be used for easier selection
git commit --fixup <the original commit> 

# rebase with autosquash linked commits for clean commit history.
git rebase master -i --autosquash

Thanks yeah, I've been doing a lot of interactive rebasing. I just ended up mixing changes into commits which made reintegrating them back into the original very trick as they would depend on changes made later in somewhat unrelated commits.
Definitely need to try to keep my commits smaller and more self contained in the future...

@tommyp1ckles
Copy link
Contributor Author

tommyp1ckles commented Jun 8, 2022

@christarazi I'll squash down the last commit once you're done reviewing.

pkg/k8s/synced/resources.go Show resolved Hide resolved
@tommyp1ckles tommyp1ckles force-pushed the pr/tommyp1ckles/dynamic-per-resource-timeouts branch 3 times, most recently from 1e1aa6f to 5d25fcf Compare June 10, 2022 00:10
Copy link
Member

@nathanjsweet nathanjsweet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K8s API changes look good.

@tommyp1ckles
Copy link
Contributor Author

@tklauser can I get another review when you have a moment.

@tklauser
Copy link
Member

tklauser commented Jun 15, 2022

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sPolicyTest Basic Test checks ICMP policies

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.23-kernel-net-next' failed:

Click to show.

Test Name

K8sPolicyTest Basic Test checks ICMP policies

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-net-next so I can create one.

@tklauser tklauser removed the request for review from nebril June 15, 2022 10:07
@tommyp1ckles
Copy link
Contributor Author

tommyp1ckles commented Jun 15, 2022

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' hit: #20217 (91.88% similarity)

While waiting for init of k8s subsystem, timeouts will be calculated
from either the start time, or the time of the last received event.
Resources that may take longer to sync but do make process by receiving
events will be less likely to crash the Pod.

Fixes: cilium#18776

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
To allow tracking k8s events per resource/action type,
add event scope and action labels to events_ts metrics.
Refactored all events_ts metrics into a single gauge vector.
API metrics will be labelled with the url resource path and
request type.
Containerd metrics removed due to not being used.
Also added test to ensure that cache with no controller doesn't wait.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles tommyp1ckles force-pushed the pr/tommyp1ckles/dynamic-per-resource-timeouts branch from 5d25fcf to 559ee6f Compare June 15, 2022 17:40
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor Author

ICMP test keeps failing, rebasing to see if recent datapath changes fix that.

@tommyp1ckles
Copy link
Contributor Author

1.23-kernel-net-next failing seems unrelated to branch

@tommyp1ckles
Copy link
Contributor Author

/test

@tklauser
Copy link
Member

@tommyp1ckles FYI, you can re-trigger individual tests using the trigger phrases mentioned in brackets next to their name. For example, to re-trigger just k8s-1.23-kernel-net-next you can use /test-1.23-net-next. This will avoid running all other tests and potentially hitting other flakes.

@tommyp1ckles
Copy link
Contributor Author

/test-1.23-net-next

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 16, 2022
@kkourt kkourt merged commit 41d77e1 into cilium:master Jun 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Cilium more robust when there are many resources to sync from Kubernetes upon startup
8 participants