Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Bump timeout of service plumbing check #23439

Merged

Conversation

pchaigno
Copy link
Member

When restarting Cilium, we check a number of things to ensure it's ready, including that the kube-dns service is correctly plumbed (in the agent and in the datapath's maps).

This check is executed in a loop with a 5s timeout. All of the kube-dns checks, including that one, are executed in a loop with a 4min timeout.

To check the service plumbing, we shell out twice, to retrieve the retrieve the agent state and to dump the BPF map contents. These shelling out can take up to a few seconds, especially when running locally where we typically execute a kubectl exec inside an SSH command.

As a result of those commands taking a few seconds to execute, the inner loop regularly times out at 5s. That means we retry until we get a runtime below 5s. What could have taken 7s now sometimes takes several 10s of seconds because we have to retry. Locally, this can get even worse and we sometimes hit the 4min timeout of the outer loop because the inner loop never succeeds in less than 5s.

To avoid this whole mess, we can simply bump the inner loop's timeout to 10s. As per the above, this should (counterintuitively) reduce the total runtime of the restart checks.

When restarting Cilium, we check a number of things to ensure it's
ready, including that the kube-dns service is correctly plumbed (in the
agent and in the datapath's maps).

This check is executed in a loop with a 5s timeout. All of the kube-dns
checks, including that one, are executed in a loop with a 4min timeout.

To check the service plumbing, we shell out twice, to retrieve the
retrieve the agent state and to dump the BPF map contents. These
shelling out can take up to a few seconds, especially when running
locally where we typically execute a kubectl exec inside an SSH command.

As a result of those commands taking a few seconds to execute, the inner
loop regularly times out at 5s. That means we retry until we get a
runtime below 5s. What could have taken 7s now sometimes takes several
10s of seconds because we have to retry. Locally, this can get even
worse and we sometimes hit the 4min timeout of the outer loop because
the inner loop never succeeds in less than 5s.

To avoid this whole mess, we can simply bump the inner loop's timeout to
10s. As per the above, this should (counterintuitively) reduce the total
runtime of the restart checks.

Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Jan 29, 2023
@pchaigno
Copy link
Member Author

/test-vagrant

@pchaigno pchaigno marked this pull request as ready for review January 30, 2023 18:35
@pchaigno pchaigno requested a review from a team as a code owner January 30, 2023 18:35
@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 30, 2023
@qmonnet qmonnet merged commit a63eb25 into cilium:master Jan 31, 2023
@pchaigno pchaigno deleted the test-bump-timeout-svc-plumbing-check branch January 31, 2023 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants