Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Detect missed tail calls on upgrade/downgrade test #13097

Merged
merged 1 commit into from Sep 7, 2020

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Sep 4, 2020

Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period.

This pull request adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened.

This new test was tested by:

We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.8, as it will otherwise fail.

Related: #13015, #13051, #13052

@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Sep 4, 2020
@pchaigno pchaigno requested a review from a team as a code owner September 4, 2020 20:48
@pchaigno pchaigno force-pushed the pr/pchaigno/test-missed-tail-calls-upgrades branch from 7c8ee3b to a34aa92 Compare September 4, 2020 20:51
Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno force-pushed the pr/pchaigno/test-missed-tail-calls-upgrades branch from a34aa92 to 31a91aa Compare September 7, 2020 06:12
@pchaigno
Copy link
Member Author

pchaigno commented Sep 7, 2020

test-me-please

@pchaigno
Copy link
Member Author

pchaigno commented Sep 7, 2020

Only the net-next build is failing, with known-flake #13062. All other required builds are passing so marking as ready to merge.

@pchaigno pchaigno added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. needs-backport/1.7 labels Sep 7, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.10 Sep 7, 2020
@aanm aanm merged commit e9b3844 into master Sep 7, 2020
@aanm aanm deleted the pr/pchaigno/test-missed-tail-calls-upgrades branch September 7, 2020 17:53
@pchaigno
Copy link
Member Author

pchaigno commented Sep 8, 2020

The v1.7 fix is merged. The v1.8 fix backport is going to be merged with #13100. So we can merge the test in v1.8 during the next round of backports.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.4 Sep 8, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.4 Sep 9, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.4 Sep 9, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.4 Sep 9, 2020
@joestringer joestringer moved this from Needs backport from master to Backport done to v1.7 in 1.7.10 Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.7.10
Backport done to v1.7
1.8.4
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

5 participants