Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Fix handling of policy call map on upgrades #13051

Merged
merged 1 commit into from Sep 7, 2020

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Sep 2, 2020

The name of the policy call map on the bpffs was changed between 1.7 and 1.8 by commit 5d6b669. Commit 6bada02 then added code to delete the old map name on initialization of the agent.

We cannot simply delete the old policy call map because it might still be used by endpoints (until they are regenerated) and the base programs (until they are reloaded). However, if we rename the map in the bpffs, it shouldn't affect BPF programs that are using it and they will then pick up the new name on reload.

The reverse renaming operation is needed in 1.7 to allow for downgrades from 1.8.

Updates: #13015
Fixes: #10626
Fixes: #10845

@pchaigno pchaigno added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. upgrade-impact This PR has potential upgrade or downgrade impact. labels Sep 2, 2020
@pchaigno pchaigno force-pushed the pr/pchaigno/fix-upgrade-policycallmap branch from eac130c to e8b2e70 Compare September 2, 2020 20:39
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.3 Sep 3, 2020
@aanm aanm added this to Needs backport from master in 1.8.4 Sep 4, 2020
@aanm aanm removed this from Needs backport from master in 1.8.3 Sep 4, 2020
The name of the policy call map on the bpffs was changed between 1.7 and
1.8 by commit 5d6b669. Commit 6bada02 then added code to delete the old
map name on initialization of the agent.

We cannot simply delete the old policy call map because it might still
be used by endpoints (until they are regenerated) and the base programs
(until they are reloaded). However, if we rename the map in the bpffs,
it shouldn't affect BPF programs that are using it and they will then
pick up the new name on reload.

The reverse renaming operation is needed in 1.7 to allow for downgrades
from 1.8.

Fixes: 5d6b669 ("maps/policymap: Rename policy call map to clarify intent")
Fixes: 6bada02 ("daemon: Remove old policy call map")
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno marked this pull request as ready for review September 4, 2020 20:24
@pchaigno pchaigno requested a review from a team September 4, 2020 20:24
@pchaigno pchaigno force-pushed the pr/pchaigno/fix-upgrade-policycallmap branch from e8b2e70 to 2b06041 Compare September 4, 2020 20:25
pchaigno added a commit that referenced this pull request Sep 4, 2020
Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

pchaigno commented Sep 4, 2020

I manually tested this fix by backporting it to v1.8 with #13097, the extended K8sUpdates test. No new drops due to missed tail calls were detected after the upgrade. With #13097 but without this fix, drops due to missed tail calls were detect (~60).

@pchaigno
Copy link
Member Author

pchaigno commented Sep 7, 2020

test-me-please

pchaigno added a commit that referenced this pull request Sep 7, 2020
Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 7, 2020
@nebril nebril merged commit bb615ea into master Sep 7, 2020
@nebril nebril deleted the pr/pchaigno/fix-upgrade-policycallmap branch September 7, 2020 08:39
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.4 Sep 7, 2020
aanm pushed a commit that referenced this pull request Sep 7, 2020
Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
fristonio pushed a commit that referenced this pull request Sep 9, 2020
[ upstream commit e9b3844 ]

Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.4 Sep 9, 2020
fristonio pushed a commit that referenced this pull request Sep 9, 2020
[ upstream commit e9b3844 ]

Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
aanm pushed a commit that referenced this pull request Sep 9, 2020
[ upstream commit e9b3844 ]

Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
aditighag pushed a commit that referenced this pull request Sep 10, 2020
[ upstream commit e9b3844 ]

Connectivity disruptions caused by missed tail calls were recently
reported at #13015. It was caused by an incorrect handling of a call map
rename. We didn't detect it because we don't have code to specifically
detect missed tail calls during the upgrade/downgrade test; the test
only fails if the connectivity is broken during a long enough period.

This commit adds a new function to retrieve the sum of 'Missed tail calls'
metrics across all Cilium pods. It is then used in the test after both
the upgrade and the subsequent downgrade to check that no drops due to
missed tail calls happened.

This new test was tested by:
- backporting to v1.8 and checking that missed tail calls are indeed
  detected.
- backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches
  and checking that no more tail calls were detected.

We need to wait for both #13052 and #13051 to be merged and backported
before we can backport this test to v1.7 and v1.8, as it will otherwise
fail.

Related: #13015, #13051, #13052
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. upgrade-impact This PR has potential upgrade or downgrade impact.
Projects
No open projects
1.8.4
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

4 participants