New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: Fix handling of policy call map on upgrades #13051
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pchaigno
added
sig/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
upgrade-impact
This PR has potential upgrade or downgrade impact.
labels
Sep 2, 2020
pchaigno
force-pushed
the
pr/pchaigno/fix-upgrade-policycallmap
branch
from
September 2, 2020 20:39
eac130c
to
e8b2e70
Compare
The name of the policy call map on the bpffs was changed between 1.7 and 1.8 by commit 5d6b669. Commit 6bada02 then added code to delete the old map name on initialization of the agent. We cannot simply delete the old policy call map because it might still be used by endpoints (until they are regenerated) and the base programs (until they are reloaded). However, if we rename the map in the bpffs, it shouldn't affect BPF programs that are using it and they will then pick up the new name on reload. The reverse renaming operation is needed in 1.7 to allow for downgrades from 1.8. Fixes: 5d6b669 ("maps/policymap: Rename policy call map to clarify intent") Fixes: 6bada02 ("daemon: Remove old policy call map") Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno
force-pushed
the
pr/pchaigno/fix-upgrade-policycallmap
branch
from
September 4, 2020 20:25
e8b2e70
to
2b06041
Compare
pchaigno
added a commit
that referenced
this pull request
Sep 4, 2020
Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io>
christarazi
approved these changes
Sep 4, 2020
borkmann
approved these changes
Sep 4, 2020
test-me-please |
pchaigno
added a commit
that referenced
this pull request
Sep 7, 2020
Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io>
maintainer-s-little-helper
bot
added
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Sep 7, 2020
maintainer-s-little-helper
bot
moved this from Needs backport from master
to Backport pending to v1.8
in 1.8.4
Sep 7, 2020
aanm
pushed a commit
that referenced
this pull request
Sep 7, 2020
Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io>
fristonio
pushed a commit
that referenced
this pull request
Sep 9, 2020
[ upstream commit e9b3844 ] Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
maintainer-s-little-helper
bot
moved this from Backport pending to v1.8
to Backport done to v1.8
in 1.8.4
Sep 9, 2020
fristonio
pushed a commit
that referenced
this pull request
Sep 9, 2020
[ upstream commit e9b3844 ] Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
aanm
pushed a commit
that referenced
this pull request
Sep 9, 2020
[ upstream commit e9b3844 ] Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
aditighag
pushed a commit
that referenced
this pull request
Sep 10, 2020
[ upstream commit e9b3844 ] Connectivity disruptions caused by missed tail calls were recently reported at #13015. It was caused by an incorrect handling of a call map rename. We didn't detect it because we don't have code to specifically detect missed tail calls during the upgrade/downgrade test; the test only fails if the connectivity is broken during a long enough period. This commit adds a new function to retrieve the sum of 'Missed tail calls' metrics across all Cilium pods. It is then used in the test after both the upgrade and the subsequent downgrade to check that no drops due to missed tail calls happened. This new test was tested by: - backporting to v1.8 and checking that missed tail calls are indeed detected. - backporting the fixes on the v1.7 (#13052) and v1.8 (#13051) branches and checking that no more tail calls were detected. We need to wait for both #13052 and #13051 to be merged and backported before we can backport this test to v1.7 and v1.8, as it will otherwise fail. Related: #13015, #13051, #13052 Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
sig/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
upgrade-impact
This PR has potential upgrade or downgrade impact.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The name of the policy call map on the bpffs was changed between 1.7 and 1.8 by commit 5d6b669. Commit 6bada02 then added code to delete the old map name on initialization of the agent.
We cannot simply delete the old policy call map because it might still be used by endpoints (until they are regenerated) and the base programs (until they are reloaded). However, if we rename the map in the bpffs, it shouldn't affect BPF programs that are using it and they will then pick up the new name on reload.
The reverse renaming operation is needed in 1.7 to allow for downgrades from 1.8.
Updates: #13015
Fixes: #10626
Fixes: #10845