-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: improve DumpReliablyWithCallback #9972
Conversation
Release note label not set, please set the appropriate release note. |
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tidyup, one main note below on the number of iterations. Did you mean to switch it to N iteration failures rather than N iterations total?
No, I didn't 🙈 Lemme fix that. |
629cf2e
to
676b21e
Compare
test-me-please |
1 similar comment
test-me-please |
Provisioning failure above: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/16703/ Re-running CI. |
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Privileged unit tests run actually failed in CI: |
That's interesting. I had multiple runs locally and none failed. It's not clear what the error is in the CI though. Timeout? I'll try decreasing the map size. |
676b21e
to
e3c598a
Compare
test-me-please |
@rolinh the privileged unit tests failed again |
Unfortunately, there isn't much useful information from the CI logs (or I didn't find it). Thus, I ran the privileged tests on my dev machine 100+ times without hitting a failure ( Then I tried running the privileged tests on a dev VM (Ubuntu with kernel
Then another attempt, stuck again:
This time, we have I managed to reproduce this issue several times and the only constant I see is that the I'm probably missing something at this point. Could someone else try to reproduce the failure and have a look at it? |
I tried to reproduce this with the latest 4.9 kernel (4.9.212 at the moment) and was unable even after 20+ runs whereas with kernel 4.9.160, the issue manifests every other test run. This test is probably triggering the data race fixed by this kernel commit which was introduced in 4.9.177. In order to make sure, I should check if I can reproduce with kernel 4.9.176 and then with kernel 4.9.177. Do we simply bump the VM kernel to 4.9.212 at this point? |
This should resolve BPF_MAP_GET_NEXT_KEY getting stuck issue mentioned in [1]. [1]: cilium/cilium#9972 (comment) Signed-off-by: Martynas Pumputis <m@lambda.lt>
It's coming cilium/packer-ci-build#184! |
So I did that and the result is:
This confirms the issue. Let's wait for our CI images to use the updated kernel (thanks @brb !) and re-run the CI tests. |
e3c598a
to
13c0c96
Compare
Seems that we're waiting on #10001 ? |
BPF_MAP_GET_NEXT_KEY returns ENOENT when all keys have been iterated. This commit modifies GetNextKeyFromPointers() (and thus GetFirstKey()) to return io.EOF when the BPF_MAP_GET_NEXT_KEY syscall returns ENOENT in order to signify that there are no more elements to iterate over. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
GetFirstKey()/GetNextKeyFromPointers() return io.EOF when there are no more elements to iterate over the map. Any other error is actually an error. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
DumpReliablyWithCallback() attempts to dump the map without acquiring a lock. If the map is being updated at the same time it is being iterated over its elements by the dump function, the iteration loop backtracks in order to catch the change(s). However, an upper bound limit for backtracking is required in order to avoid hanging by iterating "forever" in a worst case scenario. This commit improves this function by returning when io.EOF is returned by GetNextKeyFromPointer or when the max lookup limit has been reached (which is now set to 4 times the map capacity as opposed to only the map's capacity as in the previous implementation). This should make the function more reliable on concurrent map modifications. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
13c0c96
to
e18253a
Compare
Rebased against master to pickup #10001. |
test-me-please |
Commit 8b56e2f064cd02add5f1197318d898f8964007e6 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/contributing/#developer-s-certificate-of-origin |
8b56e2f
to
e18253a
Compare
Sorry, I inadvertently clicked the "Update branch" button which merged master into this branch. Forced pushed to revert. |
DumpReliablyWithCallback()
should now be more resilient although it may still fail under heavy concurrent update situations, in which case it returns an appropriate error.The upper bound limit was set to
4 * stats.MaxEntries
as suggested by @borkmann, which is probably a reasonable default.See also the discussion at cilium/ebpf#11
Fixes: #9811
This change is