Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/bpf: fix bug where second value can be skipped. #26583

Merged
merged 2 commits into from Jul 14, 2023

Conversation

tommyp1ckles
Copy link
Contributor

@tommyp1ckles tommyp1ckles commented Jun 30, 2023

DumpReliablyWithCallback will skip a value callback in some situations.

In cases where the initial key is deleted just after being retrieved, there is no previous key to fallback on. The reliable dump will attempt to use the current nextKey (that was based on the deleted current key).

The local currentKey and nextKey Key types are being passed to NextKey which eventually writes the nextKey output to the nextKeys pointer location (via the bpf syscall).

The currentValue was simply being assinged by the equals operator, which was copying the underlying interface pointer.

Thus in this situation, the next iteration attempt was passing the same pointer twice to NextKey causing the currentKey to be set to the next key a second time - skipping a map element.

Fixes #26491

Validating the fix

Because the test has 256 possible keys the particular start condition for this failure seems to be quite rare in CI, by reducing this to 5 it happens much more often (presumably 20% of the time?) - making this much easier to reproduce.

I tested this change by continuously running the DumpReliablyWithCallback (modified with maxSize=5) test to reproduce, and after the fix to validate (ran it 1000 times post fix without failure).

Fix bug where bpf map entries may not be reliably dumped or garbage collected when the map is actively being updated.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 30, 2023
@tommyp1ckles tommyp1ckles marked this pull request as ready for review June 30, 2023 17:59
@tommyp1ckles tommyp1ckles requested a review from a team as a code owner June 30, 2023 17:59
@tommyp1ckles tommyp1ckles requested review from rgo3 and ti-mo June 30, 2023 17:59
@tommyp1ckles tommyp1ckles added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Jun 30, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 30, 2023
@tommyp1ckles tommyp1ckles added needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch release-blocker/1.14 This issue will prevent the release of the next version of Cilium. labels Jun 30, 2023
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-dump-reliably-skipping-value branch from bd9c7cd to 1b1db9a Compare June 30, 2023 18:32
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-dump-reliably-skipping-value branch 2 times, most recently from 369bebe to e51df1c Compare July 1, 2023 00:06
@joestringer
Copy link
Member

Minor nit: This is marked as a bug but "second value can be skipped" isn't very user-friendly. Perhaps "Fix bug where bpf map entries may not be reliably dumped or garbage collected, particularly affecting connection tracking and load balancing maps" would demonstrate the impact more clearly.

@tommyp1ckles
Copy link
Contributor Author

@joestringer Yeah makes sense - I'll make the user impact clearer.

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-dump-reliably-skipping-value branch from e51df1c to 74f7184 Compare July 6, 2023 06:23
@tommyp1ckles
Copy link
Contributor Author

/test

pkg/bpf/map_linux.go Outdated Show resolved Hide resolved
@tommyp1ckles tommyp1ckles marked this pull request as draft July 7, 2023 21:54
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-dump-reliably-skipping-value branch from fae0fb2 to 632bc09 Compare July 9, 2023 03:41
DumpReliablyWithCallback will skip a value callback in some situations.
This may result in incorrect cilium map dumps or garbage collection.

In situations where the initial key is deleted just after being retrieved, there is no previous key to fallback on.
The reliable dump will attempt to use the current nextKey (that was based on the deleted current
key).

The local currentKey and nextKey Key types are being passed to NextKey
which eventually writes the nextKey output to the nextKeys pointer
location (via the bpf syscall).

The currentValue was simply being assinged by the equals operator, which
was copying the underlying interface pointer.

Thus in this situation, the next iteration attempt was passing the same
pointer twice to NextKey causing the currentKey to be set to the next
key a second time - skipping a map element.

Fixes cilium#26491

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-dump-reliably-skipping-value branch from 632bc09 to f6f862d Compare July 9, 2023 03:41
@tommyp1ckles tommyp1ckles marked this pull request as ready for review July 9, 2023 03:43
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles requested a review from rgo3 July 9, 2023 03:43
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-dump-reliably-skipping-value branch 2 times, most recently from 09fffa1 to b38237d Compare July 9, 2023 15:36
@tommyp1ckles
Copy link
Contributor Author

/test

Copy link
Contributor

@rgo3 rgo3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of nits/comments, otherwise LGTM.

pkg/bpf/map_linux_test.go Outdated Show resolved Hide resolved
pkg/bpf/map_linux_test.go Outdated Show resolved Hide resolved
pkg/bpf/map_linux_test.go Outdated Show resolved Hide resolved
pkg/bpf/map_linux_test.go Show resolved Hide resolved
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-dump-reliably-skipping-value branch from b38237d to 2eb6b98 Compare July 12, 2023 14:50
@tommyp1ckles
Copy link
Contributor Author

Thanks for the review @rgo3 😄, ptal

@tommyp1ckles tommyp1ckles requested a review from rgo3 July 12, 2023 14:51
This test is an improvement on the previous TestDumpReliablyWithCallback test.
The goal of this one is to provide more robust testing of the reliable dump mechanism.
Specifically, it does the following:

 1. Creates a map with a small number of entries, populate it with [1, maxEntries)
 2. Start a goroutine that continuously dumps the map and checks that the dump contains
    all odd elements in the range [1, maxEntries).
 3. At the same time, start another goroutine that continuously deletes and re-adds even
    elements in the map.

The motivation here is to provide a test that better catches regressions in code that is inherently prone to race-condition.

This creates a situation where we have interleaved updates and dumps, we're interested
in ensuring that each dump contains all odd elements in the range [1, maxEntries).
This will catch bugs and regressions related where elements are skipped in the dump.

For example, while running this without the fix: 74f71841e9c037ddd10bedc3128f3b28cb023597 this will fail a majority of the time.
Following this fix, this test should always pass.

This was tested locally by running this several times with a million iterations - both with the fix and without.
For practical purposes we will lower the number of iterations to 1000 to avoid slowing down the test suite too much.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-dump-reliably-skipping-value branch from 2eb6b98 to 84ce10e Compare July 12, 2023 14:58
@tommyp1ckles
Copy link
Contributor Author

/tesrt

@tommyp1ckles
Copy link
Contributor Author

/test

@aanm aanm merged commit 6acedee into cilium:main Jul 14, 2023
65 checks passed
@aanm aanm mentioned this pull request Jul 14, 2023
6 tasks
@aanm aanm added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Jul 14, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport pending to v1.14 in 1.14.0 Jul 14, 2023
@gandro gandro added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Jul 17, 2023
@aanm aanm moved this from Backport pending to v1.14 to Backport done to v1.14 in 1.14.0 Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.14.0
Backport done to v1.14
5 participants