New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/bpf: fix bug where second value can be skipped. #26583
pkg/bpf: fix bug where second value can be skipped. #26583
Conversation
/test |
bd9c7cd
to
1b1db9a
Compare
/test |
369bebe
to
e51df1c
Compare
Minor nit: This is marked as a |
@joestringer Yeah makes sense - I'll make the user impact clearer. |
e51df1c
to
74f7184
Compare
/test |
fae0fb2
to
632bc09
Compare
DumpReliablyWithCallback will skip a value callback in some situations. This may result in incorrect cilium map dumps or garbage collection. In situations where the initial key is deleted just after being retrieved, there is no previous key to fallback on. The reliable dump will attempt to use the current nextKey (that was based on the deleted current key). The local currentKey and nextKey Key types are being passed to NextKey which eventually writes the nextKey output to the nextKeys pointer location (via the bpf syscall). The currentValue was simply being assinged by the equals operator, which was copying the underlying interface pointer. Thus in this situation, the next iteration attempt was passing the same pointer twice to NextKey causing the currentKey to be set to the next key a second time - skipping a map element. Fixes cilium#26491 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
632bc09
to
f6f862d
Compare
/test |
09fffa1
to
b38237d
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of nits/comments, otherwise LGTM.
b38237d
to
2eb6b98
Compare
Thanks for the review @rgo3 😄, ptal |
This test is an improvement on the previous TestDumpReliablyWithCallback test. The goal of this one is to provide more robust testing of the reliable dump mechanism. Specifically, it does the following: 1. Creates a map with a small number of entries, populate it with [1, maxEntries) 2. Start a goroutine that continuously dumps the map and checks that the dump contains all odd elements in the range [1, maxEntries). 3. At the same time, start another goroutine that continuously deletes and re-adds even elements in the map. The motivation here is to provide a test that better catches regressions in code that is inherently prone to race-condition. This creates a situation where we have interleaved updates and dumps, we're interested in ensuring that each dump contains all odd elements in the range [1, maxEntries). This will catch bugs and regressions related where elements are skipped in the dump. For example, while running this without the fix: 74f71841e9c037ddd10bedc3128f3b28cb023597 this will fail a majority of the time. Following this fix, this test should always pass. This was tested locally by running this several times with a million iterations - both with the fix and without. For practical purposes we will lower the number of iterations to 1000 to avoid slowing down the test suite too much. Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
2eb6b98
to
84ce10e
Compare
/tesrt |
/test |
DumpReliablyWithCallback will skip a value callback in some situations.
In cases where the initial key is deleted just after being retrieved, there is no previous key to fallback on. The reliable dump will attempt to use the current nextKey (that was based on the deleted current key).
The local currentKey and nextKey Key types are being passed to NextKey which eventually writes the nextKey output to the nextKeys pointer location (via the bpf syscall).
The currentValue was simply being assinged by the equals operator, which was copying the underlying interface pointer.
Thus in this situation, the next iteration attempt was passing the same pointer twice to NextKey causing the currentKey to be set to the next key a second time - skipping a map element.
Fixes #26491
Validating the fix
Because the test has 256 possible keys the particular start condition for this failure seems to be quite rare in CI, by reducing this to 5 it happens much more often (presumably 20% of the time?) - making this much easier to reproduce.
I tested this change by continuously running the DumpReliablyWithCallback (modified with maxSize=5) test to reproduce, and after the fix to validate (ran it 1000 times post fix without failure).