-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11595: [C++][NIGHTLY:test-conda-cpp-valgrind] Avoid branching on potentially indeterminate values in GenerateBitsUnrolled #9471
Conversation
…n potentially indeterminate values in GenerateBitsUnrolled
Unless I'm missing something, it seems we should avoid indeterminate bits in the comparison output. |
The indeterminate bits result from applying comparisons to values under null bits; if we wanted to avoid doing those comparisons we'd lose some performance since we'd no longer decouple comparison from null handling |
How were the values under null bits generated? If they're indeterminate it basically means they might leak private data. |
Consider the following test case: TEST(TestCompareKernel, AdHoc) {
int32_t definitely_not_init[4];
Int32Array null_int32s(4, Buffer::Wrap(definitely_not_init, 4), *AllocateEmptyBitmap(4));
auto null_bools = CallFunction("equal", {
null_int32s.data(),
ArrayFromJSON(int32(), "[1,2,3,4]"),
})->array_as<BooleanArray>();
ASSERT_EQ(null_bools->null_count(), 4);
} We don't make guarantees about |
You should fix |
d4608a9
to
356c300
Compare
In the first place IIUC the above test case is legal within the arrow format: 'Array slots which are null are not required to have a particular value; any "masked" memory can have any value'. Given an array with indeterminate values underneath a null slot, it is indeed expected that the compare kernel will produce indeterminate bits. However this should not trigger valgrind unless one of those indeterminate bits is branched on, which should never happen since they are also masked by null bits and so may not be accessed. As for why this test case generated indeterminate values: when casting to pre-allocatable types like |
Two things need to be distinguished here: 1) the format spec does not mandate any specific value for null-masked value slots 2) that should not allow an implementation to leak private data in null-masked value slots.
By "don't initialize the values buffer", I take it that we're allocating an uninitialized values buffer. The problem is that the allocator may (and often will) recycle previously allocated memory. This previously allocated memory could contain anything - for example an authorization token, a S3 password or a private SSH key, if the application engages in such activities. Then the uninitialized buffer can be sent as-is via Arrow IPC, and the previously allocated data is leaked. |
Handling this doesn't seem like the responsibility of the arrow library; if a buffer is allocated for storage of sensitive data then doesn't the burden fall to whoever allocated it to ensure that buffer is overwritten before freeing it? |
I don't think you'll find a lot of software that takes care to secure-erase a S3 private key after having used it. I'm not sure the AWS SDK for C++ even does it. We can think of other concerns when using uninitialized buffers. For example, let's say you call Another concern yet is that several runs of the same program will produce non-deterministic output. Which is annoying if you try to validate output files using e.g. a checksum (think reproducible builds, but for data). All in all, I think there are good reasons to initialize null-masked value slots deterministically. The main annoyance is that I can't think of a way to test systematically for it (apart from relying on Valgrind errors, but that will only catch a subset of cases). cc @emkornfield @wesm for opinions. |
@pitrou in order to address the failing valgrind job, could we merge this? At this point I think your recommendations are valid but not necessarily an indictment against this patch. Furthermore, it'd be appropriate to move the discussion of conventions for values under null bits to the mailing list. |
@bkietz No problem on the principle, though it would be nice to check that performance isn't significantly reduced. |
@ursabot please benchmark |
Benchmark runs are scheduled for baseline = 356c300 and contender = 36352c4. Results will be available as each benchmark for each run completes: |
Thanks @bkietz , there doesn't appear to be any actual regression in those results. |
…n potentially indeterminate values in GenerateBitsUnrolled Comparison kernels generate an output bitmap for all array values, including those masked by a null bit. This should be fine since the indeterminate bits are also masked in the output but valgrind still triggers on the branching in GenerateBitsUnrolled. Fix: replace branching with equivalent multiplication. Closes apache#9471 from bkietz/11595-GenerateBitsUnrolled-trig Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…n potentially indeterminate values in GenerateBitsUnrolled Comparison kernels generate an output bitmap for all array values, including those masked by a null bit. This should be fine since the indeterminate bits are also masked in the output but valgrind still triggers on the branching in GenerateBitsUnrolled. Fix: replace branching with equivalent multiplication. Closes apache#9471 from bkietz/11595-GenerateBitsUnrolled-trig Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Comparison kernels generate an output bitmap for all array values, including those masked by a null bit. This should be fine since the indeterminate bits are also masked in the output but valgrind still triggers on the branching in GenerateBitsUnrolled.
Fix: replace branching with equivalent multiplication.