ARROW-11595: [C++][NIGHTLY:test-conda-cpp-valgrind] Avoid branching on potentially indeterminate values in GenerateBitsUnrolled #9471

bkietz · 2021-02-11T14:51:20Z

Comparison kernels generate an output bitmap for all array values, including those masked by a null bit. This should be fine since the indeterminate bits are also masked in the output but valgrind still triggers on the branching in GenerateBitsUnrolled.

Fix: replace branching with equivalent multiplication.

…n potentially indeterminate values in GenerateBitsUnrolled

github-actions · 2021-02-11T14:51:42Z

https://issues.apache.org/jira/browse/ARROW-11595

pitrou · 2021-02-11T15:05:01Z

Unless I'm missing something, it seems we should avoid indeterminate bits in the comparison output.

bkietz · 2021-02-11T15:13:49Z

The indeterminate bits result from applying comparisons to values under null bits; if we wanted to avoid doing those comparisons we'd lose some performance since we'd no longer decouple comparison from null handling

pitrou · 2021-02-11T15:19:47Z

How were the values under null bits generated? If they're indeterminate it basically means they might leak private data.

bkietz · 2021-02-11T15:45:03Z

Consider the following test case:

TEST(TestCompareKernel, AdHoc) {
  int32_t definitely_not_init[4];
  Int32Array null_int32s(4, Buffer::Wrap(definitely_not_init, 4), *AllocateEmptyBitmap(4));

  auto null_bools = CallFunction("equal", {
    null_int32s.data(),
    ArrayFromJSON(int32(), "[1,2,3,4]"),
  })->array_as<BooleanArray>();

  ASSERT_EQ(null_bools->null_count(), 4);
}

We don't make guarantees about null_bools->Value(2); its value is allowed to be indeterminate. If we want to take responsibility for never accessing values under null bits that will be a much larger project

pitrou · 2021-02-11T15:46:43Z

You should fix null_int32s to have initialized values. The fact that the compare kernel can return indeterminate bits for indeterminate inputs is not a problem; the problem is the existence of indeterminate input.

bkietz · 2021-02-16T16:16:17Z

In the first place IIUC the above test case is legal within the arrow format: 'Array slots which are null are not required to have a particular value; any "masked" memory can have any value'.

Given an array with indeterminate values underneath a null slot, it is indeed expected that the compare kernel will produce indeterminate bits. However this should not trigger valgrind unless one of those indeterminate bits is branched on, which should never happen since they are also masked by null bits and so may not be accessed.

As for why this test case generated indeterminate values: when casting to pre-allocatable types like uint8 from null, we don't initialize the values buffer. We could change this and ensure that casting from null will always produce well-defined values under null bits, but that seems unnecessary.

pitrou · 2021-02-16T16:27:24Z

Two things need to be distinguished here: 1) the format spec does not mandate any specific value for null-masked value slots 2) that should not allow an implementation to leak private data in null-masked value slots.

when casting to pre-allocatable types like uint8 from null, we don't initialize the values buffer

By "don't initialize the values buffer", I take it that we're allocating an uninitialized values buffer. The problem is that the allocator may (and often will) recycle previously allocated memory. This previously allocated memory could contain anything - for example an authorization token, a S3 password or a private SSH key, if the application engages in such activities. Then the uninitialized buffer can be sent as-is via Arrow IPC, and the previously allocated data is leaked.

bkietz · 2021-02-16T16:45:15Z

Handling this doesn't seem like the responsibility of the arrow library; if a buffer is allocated for storage of sensitive data then doesn't the burden fall to whoever allocated it to ensure that buffer is overwritten before freeing it?

pitrou · 2021-02-16T16:54:21Z

I don't think you'll find a lot of software that takes care to secure-erase a S3 private key after having used it. I'm not sure the AWS SDK for C++ even does it.

We can think of other concerns when using uninitialized buffers. For example, let's say you call MakeArrayOfNulls and it gives you back a values buffer full of (stale) random data. You then send it using Arrow IPC with compression enabled. The values buffer, while irrelevant, will be poorly compressed because it has random data instead of having been, say, zero-initialized.

Another concern yet is that several runs of the same program will produce non-deterministic output. Which is annoying if you try to validate output files using e.g. a checksum (think reproducible builds, but for data).

All in all, I think there are good reasons to initialize null-masked value slots deterministically. The main annoyance is that I can't think of a way to test systematically for it (apart from relying on Valgrind errors, but that will only catch a subset of cases).

cc @emkornfield @wesm for opinions.

bkietz · 2021-02-20T19:27:09Z

@pitrou in order to address the failing valgrind job, could we merge this? At this point I think your recommendations are valid but not necessarily an indictment against this patch. Furthermore, it'd be appropriate to move the discussion of conventions for values under null bits to the mailing list.

pitrou · 2021-02-20T21:16:31Z

@bkietz No problem on the principle, though it would be nice to check that performance isn't significantly reduced.

bkietz · 2021-02-20T22:14:00Z

@ursabot please benchmark

ursabot · 2021-02-20T23:01:09Z

Benchmark runs are scheduled for baseline = 356c300 and contender = 36352c4. Results will be available as each benchmark for each run completes:
[Finished] ursa-dgx1: https://conbench.ursa.dev/compare/runs/26b28204-28b2-4d2d-8dea-b6a09280d5da...6a476f91-24fe-4ccd-af4a-0f6bbf219668/
[Finished] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/3ca8c81c-fdca-4be5-a85c-470aa26ac48a...717941f3-94fb-4c1b-a356-f8d9049559ef/

pitrou · 2021-02-22T13:01:21Z

Thanks @bkietz , there doesn't appear to be any actual regression in those results.

…n potentially indeterminate values in GenerateBitsUnrolled Comparison kernels generate an output bitmap for all array values, including those masked by a null bit. This should be fine since the indeterminate bits are also masked in the output but valgrind still triggers on the branching in GenerateBitsUnrolled. Fix: replace branching with equivalent multiplication. Closes apache#9471 from bkietz/11595-GenerateBitsUnrolled-trig Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

ARROW-11595: [C++][NIGHTLY:test-conda-cpp-valgrind] Avoid branching o…

36352c4

…n potentially indeterminate values in GenerateBitsUnrolled

bkietz requested a review from pitrou February 11, 2021 14:51

github-actions bot added the Component: C++ label Feb 11, 2021

jorgecarleitao force-pushed the master branch from d4608a9 to 356c300 Compare February 14, 2021 12:09

bkietz mentioned this pull request Feb 19, 2021

ARROW-8919: [C++][Compute][Dataset] Add Function::DispatchBest to accomodate implicit casts #9294

Closed

pitrou approved these changes Feb 22, 2021

View reviewed changes

pitrou closed this in 0b838cc Feb 22, 2021

bkietz deleted the 11595-GenerateBitsUnrolled-trig branch February 22, 2021 16:21

asfimport mentioned this pull request Feb 22, 2021

[C++][NIGHTLY:test-conda-cpp-valgrind] GenerateBitsUnrolled triggers valgrind on uninit inputs #27464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11595: [C++][NIGHTLY:test-conda-cpp-valgrind] Avoid branching on potentially indeterminate values in GenerateBitsUnrolled #9471

ARROW-11595: [C++][NIGHTLY:test-conda-cpp-valgrind] Avoid branching on potentially indeterminate values in GenerateBitsUnrolled #9471

bkietz commented Feb 11, 2021

github-actions bot commented Feb 11, 2021

pitrou commented Feb 11, 2021

bkietz commented Feb 11, 2021

pitrou commented Feb 11, 2021

bkietz commented Feb 11, 2021

pitrou commented Feb 11, 2021

bkietz commented Feb 16, 2021

pitrou commented Feb 16, 2021

bkietz commented Feb 16, 2021

pitrou commented Feb 16, 2021

bkietz commented Feb 20, 2021

pitrou commented Feb 20, 2021

bkietz commented Feb 20, 2021

ursabot commented Feb 20, 2021 •

edited

Loading

pitrou commented Feb 22, 2021

ARROW-11595: [C++][NIGHTLY:test-conda-cpp-valgrind] Avoid branching on potentially indeterminate values in GenerateBitsUnrolled #9471

ARROW-11595: [C++][NIGHTLY:test-conda-cpp-valgrind] Avoid branching on potentially indeterminate values in GenerateBitsUnrolled #9471

Conversation

bkietz commented Feb 11, 2021

github-actions bot commented Feb 11, 2021

pitrou commented Feb 11, 2021

bkietz commented Feb 11, 2021

pitrou commented Feb 11, 2021

bkietz commented Feb 11, 2021

pitrou commented Feb 11, 2021

bkietz commented Feb 16, 2021

pitrou commented Feb 16, 2021

bkietz commented Feb 16, 2021

pitrou commented Feb 16, 2021

bkietz commented Feb 20, 2021

pitrou commented Feb 20, 2021

bkietz commented Feb 20, 2021

ursabot commented Feb 20, 2021 • edited Loading

pitrou commented Feb 22, 2021

ursabot commented Feb 20, 2021 •

edited

Loading