-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13136: [C++] Add coalesce function #10608
Conversation
392e979
to
59bdbf7
Compare
Hmm, this has some Valgrind errors - taking a look. |
@github-actions crossbow submit conda-cpp-valgrind |
@github-actions crossbow submit test-conda-cpp-valgrind |
Revision: 08c6375 Submitted crossbow builds: ursacomputing/crossbow @ actions-560
|
@bkietz do you have time to review this? Do we want to add a benchmark here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!
Yes, we definitely want a benchmark here. A few other comments:
kernel.null_handling = NullHandling::COMPUTED_NO_PREALLOCATE; | ||
kernel.mem_allocation = MemAllocation::PREALLOCATE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should always be able to preallocate the validity bitmap in addition to the data/offsets buffer, which will enable the preallocate_contiguous_ optimization for fixed width types.
kernel.null_handling = NullHandling::COMPUTED_NO_PREALLOCATE; | |
kernel.mem_allocation = MemAllocation::PREALLOCATE; | |
kernel.null_handling = NullHandling::COMPUTED_PREALLOCATE; | |
kernel.mem_allocation = MemAllocation::PREALLOCATE; | |
if (var width type) { | |
kernel.can_write_into_slices = false; | |
} |
std::shared_ptr<Array> temp_output; | ||
RETURN_NOT_OK(builder.Finish(&temp_output)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::shared_ptr<Array> temp_output; | |
RETURN_NOT_OK(builder.Finish(&temp_output)); | |
ARROW_ASSIGN_OR_RAISE(auto temp_output, builder.Finish()); |
Performance is fairly meh.
Trying an approach based on VisitSetBitRunsVoid may be beneficial, and/or manually hoisting the scalar-vs-array detection and having separate CopyScalarValues and CopyArrayValues. |
Alright, this buys us ~50% more performance by specializing the common case
|
IIUC this would require a varargs version of |
Basically, there was a lot of overhead from the fallback loop of "for offset in range(block size), if bit is set, copy one element" because 1) the 'copy one element' function used CopyBitmap which has a ton of overhead for copying one bit and 2) unboxing the array every time was costly when done in a loop like that (e.g. the profiler showed that even Buffer::data()'s check for whether the buffer is on-CPU was hot). But now I've specialized things to avoid most of that overhead. The reason why I wanted something like VisitSetBitRunsVoid was to go a step further and always try to perform block copies instead of falling back to one-element-at-a-time-copies. But yes, then it needs to be able to combine two bitmaps with AndNot (we want runs of bits where !output_valid & input_valid) |
I would kind of prefer to get all these kernels merged and consolidated before I start trying to microoptimize them, though, given they've been around for a while and all use similar helper code (that's now starting to diverge slightly once I look at optimizing). |
@ursabot please benchmark lang=C++ |
Benchmark runs are scheduled for baseline = 9c6d417 and contender = e32cf48. Results will be available as each benchmark for each run completes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, but could you rewrite a bit for readability?
if ((datum.is_scalar() && datum.scalar()->is_valid) || | ||
(datum.is_array() && !datum.array()->MayHaveNulls())) { | ||
BitBlockCounter counter(out_valid, out_offset, batch.length); | ||
int64_t offset = 0; | ||
while (offset < batch.length) { | ||
const auto block = counter.NextWord(); | ||
if (block.NoneSet()) { | ||
CopyValues<Type>(datum, offset, block.length, out_valid, out_values, | ||
out_offset + offset); | ||
} else if (!block.AllSet()) { | ||
for (int64_t j = 0; j < block.length; ++j) { | ||
if (!BitUtil::GetBit(out_valid, out_offset + offset + j)) { | ||
CopyValues<Type>(datum, offset + j, 1, out_valid, out_values, | ||
out_offset + offset + j); | ||
} | ||
} | ||
} | ||
offset += block.length; | ||
} | ||
break; | ||
} else if (datum.is_array()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you de-nest some of this branching by extracting some functions and intersperse some whitespace and comments? This is a little difficult to read. Something like:
if ((datum.is_scalar() && datum.scalar()->is_valid) || | |
(datum.is_array() && !datum.array()->MayHaveNulls())) { | |
BitBlockCounter counter(out_valid, out_offset, batch.length); | |
int64_t offset = 0; | |
while (offset < batch.length) { | |
const auto block = counter.NextWord(); | |
if (block.NoneSet()) { | |
CopyValues<Type>(datum, offset, block.length, out_valid, out_values, | |
out_offset + offset); | |
} else if (!block.AllSet()) { | |
for (int64_t j = 0; j < block.length; ++j) { | |
if (!BitUtil::GetBit(out_valid, out_offset + offset + j)) { | |
CopyValues<Type>(datum, offset + j, 1, out_valid, out_values, | |
out_offset + offset + j); | |
} | |
} | |
} | |
offset += block.length; | |
} | |
break; | |
} else if (datum.is_array()) { | |
if ((datum.is_scalar() && datum.scalar()->is_valid) || | |
(datum.is_array() && !datum.array()->MayHaveNulls())) { | |
// all-valid scalar or array | |
CopyValuesAllValid<Type>(datum, batch.length, out_valid, out_values, out_offset); | |
break; | |
} | |
// null scalar; skip | |
if (datum.is_scalar()) continue; |
Broke up the main function a bit. |
Rebased (wow, that was more painful than I wanted it to be) |
Rebased again to fix the conflict with the make_struct change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
No description provided.