Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13136: [C++] Add coalesce function #10608

Closed
wants to merge 1 commit into from

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Jun 28, 2021

No description provided.

@github-actions
Copy link

@lidavidm lidavidm force-pushed the arrow-13136 branch 2 times, most recently from 392e979 to 59bdbf7 Compare June 30, 2021 15:13
@lidavidm lidavidm marked this pull request as draft July 1, 2021 18:12
@lidavidm lidavidm marked this pull request as ready for review July 1, 2021 20:25
@lidavidm
Copy link
Member Author

lidavidm commented Jul 2, 2021

Hmm, this has some Valgrind errors - taking a look.

@lidavidm
Copy link
Member Author

lidavidm commented Jul 2, 2021

@github-actions crossbow submit conda-cpp-valgrind

@lidavidm
Copy link
Member Author

lidavidm commented Jul 2, 2021

@github-actions crossbow submit test-conda-cpp-valgrind

@github-actions
Copy link

github-actions bot commented Jul 2, 2021

Revision: 08c6375

Submitted crossbow builds: ursacomputing/crossbow @ actions-560

Task Status
test-conda-cpp-valgrind Github Actions

@lidavidm
Copy link
Member Author

@bkietz do you have time to review this? Do we want to add a benchmark here?

@bkietz bkietz requested review from pitrou and bkietz and removed request for pitrou July 12, 2021 15:07
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

Yes, we definitely want a benchmark here. A few other comments:

cpp/src/arrow/compute/kernels/codegen_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/bit_block_counter.h Outdated Show resolved Hide resolved
Comment on lines 981 to 1639
kernel.null_handling = NullHandling::COMPUTED_NO_PREALLOCATE;
kernel.mem_allocation = MemAllocation::PREALLOCATE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always be able to preallocate the validity bitmap in addition to the data/offsets buffer, which will enable the preallocate_contiguous_ optimization for fixed width types.

Suggested change
kernel.null_handling = NullHandling::COMPUTED_NO_PREALLOCATE;
kernel.mem_allocation = MemAllocation::PREALLOCATE;
kernel.null_handling = NullHandling::COMPUTED_PREALLOCATE;
kernel.mem_allocation = MemAllocation::PREALLOCATE;
if (var width type) {
kernel.can_write_into_slices = false;
}

cpp/src/arrow/compute/kernels/scalar_if_else.cc Outdated Show resolved Hide resolved
Comment on lines 967 to 968
std::shared_ptr<Array> temp_output;
RETURN_NOT_OK(builder.Finish(&temp_output));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::shared_ptr<Array> temp_output;
RETURN_NOT_OK(builder.Finish(&temp_output));
ARROW_ASSIGN_OR_RAISE(auto temp_output, builder.Finish());

@lidavidm
Copy link
Member Author

Performance is fairly meh. perf shows a decent amount of time spent in CopyValues/CopyOneValue just re-examining the Datum variant, calling Buffer::data(), etc.

-------------------------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------
CoalesceBench64/1048576/0    19254992 ns     19255104 ns           37 bytes_per_second=1.62295G/s
CoalesceBench64/1048576/99   19281752 ns     19281409 ns           37 bytes_per_second=1.62058G/s

Trying an approach based on VisitSetBitRunsVoid may be beneficial, and/or manually hoisting the scalar-vs-array detection and having separate CopyScalarValues and CopyArrayValues.

@lidavidm
Copy link
Member Author

Alright, this buys us ~50% more performance by specializing the common case

--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
CoalesceBench64/1048576/0           13335778 ns     13335766 ns           52 bytes_per_second=2.34332G/s
CoalesceBench64/1048576/99          13562011 ns     13561826 ns           51 bytes_per_second=2.30404G/s
CoalesceNonNullBench64/1048576/0    13552482 ns     13552397 ns           48 bytes_per_second=2.30587G/s
CoalesceNonNullBench64/1048576/99   13572235 ns     13572002 ns           51 bytes_per_second=2.30232G/s

@bkietz
Copy link
Member

bkietz commented Jul 13, 2021

Trying an approach based on VisitSetBitRunsVoid

IIUC this would require a varargs version of OptionalBitBlockCounter or Bitmap::VisitWords, which would probably be generally useful as varargs compute functions continue to proliferate

@lidavidm
Copy link
Member Author

Basically, there was a lot of overhead from the fallback loop of "for offset in range(block size), if bit is set, copy one element" because 1) the 'copy one element' function used CopyBitmap which has a ton of overhead for copying one bit and 2) unboxing the array every time was costly when done in a loop like that (e.g. the profiler showed that even Buffer::data()'s check for whether the buffer is on-CPU was hot). But now I've specialized things to avoid most of that overhead.

The reason why I wanted something like VisitSetBitRunsVoid was to go a step further and always try to perform block copies instead of falling back to one-element-at-a-time-copies. But yes, then it needs to be able to combine two bitmaps with AndNot (we want runs of bits where !output_valid & input_valid)

@lidavidm
Copy link
Member Author

I would kind of prefer to get all these kernels merged and consolidated before I start trying to microoptimize them, though, given they've been around for a while and all use similar helper code (that's now starting to diverge slightly once I look at optimizing).

@lidavidm
Copy link
Member Author

@ursabot please benchmark lang=C++

@ursabot
Copy link

ursabot commented Jul 14, 2021

Benchmark runs are scheduled for baseline = 9c6d417 and contender = e32cf48. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Provided benchmark filters do not have any benchmark groups to be executed on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 (mimalloc)
[Skipped ⚠️ Only ['Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x (mimalloc)
[Finished ⬇️0.53% ⬆️0.05%] ursa-thinkcentre-m75q (mimalloc)
Supported benchmarks:
ursa-i9-9960x: langs = Python, R
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, but could you rewrite a bit for readability?

Comment on lines 835 to 1489
if ((datum.is_scalar() && datum.scalar()->is_valid) ||
(datum.is_array() && !datum.array()->MayHaveNulls())) {
BitBlockCounter counter(out_valid, out_offset, batch.length);
int64_t offset = 0;
while (offset < batch.length) {
const auto block = counter.NextWord();
if (block.NoneSet()) {
CopyValues<Type>(datum, offset, block.length, out_valid, out_values,
out_offset + offset);
} else if (!block.AllSet()) {
for (int64_t j = 0; j < block.length; ++j) {
if (!BitUtil::GetBit(out_valid, out_offset + offset + j)) {
CopyValues<Type>(datum, offset + j, 1, out_valid, out_values,
out_offset + offset + j);
}
}
}
offset += block.length;
}
break;
} else if (datum.is_array()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you de-nest some of this branching by extracting some functions and intersperse some whitespace and comments? This is a little difficult to read. Something like:

Suggested change
if ((datum.is_scalar() && datum.scalar()->is_valid) ||
(datum.is_array() && !datum.array()->MayHaveNulls())) {
BitBlockCounter counter(out_valid, out_offset, batch.length);
int64_t offset = 0;
while (offset < batch.length) {
const auto block = counter.NextWord();
if (block.NoneSet()) {
CopyValues<Type>(datum, offset, block.length, out_valid, out_values,
out_offset + offset);
} else if (!block.AllSet()) {
for (int64_t j = 0; j < block.length; ++j) {
if (!BitUtil::GetBit(out_valid, out_offset + offset + j)) {
CopyValues<Type>(datum, offset + j, 1, out_valid, out_values,
out_offset + offset + j);
}
}
}
offset += block.length;
}
break;
} else if (datum.is_array()) {
if ((datum.is_scalar() && datum.scalar()->is_valid) ||
(datum.is_array() && !datum.array()->MayHaveNulls())) {
// all-valid scalar or array
CopyValuesAllValid<Type>(datum, batch.length, out_valid, out_values, out_offset);
break;
}
// null scalar; skip
if (datum.is_scalar()) continue;

@lidavidm
Copy link
Member Author

Broke up the main function a bit.

@lidavidm
Copy link
Member Author

Rebased (wow, that was more painful than I wanted it to be)

@lidavidm
Copy link
Member Author

Rebased again to fix the conflict with the make_struct change.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants