-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35749: [C++] Handle run-end encoded filters in compute kernels #35750
Conversation
The code-moving commits in this PR are better reviewed by themselves in this separate PR: #35751 |
4d04729
to
ffa119d
Compare
ffa119d
to
a0bb513
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very neat PR @felipecrv !
/// Pre-conditions guaranteed by the callers: | ||
/// - i and j are valid indices into the values buffer | ||
/// - the values in i and j are valid | ||
bool CompareValuesAt(int64_t i, int64_t j) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is doing in REE utils? This is essentially representing value access in primitive arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing values is commonly used when run-end encoding kernels. I would happily move it out of here if you have a suggestion.
const bool valid = bit_util::GetBit(filter_is_valid, i); | ||
const bool emit = !valid || bit_util::GetBit(filter_selection, i); | ||
if (emit) { | ||
emit_segment(it.logical_position(), it.run_length(), valid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether you tried to time these new kernels, but emit_segment
being a std::function
will come with its own overhead (I'm not sure by how much).
Another approach would be to give emit_segment
a batch of ranges:
struct REEFilterSegment {
int64_t position;
int64_t segment_length;
bool filter_valid; // false means emit none
};
using EmitREEFilterSegment =
std::function<void(const REEFilterSegment*, int32_t num_segments)>;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, ideally a REE-encoded array has long enough runs to make REE encoding worthwhile...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I developed the REExREE kernels first and measured the binary-size impact of using template-param lambdas: my compilation unit was the biggest in the whole project, so I decided to migrate to std::function
to save multiple MBs in binary size.
The PlainxREE kernel is simpler, so maybe the impact wouldn't be so bad.
Of course, ideally a REE-encoded array has long enough runs to make REE encoding worthwhile...
Exactly. So I think it's better to start with std::function
to avoid inflating the library size. If this gains adoption, we can revisit the kernels later.
@@ -239,6 +309,43 @@ struct Selection { | |||
} | |||
}; | |||
|
|||
if (is_ree_filter) { | |||
Status status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not let emit_segment
return a Status
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I started with the numeric kernels there was no way for emit_segment
to fail. Making it return Status
will create overhead in VisitPlainxREEFilterOutputSegments
when it's dealing with primitives.
Should I worry about the overhead of checking for the status returned by std::function
in the context of primitive kernels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. Ideally there should be no overhead returning a successful Status, but in practice there is (we try to measure it in type_benchmark.cc
). I'll let you choose what is best here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a bool
return type that I check in the VisitPlainxREEFilterOutputSegments
loop.
cpp/src/arrow/compute/kernels/vector_selection_filter_internal.cc
Outdated
Show resolved
Hide resolved
cpp/src/arrow/compute/kernels/vector_selection_filter_internal.cc
Outdated
Show resolved
Hide resolved
cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc
Outdated
Show resolved
Hide resolved
ece8e62
to
3f52581
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me now, just a number of minor comments.
cpp/src/arrow/compute/kernels/vector_selection_filter_internal.cc
Outdated
Show resolved
Hide resolved
}; | ||
Status status; | ||
VisitPlainxREEFilterOutputSegments( | ||
filter, true, null_selection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also include parameter name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed all calls to VisitPlainxREEFilterOutputSegments
and added the label.
status = emit_segment(position, segment_length, filter_valid); | ||
return status.ok(); | ||
}); | ||
RETURN_NOT_OK(std::move(status)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The std::move
is a bit pedantic here IMHO...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenericToStatus
takes an r-value and that avoids code-bloat -- no need to generate memory allocation code to copy the Status
string. This is multiplied by the number of template instances.
#define ARROW_RETURN_NOT_OK(status) \
do { \
::arrow::Status __s = ::arrow::internal::GenericToStatus(status); \
ARROW_RETURN_IF_(!__s.ok(), __s, ARROW_STRINGIFY(status)); \
} while (false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status::Status(const Status& s)
: state_((s.state_ == NULLPTR) ? NULLPTR : new State(*s.state_)) {}
vs
Status::Status(Status&& s) noexcept : state_(s.state_) { s.state_ = NULLPTR; }
being inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which yes, is a pedantic std::move
, but should I remove it? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to remove it. I'm merely pointing out that it's not necessary to micro-optimize this particular end of function :-)
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
This reverts commit a0bb513.
bfcb695
to
ad05570
Compare
I rebased and forced-pushed (instead of merging) to see if the macOS script that failed in CI now works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @felipecrv !
Conbench analyzed the 6 benchmark runs on commit There were 30 benchmark results indicating a performance regression:
The full Conbench report has more details. |
Rationale for this change
Boolean arrays (bitmaps) used to represent filters in Arrow take 1 bit per boolean value. If the filter contains long runs, the filter can be run-end encoded and save even more memory.
Using POPCNT, a bitmap can be scanned efficiently for <64 runs of logical values, but a run-end encoded array gives the lengths of the run directly and go beyond word size per run.
These two observations make the case that, for the right dataset, REE filters can be more efficiently processed in compute kernels.
What changes are included in this PR?
GetFilterOutputSize
can count number of emits from a REE filterGetTakeIndices
can produce an array of logical indices from a REE filter"array_filter"
can handle REE filtersAre these changes tested?
Yes.
Are there any user-facing changes?
Yes.