-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43733: [C++] Fix Scalar boolean handling in row encoder #43734
Conversation
|
I'm not familiar with this module so hope maintainer can tell me where should I add test for this case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a test case to reproduce the expected bug and to verify the fix.
FYI, you can refer to related existing tests in |
namespace arrow::compute::internal { | ||
|
||
// GH-43733: Test that the key encoder can handle boolean scalar values well. | ||
TEST(TestKeyEncoder, BooleanScalar) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've add a new test since testing this in compute/row...
is far from original KeyEncoder logic. Tell me if you have better idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry that I pointed you to the wrong place because I didn't realize the RowEncoder
staff (used for naive hash join/aggregation implementation) was not related to the row
stuff (used for swiss join/fast aggregation).
You did right. Here is good.
29b8fff
to
3efe551
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to row_encoder_internal.cc to make it consistent with row_encoder_internal.h
namespace arrow::compute::internal { | ||
|
||
// GH-43733: Test that the key encoder can handle boolean scalar values well. | ||
TEST(TestKeyEncoder, BooleanScalar) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry that I pointed you to the wrong place because I didn't realize the RowEncoder
staff (used for naive hash join/aggregation implementation) was not related to the row
stuff (used for swiss join/fast aggregation).
You did right. Here is good.
@@ -156,7 +156,7 @@ TEST(RowTableOffsetOverflow, LARGE_MEMORY_TEST(Encode)) { | |||
auto value, ::arrow::gen::Constant( | |||
std::make_shared<BinaryScalar>(std::string(length_per_binary, 'X'))) | |||
->Generate(1)); | |||
values.push_back(std::move(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why bother change this place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor update
@@ -137,4 +137,5 @@ add_arrow_compute_test(aggregate_test | |||
# ---------------------------------------------------------------------- | |||
# Utilities | |||
|
|||
add_arrow_compute_test(kernel_utility_test SOURCES codegen_internal_test.cc) | |||
add_arrow_compute_test(kernel_utility_test SOURCES codegen_internal_test.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems weird to add row encoder test as part of the kernel utility suite. If you do a search, you can find that row encoder is not referenced in compute/kernels
directory at all. (The only call sites are in compute/row/grouper.cc
and acero/hash_join*.cc
.)
Maybe we should consider moving it out to compute
directory. And we can make this test part of arrow-compute-internals-test
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it into compute/row
, since it's so weird in compute/kernel
9ace195
to
d0b9d85
Compare
5d2415f
to
dcca744
Compare
}; | ||
auto data_payload = reset_and_get_payload_ptrs(); | ||
ASSERT_OK(key_encoder.Encode(ExecValue{&scalar}, batch_length, data_payload)); | ||
data_payload = reset_and_get_payload_ptrs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? The payload pointers shouldn't change since those are std::array
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The elements of data_payload
are changed by Encode
internally, thus the resetting is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the data_payload would be advanced in key_encoder
ASSERT_OK_AND_ASSIGN( | ||
auto array_data, | ||
key_encoder.Decode(data_payload, batch_length, ::arrow::default_memory_pool())); | ||
ASSERT_EQ(batch_length, array_data->length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also validate array_data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change to validate array is equal to Array from scalar now
BooleanKeyEncoder key_encoder; | ||
SCOPED_TRACE("scalar " + scalar.ToString()); | ||
constexpr int64_t batch_length = 10; | ||
std::array<int32_t, batch_length> lengths{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this lengths
and the subsequent AddLength
call don't contribute anything useful to the test. We should either check the content of the lengths
or just remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would check this is 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for fixing this!
Thank you @mapleFU ! |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit a380d69. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 20 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
See #43733
What changes are included in this PR?
Separate Null and Valid handling when BooleanKeyEncoder::Encode meets a Null
This patch also does a migration:
compute/kernel
tocompute/row
Are these changes tested?
Yes
Are there any user-facing changes?
No