Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43733: [C++] Fix Scalar boolean handling in row encoder #43734

Merged
merged 14 commits into from
Aug 19, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 17, 2024

Rationale for this change

See #43733

What changes are included in this PR?

Separate Null and Valid handling when BooleanKeyEncoder::Encode meets a Null

This patch also does a migration:

  • row_encoder.cc -> row_encoder_internal.cc
  • move row_encoder_internal{.cc|.h} from compute/kernel to compute/row

Are these changes tested?

Yes

Are there any user-facing changes?

No

Copy link

⚠️ GitHub issue #43733 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Aug 17, 2024

I'm not familiar with this module so hope maintainer can tell me where should I add test for this case?

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a test case to reproduce the expected bug and to verify the fix.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 17, 2024
@zanmato1984
Copy link
Contributor

I think we need a test case to reproduce the expected bug and to verify the fix.

FYI, you can refer to related existing tests in compare_test.cc and row_test.cc.

namespace arrow::compute::internal {

// GH-43733: Test that the key encoder can handle boolean scalar values well.
TEST(TestKeyEncoder, BooleanScalar) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've add a new test since testing this in compute/row... is far from original KeyEncoder logic. Tell me if you have better idea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry that I pointed you to the wrong place because I didn't realize the RowEncoder staff (used for naive hash join/aggregation implementation) was not related to the row stuff (used for swiss join/fast aggregation).

You did right. Here is good.

@mapleFU mapleFU force-pushed the boolean-encoder-valid-and-true branch from 29b8fff to 3efe551 Compare August 18, 2024 04:59
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to row_encoder_internal.cc to make it consistent with row_encoder_internal.h

namespace arrow::compute::internal {

// GH-43733: Test that the key encoder can handle boolean scalar values well.
TEST(TestKeyEncoder, BooleanScalar) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry that I pointed you to the wrong place because I didn't realize the RowEncoder staff (used for naive hash join/aggregation implementation) was not related to the row stuff (used for swiss join/fast aggregation).

You did right. Here is good.

@@ -156,7 +156,7 @@ TEST(RowTableOffsetOverflow, LARGE_MEMORY_TEST(Encode)) {
auto value, ::arrow::gen::Constant(
std::make_shared<BinaryScalar>(std::string(length_per_binary, 'X')))
->Generate(1));
values.push_back(std::move(value));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bother change this place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor update

@@ -137,4 +137,5 @@ add_arrow_compute_test(aggregate_test
# ----------------------------------------------------------------------
# Utilities

add_arrow_compute_test(kernel_utility_test SOURCES codegen_internal_test.cc)
add_arrow_compute_test(kernel_utility_test SOURCES codegen_internal_test.cc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to add row encoder test as part of the kernel utility suite. If you do a search, you can find that row encoder is not referenced in compute/kernels directory at all. (The only call sites are in compute/row/grouper.cc and acero/hash_join*.cc.)

Maybe we should consider moving it out to compute directory. And we can make this test part of arrow-compute-internals-test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it into compute/row, since it's so weird in compute/kernel

@mapleFU mapleFU force-pushed the boolean-encoder-valid-and-true branch from 9ace195 to d0b9d85 Compare August 19, 2024 08:17
cpp/src/arrow/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/src/arrow/compute/CMakeLists.txt Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal.cc Outdated Show resolved Hide resolved
@mapleFU mapleFU force-pushed the boolean-encoder-valid-and-true branch from 5d2415f to dcca744 Compare August 19, 2024 11:09
cpp/src/arrow/compute/row/row_encoder_internal_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal_test.cc Outdated Show resolved Hide resolved
};
auto data_payload = reset_and_get_payload_ptrs();
ASSERT_OK(key_encoder.Encode(ExecValue{&scalar}, batch_length, data_payload));
data_payload = reset_and_get_payload_ptrs();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? The payload pointers shouldn't change since those are std::array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The elements of data_payload are changed by Encode internally, thus the resetting is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the data_payload would be advanced in key_encoder

ASSERT_OK_AND_ASSIGN(
auto array_data,
key_encoder.Decode(data_payload, batch_length, ::arrow::default_memory_pool()));
ASSERT_EQ(batch_length, array_data->length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also validate array_data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change to validate array is equal to Array from scalar now

BooleanKeyEncoder key_encoder;
SCOPED_TRACE("scalar " + scalar.ToString());
constexpr int64_t batch_length = 10;
std::array<int32_t, batch_length> lengths{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this lengths and the subsequent AddLength call don't contribute anything useful to the test. We should either check the content of the lengths or just remove them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would check this is 2

cpp/src/arrow/compute/row/row_encoder_internal_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/row/row_encoder_internal_test.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for fixing this!

@pitrou pitrou changed the title GH-43733: [C++] Compute: Seperate Null and Valid in BooleanKeyEncoder GH-43733: [C++] Fix Scalar boolean handling in row encoder Aug 19, 2024
@pitrou pitrou merged commit a380d69 into apache:main Aug 19, 2024
40 of 41 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Aug 19, 2024
@pitrou
Copy link
Member

pitrou commented Aug 19, 2024

Thank you @mapleFU !

@mapleFU mapleFU deleted the boolean-encoder-valid-and-true branch August 19, 2024 17:56
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit a380d69.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 20 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants