GH-34911: [C++] Add first and last aggregator #34912

icexelloss · 2023-04-05T21:42:45Z

Rationale for this change

This PR adds "first" and "last" aggregator and support using those with Acero's segmented aggregation.

What changes are included in this PR?

Numeric Scalar Aggregator (bool, int types, floating types)
Numeric Hash Aggregator (bool, int types, floating types)
Docstring
Non-Numeric Scalar Aggregator (string, binary, fixed binary, temporal)
Non-Numeric Hash Aggregator (string, binary, fixed binary, temporal)
Add ordered flag in aggregate kernels
Implement and test skip null
Update compute.rst

Are these changes tested?

Compute Kernel Test (Scalar Kernels, all supported datatypes)
Hash Aggregate Test (Hash Kernels, all supported datatypes)
Segmented Aggregation Test (Both Scalar and Hash Kernels)

Are there any user-facing changes?

Yes. Added First and Last aggregator.

github-actions · 2023-04-05T21:43:06Z

Closes: [C++] Add first and last aggregation #34911

icexelloss · 2023-04-05T21:47:28Z

This is work in progress but I want to put this up because I got segmented aggregation to work with numeric types. Will finish up the rest of the box (see check boxes in the description).

icexelloss · 2023-04-05T21:48:30Z

@westonpace I pretty much followed the way that min/max aggregator is implemented so hopefully there is no surprises here. Still, would be appreciate if you can take a look at whether this general approach is correct.

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

cpp/src/arrow/compute/kernels/hash_aggregate.cc

westonpace

A few general questions.

Also, how does first compare with something like LIMIT 1?

Do you think there is value in instead implementing something like nth_value? E.g. https://docs.snowflake.com/en/sql-reference/functions/nth_value

cpp/src/arrow/compute/kernels/hash_aggregate.cc

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

westonpace · 2023-04-10T21:55:44Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

+      for (int64_t i = 0; i < arr.length(); i++) {
+        local.MergeOne(arr.GetView(i));
+      }


Wouldn't you break as soon as you encounter a value? Why do you need to iterate the entire array? I suppose if you want both first AND last then you might need to iterate from both directions. Something like...

if (!has_first) { int index = 0; while (index < length) { if (arr[index] != null || !skip_nulls) { has_first = true; first = arr[index]; break; } else { index++; } } } // No need to check has_last here since we always assume the current batch is replacing the last int index = length - 1; while (index >= 0) { if (arr[index] != null || !skip_nulls) { last = arr[index]; break; } else { index--; } }

Also, it appears that last carries quite a bit more cost than first. Imagine you were searching for first and skip_nulls=false. All you need to do is look at one value and you can skip all future batches.

Given this I'm not sure if we want to combine first/last into a single kernel. Or at least, make it possible in some way to skip data if last isn't needed.

I updated this to be close to what you have. Can you take a look if that looks fine to you?

westonpace · 2023-04-10T21:57:02Z

cpp/src/arrow/compute/kernels/hash_aggregate.cc

+struct NullSentinel {
+  static constexpr CType value() { return std::numeric_limits<CType>::min(); }
+};
+
+template <>
+struct NullSentinel<float> {
+  static constexpr float value() { return std::numeric_limits<float>::infinity(); }
+};
+
+template <>
+struct NullSentinel<double> {
+  static constexpr double value() { return std::numeric_limits<double>::infinity(); }
+};


Maybe UninitializedSentinel?

I removed this and ended up reusing AntiExtrema for the sentinel values

icexelloss · 2023-04-12T18:43:32Z

Also, how does first compare with something like LIMIT 1?

I see them as different things. limit to me seems like "get partial results from the full results". first feels like a ordered required aggregation function, which can be used with group by, window aggregations. limit can only be used (IIRC) as the last statement in the query plan, while first can appear in the middle.

Do you think there is value in instead implementing something like nth_value? E.g. https://docs.snowflake.com/en/sql-reference/functions/nth_value

I think there is value. I can look into if this is easy to do.

cpp/src/arrow/compute/kernels/hash_aggregate.cc

icexelloss · 2023-04-18T21:26:28Z

cpp/src/arrow/compute/kernels/hash_aggregate.cc

+  }
+
+  template <typename T = Type>
+  enable_if_base_binary<T, Status> MakeOffsetsValues(


This is copied from the binary version of the grouped min/max kernel. I will try to refactor this to a base class.

I failed to address this because the templates make it hard to refactor and share the logic between "MinMax" and "FirstLast" for binary types. if that's ok I would like to leave that as follow up to better refactor this and avoid adding more complexity to this PR.

cpp/src/arrow/compute/kernels/hash_aggregate.cc

cpp/src/arrow/acero/hash_aggregate_test.cc

icexelloss · 2023-04-27T16:03:29Z

@westonpace I addressed all the comments I think.

For skip_nulls=false, it's actually non-trivial to support and I ended up adding three more flags (first_is_null, last_is_null, has_any_values) to support this. Please take look.

https://github.com/apache/arrow/pull/34912/files#diff-395ffe24a47c8284e800ec4bc812075ac9efe77d8e24f430c5a4bbe2b5809940R343

ianmcook · 2023-04-27T22:00:19Z

The additions to compute.rst look good to me, thanks!

westonpace

This matches the null behavior I would expect. Thanks for fixing that. This looks good, appreciate the persistence.

ursabot · 2023-04-28T18:24:05Z

Benchmark runs are scheduled for baseline = 05a61d6 and contender = 34bfbd9. 34bfbd9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️3.8% ⬆️2.21%] test-mac-arm
[Failed ⬇️19.69% ⬆️0.0% ⚠️ Contender and baseline run contexts do not match] ursa-i9-9960x
[Finished ⬇️3.1% ⬆️1.26% ⚠️ Contender and baseline run contexts do not match] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 34bfbd93 ec2-t3-xlarge-us-east-2
[Failed] 34bfbd93 test-mac-arm
[Failed] 34bfbd93 ursa-i9-9960x
[Finished] 34bfbd93 ursa-thinkcentre-m75q
[Finished] 05a61d6f ec2-t3-xlarge-us-east-2
[Finished] 05a61d6f test-mac-arm
[Failed] 05a61d6f ursa-i9-9960x
[Finished] 05a61d6f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-04-28T18:26:11Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm
ursa-i9-9960x

…d segmentation fault (#35384) ### Rationale for this change The recent change (#34912) calculates the max concurrency using `plan->query_context()->executor()->GetCapacity()`. This is later used to initialize the kernel states. However, this is different than what we used to use. The previous method used was `plan->query_context()->max_concurrency()` which is slightly different(if the aggregate node IS run in parallel then we initialize one state for each CPU thread, one for each I/O thread, and one for the calling user thread). This is unfortunately a bit complicated as `max_concurrency` would not be a good indicator to use when determining if the plan is running in parallel or not. So we need to query both properties and use them in their respective spots. ### What changes are included in this PR? Now, `max_concurrency` is used to figure out how many thread local states need to be initialized and `GetCapacity` is used to figure out if there are multiple CPU threads or not. ### Are these changes tested? The bug was caught by the benchmarks which is a bit concerning. Most of the CI have a very small number of CPU threads and don't experience much concurrency and so I think we just didn't see this pattern. Or possibly, this pattern is only experienced in the legacy way that pyarrow launches exec plans. ### Are there any user-facing changes? No. * Closes: #35383 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

### Rationale for this change This PR adds "first" and "last" aggregator and support using those with Acero's segmented aggregation. ### What changes are included in this PR? - [x] Numeric Scalar Aggregator (bool, int types, floating types) - [x] Numeric Hash Aggregator (bool, int types, floating types) - [x] Docstring - [x] Non-Numeric Scalar Aggregator (string, binary, fixed binary, temporal) - [x] Non-Numeric Hash Aggregator (string, binary, fixed binary, temporal) - [x] Add `ordered` flag in aggregate kernels - [x] Implement and test skip null - [x] Update compute.rst ### Are these changes tested? - [x] Compute Kernel Test (Scalar Kernels, all supported datatypes) - [x] Hash Aggregate Test (Hash Kernels, all supported datatypes) - [x] Segmented Aggregation Test (Both Scalar and Hash Kernels) ### Are there any user-facing changes? Yes. Added First and Last aggregator. Authored-by: Li Jin <ice.xelloss@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

…o avoid segmentation fault (apache#35384) ### Rationale for this change The recent change (apache#34912) calculates the max concurrency using `plan->query_context()->executor()->GetCapacity()`. This is later used to initialize the kernel states. However, this is different than what we used to use. The previous method used was `plan->query_context()->max_concurrency()` which is slightly different(if the aggregate node IS run in parallel then we initialize one state for each CPU thread, one for each I/O thread, and one for the calling user thread). This is unfortunately a bit complicated as `max_concurrency` would not be a good indicator to use when determining if the plan is running in parallel or not. So we need to query both properties and use them in their respective spots. ### What changes are included in this PR? Now, `max_concurrency` is used to figure out how many thread local states need to be initialized and `GetCapacity` is used to figure out if there are multiple CPU threads or not. ### Are these changes tested? The bug was caught by the benchmarks which is a bit concerning. Most of the CI have a very small number of CPU threads and don't experience much concurrency and so I think we just didn't see this pattern. Or possibly, this pattern is only experienced in the legacy way that pyarrow launches exec plans. ### Are there any user-facing changes? No. * Closes: apache#35383 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

### Rationale for this change This PR adds "first" and "last" aggregator and support using those with Acero's segmented aggregation. ### What changes are included in this PR? - [x] Numeric Scalar Aggregator (bool, int types, floating types) - [x] Numeric Hash Aggregator (bool, int types, floating types) - [x] Docstring - [x] Non-Numeric Scalar Aggregator (string, binary, fixed binary, temporal) - [x] Non-Numeric Hash Aggregator (string, binary, fixed binary, temporal) - [x] Add `ordered` flag in aggregate kernels - [x] Implement and test skip null - [x] Update compute.rst ### Are these changes tested? - [x] Compute Kernel Test (Scalar Kernels, all supported datatypes) - [x] Hash Aggregate Test (Hash Kernels, all supported datatypes) - [x] Segmented Aggregation Test (Both Scalar and Hash Kernels) ### Are there any user-facing changes? Yes. Added First and Last aggregator. Authored-by: Li Jin <ice.xelloss@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

…o avoid segmentation fault (apache#35384) ### Rationale for this change The recent change (apache#34912) calculates the max concurrency using `plan->query_context()->executor()->GetCapacity()`. This is later used to initialize the kernel states. However, this is different than what we used to use. The previous method used was `plan->query_context()->max_concurrency()` which is slightly different(if the aggregate node IS run in parallel then we initialize one state for each CPU thread, one for each I/O thread, and one for the calling user thread). This is unfortunately a bit complicated as `max_concurrency` would not be a good indicator to use when determining if the plan is running in parallel or not. So we need to query both properties and use them in their respective spots. ### What changes are included in this PR? Now, `max_concurrency` is used to figure out how many thread local states need to be initialized and `GetCapacity` is used to figure out if there are multiple CPU threads or not. ### Are these changes tested? The bug was caught by the benchmarks which is a bit concerning. Most of the CI have a very small number of CPU threads and don't experience much concurrency and so I think we just didn't see this pattern. Or possibly, this pattern is only experienced in the legacy way that pyarrow launches exec plans. ### Are there any user-facing changes? No. * Closes: apache#35383 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

### Rationale for this change This PR adds "first" and "last" aggregator and support using those with Acero's segmented aggregation. ### What changes are included in this PR? - [x] Numeric Scalar Aggregator (bool, int types, floating types) - [x] Numeric Hash Aggregator (bool, int types, floating types) - [x] Docstring - [x] Non-Numeric Scalar Aggregator (string, binary, fixed binary, temporal) - [x] Non-Numeric Hash Aggregator (string, binary, fixed binary, temporal) - [x] Add `ordered` flag in aggregate kernels - [x] Implement and test skip null - [x] Update compute.rst ### Are these changes tested? - [x] Compute Kernel Test (Scalar Kernels, all supported datatypes) - [x] Hash Aggregate Test (Hash Kernels, all supported datatypes) - [x] Segmented Aggregation Test (Both Scalar and Hash Kernels) ### Are there any user-facing changes? Yes. Added First and Last aggregator. Authored-by: Li Jin <ice.xelloss@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

…o avoid segmentation fault (apache#35384) ### Rationale for this change The recent change (apache#34912) calculates the max concurrency using `plan->query_context()->executor()->GetCapacity()`. This is later used to initialize the kernel states. However, this is different than what we used to use. The previous method used was `plan->query_context()->max_concurrency()` which is slightly different(if the aggregate node IS run in parallel then we initialize one state for each CPU thread, one for each I/O thread, and one for the calling user thread). This is unfortunately a bit complicated as `max_concurrency` would not be a good indicator to use when determining if the plan is running in parallel or not. So we need to query both properties and use them in their respective spots. ### What changes are included in this PR? Now, `max_concurrency` is used to figure out how many thread local states need to be initialized and `GetCapacity` is used to figure out if there are multiple CPU threads or not. ### Are these changes tested? The bug was caught by the benchmarks which is a bit concerning. Most of the CI have a very small number of CPU threads and don't experience much concurrency and so I think we just didn't see this pattern. Or possibly, this pattern is only experienced in the legacy way that pyarrow launches exec plans. ### Are there any user-facing changes? No. * Closes: apache#35383 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

icexelloss requested a review from westonpace as a code owner April 5, 2023 21:42

github-actions bot added Component: C++ awaiting committer review Awaiting committer review labels Apr 5, 2023

icexelloss commented Apr 5, 2023

View reviewed changes

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 5, 2023

icexelloss commented Apr 5, 2023

View reviewed changes

cpp/src/arrow/compute/kernels/hash_aggregate.cc Outdated Show resolved Hide resolved

icexelloss commented Apr 5, 2023

View reviewed changes

cpp/src/arrow/compute/kernels/hash_aggregate.cc Outdated Show resolved Hide resolved

westonpace reviewed Apr 10, 2023

View reviewed changes

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 14, 2023

icexelloss force-pushed the acero-first-last-agg-2 branch 2 times, most recently from c0a8463 to e8ddf80 Compare April 17, 2023 14:47

icexelloss changed the title ~~GH-34911: [C++] [WIP] Add first and last aggregator~~ GH-34911: [C++] Add first and last aggregator Apr 18, 2023