GH-35515: [C++][Python] Add non decomposable aggregation UDF #35514

icexelloss · 2023-05-09T15:16:13Z

Rationale for this change

Non decomposable aggregation is aggregation that cannot be split into consume/merge/finalize. This is often when the logic rewritten with external python libraries (numpy, pandas, statmodels, etc) and those either cannot be decomposed or not worthy the effect (these are often one-off function instead of reusable one). This PR implements the support for non decomposable aggregation UDFs.

The major issue with non decomposable UDF is that the UDF needs to see all data at once, unlike scalar UDF where UDF only needs to see a batch at a time. This makes non decomposable not so useful as it is same as collect all the data to a pd.DataFrame and apply the UDF on it. However, one very application of non decomposable UDF is with segmented aggregation. To refresh, segmented aggregation works on ordered data and passed one logic chunk at a time (e.g., all data with the same date). With segmented aggregation and non decomposable aggregation UDF, the user can apply any custom aggregation logic over large stream of ordered data, with the memory overhead of a single segment.

What changes are included in this PR?

This PR is currently WIP and not ready for review.

So far I have implemented the minimal amount of code to make a basic test working but needs clean up, error handling etc.

First round of self review
Second round of self review
Implement and test unary
Implement and test varargs
Implement and test Acero support with segmented aggregation

Are these changes tested?

Added new test calling with compute and acero.

The compute tests calls the aggregation on the full array. The acero test callings the aggregation with segmented aggregation.

Are there any user-facing changes?

Closes: [C++][Python] Non decomposable aggregation UDF #35515

github-actions · 2023-05-09T15:16:40Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2023-05-09T15:18:58Z

Closes: [C++][Python] Non decomposable aggregation UDF #35515

github-actions · 2023-05-09T15:19:00Z

⚠️ GitHub issue #35515 has been automatically assigned in GitHub to PR creator.

icexelloss · 2023-05-09T15:32:36Z

python/pyarrow/src/arrow/python/udf.cc

@@ -65,6 +70,26 @@ struct PythonUdfKernelInit {
  std::shared_ptr<OwnedRefNoGIL> function;
 };

+struct ScalarUdfAggregator : public compute::KernelState {


"Scalar" as supposed to the "grouped" aggregator which has difference interface:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/hash_aggregate.cc#L66

python/pyarrow/tests/test_udf.py

python/pyarrow/src/arrow/python/udf.cc

cpp/src/arrow/engine/substrait/extension_set.cc

icexelloss · 2023-06-07T14:16:17Z

@westonpace I believe this PR is good to go. The failed CI seems unrelated. I have checked the Py refcount and it seems OK (I will add details in the comment thread above)

westonpace · 2023-06-07T16:28:13Z

@icexelloss I'll take another look through today.

icexelloss · 2023-06-07T18:58:40Z

@icexelloss I'll take another look through today.

Thank you!

westonpace

A few more very minor suggestions but, overall, I think this is fine.

python/pyarrow/_compute.pyx

python/pyarrow/src/arrow/python/udf.cc

westonpace · 2023-06-08T06:11:48Z

python/pyarrow/conftest.py

+                                       "x": pa.int64(),
+                                       "y": pa.float64()


Ok, so the test case is verifying that the python function can take in *args if needed (even though it still lists the args when registering)?

westonpace · 2023-06-08T06:17:10Z

python/pyarrow/src/arrow/python/udf.cc

+                                std::vector<std::shared_ptr<DataType>> input_types,
+                                std::shared_ptr<DataType> output_type)
+      : agg_cb(agg_cb), agg_function(agg_function), output_type(output_type) {
+    Py_INCREF(agg_function->obj());


This increment seems redundant given you already have one here.

Admitted there could be some redundancy here. I created an follow up to take a closer look:

#36000

python/pyarrow/src/arrow/python/udf.cc

Co-authored-by: Weston Pace <weston.pace@gmail.com>

icexelloss · 2023-06-08T18:11:44Z

I checked failed CI jobs and those seem unrelated.

ursabot · 2023-06-10T01:04:19Z

Benchmark runs are scheduled for baseline = e920bed and contender = 8b5919d. 8b5919d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.15% ⬆️0.06%] test-mac-arm
[Finished ⬇️10.46% ⬆️6.21%] ursa-i9-9960x
[Finished ⬇️0.3% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 8b5919d8 ec2-t3-xlarge-us-east-2
[Finished] 8b5919d8 test-mac-arm
[Finished] 8b5919d8 ursa-i9-9960x
[Finished] 8b5919d8 ursa-thinkcentre-m75q
[Finished] e920bed4 ec2-t3-xlarge-us-east-2
[Finished] e920bed4 test-mac-arm
[Finished] e920bed4 ursa-i9-9960x
[Finished] e920bed4 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

icexelloss requested a review from AlenkaF as a code owner May 9, 2023 15:16

github-actions bot added Component: C++ Component: Python labels May 9, 2023

icexelloss changed the title ~~WIP: [C++][Python] Add non decomposable aggregation UDF~~ GH-35515: [WIP][C++][Python] Add non decomposable aggregation UDF May 9, 2023

github-actions bot added the awaiting committer review Awaiting committer review label May 9, 2023

icexelloss force-pushed the acero-group-agg-UDF-2 branch 2 times, most recently from 0dd8e25 to 2b38a2f Compare May 9, 2023 15:31

github-actions bot removed the Component: C++ label May 9, 2023

icexelloss commented May 9, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels May 9, 2023

icexelloss commented May 9, 2023

View reviewed changes

python/pyarrow/tests/test_udf.py Outdated Show resolved Hide resolved

icexelloss commented May 9, 2023

View reviewed changes

python/pyarrow/src/arrow/python/udf.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels May 9, 2023

icexelloss commented May 15, 2023

View reviewed changes

python/pyarrow/src/arrow/python/udf.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 15, 2023

icexelloss requested a review from westonpace as a code owner May 15, 2023 22:08

github-actions bot added Component: C++ awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 15, 2023

icexelloss commented May 16, 2023

View reviewed changes

cpp/src/arrow/engine/substrait/extension_set.cc Outdated Show resolved Hide resolved

icexelloss force-pushed the acero-group-agg-UDF-2 branch from d5e63df to dc1d734 Compare June 5, 2023 20:25

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 5, 2023

Fix core-dump when running with Python dev mode

1203346

icexelloss force-pushed the acero-group-agg-UDF-2 branch from 7ba9cc8 to 1203346 Compare June 6, 2023 15:39

icexelloss added 2 commits June 6, 2023 14:03

Lint

84c1e91

Try fixing numpydoc lint

17ff274

icexelloss force-pushed the acero-group-agg-UDF-2 branch from 6044f20 to 17ff274 Compare June 6, 2023 21:53

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 7, 2023

westonpace approved these changes Jun 8, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Jun 8, 2023

Apply suggestions from code review

7f65599

Co-authored-by: Weston Pace <weston.pace@gmail.com>

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 8, 2023

Lint fix

febf6cc

icexelloss merged commit 8b5919d into apache:main Jun 8, 2023

icexelloss mentioned this pull request Jun 8, 2023

[Python][C++] Simplify/Validate PyObject refcount in udf.cc #36000

Open

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 8, 2023

icexelloss mentioned this pull request Jun 22, 2023

GH-36252: [Python] Add non decomposable hash aggregate UDF #36253

Merged

2 tasks

jorisvandenbossche mentioned this pull request Jan 23, 2024

GH-39640: [Docs] Pin pydata-sphinx-theme to 0.14.* #39758

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35515: [C++][Python] Add non decomposable aggregation UDF #35514

GH-35515: [C++][Python] Add non decomposable aggregation UDF #35514

icexelloss commented May 9, 2023 •

edited

Loading

github-actions bot commented May 9, 2023

github-actions bot commented May 9, 2023

github-actions bot commented May 9, 2023

icexelloss May 9, 2023

icexelloss commented Jun 7, 2023

westonpace commented Jun 7, 2023

icexelloss commented Jun 7, 2023

westonpace left a comment

westonpace Jun 8, 2023

westonpace Jun 8, 2023

icexelloss Jun 8, 2023

icexelloss commented Jun 8, 2023

ursabot commented Jun 10, 2023

GH-35515: [C++][Python] Add non decomposable aggregation UDF #35514

GH-35515: [C++][Python] Add non decomposable aggregation UDF #35514

Conversation

icexelloss commented May 9, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 9, 2023

github-actions bot commented May 9, 2023

github-actions bot commented May 9, 2023

icexelloss May 9, 2023

Choose a reason for hiding this comment

icexelloss commented Jun 7, 2023

westonpace commented Jun 7, 2023

icexelloss commented Jun 7, 2023

westonpace left a comment

Choose a reason for hiding this comment

westonpace Jun 8, 2023

Choose a reason for hiding this comment

westonpace Jun 8, 2023

Choose a reason for hiding this comment

icexelloss Jun 8, 2023

Choose a reason for hiding this comment

icexelloss commented Jun 8, 2023

ursabot commented Jun 10, 2023

icexelloss commented May 9, 2023 •

edited

Loading