Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35515: [C++][Python] Add non decomposable aggregation UDF #35514

Merged
merged 19 commits into from
Jun 8, 2023

Conversation

icexelloss
Copy link
Contributor

@icexelloss icexelloss commented May 9, 2023

Rationale for this change

Non decomposable aggregation is aggregation that cannot be split into consume/merge/finalize. This is often when the logic rewritten with external python libraries (numpy, pandas, statmodels, etc) and those either cannot be decomposed or not worthy the effect (these are often one-off function instead of reusable one). This PR implements the support for non decomposable aggregation UDFs.

The major issue with non decomposable UDF is that the UDF needs to see all data at once, unlike scalar UDF where UDF only needs to see a batch at a time. This makes non decomposable not so useful as it is same as collect all the data to a pd.DataFrame and apply the UDF on it. However, one very application of non decomposable UDF is with segmented aggregation. To refresh, segmented aggregation works on ordered data and passed one logic chunk at a time (e.g., all data with the same date). With segmented aggregation and non decomposable aggregation UDF, the user can apply any custom aggregation logic over large stream of ordered data, with the memory overhead of a single segment.

What changes are included in this PR?

This PR is currently WIP and not ready for review.

So far I have implemented the minimal amount of code to make a basic test working but needs clean up, error handling etc.

  • First round of self review
  • Second round of self review
  • Implement and test unary
  • Implement and test varargs
  • Implement and test Acero support with segmented aggregation

Are these changes tested?

Added new test calling with compute and acero.

The compute tests calls the aggregation on the full array. The acero test callings the aggregation with segmented aggregation.

Are there any user-facing changes?

@icexelloss icexelloss requested a review from AlenkaF as a code owner May 9, 2023 15:16
@github-actions
Copy link

github-actions bot commented May 9, 2023

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@icexelloss icexelloss changed the title WIP: [C++][Python] Add non decomposable aggregation UDF GH-35515: [WIP][C++][Python] Add non decomposable aggregation UDF May 9, 2023
@github-actions
Copy link

github-actions bot commented May 9, 2023

@github-actions
Copy link

github-actions bot commented May 9, 2023

⚠️ GitHub issue #35515 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label May 9, 2023
@icexelloss icexelloss force-pushed the acero-group-agg-UDF-2 branch 2 times, most recently from 0dd8e25 to 2b38a2f Compare May 9, 2023 15:31
@@ -65,6 +70,26 @@ struct PythonUdfKernelInit {
std::shared_ptr<OwnedRefNoGIL> function;
};

struct ScalarUdfAggregator : public compute::KernelState {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Scalar" as supposed to the "grouped" aggregator which has difference interface:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/hash_aggregate.cc#L66

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels May 9, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels May 9, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 15, 2023
@github-actions github-actions bot added Component: C++ awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 15, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 5, 2023
@icexelloss
Copy link
Contributor Author

@westonpace I believe this PR is good to go. The failed CI seems unrelated. I have checked the Py refcount and it seems OK (I will add details in the comment thread above)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 7, 2023
@westonpace
Copy link
Member

@icexelloss I'll take another look through today.

@icexelloss
Copy link
Contributor Author

@icexelloss I'll take another look through today.

Thank you!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more very minor suggestions but, overall, I think this is fine.

python/pyarrow/_compute.pyx Show resolved Hide resolved
python/pyarrow/src/arrow/python/udf.cc Outdated Show resolved Hide resolved
python/pyarrow/src/arrow/python/udf.cc Show resolved Hide resolved
Comment on lines +331 to +332
"x": pa.int64(),
"y": pa.float64()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the test case is verifying that the python function can take in *args if needed (even though it still lists the args when registering)?

std::vector<std::shared_ptr<DataType>> input_types,
std::shared_ptr<DataType> output_type)
: agg_cb(agg_cb), agg_function(agg_function), output_type(output_type) {
Py_INCREF(agg_function->obj());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This increment seems redundant given you already have one here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admitted there could be some redundancy here. I created an follow up to take a closer look:

#36000

python/pyarrow/src/arrow/python/udf.cc Outdated Show resolved Hide resolved
python/pyarrow/src/arrow/python/udf.cc Outdated Show resolved Hide resolved
python/pyarrow/src/arrow/python/udf.cc Outdated Show resolved Hide resolved
python/pyarrow/src/arrow/python/udf.cc Outdated Show resolved Hide resolved
python/pyarrow/src/arrow/python/udf.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Jun 8, 2023
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 8, 2023
@icexelloss
Copy link
Contributor Author

I checked failed CI jobs and those seem unrelated.

@icexelloss icexelloss merged commit 8b5919d into apache:main Jun 8, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 8, 2023
@ursabot
Copy link

ursabot commented Jun 10, 2023

Benchmark runs are scheduled for baseline = e920bed and contender = 8b5919d. 8b5919d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.15% ⬆️0.06%] test-mac-arm
[Finished ⬇️10.46% ⬆️6.21%] ursa-i9-9960x
[Finished ⬇️0.3% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 8b5919d8 ec2-t3-xlarge-us-east-2
[Finished] 8b5919d8 test-mac-arm
[Finished] 8b5919d8 ursa-i9-9960x
[Finished] 8b5919d8 ursa-thinkcentre-m75q
[Finished] e920bed4 ec2-t3-xlarge-us-east-2
[Finished] e920bed4 test-mac-arm
[Finished] e920bed4 ursa-i9-9960x
[Finished] e920bed4 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Python] Non decomposable aggregation UDF
4 participants