-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36252: [Python] Add non decomposable hash aggregate UDF #36253
GH-36252: [Python] Add non decomposable hash aggregate UDF #36253
Conversation
@@ -355,18 +536,10 @@ Status RegisterTabularFunction(PyObject* user_function, UdfWrapperCallback wrapp | |||
wrapper, options, registry); | |||
} | |||
|
|||
Status AddAggKernel(std::shared_ptr<compute::KernelSignature> sig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inlined now.
@westonpace I would like a request a review on this PR. The code should be relatively straight forward and similar to #35514 so hopefully no confusion/surprises here. For the implementation of the grouping, I used an approach similar to For the registration, I decided to register both scalar/hash kernel with one API More details in the PR description. Let me know if those sounds OK to you. |
@icexelloss I should have some time to take a look tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good set of tests. It's nice and convenient that the same python implementation can work for both. I have a few minor suggestions / questions. I think the only concerning thing is that we need to restrict group sizes to things that will fit in a single batch. Do you think this will be a problem for your use cases?
const ArraySpan& groups_array_data = batch[batch.num_values() - 1].array; | ||
DCHECK_EQ(groups_array_data.offset, 0); | ||
int64_t batch_num_values = groups_array_data.length; | ||
const auto* batch_groups = groups_array_data.GetValues<uint32_t>(1, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just groups_array_data.GetValues<uint32_t>(1);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - updated
} | ||
|
||
num_values += other.num_values; | ||
return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does num_groups
need to be updated here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think num_groups
need to be updated here. Reasoning:
From the code in https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/hash_aggregate.cc#L233 and https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/groupby_aggregate_node.cc#L248
(1) Other hash kernel implementation updates the num_groups in Resize
(2) resize
is always called before consume
and merge
UdfContext udf_context{ctx->memory_pool(), table->num_rows()}; | ||
|
||
if (rb->num_rows() == 0) { | ||
return Status::Invalid("Finalized is called with empty inputs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - I was being lazy here and didn't want to bother with empty aggregation. But now I look at this I can just return empty result here. Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
||
ARROW_ASSIGN_OR_RAISE(auto table, | ||
arrow::Table::FromRecordBatches(input_schema, values)); | ||
ARROW_ASSIGN_OR_RAISE(auto rb, table->CombineChunksToBatch(ctx->memory_pool())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some cases where this won't be possible. For example, if you have an array of strings then the array may only have 2GB of string data (regardless of how many elements it has). So any single group can't have more than 2GB of string data. I don't know this is fatal but you may want to mention in the user docs somewhere or wrap this failure with extra context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Thanks @westonpace. Currently we only plan to use this with segmented aggregation so each group is not going to be very large. (grouping inside a segment), so I don't think it would be a problem. |
@westonpace This should be clean now (all comments addressed, CI green) - another look? |
Gentle ping @westonpace anything else you want me to change here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor wording suggestion. Otherwise this looks good.
std::vector<std::shared_ptr<DataType>> input_types, | ||
std::shared_ptr<DataType> output_type) | ||
: function(function), cb(std::move(cb)), output_type(std::move(output_type)) { | ||
Py_INCREF(function->obj()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These INCREF
's still seem superfluous to me but I don't think it's critical. We could test in a follow-up using temporary function registries to see if we are preventing UDF functions from being garbage collected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I plan to address this in #36000 but haven't got to it.
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Thanks @westonpace. I applied your suggestion and will merge once CI passes. |
CI failure is unrelated. Merging. |
Conbench analyzed the 6 benchmark runs on commit There was 1 benchmark result with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Rationale for this change
In #35515,
I have implemented a Scalar version of the non decomposable UDF (Scalar as in SCALAR_AGGREGATE). I would like to support the Hash version of it (Hash as in HASH_AGGREGATE)
With this PR, user can register an aggregate UDF once with
pc.register_aggregate_function
and it can be used as both scalar aggregate function and hash aggregate function.Example:
What changes are included in this PR?
The main changes are:
UdfWrapperCallback
objects are namedcb
(previously,agg_cb
orwrapper
) now and the user defined python function is now just calledfunction
(previouslyagg_function
)For table.groupby().aggregate(...), the space complexity is O(n) where n is the size of the table (and therefore, is not very useful). However, this is more useful in the segmented aggregation case, where the space complexity of O(s), where s the size of the segments.
Are these changes tested?
Added new test in test_udf.py (with table.group_by().aggregate() and test_substrait.py (with segmented aggregation)
Are there any user-facing changes?
Yes with this change, user can call use registered aggregate UDF with
table.group_by().aggregate()
or Acero's segmented aggregation.Checklist