-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-34333: [Python] Test run_query with a registered scalar UDF #34373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyarrow/tests/test_udf.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because a known bug that Acero doesn't name the result columns correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may explain the problem we are seeing in ibis-project/ibis-substrait#414
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there is an issue from a while back: #33434
|
cc @westonpace |
8c54d1f to
2052b4f
Compare
|
|
| FromProto(expr, ext_set, conversion_options)); | ||
| auto bound_expr = des_expr.Bind(*input.output_schema); | ||
| if (auto* expr_call = bound_expr->call()) { | ||
| ARROW_ASSIGN_OR_RAISE( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this for better error handling - before it will segfault if the function cannot be found, now it raises error to user, e.g.
> raise ArrowKeyError(message)
E pyarrow.lib.ArrowKeyError: No function registered with name: my_udf
E /home/icexelloss/workspace/arrow/cpp/src/arrow/compute/exec/expression.cc:534 GetFunction(call, exec_context)
E /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:550 des_expr.Bind(*input.output_schema)
E /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/serde.cc:157 FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), ext_set, conversion_options)
E /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/serde.cc:200 DeserializePlans(buf, MakeNoSinkDeclarationFactory(), registry, ext_set_out, conversion_options)
E /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/util.cc:47 DeserializePlan(substrait_buffer, registry, nullptr, conversion_options)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test for this.
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. A few nit-picks but good to have more testing here.
Not related to this PR but I would love to move away from these large JSON blobs in tests. I'm open to ideas if you have them :)
python/pyarrow/tests/test_udf.py
Outdated
|
|
||
| import pyarrow as pa | ||
| from pyarrow import compute as pc | ||
| from pyarrow.lib import tobytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are in pure-python context (and not a cython context) I think you can just use:
substrait_query = b"""
...
Then you don't have to rely on tobytes. Even if that doesn't work I think substrait_query.encode() is still preferable over tobytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
python/pyarrow/tests/test_udf.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may explain the problem we are seeing in ibis-project/ibis-substrait#414
python/pyarrow/tests/test_udf.py
Outdated
| { | ||
| "root": { | ||
| "input": { | ||
| "project": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why two project nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a very good reason, mostly because I was reusing my existing code to generate the json that has two projections:
dt = ...
dt = dt[['p', 't']]
dt = dt.assign(p2=...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed first the projection manually, will push a revision soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and simplified the plan
|
@westonpace I am not sure what this error is about: from this: Any idea? (This passes for me locally) |
I think what we can do for now is too keep these json test to minimum and do more heavy testing with ibis -> Acero integration. This is basically what we are doing with our internal integration testing. |
I see what the issue is - let me try to organize the tests to bypass this |
|
@westonpace Ok build is green now let me know if any changes you would like me make |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very minor suggestions. Feel free to take them or merge as-is.
| def test_udf_via_substrait(unary_func_fixture, use_threads): | ||
| test_table_1 = pa.Table.from_pydict({"x": [1, 2, 3]}) | ||
|
|
||
| def table_provider(names, _): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could even simplify to:
def table_provider(_names, _schema):
return test_table
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
|
Benchmark runs are scheduled for baseline = 4c1448e and contender = e8107bf. e8107bf is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Rationale for this change
Currently Acero has a way to execute a registered UDF via substrait however there are no tests for it.
What changes are included in this PR?
This PR adds a test for passing a registered UDF via a substrait plan.
Are these changes tested?
N/A
Are there any user-facing changes?
No