Skip to content

Conversation

@icexelloss
Copy link
Contributor

@icexelloss icexelloss commented Feb 27, 2023

Rationale for this change

Currently Acero has a way to execute a registered UDF via substrait however there are no tests for it.

What changes are included in this PR?

This PR adds a test for passing a registered UDF via a substrait plan.

Are these changes tested?

N/A

Are there any user-facing changes?

No

@icexelloss icexelloss requested a review from AlenkaF as a code owner February 27, 2023 19:47
@github-actions
Copy link

github-actions bot commented Feb 27, 2023

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because a known bug that Acero doesn't name the result columns correctly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may explain the problem we are seeing in ibis-project/ibis-substrait#414

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there is an issue from a while back: #33434

@icexelloss
Copy link
Contributor Author

cc @westonpace

@icexelloss icexelloss force-pushed the pyarrow-run-query-udf branch from 8c54d1f to 2052b4f Compare February 27, 2023 19:53
@icexelloss icexelloss changed the title GH-34366: [Python] Test run_query with a registered scalar UDF GH-34333: [Python] Test run_query with a registered scalar UDF Feb 27, 2023
@github-actions
Copy link

⚠️ GitHub issue #34333 has been automatically assigned in GitHub to PR creator.

FromProto(expr, ext_set, conversion_options));
auto bound_expr = des_expr.Bind(*input.output_schema);
if (auto* expr_call = bound_expr->call()) {
ARROW_ASSIGN_OR_RAISE(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this for better error handling - before it will segfault if the function cannot be found, now it raises error to user, e.g.

>   raise ArrowKeyError(message)
E   pyarrow.lib.ArrowKeyError: No function registered with name: my_udf
E   /home/icexelloss/workspace/arrow/cpp/src/arrow/compute/exec/expression.cc:534  GetFunction(call, exec_context)
E   /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:550  des_expr.Bind(*input.output_schema)
E   /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/serde.cc:157  FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), ext_set, conversion_options)
E   /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/serde.cc:200  DeserializePlans(buf, MakeNoSinkDeclarationFactory(), registry, ext_set_out, conversion_options)
E   /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/util.cc:47  DeserializePlan(substrait_buffer, registry, nullptr, conversion_options)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test for this.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. A few nit-picks but good to have more testing here.

Not related to this PR but I would love to move away from these large JSON blobs in tests. I'm open to ideas if you have them :)


import pyarrow as pa
from pyarrow import compute as pc
from pyarrow.lib import tobytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are in pure-python context (and not a cython context) I think you can just use:

substrait_query = b"""
...

Then you don't have to rely on tobytes. Even if that doesn't work I think substrait_query.encode() is still preferable over tobytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may explain the problem we are seeing in ibis-project/ibis-substrait#414

{
"root": {
"input": {
"project": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why two project nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a very good reason, mostly because I was reusing my existing code to generate the json that has two projections:

dt = ...
dt = dt[['p', 't']]
dt = dt.assign(p2=...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed first the projection manually, will push a revision soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and simplified the plan

@github-actions github-actions bot added the awaiting changes Awaiting changes label Feb 28, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 28, 2023
@icexelloss
Copy link
Contributor Author

icexelloss commented Feb 28, 2023

@westonpace I am not sure what this error is about:

==================================== ERRORS ====================================
_ ERROR collecting opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py _
ImportError while importing test module '/opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
opt/conda/envs/arrow/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py:22: in <module>
    import pyarrow._substrait
E   ModuleNotFoundError: No module named 'pyarrow._substrait'

from this:
https://github.com/apache/arrow/actions/runs/4295311322/jobs/7485550126

Any idea?

(This passes for me locally)

@icexelloss
Copy link
Contributor Author

Not related to this PR but I would love to move away from these large JSON blobs in tests. I'm open to ideas if you have them :)

I think what we can do for now is too keep these json test to minimum and do more heavy testing with ibis -> Acero integration. This is basically what we are doing with our internal integration testing.

@icexelloss
Copy link
Contributor Author

@westonpace I am not sure what this error is about:

==================================== ERRORS ====================================
_ ERROR collecting opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py _
ImportError while importing test module '/opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
opt/conda/envs/arrow/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py:22: in <module>
    import pyarrow._substrait
E   ModuleNotFoundError: No module named 'pyarrow._substrait'

from this: https://github.com/apache/arrow/actions/runs/4295311322/jobs/7485550126

Any idea?

(This passes for me locally)

I see what the issue is - let me try to organize the tests to bypass this

@icexelloss
Copy link
Contributor Author

@westonpace Ok build is green now let me know if any changes you would like me make

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very minor suggestions. Feel free to take them or merge as-is.

def test_udf_via_substrait(unary_func_fixture, use_threads):
test_table_1 = pa.Table.from_pydict({"x": [1, 2, 3]})

def table_provider(names, _):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could even simplify to:

def table_provider(_names, _schema):
  return test_table

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 28, 2023
icexelloss and others added 2 commits February 28, 2023 17:20
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@icexelloss icexelloss merged commit e8107bf into apache:main Mar 1, 2023
@ursabot
Copy link

ursabot commented Mar 2, 2023

Benchmark runs are scheduled for baseline = 4c1448e and contender = e8107bf. e8107bf is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.31% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️0.13% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] e8107bfa ec2-t3-xlarge-us-east-2
[Finished] e8107bfa test-mac-arm
[Finished] e8107bfa ursa-i9-9960x
[Failed] e8107bfa ursa-thinkcentre-m75q
[Finished] 4c1448e8 ec2-t3-xlarge-us-east-2
[Finished] 4c1448e8 test-mac-arm
[Finished] 4c1448e8 ursa-i9-9960x
[Finished] 4c1448e8 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Test run_query with a registered UDF

4 participants