GH-34333: [Python] Test run_query with a registered scalar UDF #34373

icexelloss · 2023-02-27T19:47:03Z

Rationale for this change

Currently Acero has a way to execute a registered UDF via substrait however there are no tests for it.

What changes are included in this PR?

This PR adds a test for passing a registered UDF via a substrait plan.

Are these changes tested?

N/A

Are there any user-facing changes?

No

Closes: [Python] Test run_query with a registered UDF #34333

github-actions · 2023-02-27T19:47:29Z

Closes: [Python] Test run_query with a registered UDF #34333

icexelloss · 2023-02-27T19:47:52Z

python/pyarrow/tests/test_udf.py

This is because a known bug that Acero doesn't name the result columns correctly

This may explain the problem we are seeing in ibis-project/ibis-substrait#414

Yeah there is an issue from a while back: #33434

icexelloss · 2023-02-27T19:48:08Z

cc @westonpace

github-actions · 2023-02-27T19:55:15Z

⚠️ GitHub issue #34333 has been automatically assigned in GitHub to PR creator.

icexelloss · 2023-02-27T21:12:18Z

cpp/src/arrow/engine/substrait/relation_internal.cc

                              FromProto(expr, ext_set, conversion_options));
-        auto bound_expr = des_expr.Bind(*input.output_schema);
-        if (auto* expr_call = bound_expr->call()) {
+        ARROW_ASSIGN_OR_RAISE(


Added this for better error handling - before it will segfault if the function cannot be found, now it raises error to user, e.g.

> raise ArrowKeyError(message) E pyarrow.lib.ArrowKeyError: No function registered with name: my_udf E /home/icexelloss/workspace/arrow/cpp/src/arrow/compute/exec/expression.cc:534 GetFunction(call, exec_context) E /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:550 des_expr.Bind(*input.output_schema) E /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/serde.cc:157 FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), ext_set, conversion_options) E /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/serde.cc:200 DeserializePlans(buf, MakeNoSinkDeclarationFactory(), registry, ext_set_out, conversion_options) E /home/icexelloss/workspace/arrow/cpp/src/arrow/engine/substrait/util.cc:47 DeserializePlan(substrait_buffer, registry, nullptr, conversion_options)

Good catch.

Can you also add a test for this?

Add test for this.

westonpace

Thanks. A few nit-picks but good to have more testing here.

Not related to this PR but I would love to move away from these large JSON blobs in tests. I'm open to ideas if you have them :)

westonpace · 2023-02-28T02:56:38Z

python/pyarrow/tests/test_udf.py


 import pyarrow as pa
 from pyarrow import compute as pc
+from pyarrow.lib import tobytes


Since we are in pure-python context (and not a cython context) I think you can just use:

substrait_query = b""" ...

Then you don't have to rely on tobytes. Even if that doesn't work I think substrait_query.encode() is still preferable over tobytes.

westonpace · 2023-02-28T02:58:57Z

python/pyarrow/tests/test_udf.py

This may explain the problem we are seeing in ibis-project/ibis-substrait#414

westonpace · 2023-02-28T02:59:45Z

python/pyarrow/tests/test_udf.py

+    {
+      "root": {
+        "input": {
+          "project": {


Why two project nodes?

Not a very good reason, mostly because I was reusing my existing code to generate the json that has two projections:

dt = ... dt = dt[['p', 't']] dt = dt.assign(p2=...)

I removed first the projection manually, will push a revision soon

Updated and simplified the plan

icexelloss · 2023-02-28T18:35:00Z

@westonpace I am not sure what this error is about:

==================================== ERRORS ====================================
_ ERROR collecting opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py _
ImportError while importing test module '/opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
opt/conda/envs/arrow/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py:22: in <module>
    import pyarrow._substrait
E   ModuleNotFoundError: No module named 'pyarrow._substrait'

from this:
https://github.com/apache/arrow/actions/runs/4295311322/jobs/7485550126

Any idea?

(This passes for me locally)

icexelloss · 2023-02-28T18:40:35Z

Not related to this PR but I would love to move away from these large JSON blobs in tests. I'm open to ideas if you have them :)

I think what we can do for now is too keep these json test to minimum and do more heavy testing with ibis -> Acero integration. This is basically what we are doing with our internal integration testing.

icexelloss · 2023-02-28T20:28:32Z

@westonpace I am not sure what this error is about:

==================================== ERRORS ====================================
_ ERROR collecting opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py _
ImportError while importing test module '/opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
opt/conda/envs/arrow/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_udf.py:22: in <module>
    import pyarrow._substrait
E   ModuleNotFoundError: No module named 'pyarrow._substrait'

from this: https://github.com/apache/arrow/actions/runs/4295311322/jobs/7485550126

Any idea?

(This passes for me locally)

I see what the issue is - let me try to organize the tests to bypass this

icexelloss · 2023-02-28T21:48:50Z

@westonpace Ok build is green now let me know if any changes you would like me make

westonpace

Some very minor suggestions. Feel free to take them or merge as-is.

python/pyarrow/tests/test_substrait.py

westonpace · 2023-02-28T21:59:21Z

python/pyarrow/tests/test_substrait.py

+def test_udf_via_substrait(unary_func_fixture, use_threads):
+    test_table_1 = pa.Table.from_pydict({"x": [1, 2, 3]})
+
+    def table_provider(names, _):


You could even simplify to:

def table_provider(_names, _schema): return test_table

python/pyarrow/tests/test_substrait.py

Co-authored-by: Weston Pace <weston.pace@gmail.com>

ursabot · 2023-03-02T10:23:42Z

Benchmark runs are scheduled for baseline = 4c1448e and contender = e8107bf. e8107bf is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.31% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️0.13% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] e8107bfa ec2-t3-xlarge-us-east-2
[Finished] e8107bfa test-mac-arm
[Finished] e8107bfa ursa-i9-9960x
[Failed] e8107bfa ursa-thinkcentre-m75q
[Finished] 4c1448e8 ec2-t3-xlarge-us-east-2
[Finished] 4c1448e8 test-mac-arm
[Finished] 4c1448e8 ursa-i9-9960x
[Finished] 4c1448e8 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

icexelloss requested a review from AlenkaF as a code owner February 27, 2023 19:47

github-actions bot added the Component: Python label Feb 27, 2023

icexelloss commented Feb 27, 2023

View reviewed changes

apacheGH-34333: [Python] Test run_query with a registered scalar UDF

2052b4f

icexelloss force-pushed the pyarrow-run-query-udf branch from 8c54d1f to 2052b4f Compare February 27, 2023 19:53

icexelloss changed the title ~~GH-34366: [Python] Test run_query with a registered scalar UDF~~ GH-34333: [Python] Test run_query with a registered scalar UDF Feb 27, 2023

icexelloss requested a review from westonpace February 27, 2023 20:06

Add better error handling if udf cannot be found

ae7d06e

github-actions bot added the Component: C++ label Feb 27, 2023

icexelloss commented Feb 27, 2023

View reviewed changes

westonpace requested changes Feb 28, 2023

View reviewed changes

github-actions bot added the awaiting changes Awaiting changes label Feb 28, 2023

Address comments. Simplify substrait plan. Add test for missing udf.

231a0d2

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 28, 2023

icexelloss added 2 commits February 28, 2023 11:28

Try adding _substrait import

04f759a

Fix clang-format

15e3172

Move things around to fix CI error

cdb493e

westonpace approved these changes Feb 28, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 28, 2023

icexelloss and others added 2 commits February 28, 2023 17:20

Update python/pyarrow/tests/test_substrait.py

42bdbf5

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update python/pyarrow/tests/test_substrait.py

d867182

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Apply suggestions from code review

122e425

Co-authored-by: Weston Pace <weston.pace@gmail.com>

icexelloss merged commit e8107bf into apache:main Mar 1, 2023

GH-34333: [Python] Test run_query with a registered scalar UDF #34373

GH-34333: [Python] Test run_query with a registered scalar UDF #34373

Uh oh!

Conversation

icexelloss commented Feb 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Feb 27, 2023 • edited by icexelloss Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Feb 27, 2023

Uh oh!

github-actions bot commented Feb 27, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icexelloss commented Feb 28, 2023

Uh oh!

icexelloss commented Feb 28, 2023

Uh oh!

icexelloss commented Feb 28, 2023

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ursabot commented Mar 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

icexelloss commented Feb 27, 2023 •

edited

Loading

github-actions bot commented Feb 27, 2023 •

edited by icexelloss

Loading

icexelloss commented Feb 28, 2023 •

edited

Loading