[Enhancement](pyudf) Support parameterless calls for pythonUDF by linrrzqqq · Pull Request #62624 · apache/doris

linrrzqqq · 2026-04-20T07:06:14Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

CREATE FUNCTION py_pkg_versions()
RETURNS STRING
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
import json
import sys
def evaluate():
    versions = {"python": sys.version}
    try:
        import numpy
        versions["numpy"] = numpy.__version__
    except:
        versions["numpy"] = "not_found"
    try:
        import pandas
        versions["pandas"] = pandas.__version__
    except:
        versions["pandas"] = "not_found"
    try:
        import jieba
        versions["jieba"] = jieba.__version__
    except:
        versions["jieba"] = "not_found"
    return json.dumps(versions)
$$;

before:

SELECT py_pkg_versions();
-- errCode = 2, detailMessage = (172.20.49.73)[INVALID_ARGUMENT]Python UDF input types is empty

now:

SELECT py_pkg_versions();
+------------------------------------------------------------------------------------------------------------------------------------------------------+
| py_pkg_versions()                                                                                                                                    |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"python": "3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0]", "numpy": "2.4.3", "pandas": "3.0.1", "jieba": "0.42.1"} |
+------------------------------------------------------------------------------------------------------------------------------------------------------+

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Thearas · 2026-04-20T07:06:19Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

linrrzqqq · 2026-04-20T10:42:34Z

run buildall

hello-stephen · 2026-04-20T13:00:32Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (20/20) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.80% (27591/37387)
Line Coverage	57.49% (298156/518581)
Region Coverage	54.67% (247919/453496)
Branch Coverage	56.26% (107301/190738)

linrrzqqq · 2026-04-21T02:10:00Z

/review

github-actions

be/src/exprs/table_function/python_udtf_function.cpp: the new zero-arg UDTF path still trusts the Python server to return one list entry per input row. TableFunctionOperator calls process_row() for every child row, and process_row() indexes array_nullmap_data[row_idx] / offsets_ptr[row_idx]. Because there is no list_array->length() == input_rows check after _udtf_client->evaluate(), a short or malformed response becomes an out-of-bounds read instead of a clean failure. The scalar UDF path already rejects this with output_rows == input_rows; the UDTF path needs the same guard.
be/src/udf/python/python_udf_meta.cpp: this new validation now depends on client_type being initialized, but PythonUDFMeta is still a plain struct without default member initializers. Current production callers set it, so I am not blocking on it, but defaulting type / client_type to UNKNOWN would make future callers safer.

Critical Checkpoints

Goal of task: Partially met. The patch enables parameterless Python UDFs and most of the UDTF plumbing, but the UDTF execution path is still missing the row-count validation needed to make the new feature robust end-to-end. The added unit/regression tests demonstrate the inline happy path only.
Change size/focus: Focused and reasonably small.
Concurrency: No new concurrency or lock-order changes.
Lifecycle: No new special lifecycle or static-initialization concerns beyond the existing Python client/server lifecycle.
Config: No new config items.
Compatibility: No storage-format or FE/BE protocol compatibility issue found.
Parallel code paths: Scalar UDF and table-function paths were updated; Python UDAF remains intentionally unchanged. Module-loaded no-input UDF/UDTF behavior is not covered by the new tests, so that branch still carries residual risk.
Special condition checks: The new empty-argument branches are straightforward, but the UDTF path also needs an explicit output-row check because later code assumes one ListArray entry per input row.
Test coverage: Added BE unit tests and inline regression tests for zero-arg UDF/UDTF happy paths. Missing negative coverage for malformed or mismatched UDTF output, and missing module-path coverage.
Observability: Existing error messages are sufficient once the missing validation is added.
Transaction/persistence/data-write concerns: Not applicable.
FE/BE variable propagation: Not applicable.
Performance: Neutral; the empty-argument fast path is reasonable.

Overall: requesting changes until the UDTF output row-count contract is enforced.

github-actions · 2026-04-21T02:44:39Z

                                              std::shared_ptr<arrow::Schema>* schema) {
-    assert(!types.empty());
    arrow::SchemaBuilder builder;
    for (size_t i = 0; i < types.size(); ++i) {


This branch now depends on client_type being initialized, but PythonUDFMeta still has no default member initializers. Today the production callers set it before check(), but a future default-constructed caller with empty input_types would read an indeterminate enum here. Defaulting type / client_type to UNKNOWN in the struct would make this safer.

hello-stephen · 2026-04-21T04:04:33Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (20/20) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.75% (27573/37387)
Line Coverage	57.47% (298020/518581)
Region Coverage	54.65% (247825/453496)
Branch Coverage	56.23% (107251/190738)

hello-stephen · 2026-04-23T10:48:58Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (20/20) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.78% (27583/37387)
Line Coverage	57.46% (297990/518581)
Region Coverage	54.65% (247834/453496)
Branch Coverage	56.22% (107241/190738)

linrrzqqq · 2026-04-28T10:00:15Z

run buildall

linrrzqqq · 2026-04-30T07:52:15Z

/review

github-actions

Review result: request changes.

Critical checkpoints:

Goal/test proof: The BE changes implement zero-argument Python UDF/UDTF Arrow batches and add unit/regression coverage, but the new regression tests violate the repository test-preservation standard by dropping test tables after execution.
Scope/focus: The production changes are small and focused on empty input schemas and row-count handling.
Concurrency/lifecycle: No new shared concurrency or special lifecycle risks found; Python client lifecycle follows existing open/evaluate/close flow.
Configuration/compatibility: No new config, storage format, or FE-BE protocol field changes found. Empty Arrow schemas remain serialized through the existing metadata field.
Parallel paths: UDF and UDTF zero-argument paths are both handled; UDAF remains rejected for empty inputs. Existing review threads already covered default initialization and UDTF output-row validation, and the current diff addresses those, so I did not duplicate them.
Test coverage/results: Unit tests and regression tests were added, but the regression tests drop tables in finally blocks. Doris regression-test standards require dropping tables before use and preserving them after the run for debugging.
Observability/performance/data correctness: No additional observability, hot-path performance, transaction, persistence, or data-visibility issue found in this PR.
User focus: No additional user-provided review focus was specified.

Please remove the post-test table drops from the new regression cases while keeping the pre-test drops before creation.

github-actions · 2026-04-30T08:00:58Z

+        assert rows.every { (it[1] as int) == 7 }
+    } finally {
+        try_sql(""" DROP FUNCTION IF EXISTS py_const_no_input(); """)
+        try_sql(""" DROP TABLE IF EXISTS ${table_name}; """)


The regression-test standard in AGENTS.md says tables should be dropped before use, not after, so the environment remains available for debugging failed runs. This test already drops test_pythonudf_no_input_tbl before creating it; please remove the DROP TABLE from the finally block (the function cleanup is fine to keep if desired).

github-actions · 2026-04-30T08:00:58Z

+        ]
+    } finally {
+        try_sql(""" DROP FUNCTION IF EXISTS py_emit_no_input(); """)
+        try_sql(""" DROP TABLE IF EXISTS ${table_name}; """)


Same regression-test standard issue here: newly added cases should not drop the test table after execution. Since the table is already dropped before creation, please leave it in place after the test so failed-run state is preserved for debugging.

linrrzqqq · 2026-05-07T12:21:46Z

run buildall

linrrzqqq · 2026-05-08T02:04:04Z

/review

github-actions

I reviewed the full PR and existing review threads. The previous comments about post-test table drops, PythonUDFMeta default initialization, and UDTF output row-count validation appear addressed.

Critical checkpoint conclusions:

Goal/test proof: the PR targets zero-argument Python UDF/UDTF execution and adds BE metadata/unit coverage plus regression coverage; the functional BE paths now build zero-column Arrow batches and validate UDTF output row count.
Scope: the implementation is small and focused.
Concurrency/lifecycle/config: no new concurrency, static lifecycle, or config behavior introduced.
Compatibility/protocol: no new thrift fields or persisted formats; the existing Python metadata JSON still carries schemas and client_type.
Parallel paths: scalar UDF and table-function UDTF paths are both handled; UDAF remains rejected for empty inputs as before.
Error handling/observability: Status propagation is preserved; UDTF row mismatch now fails explicitly.
Tests: coverage was added, but the new regression tests still violate the repository regression-test standard for ordinary single test tables.

User focus: no additional user-provided review focus was specified.

github-actions · 2026-05-08T02:10:09Z

+
+suite("test_pythonudf_no_input") {
+    def runtime_version = getPythonUdfRuntimeVersion()
+    def table_name = "test_pythonudf_no_input_tbl"


The regression-test guidance in AGENTS.md says ordinary single test tables should hardcode the table name instead of using a def tableName/variable form. This new test only uses one table, so please remove table_name and use test_pythonudf_no_input_tbl directly in the SQL statements. This is distinct from the existing thread, which only covered dropping the table after the test.

github-actions · 2026-05-08T02:10:09Z

+
+suite("test_pythonudtf_no_input") {
+    def runtime_version = getPythonUdfRuntimeVersion()
+    def table_name = "test_pythonudtf_no_input_tbl"


Same regression-test standard issue here: this is an ordinary single-table case, so the table name should be hardcoded rather than stored in table_name. Please use test_pythonudtf_no_input_tbl directly in the SQL statements. This is distinct from the existing thread, which only covered dropping the table after the test.

linrrzqqq · 2026-05-08T06:42:39Z

run buildall

no need

HappenLee

LGTM

…e#62624) Problem Summary: ```sql CREATE FUNCTION py_pkg_versions() RETURNS STRING PROPERTIES ( "type" = "PYTHON_UDF", "symbol" = "evaluate", "runtime_version" = "3.12.11", "always_nullable" = "true" ) AS $$ import json import sys def evaluate(): versions = {"python": sys.version} try: import numpy versions["numpy"] = numpy.__version__ except: versions["numpy"] = "not_found" try: import pandas versions["pandas"] = pandas.__version__ except: versions["pandas"] = "not_found" try: import jieba versions["jieba"] = jieba.__version__ except: versions["jieba"] = "not_found" return json.dumps(versions) $$; ``` before: ```sql SELECT py_pkg_versions(); -- errCode = 2, detailMessage = (172.20.49.73)[INVALID_ARGUMENT]Python UDF input types is empty ``` now: ```sql SELECT py_pkg_versions(); +------------------------------------------------------------------------------------------------------------------------------------------------------+ | py_pkg_versions() | +------------------------------------------------------------------------------------------------------------------------------------------------------+ | {"python": "3.12.11 | packaged by conda-forge | (main, Jun 4 2025, 14:45:31) [GCC 13.3.0]", "numpy": "2.4.3", "pandas": "3.0.1", "jieba": "0.42.1"} | +------------------------------------------------------------------------------------------------------------------------------------------------------+ ```

github-actions Bot requested changes Apr 21, 2026

View reviewed changes

linrrzqqq force-pushed the pyudf-empty-arg branch from d404d8e to e1b2867 Compare April 28, 2026 09:55

github-actions Bot requested changes Apr 30, 2026

View reviewed changes

github-actions Bot previously requested changes May 8, 2026

View reviewed changes

linrrzqqq added 4 commits May 8, 2026 14:29

[Enhancement](pyudf) Support empty arg pyudf && udtf

93ec2f5

fix empty-arg

ee1dd34

update

eae02d3

update p0

892322d

linrrzqqq force-pushed the pyudf-empty-arg branch from 7f15d26 to 892322d Compare May 8, 2026 06:40

HappenLee approved these changes May 8, 2026

View reviewed changes

HappenLee merged commit 75b381a into apache:master May 8, 2026
30 of 32 checks passed

linrrzqqq deleted the pyudf-empty-arg branch May 8, 2026 09:25

Conversation

linrrzqqq commented Apr 20, 2026

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Apr 20, 2026

Uh oh!

linrrzqqq commented Apr 20, 2026

Uh oh!

hello-stephen commented Apr 20, 2026

BE Regression && UT Coverage Report

Uh oh!

linrrzqqq commented Apr 21, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Critical Checkpoints

Uh oh!

Uh oh!

github-actions Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

hello-stephen commented Apr 21, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Apr 23, 2026

BE Regression && UT Coverage Report

Uh oh!

linrrzqqq commented Apr 28, 2026

Uh oh!

linrrzqqq commented Apr 30, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

linrrzqqq commented May 7, 2026

Uh oh!

linrrzqqq commented May 8, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

linrrzqqq commented May 8, 2026

Uh oh!

HappenLee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants