Skip to content

[Enhancement](pyudf) Support parameterless calls for pythonUDF#62624

Merged
HappenLee merged 4 commits into
apache:masterfrom
linrrzqqq:pyudf-empty-arg
May 8, 2026
Merged

[Enhancement](pyudf) Support parameterless calls for pythonUDF#62624
HappenLee merged 4 commits into
apache:masterfrom
linrrzqqq:pyudf-empty-arg

Conversation

@linrrzqqq
Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

CREATE FUNCTION py_pkg_versions()
RETURNS STRING
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
import json
import sys
def evaluate():
    versions = {"python": sys.version}
    try:
        import numpy
        versions["numpy"] = numpy.__version__
    except:
        versions["numpy"] = "not_found"
    try:
        import pandas
        versions["pandas"] = pandas.__version__
    except:
        versions["pandas"] = "not_found"
    try:
        import jieba
        versions["jieba"] = jieba.__version__
    except:
        versions["jieba"] = "not_found"
    return json.dumps(versions)
$$;

before:

SELECT py_pkg_versions();
-- errCode = 2, detailMessage = (172.20.49.73)[INVALID_ARGUMENT]Python UDF input types is empty

now:

SELECT py_pkg_versions();
+------------------------------------------------------------------------------------------------------------------------------------------------------+
| py_pkg_versions()                                                                                                                                    |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"python": "3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0]", "numpy": "2.4.3", "pandas": "3.0.1", "jieba": "0.42.1"} |
+------------------------------------------------------------------------------------------------------------------------------------------------------+

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 20, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (20/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.80% (27591/37387)
Line Coverage 57.49% (298156/518581)
Region Coverage 54.67% (247919/453496)
Branch Coverage 56.26% (107301/190738)

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. be/src/exprs/table_function/python_udtf_function.cpp: the new zero-arg UDTF path still trusts the Python server to return one list entry per input row. TableFunctionOperator calls process_row() for every child row, and process_row() indexes array_nullmap_data[row_idx] / offsets_ptr[row_idx]. Because there is no list_array->length() == input_rows check after _udtf_client->evaluate(), a short or malformed response becomes an out-of-bounds read instead of a clean failure. The scalar UDF path already rejects this with output_rows == input_rows; the UDTF path needs the same guard.

  2. be/src/udf/python/python_udf_meta.cpp: this new validation now depends on client_type being initialized, but PythonUDFMeta is still a plain struct without default member initializers. Current production callers set it, so I am not blocking on it, but defaulting type / client_type to UNKNOWN would make future callers safer.

Critical Checkpoints

  • Goal of task: Partially met. The patch enables parameterless Python UDFs and most of the UDTF plumbing, but the UDTF execution path is still missing the row-count validation needed to make the new feature robust end-to-end. The added unit/regression tests demonstrate the inline happy path only.
  • Change size/focus: Focused and reasonably small.
  • Concurrency: No new concurrency or lock-order changes.
  • Lifecycle: No new special lifecycle or static-initialization concerns beyond the existing Python client/server lifecycle.
  • Config: No new config items.
  • Compatibility: No storage-format or FE/BE protocol compatibility issue found.
  • Parallel code paths: Scalar UDF and table-function paths were updated; Python UDAF remains intentionally unchanged. Module-loaded no-input UDF/UDTF behavior is not covered by the new tests, so that branch still carries residual risk.
  • Special condition checks: The new empty-argument branches are straightforward, but the UDTF path also needs an explicit output-row check because later code assumes one ListArray entry per input row.
  • Test coverage: Added BE unit tests and inline regression tests for zero-arg UDF/UDTF happy paths. Missing negative coverage for malformed or mismatched UDTF output, and missing module-path coverage.
  • Observability: Existing error messages are sufficient once the missing validation is added.
  • Transaction/persistence/data-write concerns: Not applicable.
  • FE/BE variable propagation: Not applicable.
  • Performance: Neutral; the empty-argument fast path is reasonable.

Overall: requesting changes until the UDTF output row-count contract is enforced.

Comment thread be/src/exprs/table_function/python_udtf_function.cpp
std::shared_ptr<arrow::Schema>* schema) {
assert(!types.empty());
arrow::SchemaBuilder builder;
for (size_t i = 0; i < types.size(); ++i) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch now depends on client_type being initialized, but PythonUDFMeta still has no default member initializers. Today the production callers set it before check(), but a future default-constructed caller with empty input_types would read an indeterminate enum here. Defaulting type / client_type to UNKNOWN in the struct would make this safer.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (20/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.75% (27573/37387)
Line Coverage 57.47% (298020/518581)
Region Coverage 54.65% (247825/453496)
Branch Coverage 56.23% (107251/190738)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (20/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.78% (27583/37387)
Line Coverage 57.46% (297990/518581)
Region Coverage 54.65% (247834/453496)
Branch Coverage 56.22% (107241/190738)

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes.

Critical checkpoints:

  • Goal/test proof: The BE changes implement zero-argument Python UDF/UDTF Arrow batches and add unit/regression coverage, but the new regression tests violate the repository test-preservation standard by dropping test tables after execution.
  • Scope/focus: The production changes are small and focused on empty input schemas and row-count handling.
  • Concurrency/lifecycle: No new shared concurrency or special lifecycle risks found; Python client lifecycle follows existing open/evaluate/close flow.
  • Configuration/compatibility: No new config, storage format, or FE-BE protocol field changes found. Empty Arrow schemas remain serialized through the existing metadata field.
  • Parallel paths: UDF and UDTF zero-argument paths are both handled; UDAF remains rejected for empty inputs. Existing review threads already covered default initialization and UDTF output-row validation, and the current diff addresses those, so I did not duplicate them.
  • Test coverage/results: Unit tests and regression tests were added, but the regression tests drop tables in finally blocks. Doris regression-test standards require dropping tables before use and preserving them after the run for debugging.
  • Observability/performance/data correctness: No additional observability, hot-path performance, transaction, persistence, or data-visibility issue found in this PR.
  • User focus: No additional user-provided review focus was specified.

Please remove the post-test table drops from the new regression cases while keeping the pre-test drops before creation.

assert rows.every { (it[1] as int) == 7 }
} finally {
try_sql(""" DROP FUNCTION IF EXISTS py_const_no_input(); """)
try_sql(""" DROP TABLE IF EXISTS ${table_name}; """)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regression-test standard in AGENTS.md says tables should be dropped before use, not after, so the environment remains available for debugging failed runs. This test already drops test_pythonudf_no_input_tbl before creating it; please remove the DROP TABLE from the finally block (the function cleanup is fine to keep if desired).

]
} finally {
try_sql(""" DROP FUNCTION IF EXISTS py_emit_no_input(); """)
try_sql(""" DROP TABLE IF EXISTS ${table_name}; """)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same regression-test standard issue here: newly added cases should not drop the test table after execution. Since the table is already dropped before creation, please leave it in place after the test so failed-run state is preserved for debugging.

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the full PR and existing review threads. The previous comments about post-test table drops, PythonUDFMeta default initialization, and UDTF output row-count validation appear addressed.

Critical checkpoint conclusions:

  • Goal/test proof: the PR targets zero-argument Python UDF/UDTF execution and adds BE metadata/unit coverage plus regression coverage; the functional BE paths now build zero-column Arrow batches and validate UDTF output row count.
  • Scope: the implementation is small and focused.
  • Concurrency/lifecycle/config: no new concurrency, static lifecycle, or config behavior introduced.
  • Compatibility/protocol: no new thrift fields or persisted formats; the existing Python metadata JSON still carries schemas and client_type.
  • Parallel paths: scalar UDF and table-function UDTF paths are both handled; UDAF remains rejected for empty inputs as before.
  • Error handling/observability: Status propagation is preserved; UDTF row mismatch now fails explicitly.
  • Tests: coverage was added, but the new regression tests still violate the repository regression-test standard for ordinary single test tables.

User focus: no additional user-provided review focus was specified.


suite("test_pythonudf_no_input") {
def runtime_version = getPythonUdfRuntimeVersion()
def table_name = "test_pythonudf_no_input_tbl"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regression-test guidance in AGENTS.md says ordinary single test tables should hardcode the table name instead of using a def tableName/variable form. This new test only uses one table, so please remove table_name and use test_pythonudf_no_input_tbl directly in the SQL statements. This is distinct from the existing thread, which only covered dropping the table after the test.


suite("test_pythonudtf_no_input") {
def runtime_version = getPythonUdfRuntimeVersion()
def table_name = "test_pythonudtf_no_input_tbl"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same regression-test standard issue here: this is an ordinary single-table case, so the table name should be hardcoded rather than stored in table_name. Please use test_pythonudtf_no_input_tbl directly in the SQL statements. This is distinct from the existing thread, which only covered dropping the table after the test.

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

Copy link
Copy Markdown
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HappenLee HappenLee merged commit 75b381a into apache:master May 8, 2026
30 of 32 checks passed
@linrrzqqq linrrzqqq deleted the pyudf-empty-arg branch May 8, 2026 09:25
linrrzqqq added a commit to linrrzqqq/doris that referenced this pull request May 19, 2026
…e#62624)

Problem Summary:

```sql
CREATE FUNCTION py_pkg_versions()
RETURNS STRING
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
import json
import sys
def evaluate():
    versions = {"python": sys.version}
    try:
        import numpy
        versions["numpy"] = numpy.__version__
    except:
        versions["numpy"] = "not_found"
    try:
        import pandas
        versions["pandas"] = pandas.__version__
    except:
        versions["pandas"] = "not_found"
    try:
        import jieba
        versions["jieba"] = jieba.__version__
    except:
        versions["jieba"] = "not_found"
    return json.dumps(versions)
$$;
```

before:
```sql
SELECT py_pkg_versions();
-- errCode = 2, detailMessage = (172.20.49.73)[INVALID_ARGUMENT]Python UDF input types is empty
```

now:
```sql
SELECT py_pkg_versions();
+------------------------------------------------------------------------------------------------------------------------------------------------------+
| py_pkg_versions()                                                                                                                                    |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"python": "3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0]", "numpy": "2.4.3", "pandas": "3.0.1", "jieba": "0.42.1"} |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants