[Enhancement](pyudf) Support parameterless calls for pythonUDF#62624
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
-
be/src/exprs/table_function/python_udtf_function.cpp: the new zero-arg UDTF path still trusts the Python server to return one list entry per input row.TableFunctionOperatorcallsprocess_row()for every child row, andprocess_row()indexesarray_nullmap_data[row_idx]/offsets_ptr[row_idx]. Because there is nolist_array->length() == input_rowscheck after_udtf_client->evaluate(), a short or malformed response becomes an out-of-bounds read instead of a clean failure. The scalar UDF path already rejects this withoutput_rows == input_rows; the UDTF path needs the same guard. -
be/src/udf/python/python_udf_meta.cpp: this new validation now depends onclient_typebeing initialized, butPythonUDFMetais still a plain struct without default member initializers. Current production callers set it, so I am not blocking on it, but defaultingtype/client_typetoUNKNOWNwould make future callers safer.
Critical Checkpoints
- Goal of task: Partially met. The patch enables parameterless Python UDFs and most of the UDTF plumbing, but the UDTF execution path is still missing the row-count validation needed to make the new feature robust end-to-end. The added unit/regression tests demonstrate the inline happy path only.
- Change size/focus: Focused and reasonably small.
- Concurrency: No new concurrency or lock-order changes.
- Lifecycle: No new special lifecycle or static-initialization concerns beyond the existing Python client/server lifecycle.
- Config: No new config items.
- Compatibility: No storage-format or FE/BE protocol compatibility issue found.
- Parallel code paths: Scalar UDF and table-function paths were updated; Python UDAF remains intentionally unchanged. Module-loaded no-input UDF/UDTF behavior is not covered by the new tests, so that branch still carries residual risk.
- Special condition checks: The new empty-argument branches are straightforward, but the UDTF path also needs an explicit output-row check because later code assumes one
ListArrayentry per input row. - Test coverage: Added BE unit tests and inline regression tests for zero-arg UDF/UDTF happy paths. Missing negative coverage for malformed or mismatched UDTF output, and missing module-path coverage.
- Observability: Existing error messages are sufficient once the missing validation is added.
- Transaction/persistence/data-write concerns: Not applicable.
- FE/BE variable propagation: Not applicable.
- Performance: Neutral; the empty-argument fast path is reasonable.
Overall: requesting changes until the UDTF output row-count contract is enforced.
| std::shared_ptr<arrow::Schema>* schema) { | ||
| assert(!types.empty()); | ||
| arrow::SchemaBuilder builder; | ||
| for (size_t i = 0; i < types.size(); ++i) { |
There was a problem hiding this comment.
This branch now depends on client_type being initialized, but PythonUDFMeta still has no default member initializers. Today the production callers set it before check(), but a future default-constructed caller with empty input_types would read an indeterminate enum here. Defaulting type / client_type to UNKNOWN in the struct would make this safer.
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
d404d8e to
e1b2867
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review result: request changes.
Critical checkpoints:
- Goal/test proof: The BE changes implement zero-argument Python UDF/UDTF Arrow batches and add unit/regression coverage, but the new regression tests violate the repository test-preservation standard by dropping test tables after execution.
- Scope/focus: The production changes are small and focused on empty input schemas and row-count handling.
- Concurrency/lifecycle: No new shared concurrency or special lifecycle risks found; Python client lifecycle follows existing open/evaluate/close flow.
- Configuration/compatibility: No new config, storage format, or FE-BE protocol field changes found. Empty Arrow schemas remain serialized through the existing metadata field.
- Parallel paths: UDF and UDTF zero-argument paths are both handled; UDAF remains rejected for empty inputs. Existing review threads already covered default initialization and UDTF output-row validation, and the current diff addresses those, so I did not duplicate them.
- Test coverage/results: Unit tests and regression tests were added, but the regression tests drop tables in finally blocks. Doris regression-test standards require dropping tables before use and preserving them after the run for debugging.
- Observability/performance/data correctness: No additional observability, hot-path performance, transaction, persistence, or data-visibility issue found in this PR.
- User focus: No additional user-provided review focus was specified.
Please remove the post-test table drops from the new regression cases while keeping the pre-test drops before creation.
| assert rows.every { (it[1] as int) == 7 } | ||
| } finally { | ||
| try_sql(""" DROP FUNCTION IF EXISTS py_const_no_input(); """) | ||
| try_sql(""" DROP TABLE IF EXISTS ${table_name}; """) |
There was a problem hiding this comment.
The regression-test standard in AGENTS.md says tables should be dropped before use, not after, so the environment remains available for debugging failed runs. This test already drops test_pythonudf_no_input_tbl before creating it; please remove the DROP TABLE from the finally block (the function cleanup is fine to keep if desired).
| ] | ||
| } finally { | ||
| try_sql(""" DROP FUNCTION IF EXISTS py_emit_no_input(); """) | ||
| try_sql(""" DROP TABLE IF EXISTS ${table_name}; """) |
There was a problem hiding this comment.
Same regression-test standard issue here: newly added cases should not drop the test table after execution. Since the table is already dropped before creation, please leave it in place after the test so failed-run state is preserved for debugging.
|
run buildall |
|
/review |
There was a problem hiding this comment.
I reviewed the full PR and existing review threads. The previous comments about post-test table drops, PythonUDFMeta default initialization, and UDTF output row-count validation appear addressed.
Critical checkpoint conclusions:
- Goal/test proof: the PR targets zero-argument Python UDF/UDTF execution and adds BE metadata/unit coverage plus regression coverage; the functional BE paths now build zero-column Arrow batches and validate UDTF output row count.
- Scope: the implementation is small and focused.
- Concurrency/lifecycle/config: no new concurrency, static lifecycle, or config behavior introduced.
- Compatibility/protocol: no new thrift fields or persisted formats; the existing Python metadata JSON still carries schemas and client_type.
- Parallel paths: scalar UDF and table-function UDTF paths are both handled; UDAF remains rejected for empty inputs as before.
- Error handling/observability: Status propagation is preserved; UDTF row mismatch now fails explicitly.
- Tests: coverage was added, but the new regression tests still violate the repository regression-test standard for ordinary single test tables.
User focus: no additional user-provided review focus was specified.
|
|
||
| suite("test_pythonudf_no_input") { | ||
| def runtime_version = getPythonUdfRuntimeVersion() | ||
| def table_name = "test_pythonudf_no_input_tbl" |
There was a problem hiding this comment.
The regression-test guidance in AGENTS.md says ordinary single test tables should hardcode the table name instead of using a def tableName/variable form. This new test only uses one table, so please remove table_name and use test_pythonudf_no_input_tbl directly in the SQL statements. This is distinct from the existing thread, which only covered dropping the table after the test.
|
|
||
| suite("test_pythonudtf_no_input") { | ||
| def runtime_version = getPythonUdfRuntimeVersion() | ||
| def table_name = "test_pythonudtf_no_input_tbl" |
There was a problem hiding this comment.
Same regression-test standard issue here: this is an ordinary single-table case, so the table name should be hardcoded rather than stored in table_name. Please use test_pythonudtf_no_input_tbl directly in the SQL statements. This is distinct from the existing thread, which only covered dropping the table after the test.
|
run buildall |
…e#62624) Problem Summary: ```sql CREATE FUNCTION py_pkg_versions() RETURNS STRING PROPERTIES ( "type" = "PYTHON_UDF", "symbol" = "evaluate", "runtime_version" = "3.12.11", "always_nullable" = "true" ) AS $$ import json import sys def evaluate(): versions = {"python": sys.version} try: import numpy versions["numpy"] = numpy.__version__ except: versions["numpy"] = "not_found" try: import pandas versions["pandas"] = pandas.__version__ except: versions["pandas"] = "not_found" try: import jieba versions["jieba"] = jieba.__version__ except: versions["jieba"] = "not_found" return json.dumps(versions) $$; ``` before: ```sql SELECT py_pkg_versions(); -- errCode = 2, detailMessage = (172.20.49.73)[INVALID_ARGUMENT]Python UDF input types is empty ``` now: ```sql SELECT py_pkg_versions(); +------------------------------------------------------------------------------------------------------------------------------------------------------+ | py_pkg_versions() | +------------------------------------------------------------------------------------------------------------------------------------------------------+ | {"python": "3.12.11 | packaged by conda-forge | (main, Jun 4 2025, 14:45:31) [GCC 13.3.0]", "numpy": "2.4.3", "pandas": "3.0.1", "jieba": "0.42.1"} | +------------------------------------------------------------------------------------------------------------------------------------------------------+ ```
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
before:
now:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)