[Fix](pyudf) Convert nested map value correctly#63907
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Summary opinion: Request changes. The PR fixes the intended nested MAP representation problem, but the current implementation also adds per-row Python conversion to every vectorized UDF argument, including primitive pandas/list arguments that do not need normalization, and the new regression test has deterministic-output/debuggability issues that should be fixed before merge.
Critical checkpoint conclusions:
- Goal/test: The goal is clear and the new test covers scalar, vector-list, vector-series, and mixed vector/scalar nested MAP cases, but one expected-result query is not ordered.
- Scope/focus: The feature change is small, but the conversion is applied more broadly than necessary and affects all vectorized UDF inputs.
- Concurrency/lifecycle/config/compatibility/persistence/data writes: No new concurrency, lifecycle management, configs, storage format, persistence, or transaction behavior found in this PR.
- Parallel paths: The scalar and vectorized Python UDF argument paths were both updated, but the vectorized primitive path now pays the nested conversion cost unnecessarily.
- Error handling/observability: Existing exception propagation/logging remains unchanged; no new observability requirement found.
- Test coverage/results: Coverage is relevant, but the new regression test violates Doris ordering and cleanup standards.
- User focus: No additional user-provided review focus was specified.
- Local verification: I inspected the diff and review context; I could not run a PyArrow reproduction because pyarrow is not installed in this runner environment.
|
run buildall |
TPC-H: Total hot run time: 30999 ms |
TPC-DS: Total hot run time: 170853 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
|
OpenCode automated review failed and did not complete. Error: Error: Token refresh failed: 401 Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
/review |
|
OpenCode automated review failed and did not complete. Error: Error: Token refresh failed: 401 Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
TPC-H: Total hot run time: 32034 ms |
|
/review |
TPC-DS: Total hot run time: 172482 ms |
There was a problem hiding this comment.
Code review completed for the latest head d701b80.
Findings: I did not find additional blocking issues beyond the already-existing inline threads, and the latest commit appears to address those known points: vector conversion is gated for non-nested Arrow types, the UNION ALL regression output is ordered, and the post-test table cleanup was removed.
Critical checkpoint conclusions:
- Goal and coverage: The PR targets Python UDF nested MAP conversion under ARRAY/STRUCT/vectorized inputs, and the implementation plus new regression coverage exercise scalar constants, vector list, vector pandas Series, nulls, empty containers, and nested MAP inside STRUCT/MAP.
- Scope and focus: The actual PR file list is limited to python_server.py and one regression suite/output pair; the change is focused on Arrow-to-Python conversion.
- Concurrency: No new shared mutable state or thread lifecycle is introduced.
- Lifecycle/static initialization: No new special lifecycle or static/global initialization concerns found.
- Configuration: No config changes.
- Compatibility/storage format: No persisted format or FE-BE protocol change.
- Parallel paths: Scalar UDF, vectorized UDF, and existing UDAF/UDTF callers through convert_arrow_field_to_python are covered by the shared helper path; vectorized explicit list/Series paths are handled separately.
- Conditional checks: The new needs_nested_python_normalization gate is appropriate to avoid applying recursive normalization on primitive/non-nested vector inputs.
- Tests/results: New regression test output is deterministic after ordering the UNION ALL query. I did not run the regression suite in this review environment.
- Observability: No additional observability appears necessary for this conversion helper.
- Transaction/persistence/data writes: Not applicable.
- FE/BE variable passing: Not applicable.
- Performance: The known vectorized primitive regression concern is addressed by keeping the direct to_pylist/to_pandas path when the Arrow type cannot contain nested MAP values.
User focus: No additional user-provided review focus was present.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
57f159b to
33cb1a2
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 28593 ms |
TPC-DS: Total hot run time: 171397 ms |
|
/review |
There was a problem hiding this comment.
Automated review summary:
No additional blocking issues found in the current PR head.
Critical checkpoint conclusions:
- Goal and proof: The PR fixes nested MAP conversion for scalar and vectorized Python UDF inputs, and adds a regression case covering scalar constants, vectorized list inputs, vectorized pandas Series inputs, and mixed vector/scalar arguments.
- Scope: The effective GitHub PR diff is focused on
be/src/udf/python/python_server.pyplus the new regression test and expected output. - Concurrency/lifecycle: No new shared state, threads, locks, static initialization, or lifecycle-sensitive ownership was introduced.
- Configuration/compatibility: No new configuration, storage format, Thrift/protocol, or persisted metadata changes.
- Parallel paths: Scalar conversion, vectorized list conversion, vectorized pandas conversion, and vectorized mixed scalar conversion were considered. The current head keeps primitive vectorized inputs on the original fast path and normalizes only map-containing nested types.
- Error handling/data correctness: The recursive conversion preserves NULL handling and converts MAP/LIST/STRUCT according to Arrow type metadata. I did not find a distinct data-correctness issue beyond the already-known prior review threads.
- Tests: The new regression output is deterministic in the current head; table cleanup now follows the Doris regression convention. I did not run the regression suite in this review environment.
- Observability/performance: No new observability requirement. The current head avoids the earlier performance regression by gating recursive normalization to types that can contain nested MAP values.
User focus: No additional user-provided review focus was specified.
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
Problem Summary:
Fix Python UDF nested complex type conversion when `MAP` appears inside
`ARRAY`, `STRUCT`, or vectorized inputs.
Previously, Python UDF argument conversion mostly relied on PyArrow's
default conversions(`Scalar.as_py()`, `Array.to_pylist()`,
`Array.to_pandas()`). Those APIs convert a top-level Arrow `MAP` into
Python-friendly values in some paths, but nested `MAP` values are
exposed as list-of-tuples. For example, `ARRAY<MAP<STRING, INT>>` could
arrive in Python as `[[('a', 1)]]` instead of `[{'a': 1}]`. This made
user UDF code see nested maps as `list` instead of `dict`.
This PR introduces a recursive Arrow-value conversion helper and applies
it consistently across Python UDF argument conversion paths. The helper
manually reconstructs Python values according to the Arrow type:
- `MAP` -> `dict`
- `LIST` / `LARGE_LIST` -> `list`
- `STRUCT` -> `dict`
before
```sql
CREATE FUNCTION py_deep_nested_debug(ARRAY<MAP<STRING, ARRAY<INT>>> )
RETURNS STRING
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "evaluate",
"runtime_version" = "3.12.11",
"always_nullable" = "true"
)
AS $$
def evaluate(arr):
if arr is None:
return 'None'
return 'outer_type={}, outer_repr={}'.format(type(arr).__name__, repr(arr))
$$;
SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]) |
+-------------------------------------------------------------------------------+
| outer_type=list, outer_repr=[[('a', [1, 2]), ('b', [3])], [('c', [4, 5, 6])]] |
+-------------------------------------------------------------------------------+
```
now:
```text
SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]) |
+-------------------------------------------------------------------------+
| outer_type=list, outer_repr=[{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}] |
+-------------------------------------------------------------------------+
```
Problem Summary:
Fix Python UDF nested complex type conversion when `MAP` appears inside
`ARRAY`, `STRUCT`, or vectorized inputs.
Previously, Python UDF argument conversion mostly relied on PyArrow's
default conversions(`Scalar.as_py()`, `Array.to_pylist()`,
`Array.to_pandas()`). Those APIs convert a top-level Arrow `MAP` into
Python-friendly values in some paths, but nested `MAP` values are
exposed as list-of-tuples. For example, `ARRAY<MAP<STRING, INT>>` could
arrive in Python as `[[('a', 1)]]` instead of `[{'a': 1}]`. This made
user UDF code see nested maps as `list` instead of `dict`.
This PR introduces a recursive Arrow-value conversion helper and applies
it consistently across Python UDF argument conversion paths. The helper
manually reconstructs Python values according to the Arrow type:
- `MAP` -> `dict`
- `LIST` / `LARGE_LIST` -> `list`
- `STRUCT` -> `dict`
before
```sql
CREATE FUNCTION py_deep_nested_debug(ARRAY<MAP<STRING, ARRAY<INT>>> )
RETURNS STRING
PROPERTIES (
"type" = "PYTHON_UDF",
"symbol" = "evaluate",
"runtime_version" = "3.12.11",
"always_nullable" = "true"
)
AS $$
def evaluate(arr):
if arr is None:
return 'None'
return 'outer_type={}, outer_repr={}'.format(type(arr).__name__, repr(arr))
$$;
SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]) |
+-------------------------------------------------------------------------------+
| outer_type=list, outer_repr=[[('a', [1, 2]), ('b', [3])], [('c', [4, 5, 6])]] |
+-------------------------------------------------------------------------------+
```
now:
```text
SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]) |
+-------------------------------------------------------------------------+
| outer_type=list, outer_repr=[{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}] |
+-------------------------------------------------------------------------+
```
Problem Summary:
Fix Python UDF nested complex type conversion when
MAPappears insideARRAY,STRUCT, or vectorized inputs.Previously, Python UDF argument conversion mostly relied on PyArrow's default conversions(
Scalar.as_py(),Array.to_pylist(),Array.to_pandas()). Those APIs convert a top-level ArrowMAPinto Python-friendly values in some paths, but nestedMAPvalues are exposed as list-of-tuples. For example,ARRAY<MAP<STRING, INT>>could arrive in Python as[[('a', 1)]]instead of[{'a': 1}]. This made user UDF code see nested maps aslistinstead ofdict.This PR introduces a recursive Arrow-value conversion helper and applies it consistently across Python UDF argument conversion paths. The helper manually reconstructs Python values according to the Arrow type:
MAP->dictLIST/LARGE_LIST->listSTRUCT->dictbefore
now: