[opt](agg) use binary serde for collect_list/collect_set on complex types#62743
[opt](agg) use binary serde for collect_list/collect_set on complex types#62743Mryange wants to merge 1 commit into
Conversation
|
/review |
There was a problem hiding this comment.
I found blocking issues in the rollout plumbing for collect_list_v2 / collect_set_v2.
collect_set_v1is registered, but it still goes through the old creator which only treats the exact namecollect_setas the distinct variant. On a new BE,collect_set_v1therefore instantiates list semantics and keeps duplicates.- New FE default resolution now emits
collect_list_v2/collect_set_v2, but no exec-version compatibility rewrite was added.ExpressionTranslatorandFunctionToThriftConverterserialize those names verbatim, while pre-change BEs only knowcollect_list/collect_set. The rolling-upgrade behavior described in the PR is not implemented. - The default-name switch was not mirrored in
AggStateType.aliasToName.agg_state<collect_list(...)>/agg_state<collect_set(...)>still canonicalize to the old names, while_state/_merge/_unioncombinators now build v2 names. Those names are compared literally in FE cast checks and BE agg-state execution, so agg-state columns/combinators can no longer interoperate.
Critical checkpoints
- Goal: The new BE serializer itself moves the generic
collect_listpath to binary serde, but the end-to-end rollout goal is not met because compatibility and agg-state plumbing are incomplete. - Scope/focus: Mostly small, but the added test file is unrelated to the changed feature.
- Concurrency: No new concurrency concerns in the touched path.
- Lifecycle/config/compatibility: Yes, this is a function-name / serialized-state compatibility change, and the required compatibility handling is incomplete.
- Parallel code paths: The agg-state/combinator path and rolling-upgrade path were not updated consistently.
- Test coverage: No new test exercises
collect_*_v2on complex types, partial-state merge/deserialize, or the agg-state path.
User focus points
- No additional user-provided focus points.
| factory.register_function_both("collect_set", create_aggregate_function_collect); | ||
| // v1 names: FE may send these after being upgraded while BEs are still upgrading | ||
| factory.register_function_both("collect_list_v1", create_aggregate_function_collect); | ||
| factory.register_function_both("collect_set_v1", create_aggregate_function_collect); |
There was a problem hiding this comment.
collect_set_v1 does not actually create a set aggregate here. create_aggregate_function_collect_impl() still uses bool distinct = name == "collect_set";, so this newly registered name goes down the collect_list branch and preserves duplicates. A concrete regression is the rollout path that is supposed to call collect_set_v1: on a new BE it will build list semantics instead of set semantics unless the helper also treats collect_set_v1 as the distinct variant.
| * constructor with 1 argument. | ||
| */ | ||
| public CollectListV2(Expression arg) { | ||
| super("collect_list_v2", arg); |
There was a problem hiding this comment.
Making the user-visible collect_list name resolve to collect_list_v2 here is not enough for the rolling-upgrade story in the PR description. ExpressionTranslator / FunctionToThriftConverter serialize function.getName() verbatim, and I can't find the companion BeExecVersionManager compatibility registration on the BE side for collect_list_v2 / collect_set_v2. That means FE with an old be_exec_version still sends *_v2 to pre-change BEs, which only register collect_list / collect_set (plus the new *_v1 aliases from this PR) and will fail function lookup.
| * constructor with 1 argument. | ||
| */ | ||
| public CollectSetV2(Expression arg) { | ||
| super("collect_set_v2", arg); |
There was a problem hiding this comment.
Changing the default nested function name to collect_*_v2 also needs the same AggStateType.aliasToName update that map_agg and window_funnel already have. Right now agg_state<collect_list(...)> / agg_state<collect_set(...)> still canonicalize to collect_list / collect_set, while collect_list_state / collect_set_state and the corresponding _merge / _union combinators now build *_v2 names. Those names are compared literally in FE agg-state cast checks and in BE AggFnEvaluator, so agg-state columns and combinators stop matching each other.
e7193b1 to
a45b4c0
Compare
|
/review |
There was a problem hiding this comment.
Summary
I found two additional blocking issues beyond the existing review threads. I rechecked the prior review context and did not duplicate those comments.
Critical Checkpoints
- Goal / correctness: Partially met. The new binary-serde path is implemented for
collect_list, butcollect_setstill has no working complex-type path in BE, so the PR does not yet deliver the advertisedcollect_sethalf of the feature. - Scope / focus: Not focused enough. The PR also adds an unrelated
array_match_*BE unit test file. - Concurrency: No new concurrency or lock-order issue found in the touched code paths.
- Lifecycle / static init: No special lifecycle or static-initialization issue found in the changed paths.
- Config changes: None.
- Compatibility / incompatible changes: This PR touches agg-state and function-name compatibility, which remains high-risk. I relied on the existing review threads for the already-raised mixed-version naming concerns and did not re-comment them inline.
- Parallel code paths: Not fully handled.
collect_listandcollect_setare parallel paths, but only thecollect_listcomplex-type path was implemented. - Special conditions: The new
INVALID_TYPEdistinct branch is inconsistent with the FEAnyDataTypesignature and the PR goal. - Test coverage: Insufficient. There is no targeted coverage for
collect_*v2 complex-type behavior, agg-state compatibility, or rolling-upgrade behavior. The only added test is unrelated and currently fails against the checked-in implementation. - Observability: No additional observability appears necessary for this change.
- Persistence / transactionality: No transaction or edit-log path is involved, but agg-state serialization compatibility is part of the changed surface.
- FE-BE protocol: Existing review context already covers additional FE to BE function-name compatibility risk. No duplicate inline comment submitted.
- Performance: The binary-serde direction is reasonable, but the incomplete
collect_setpath and missing targeted coverage block confidence in the change.
User focus points: none were provided in /tmp/review_focus.txt, and I did not find any extra issues beyond the complete PR review itself.
| const AggregateFunctionAttr& attr) { | ||
| if (distinct) { | ||
| if constexpr (T == INVALID_TYPE) { | ||
| throw Exception(ErrorCode::INTERNAL_ERROR, |
There was a problem hiding this comment.
collect_set now binds to collect_set_v2 in FE for the same AnyDataType signature as before, but this INVALID_TYPE branch still throws on every complex input. A query like SELECT collect_set(km) FROM test_array_agg_complex where km is MAP, ARRAY, or STRUCT will reach this path and fail with unexpected type for collect_set_v2, so the collect_set half of the advertised complex-type binary-serde feature is still missing. Either keep routing complex inputs to collect_set_v1 or add a V2 distinct implementation and coverage for the INVALID_TYPE path.
| // This assertion currently FAILS because the function overrides | ||
| // is_use_default_implementation_for_constants() to return false, | ||
| // breaking VExpr-level const-folding for all-const inputs. | ||
| EXPECT_TRUE(func->is_use_default_implementation_for_constants()) |
There was a problem hiding this comment.
This PR does not change array_match_*, but be/test/CMakeLists.txt picks up every *.cpp under be/test/, so this file becomes part of normal BE UT. The assertion here is currently false in checked-in code: ArrayMatchFunction::is_use_default_implementation_for_constants() explicitly returns false in be/src/exprs/function/array/varray_match_function.cpp:50, and DefaultFunction forwards that value unchanged. As written, the PR adds a deterministic BE unit-test failure unrelated to the collect_* change. If the array-match fix is intentional, the implementation change needs to be included here. Otherwise this test file should be dropped.
a45b4c0 to
85d2da2
Compare
|
/review |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
collect_listandcollect_setaggregate functions use JSON text serialization (serialize_one_cell_to_json/deserialize_one_cell_from_json) when handling complex or arbitrary-typed input columns (theINVALID_TYPEspecialization). JSON serde is significantly slower than binary serde and can introduce precision/correctness issues for certain types (e.g., large integers, nested structures).This PR introduces
collect_list_v2/collect_set_v2using the binary IColumn API (serialize_value_into_arena/deserialize_and_insert_from_arena) for all types that fall into the generic path, following the same v1/v2 upgrade-compatibility pattern used bymap_agg→map_agg_v2.Changes
BE:
aggregate_function_collect.h: AddAggregateFunctionCollectListDataV2struct that uses binary serde viaIColumn::serialize_value_into_arena/deserialize_and_insert_from_arena. AddIsV2template parameter toAggregateFunctionCollect.aggregate_function_collect.cpp: Registercollect_list_v1/collect_set_v1as aliases for backward compatibility (old serialized states remain readable).aggregate_function_collect_impl_v2.h/aggregate_function_collect_v2*.cpp: New registration entry point forcollect_list_v2andcollect_set_v2.aggregate_function_simple_factory.cpp: Callregister_aggregate_function_collect_list_v2.FE:
CollectListV2.java/CollectSetV2.java: New Nereids function classes bound tocollect_list_v2/collect_set_v2.BuiltinAggregateFunctions.java: Map user-facingcollect_list/group_arraytoCollectListV2(v2 as default); keepCollectListregistered ascollect_list_v1for upgrade compatibility.Upgrade compatibility
collect_list_v1; old BE handles it with JSON serde as beforecollect_list_v2; BE uses binary serdecollect_list_v1alias on new BE falls back to JSON serde pathRelease note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)