Skip to content

[opt](agg) use binary serde for collect_list/collect_set on complex types#62743

Closed
Mryange wants to merge 1 commit into
apache:masterfrom
Mryange:fix-collect_list
Closed

[opt](agg) use binary serde for collect_list/collect_set on complex types#62743
Mryange wants to merge 1 commit into
apache:masterfrom
Mryange:fix-collect_list

Conversation

@Mryange
Copy link
Copy Markdown
Contributor

@Mryange Mryange commented Apr 23, 2026

What problem does this PR solve?

collect_list and collect_set aggregate functions use JSON text serialization (serialize_one_cell_to_json / deserialize_one_cell_from_json) when handling complex or arbitrary-typed input columns (the INVALID_TYPE specialization). JSON serde is significantly slower than binary serde and can introduce precision/correctness issues for certain types (e.g., large integers, nested structures).

This PR introduces collect_list_v2 / collect_set_v2 using the binary IColumn API (serialize_value_into_arena / deserialize_and_insert_from_arena) for all types that fall into the generic path, following the same v1/v2 upgrade-compatibility pattern used by map_aggmap_agg_v2.

Changes

BE:

  • aggregate_function_collect.h: Add AggregateFunctionCollectListDataV2 struct that uses binary serde via IColumn::serialize_value_into_arena / deserialize_and_insert_from_arena. Add IsV2 template parameter to AggregateFunctionCollect.
  • aggregate_function_collect.cpp: Register collect_list_v1 / collect_set_v1 as aliases for backward compatibility (old serialized states remain readable).
  • aggregate_function_collect_impl_v2.h / aggregate_function_collect_v2*.cpp: New registration entry point for collect_list_v2 and collect_set_v2.
  • aggregate_function_simple_factory.cpp: Call register_aggregate_function_collect_list_v2.

FE:

  • CollectListV2.java / CollectSetV2.java: New Nereids function classes bound to collect_list_v2 / collect_set_v2.
  • BuiltinAggregateFunctions.java: Map user-facing collect_list / group_array to CollectListV2 (v2 as default); keep CollectList registered as collect_list_v1 for upgrade compatibility.

Upgrade compatibility

Scenario Behavior
Rolling upgrade (old BE + new FE) FE routes to collect_list_v1; old BE handles it with JSON serde as before
New BE + new FE FE routes to collect_list_v2; BE uses binary serde
Mixed cluster mid-upgrade collect_list_v1 alias on new BE falls back to JSON serde path

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 23, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found blocking issues in the rollout plumbing for collect_list_v2 / collect_set_v2.

  1. collect_set_v1 is registered, but it still goes through the old creator which only treats the exact name collect_set as the distinct variant. On a new BE, collect_set_v1 therefore instantiates list semantics and keeps duplicates.
  2. New FE default resolution now emits collect_list_v2 / collect_set_v2, but no exec-version compatibility rewrite was added. ExpressionTranslator and FunctionToThriftConverter serialize those names verbatim, while pre-change BEs only know collect_list / collect_set. The rolling-upgrade behavior described in the PR is not implemented.
  3. The default-name switch was not mirrored in AggStateType.aliasToName. agg_state<collect_list(...)> / agg_state<collect_set(...)> still canonicalize to the old names, while _state / _merge / _union combinators now build v2 names. Those names are compared literally in FE cast checks and BE agg-state execution, so agg-state columns/combinators can no longer interoperate.

Critical checkpoints

  • Goal: The new BE serializer itself moves the generic collect_list path to binary serde, but the end-to-end rollout goal is not met because compatibility and agg-state plumbing are incomplete.
  • Scope/focus: Mostly small, but the added test file is unrelated to the changed feature.
  • Concurrency: No new concurrency concerns in the touched path.
  • Lifecycle/config/compatibility: Yes, this is a function-name / serialized-state compatibility change, and the required compatibility handling is incomplete.
  • Parallel code paths: The agg-state/combinator path and rolling-upgrade path were not updated consistently.
  • Test coverage: No new test exercises collect_*_v2 on complex types, partial-state merge/deserialize, or the agg-state path.

User focus points

  • No additional user-provided focus points.

factory.register_function_both("collect_set", create_aggregate_function_collect);
// v1 names: FE may send these after being upgraded while BEs are still upgrading
factory.register_function_both("collect_list_v1", create_aggregate_function_collect);
factory.register_function_both("collect_set_v1", create_aggregate_function_collect);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect_set_v1 does not actually create a set aggregate here. create_aggregate_function_collect_impl() still uses bool distinct = name == "collect_set";, so this newly registered name goes down the collect_list branch and preserves duplicates. A concrete regression is the rollout path that is supposed to call collect_set_v1: on a new BE it will build list semantics instead of set semantics unless the helper also treats collect_set_v1 as the distinct variant.

* constructor with 1 argument.
*/
public CollectListV2(Expression arg) {
super("collect_list_v2", arg);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the user-visible collect_list name resolve to collect_list_v2 here is not enough for the rolling-upgrade story in the PR description. ExpressionTranslator / FunctionToThriftConverter serialize function.getName() verbatim, and I can't find the companion BeExecVersionManager compatibility registration on the BE side for collect_list_v2 / collect_set_v2. That means FE with an old be_exec_version still sends *_v2 to pre-change BEs, which only register collect_list / collect_set (plus the new *_v1 aliases from this PR) and will fail function lookup.

* constructor with 1 argument.
*/
public CollectSetV2(Expression arg) {
super("collect_set_v2", arg);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the default nested function name to collect_*_v2 also needs the same AggStateType.aliasToName update that map_agg and window_funnel already have. Right now agg_state<collect_list(...)> / agg_state<collect_set(...)> still canonicalize to collect_list / collect_set, while collect_list_state / collect_set_state and the corresponding _merge / _union combinators now build *_v2 names. Those names are compared literally in FE agg-state cast checks and in BE AggFnEvaluator, so agg-state columns and combinators stop matching each other.

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 23, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

I found two additional blocking issues beyond the existing review threads. I rechecked the prior review context and did not duplicate those comments.

Critical Checkpoints

  • Goal / correctness: Partially met. The new binary-serde path is implemented for collect_list, but collect_set still has no working complex-type path in BE, so the PR does not yet deliver the advertised collect_set half of the feature.
  • Scope / focus: Not focused enough. The PR also adds an unrelated array_match_* BE unit test file.
  • Concurrency: No new concurrency or lock-order issue found in the touched code paths.
  • Lifecycle / static init: No special lifecycle or static-initialization issue found in the changed paths.
  • Config changes: None.
  • Compatibility / incompatible changes: This PR touches agg-state and function-name compatibility, which remains high-risk. I relied on the existing review threads for the already-raised mixed-version naming concerns and did not re-comment them inline.
  • Parallel code paths: Not fully handled. collect_list and collect_set are parallel paths, but only the collect_list complex-type path was implemented.
  • Special conditions: The new INVALID_TYPE distinct branch is inconsistent with the FE AnyDataType signature and the PR goal.
  • Test coverage: Insufficient. There is no targeted coverage for collect_* v2 complex-type behavior, agg-state compatibility, or rolling-upgrade behavior. The only added test is unrelated and currently fails against the checked-in implementation.
  • Observability: No additional observability appears necessary for this change.
  • Persistence / transactionality: No transaction or edit-log path is involved, but agg-state serialization compatibility is part of the changed surface.
  • FE-BE protocol: Existing review context already covers additional FE to BE function-name compatibility risk. No duplicate inline comment submitted.
  • Performance: The binary-serde direction is reasonable, but the incomplete collect_set path and missing targeted coverage block confidence in the change.

User focus points: none were provided in /tmp/review_focus.txt, and I did not find any extra issues beyond the complete PR review itself.

const AggregateFunctionAttr& attr) {
if (distinct) {
if constexpr (T == INVALID_TYPE) {
throw Exception(ErrorCode::INTERNAL_ERROR,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect_set now binds to collect_set_v2 in FE for the same AnyDataType signature as before, but this INVALID_TYPE branch still throws on every complex input. A query like SELECT collect_set(km) FROM test_array_agg_complex where km is MAP, ARRAY, or STRUCT will reach this path and fail with unexpected type for collect_set_v2, so the collect_set half of the advertised complex-type binary-serde feature is still missing. Either keep routing complex inputs to collect_set_v1 or add a V2 distinct implementation and coverage for the INVALID_TYPE path.

// This assertion currently FAILS because the function overrides
// is_use_default_implementation_for_constants() to return false,
// breaking VExpr-level const-folding for all-const inputs.
EXPECT_TRUE(func->is_use_default_implementation_for_constants())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not change array_match_*, but be/test/CMakeLists.txt picks up every *.cpp under be/test/, so this file becomes part of normal BE UT. The assertion here is currently false in checked-in code: ArrayMatchFunction::is_use_default_implementation_for_constants() explicitly returns false in be/src/exprs/function/array/varray_match_function.cpp:50, and DefaultFunction forwards that value unchanged. As written, the PR adds a deterministic BE unit-test failure unrelated to the collect_* change. If the array-match fix is intentional, the implementation change needs to be included here. Otherwise this test file should be dropped.

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 23, 2026

/review

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 24, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 12.00% (6/50) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 42.29% (74/175) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.39% (26740/37454)
Line Coverage 53.75% (279242/519507)
Region Coverage 47.11% (214586/455525)
Branch Coverage 50.41% (97137/192702)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 37.78% (34/90) 🎉
Increment coverage report
Complete coverage report

@Mryange Mryange closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants