[feat](hll) add to_hll(bigint) scalar function#60996
[feat](hll) add to_hll(bigint) scalar function#60996HappenLee wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Adds a new built-in scalar function to_hll(BIGINT) -> HLL to construct an HLL value directly from a BIGINT (without hashing), with FE/BE registration and a BE unit test.
Changes:
- BE: implement and register
to_hllin vectorized HLL functions (including negative-input validation viaStatus::InvalidArgument). - FE (Nereids): add
ToHllscalar expression + visitor hook and register it as a built-in scalar function. - Tests: add BE unit test for basic positive inputs.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/ScalarFunctionVisitor.java | Adds a visitor entrypoint for the new ToHll scalar function. |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/ToHll.java | Introduces the Nereids scalar function definition and signature BIGINT -> HLL. |
| fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinScalarFunctions.java | Registers to_hll as a built-in scalar function in FE. |
| be/src/vec/functions/function_hll.cpp | Implements ToHll BE vectorized function and registers it in the HLL function factory. |
| be/test/vec/function/function_hll_test.cpp | Adds a unit test covering several positive BIGINT inputs for to_hll. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (arg_is_nullable && (*nullmap)[i]) { | ||
| continue; | ||
| } else { | ||
| int64_t value = data[i]; | ||
| if (value < 0) { | ||
| return Status::InvalidArgument( | ||
| "to_hll does not support negative bigint value: {}", value); | ||
| } | ||
| res_data[i].update(static_cast<uint64_t>(value)); | ||
| } |
There was a problem hiding this comment.
In the non-nullable instantiation, nullmap is passed as nullptr, but the loop still references (*nullmap)[i] inside a normal if condition. While short-circuiting avoids evaluation when arg_is_nullable is false, rewriting this as an if constexpr (arg_is_nullable) block (or splitting nullable/non-nullable loops) avoids relying on that subtlety and keeps static analyzers/sanitizers from flagging a potential null dereference.
| if (arg_is_nullable && (*nullmap)[i]) { | |
| continue; | |
| } else { | |
| int64_t value = data[i]; | |
| if (value < 0) { | |
| return Status::InvalidArgument( | |
| "to_hll does not support negative bigint value: {}", value); | |
| } | |
| res_data[i].update(static_cast<uint64_t>(value)); | |
| } | |
| if constexpr (arg_is_nullable) { | |
| if ((*nullmap)[i]) { | |
| continue; | |
| } | |
| } | |
| int64_t value = data[i]; | |
| if (value < 0) { | |
| return Status::InvalidArgument( | |
| "to_hll does not support negative bigint value: {}", value); | |
| } | |
| res_data[i].update(static_cast<uint64_t>(value)); |
|
|
||
| DataSet data_set = {{{(int64_t)1}, &hll1}, | ||
| {{(int64_t)2}, &hll2}, | ||
| {{(int64_t)100}, &hll3}, | ||
| {{(int64_t)0}, &hll4}}; | ||
|
|
||
| static_cast<void>(check_function<DataTypeHLL>(func_name, input_types, data_set)); | ||
| } |
There was a problem hiding this comment.
The new to_hll behavior has an explicit error path for negative BIGINT values (returns Status::InvalidArgument) and a special null-handling path (skips update and returns a default/empty HLL). The unit test currently only covers a few positive values; please add coverage for at least one negative input (expect execute failure) and a NULL input row (verify it returns an empty HLL rather than NULL).
| DataSet data_set = {{{(int64_t)1}, &hll1}, | |
| {{(int64_t)2}, &hll2}, | |
| {{(int64_t)100}, &hll3}, | |
| {{(int64_t)0}, &hll4}}; | |
| static_cast<void>(check_function<DataTypeHLL>(func_name, input_types, data_set)); | |
| } | |
| HyperLogLog empty_hll; | |
| DataSet data_set = {{{(int64_t)1}, &hll1}, | |
| {{(int64_t)2}, &hll2}, | |
| {{(int64_t)100}, &hll3}, | |
| {{(int64_t)0}, &hll4}, | |
| {{Null()}, &empty_hll}}; | |
| static_cast<void>(check_function<DataTypeHLL, true>(func_name, input_types, data_set)); | |
| } | |
| TEST(function_hll_test, function_to_hll_negative_input_test) { | |
| std::string func_name = "to_hll"; | |
| InputTypeSet input_types = {PrimitiveType::TYPE_BIGINT}; | |
| HyperLogLog dummy_hll; | |
| // Negative BIGINT input should cause function execution to fail. | |
| DataSet data_set = {{{(int64_t)-1}, &dummy_hll}}; | |
| EXPECT_FALSE(check_function<DataTypeHLL>(func_name, input_types, data_set)); | |
| } |
|
run buidall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 28759 ms |
TPC-DS: Total hot run time: 183883 ms |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
TPC-H: Total hot run time: 28842 ms |
TPC-DS: Total hot run time: 184251 ms |
Add a new built-in scalar function to_hll(bigint) that constructs an HLL value from a BIGINT directly, bypassing any hash — the bigint is fed as-is into HyperLogLog::update(uint64_t). Changes: - BE (vec/functions/function_hll.cpp): implement struct ToHll with Status-returning execute/vector/vector_nullable methods; register as FunctionAlwaysNotNullable<ToHll, true> so InvalidArgument status is propagated on negative input. - BE test (be/test/vec/function/function_hll_test.cpp): add unit test function_hll_test.function_to_hll_test covering values 0, 1, 2, 100. - FE Nereids (scalar/ToHll.java): new ScalarFunction with signature BIGINT -> HLL. - FE catalog (BuiltinScalarFunctions.java): register to_hll. - FE visitor (ScalarFunctionVisitor.java): add visitToHll visitor hook.
58dc8d4 to
697bd4d
Compare
|
run buildall |
TPC-H: Total hot run time: 28948 ms |
TPC-DS: Total hot run time: 183969 ms |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
Add a new built-in scalar function to_hll(bigint) that constructs an HLL value from a BIGINT directly, bypassing any hash — the bigint is fed as-is into HyperLogLog::update(uint64_t).
Changes:
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)