[fix](be) Clean linear histogram decimal boundaries#63430
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
I found blocking correctness issues in the new floating-point rounding logic for linear_histogram. The change fixes the 0.1/0.3 presentation case covered by the new tests, but the tolerance is also applied to bucket assignment and scale inference in ways that change valid query results.
Critical checkpoint conclusions:
- Goal/test: The PR aims to avoid floating-point artifacts in
linear_histogram; the added BE and regression tests cover the basic0.1interval case, but not large near-boundary inputs or very small intervals. - Scope: The PR is small and focused, but the helper logic changes bucket assignment semantics, not only JSON boundary formatting.
- Concurrency/lifecycle/config/compatibility/persistence: No new concurrency, lifecycle, configuration, protocol, storage-format, transaction, or persistence concerns found.
- Parallel paths: FE signatures are unchanged and BE registration remains unchanged; no other function path appears to need updates for this PR.
- Error handling/data correctness: Data correctness is affected because some valid values are assigned to the wrong bucket, and some valid interval boundaries are serialized incorrectly.
- Tests: Current tests are insufficient for these edge cases. Existing prior review context was empty, and there were no additional user focus points to address.
Please fix the rounding rules so they preserve mathematical bucket assignment and do not round small nonzero intervals to integer-scale boundaries.
|
|
||
| private: | ||
| double bucket_key(double value) const { | ||
| double key = (value - offset) / interval; |
There was a problem hiding this comment.
This tolerance changes bucket assignment for valid values that are close to, but still below, a boundary. For example, with interval = 1, offset = 0, and value = 999999999.999999, the mathematical bucket is floor(999999999.999999) = 999999999, but the tolerance is about 1.8e-6, so this branch rounds it to 1000000000. Near the configured upper key limit it can also turn a valid last-bucket value such as 2147483646.999999 into 2147483647 and throw exceeds the bucket range limit. The fix should avoid using an absolute key-space tolerance that can move real input values across bucket boundaries; restrict correction to the specific floating artifact being addressed or keep exact floor semantics for bucket selection.
| int32_t decimal_scale(double value) const { | ||
| double multiplier = 1; | ||
| for (int32_t scale = 0; scale <= MAX_BOUNDARY_SCALE; ++scale, multiplier *= 10) { | ||
| double scaled_value = value * multiplier; |
There was a problem hiding this comment.
decimal_scale() treats very small nonzero values as scale 0 because the scale-0 tolerance is around 1.8e-15. For interval = 1e-16 and offset = 0, this returns boundary scale 0, so bucket_boundary() rounds boundaries like 1e-16 and 2e-16 to 0, serializing buckets with identical/wrong lower and upper values. That regresses valid positive intervals compared with the previous direct double boundary output. Please make scale inference reject values that round to zero unless the original value is actually zero, or otherwise avoid applying decimal rounding when it would erase nonzero boundaries.
|
/review |
There was a problem hiding this comment.
The PR is small and focused, but I cannot approve it because the existing inline threads identify blocking data-correctness regressions in the new floating-point rounding logic. I did not find any additional distinct issues beyond those already raised, so I am not duplicating inline comments.
Critical checkpoint conclusions:
- Goal/test: The PR aims to fix
linear_histogramdecimal bucket assignment/output artifacts. The added BE unit test and regression case cover the simple0.1boundary case, but they do not prove correctness for near-boundary large values or very small positive intervals already called out in the existing threads. - Scope: The change is small and focused, but the new helpers affect bucket assignment semantics as well as JSON boundary formatting.
- Concurrency/lifecycle/config/compatibility/persistence: No new concurrency, lifecycle, configuration, protocol, storage-format, transaction, or persistence concerns were found.
- Parallel paths: BE registration and FE signatures are unchanged; no missing parallel implementation path was found.
- Error handling/data correctness: Data correctness remains blocked by the known cases where valid values can move to the wrong bucket and very small nonzero boundaries can serialize incorrectly.
- Tests/results: The new results match the covered case and appear deterministic, but coverage is insufficient for the edge cases described in the existing inline review.
- Observability/performance: No new observability requirement or material performance issue was found for this localized aggregate-function fix.
- User focus: No additional user-provided review focus was specified.
Please address the two existing inline comments before this can be approved.
915f6cc to
3f6892d
Compare
|
/review |
There was a problem hiding this comment.
I completed the PR review against the actual GitHub PR diff. I did not add duplicate inline comments because the existing review threads already cover the blocking issues I still see in aggregate_function_linear_histogram.h: bucket assignment can still move valid near-boundary values across bucket boundaries, and decimal scale inference can still round very small nonzero boundaries to zero.
Critical checkpoint conclusions:
- Goal and tests: The PR aims to clean decimal boundary output for
linear_histogramand adds BE unit/regression coverage for normal decimal boundaries. The current implementation is not fully correct for edge decimal/floating cases already called out in existing threads. - Scope/focus: The code is focused on histogram boundary normalization, but the correction changes bucket assignment semantics in edge cases.
- Concurrency/lifecycle: No new concurrency or special lifecycle concerns found.
- Configuration/compatibility: No config change. Aggregate state serialization format is unchanged; no new compatibility issue found.
- Parallel paths: The BE unit and regression paths were checked; no additional parallel implementation was found for this aggregate.
- Tests: Added tests cover common decimal cases and some near-boundary cases, but they do not cover the existing blocking small-nonzero interval case and the remaining boundary-crossing risk.
- Observability/transactions/data writes: Not applicable.
- Performance: The added per-row scale inference is cached per parameter set; no additional performance blocker found.
User focus response: No additional user-provided review focus was supplied.
Please address the existing inline review threads before this can be approved.
|
run buildall |
3f6892d to
6b816ed
Compare
|
/review |
There was a problem hiding this comment.
I found one remaining correctness issue in the updated boundary snapping logic. The existing review focus file had no additional focus points; I reviewed the full PR and paid particular attention to the already-raised histogram boundary/scale concerns without duplicating them.
Critical checkpoint conclusions:
- Goal/test coverage: the PR addresses decimal boundary artifacts in LINEAR_HISTOGRAM and adds BE unit plus regression coverage, but the implementation still mis-buckets valid near-decimal interval values.
- Scope/clarity: the change is focused on histogram bucket assignment/output formatting.
- Concurrency/lifecycle/config/transaction/storage compatibility: no new concurrency, lifecycle, config, transaction, or storage-format concerns were identified. Aggregate state serialization remains unchanged.
- Parallel paths: FE/BE function metadata paths are not changed; the affected BE aggregation path is covered.
- Tests/results: tests cover the previous key-space tolerance and tiny interval issues, but they do not cover intervals that are close to, but not exactly, the inferred decimal scale.
- Observability/performance: no additional observability requirement or material performance issue found for this small arithmetic path.
- User focus: no additional user-provided review focus was specified.
Please fix the remaining bucket-assignment regression before merging.
| if (boundary_decimal_scale != NO_BOUNDARY_SCALE) { | ||
| double raw_boundary = rounded_key * interval + offset; | ||
| double boundary = normalize_boundary(raw_boundary, boundary_decimal_scale); | ||
| if (value == boundary) { |
There was a problem hiding this comment.
This can still move valid values across the true bucket boundary when decimal_scale() classifies an interval that is only near a decimal value. For example, with interval = 0.10000000000000002 and offset = 0, decimal_scale(interval) returns scale 1 because interval * 10 is within the epsilon tolerance of 1. Then bucket_key(0.1) normalizes the rounded raw boundary back to 0.1, this equality succeeds, and the value is assigned to key 1, even though (0.1 - 0) / 0.10000000000000002 is below 1 and should floor to bucket 0. This is distinct from the already-raised key-space tolerance issue because the current code removed that tolerance but still gets the same kind of mis-bucketing through tolerant scale inference plus normalized-boundary equality. Please only enable decimal boundary snapping when the interval/offset are exactly represented by the inferred scale, or otherwise ensure snapping cannot cross a raw boundary that is greater than the input value.
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: LINEAR_HISTOGRAM used raw floating-point arithmetic both when assigning values to bucket indexes and when emitting bucket boundaries. Decimal intervals such as 0.1 could therefore place a value exactly on a decimal boundary into the previous bucket and expose artifacts such as 0.30000000000000004 in the JSON result. This patch rounds emitted boundaries only when the interval and offset have an exactly inferable bounded decimal scale. It keeps raw boundaries for near-decimal intervals such as 0.10000000000000002, so decimal-boundary snapping cannot move a value across a raw bucket boundary that is greater than the input value.
### Release note
Fix LINEAR_HISTOGRAM JSON output for decimal bucket boundaries.
### Check List (For Author)
- Test:
- Unit Test: ./run-be-ut.sh --run --filter=AggLinearHistogramTest.*
- Regression test: doris-local-regression start/run with offset 46001 and ./run-regression-test.sh --conf output/local-regression/regression-conf-46001.groovy --run -d feature_review_repro/agg_function -s AGG-HIST-001
- Format check: build-support/clang-format.sh; build-support/check-format.sh; git diff --check
- Behavior changed: Yes. Decimal bucket boundaries no longer expose floating-point artifacts, and near-decimal intervals keep raw bucket semantics instead of snapping values across true raw boundaries.
- Does this need documentation: No
6b816ed to
f6bfc5f
Compare
|
/review |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
There was a problem hiding this comment.
Review result: no additional blocking issues found in the current diff.\n\nCritical checkpoint conclusions:\n- Goal/test coverage: The change targets LINEAR_HISTOGRAM decimal boundary assignment/output artifacts, and the added BE unit tests plus regression case cover decimal boundaries, near-boundary values, near-decimal intervals, large bucket keys, and tiny nonzero intervals.\n- Scope/focus: The implementation is localized to linear_histogram boundary/key handling with focused tests.\n- Concurrency/lifecycle: No new shared mutable state, threads, locks, or special lifecycle concerns were introduced; aggregate state serialization format remains compatible because the derived scale is recomputed from serialized interval/offset.\n- Compatibility/persistence: No storage format, edit log, or FE-BE protocol changes. Serialized aggregate state layout is unchanged.\n- Parallel paths: Serialize/deserialize and merge paths propagate or recompute the derived boundary scale consistently.\n- Error handling: Existing invalid interval/offset and bucket range checks remain intact; no ignored Status or new exception boundary issue found.\n- Performance/memory: Added work is constant-time per row and per emitted boundary; no significant new allocations in hot add path.\n- Test result review: Added regression output matches the covered scenarios and uses ordered query output.\n- Observability: Not applicable; this is deterministic scalar aggregate formatting/assignment behavior and does not add distributed operational paths.\n\nUser focus: no additional user-provided review focus was specified.
TPC-H: Total hot run time: 31152 ms |
|
run buildall |
TPC-DS: Total hot run time: 170014 ms |
TPC-H: Total hot run time: 30954 ms |
TPC-DS: Total hot run time: 169494 ms |
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: LINEAR_HISTOGRAM used raw floating-point arithmetic when assigning values to bucket indexes and when emitting bucket boundaries. Decimal intervals such as
0.1could therefore place a value exactly on a decimal boundary into the previous bucket and expose artifacts such as0.30000000000000004in the JSON result. This patch rounds emitted boundaries only when the interval and offset have an exactly inferable bounded decimal scale. For near-decimal intervals such as0.10000000000000002, it keeps raw bucket boundaries so decimal-boundary snapping cannot move a value across a true raw boundary that is greater than the input value.Release note
Fix LINEAR_HISTOGRAM JSON output for decimal bucket boundaries.
Check List (For Author)
./build.sh --be --fethroughdoris-local-regression build./run-be-ut.sh --run --filter=AggLinearHistogramTest.*./run-regression-test.sh --conf output/local-regression/regression-conf-46001.groovy --run -d feature_review_repro/agg_function -s AGG-HIST-001build-support/clang-format.sh,build-support/check-format.sh,git diff --checkbuild-support/run-clang-tidy.sh --build-dir be/ut_build_ASANattempted, but clang-tidy could not analyze the changed files in this local environment because the toolchain reported missingstddef.hand a pre-existing unmatchedNOLINTENDinbe/src/core/types.h.