Skip to content

[fix](be) Preserve Variant predefined decimal precision#63322

Open
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:codex/cir-20240-variant-precision-pr
Open

[fix](be) Preserve Variant predefined decimal precision#63322
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:codex/cir-20240-variant-precision-pr

Conversation

@eldenmoon
Copy link
Copy Markdown
Member

@eldenmoon eldenmoon commented May 17, 2026

What problem does this PR solve?

Issue Number: N/A (CIR-20240)

Related PR: N/A

Problem Summary: Variant predefined decimal typed paths could lose numeric precision because matching JSON numbers were parsed through the default numeric DOM representation before decimal cast. This PR preserves raw numeric tokens only when materializing Variant columns with predefined decimal typed paths, including array and glob decimal paths, while leaving ordinary untyped Variant parsing on the existing DOM parser path.

Release note

Fix Variant predefined decimal typed path precision loss for high precision and scientific notation JSON numbers.

Check List (For Author)

  • Test:
    • BE unit test: ./run-be-ut.sh --run --filter='JsonParserTest.*'
    • Regression test: ./run-regression-test.sh --conf tmp/regression-conf.auto.groovy --run -d variant_p0/predefine -s test_variant_high_precision_decimal
    • Build: BUILD_TYPE=ASAN USE_MEM_TRACKER=ON ./build.sh --be --fe
    • Format: build-support/clang-format.sh, build-support/check-format.sh
    • Static analysis: build-support/run-clang-tidy.sh attempted, but clang-tidy was not found in the local environment.
  • Behavior changed: Yes. Variant predefined decimal typed paths preserve matching JSON numeric tokens for decimal materialization.
  • Does this need documentation: No

Copilot AI review requested due to automatic review settings May 17, 2026 09:38
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes precision loss for Variant typed decimal extraction by preserving high-precision JSON numeric tokens as exact text during parsing, and ensuring typed-path casts keep Variant’s nullable/array nullability semantics. It also adds targeted unit/regression coverage for high-precision decimals in both scalar and array paths.

Changes:

  • Preserve raw JSON numeric tokens (via simdjson on-demand parsing) and materialize precision-sensitive numbers as strings for later exact decimal casts.
  • Preserve Variant nullability (including nested array nullability) when converting typed paths to storage types.
  • Add unit + regression tests covering high-precision decimals in Variant scalar and array typed paths.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
regression-test/suites/variant_p0/predefine/test_variant_high_precision_decimal.groovy New regression test validating high-precision decimal preservation for Variant typed paths (scalar + array).
be/test/core/jsonb/json_parser_test.cpp Unit tests asserting high-precision numeric tokens are preserved as string fields (including inside arrays).
be/test/core/column/column_variant_test.cpp Unit test validating typed-path conversion + serialization preserves decimal text/precision for scalars and arrays.
be/src/util/json/simd_json_parser.h Refactors SimdJSONParser to build an internal node tree from simdjson on-demand and exposes raw number tokens.
be/src/util/json/json_parser.h Adds heuristic to preserve precision-sensitive doubles as strings based on raw token digits.
be/src/exprs/function/cast/cast_to_decimal.h Enhances string→decimal cast to accept nullable string columns and propagate nulls.
be/src/core/column/column_variant.cpp Preserves Variant nullability when casting typed paths to declared storage types (incl. arrays).
Comments suppressed due to low confidence (1)

be/src/util/json/json_parser.h:30

  • json_parser.h introduces should_preserve_json_number_as_string(std::string_view ...) but the header does not include <string_view>. Relying on transitive includes is brittle; add #include <string_view> to ensure this header compiles standalone.
#include <parallel_hashmap/phmap.h>

#include <cstddef>
#include <optional>
#include <string>
#include <utility>
#include <vector>


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread be/src/util/json/simd_json_parser.h
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking correctness regression. The PR addresses typed Variant decimal precision for the tested path, but the parser change currently applies to all Variant subcolumn extraction and changes some JSON numbers into strings when the path is not declared as a typed decimal path.

Critical checkpoints:

  • Goal/test: The PR fixes the covered typed decimal path case and adds BE/regression coverage for that case, but misses untyped Variant numeric semantics.
  • Scope/focus: The implementation is broader than the stated goal because generic JSON parsing now changes inferred Field type based on token precision.
  • Concurrency/lifecycle: No new shared concurrency or non-obvious lifecycle issue identified in the reviewed paths.
  • Compatibility/behavior: Behavior changes for existing Variant data: high-precision JSON numeric tokens on untyped paths can become strings. This is user-visible and not covered by compatibility/tests.
  • Parallel paths: Doc-value/jsonb writing still treats doubles as doubles; the subcolumn path is the one affected by the string conversion.
  • Tests: Added tests cover typed decimal conversion, but no negative/regression coverage proves ordinary untyped numbers remain numeric.
  • Performance/memory/observability: No additional blocking issue found beyond the correctness problem above.

User focus: No additional user-provided review focus was specified.

Comment thread be/src/util/json/json_parser.h Outdated
@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from ecba866 to 9ccf680 Compare May 17, 2026 10:19
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional correctness issue beyond the existing review threads.

Critical checkpoint conclusions:

  • Goal/test coverage: The PR aims to preserve high-precision JSON numbers for Variant decimal typed paths and adds BE/regression coverage, but scientific-notation decimal tokens remain uncovered and can still lose precision.
  • Scope/focus: The change is mostly focused on Variant JSON parsing and typed-path decimal conversion, though the simdjson wrapper rewrite is broader than the local fix.
  • Concurrency/lifecycle: No new shared mutable runtime state or lock-order concern found in the changed path. Parser-owned node lifetimes are local to each parse call.
  • Configuration/compatibility: No new config or storage-format change found. Existing typed path schema metadata is used to drive parsing.
  • Parallel paths: Typed decimal preservation is wired into normal Variant ingest, but the heuristic misses exponent notation on the same path.
  • Error handling/memory safety: No ignored Status issue found in the reviewed changes. The parser rewrite adds extra allocations, but I did not find a separate blocking memory-safety issue.
  • Data correctness: Blocking issue below: some valid decimal typed-path inputs still round through double before decimal conversion.
  • Observability: No additional observability requirement identified for this local parsing/casting fix.

User focus: No additional user-provided review focus was specified.

Comment thread be/src/util/json/json_parser.h Outdated
@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 9ccf680 to 832ca14 Compare May 17, 2026 10:37
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional correctness issue. The existing review threads already cover the raw-number heuristic/scientific-notation precision cases and the missing self-contained <cassert> include, so I did not duplicate those.

Critical checkpoint conclusions:

  • Goal/test: the PR aims to preserve high-precision decimal JSON tokens for Variant predefined decimal paths; it adds non-doc-mode unit/regression coverage, but doc-mode predefined typed paths are not covered and still lose precision.
  • Scope: the change is focused, but the fix is incomplete across the parallel doc-mode write path.
  • Concurrency/lifecycle: no new shared mutable concurrency or special lifecycle issue found in the reviewed change.
  • Config/compatibility: no new external config or storage-format compatibility change found.
  • Parallel paths: doc-mode Variant materialization is a parallel path and needs the same decimal-preservation setup.
  • Tests: add a doc-mode predefined decimal regression case, preferably under variant_p0/doc_mode/predefine, to prove the fix.
  • Observability/transactions/data writes: no additional logging or transaction/persistence concern found beyond the data-correctness issue below.
  • Performance: the parser now materializes/copies the JSON tree to retain raw tokens; this is worth watching on insert hot paths, but I did not mark it blocking without benchmark evidence.

User focus: no additional user-provided review focus was supplied.

Comment thread be/src/exec/common/variant_util.cpp
@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 832ca14 to 43b1a14 Compare May 17, 2026 10:47
@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completed a full pass over the PR diff and the existing inline review context. I am requesting changes based on the already-open inline threads rather than adding duplicate comments.

Summary opinion: the PR is targeted at preserving high-precision JSON decimal tokens for predefined Variant decimal paths, and the current non-doc-mode path is closer to the goal. However, the implementation is still incomplete for doc-mode predefined Variant columns, where the parser is configured as OnlyDocValueColumn and the decimal raw-number preserve paths are still not installed. be/src/util/json/simd_json_parser.h also still uses assert() without including <cassert>, so the header remains non-self-contained. I did not find additional distinct issues beyond the existing threads.

Critical checkpoint conclusions:

  • Goal and tests: the non-doc-mode typed decimal case has regression/unit coverage, but doc-mode typed decimal coverage is still missing and the doc-mode path does not accomplish the precision-preservation goal.
  • Scope/focus: the functional change is mostly focused, though the simdjson wrapper refactor is larger than the minimal parser hook; no separate blocking issue found beyond the existing header/self-containment concern.
  • Concurrency: no new shared mutable cross-thread state or lock interaction was introduced in the reviewed paths.
  • Lifecycle/static initialization: no new static initialization dependency was found. Parser element lifetimes are local to parsing/traversal and copied into the new tree before the padded input is destroyed.
  • Configuration: no new configuration items were added.
  • Compatibility: no storage-format or FE-BE protocol change was found. The remaining compatibility risk is behavioral inconsistency for doc-mode predefined decimal Variant paths.
  • Parallel paths: doc-mode is the parallel insert/materialization path that still needs the same decimal preserve-path setup.
  • Conditional checks: the typed-path preservation condition is now path-gated; no additional distinct condition issue found.
  • Test coverage/results: tests cover non-doc mode, arrays, glob matching, and scientific notation, but not doc-mode predefined decimals.
  • Observability: no new observability requirement identified for this parser/cast fix.
  • Transactions/persistence/data writes: no new transaction or persistence metadata path was introduced.
  • Performance: the JSON parser wrapper now materializes a copied node tree, but I did not identify a concrete correctness blocker from this beyond the existing issues.
  • User focus: no additional user-provided review focus was specified.

@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 43b1a14 to 7885a6c Compare May 17, 2026 11:08
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional correctness issue in the nullable string-to-decimal cast path. Existing inline threads already cover the remaining blockers around doc-mode typed decimal preservation and parser/header concerns, so I did not duplicate those comments.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to preserve exact JSON numeric text for predefined decimal Variant typed paths and adds unit/regression coverage, but doc-mode typed paths remain covered by an existing unresolved thread and nullable decimal conversion needs more coverage.
  • Scope: The change is mostly focused, but the parser replacement adds broad allocation/copying behavior; no separate blocker found beyond the submitted issue and existing comments.
  • Concurrency/lifecycle: No new shared concurrent state or non-obvious lifecycle risk identified; parser result references are copied into the owned tree before traversal.
  • Config/compatibility: No new config or storage format incompatibility identified.
  • Parallel paths: Non-doc-mode path is configured; doc-mode path remains an existing unresolved review concern.
  • Error handling/data correctness: The new strict nullable cast path can misparse rows because skipped null rows do not advance string offsets in the decimal serde.
  • Tests: Added tests cover high-precision predefined decimals in non-doc mode, arrays, globs, and exponent cases; missing coverage remains for doc mode (already raised) and nullable/missing typed decimal rows.
  • Observability/performance: No new observability requirement identified; parser memory overhead is higher but not raised as a blocking issue.

User focus: No additional user-provided review focus was specified.

Comment thread be/src/exprs/function/cast/cast_to_decimal.h
@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31288 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9ccf680bcc7622c4472bb012526de8e07acd5055, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17710	3972	3884	3884
q2	q3	10805	1389	829	829
q4	4686	479	344	344
q5	7561	2298	2145	2145
q6	362	178	139	139
q7	971	774	652	652
q8	9641	1721	1729	1721
q9	6702	4974	4944	4944
q10	6470	2150	1828	1828
q11	454	273	247	247
q12	700	420	295	295
q13	18248	3545	2804	2804
q14	264	255	237	237
q15	q16	823	776	705	705
q17	934	877	947	877
q18	6949	5687	5721	5687
q19	1250	1270	1064	1064
q20	496	418	255	255
q21	5749	2580	2326	2326
q22	444	377	305	305
Total cold run time: 101219 ms
Total hot run time: 31288 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4201	4114	4127	4114
q2	q3	4514	4886	4366	4366
q4	2138	2196	1421	1421
q5	4426	4305	4617	4305
q6	249	193	146	146
q7	2061	1850	1599	1599
q8	2493	2099	2144	2099
q9	7794	7881	7766	7766
q10	4558	4497	4326	4326
q11	617	456	385	385
q12	758	732	535	535
q13	3258	3683	3044	3044
q14	292	318	294	294
q15	q16	720	768	658	658
q17	1308	1322	1317	1317
q18	7812	7413	6904	6904
q19	1127	1086	1102	1086
q20	2233	2233	1938	1938
q21	5373	4660	4564	4564
q22	529	474	406	406
Total cold run time: 56461 ms
Total hot run time: 51273 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31294 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 43b1a1432eb96745bdb80b377765e638551b4e56, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17684	3930	3838	3838
q2	q3	10778	1358	779	779
q4	4686	474	339	339
q5	7570	2248	2114	2114
q6	389	176	138	138
q7	938	783	619	619
q8	9482	1809	1632	1632
q9	7044	4914	4867	4867
q10	6442	2094	1812	1812
q11	441	276	250	250
q12	693	422	296	296
q13	18244	3462	2833	2833
q14	271	255	235	235
q15	q16	823	776	706	706
q17	977	938	927	927
q18	6849	5646	5541	5541
q19	1187	1270	1238	1238
q20	533	418	278	278
q21	6104	2833	2542	2542
q22	456	370	310	310
Total cold run time: 101591 ms
Total hot run time: 31294 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4695	4548	4689	4548
q2	q3	4887	5278	4672	4672
q4	2120	2217	1431	1431
q5	4873	4738	4655	4655
q6	235	183	129	129
q7	1888	1691	1578	1578
q8	2404	1935	1920	1920
q9	7305	7282	7213	7213
q10	4521	4430	3956	3956
q11	544	383	383	383
q12	709	721	507	507
q13	2993	3495	2787	2787
q14	270	287	247	247
q15	q16	677	709	624	624
q17	1272	1229	1251	1229
q18	7358	6840	6842	6840
q19	1152	1085	1127	1085
q20	2219	2219	1937	1937
q21	5365	4717	4535	4535
q22	529	472	411	411
Total cold run time: 56016 ms
Total hot run time: 50687 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31205 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 832ca14461bc0faf4fbf9b0d2a91c7df1111f393, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17647	3855	3817	3817
q2	q3	10758	1360	803	803
q4	4705	479	344	344
q5	8069	2303	2126	2126
q6	386	176	138	138
q7	955	770	647	647
q8	9675	1690	1594	1594
q9	7088	4964	4934	4934
q10	6460	2120	1810	1810
q11	441	280	248	248
q12	697	429	298	298
q13	18232	3396	2793	2793
q14	264	262	237	237
q15	q16	813	772	701	701
q17	920	886	1021	886
q18	6905	5624	5516	5516
q19	1194	1440	1122	1122
q20	498	409	275	275
q21	6230	2743	2613	2613
q22	462	380	303	303
Total cold run time: 102399 ms
Total hot run time: 31205 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4623	4679	4591	4591
q2	q3	4865	5289	4606	4606
q4	2148	2317	1468	1468
q5	4825	4698	4662	4662
q6	233	173	127	127
q7	1866	1744	1380	1380
q8	2208	1925	1905	1905
q9	7222	7185	7156	7156
q10	4557	4422	3997	3997
q11	536	384	352	352
q12	716	721	512	512
q13	3012	3393	2779	2779
q14	266	276	251	251
q15	q16	676	703	625	625
q17	1264	1234	1229	1229
q18	7313	6907	6764	6764
q19	1145	1134	1105	1105
q20	2205	2211	1938	1938
q21	5329	4666	4504	4504
q22	518	454	420	420
Total cold run time: 55527 ms
Total hot run time: 50371 ms

@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 7885a6c to 85fcf56 Compare May 17, 2026 11:41
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30897 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7885a6ce76862cca4cd5fd9cc18bd2d7185fb161, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17624	3801	3808	3801
q2	q3	10822	1327	851	851
q4	4680	469	357	357
q5	7595	2284	2081	2081
q6	240	177	137	137
q7	929	754	651	651
q8	9422	1780	1582	1582
q9	6585	4875	4841	4841
q10	6446	2111	1809	1809
q11	438	264	242	242
q12	699	429	291	291
q13	18190	3354	2742	2742
q14	268	254	234	234
q15	q16	814	781	702	702
q17	950	969	987	969
q18	6741	5630	5556	5556
q19	1181	1349	1052	1052
q20	529	410	258	258
q21	5902	2593	2437	2437
q22	432	368	304	304
Total cold run time: 100487 ms
Total hot run time: 30897 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4170	4076	4055	4055
q2	q3	4480	4864	4305	4305
q4	2090	2193	1375	1375
q5	4317	4301	4236	4236
q6	232	186	142	142
q7	2099	1928	1768	1768
q8	2488	2175	2103	2103
q9	8040	7818	7854	7818
q10	4638	4567	4118	4118
q11	600	417	385	385
q12	930	785	559	559
q13	3296	3552	2995	2995
q14	291	300	271	271
q15	q16	717	763	628	628
q17	1336	1298	1299	1298
q18	7931	7265	7050	7050
q19	1160	1067	1110	1067
q20	2216	2191	1930	1930
q21	5220	4542	4384	4384
q22	508	455	405	405
Total cold run time: 56759 ms
Total hot run time: 50892 ms

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR still has a blocking correctness issue. The main goal is to preserve predefined Variant decimal precision during JSON parsing/materialization; the added regression coverage covers scalar and array-of-scalar typed decimal paths, but not typed decimal paths inside nested arrays serialized through JSONB.

Critical checkpoint conclusions:

  • Goal/test coverage: the goal is only partially achieved; decimal precision can still be lost in the nested-array JSONB serialization path, and there is no regression for that path.
  • Scope/focus: the core change is focused on JSON parsing/materialization, but the raw-number preservation decision is applied only to the scalar Field path.
  • Concurrency/lifecycle/config compatibility: no new concurrency, lifecycle, persistent format, or dynamic config issue found in the reviewed diff.
  • Parallel paths: existing review threads already cover doc-mode/raw-number, big-integer, nullable, and parser hot-path concerns; I did not repeat them.
  • Data correctness: blocking issue below can round high-precision typed decimals before materialization when the value is inside an array containing objects.
  • Performance/observability: no additional distinct blocking issue beyond already-open raw parser allocation comments.
  • User focus: no additional user-provided focus was specified.

Comment thread be/src/util/json/json_parser.cpp
@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 9aa9fa9 to 67a3b9f Compare May 18, 2026 04:44
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add duplicate inline comments because the blocking issues I found are already covered by existing review threads. The PR still needs changes before it is safe to merge.

Critical checkpoint conclusions:

  • Goal/test proof: the goal is to preserve high-precision JSON numeric tokens for predefined Variant decimal paths. The added tests cover several scalar, array, glob, insert, and staged stream-load cases, but existing review threads identify untested failing cases such as doc mode, nullable direct Variant load, decimal paths inside array-of-object JSONB serialization, unrelated oversized integers, and Decimal256 big integers.
  • Scope/focus: the change is not yet small/focused enough for the hot parser path because it replaces the simdjson DOM wrapper with an eager copied Node tree for all parses, which is already called out in the existing performance/memory thread.
  • Concurrency: no new shared mutable concurrency path was found in the changed parser/config code. The glob regex cache uses the existing mutex-protected cache path.
  • Lifecycle/static initialization: no new cross-TU static initialization dependency was found. The parser Node tree lifetime is local to the parser instance and consumed before the next parse.
  • Configuration: no new user config item is added. Existing session/config variables in tests are relevant to the feature.
  • Compatibility: no storage format or FE-BE protocol change was found. Behavior compatibility remains a concern for Variant parsing edge cases already raised in the existing threads.
  • Parallel paths: doc-mode Variant parsing and direct stream-load/serde paths are not fully covered by the current preservation setup, as already noted in existing threads.
  • Conditional checks: the new raw-number preservation conditions are path-dependent, but existing comments show distinct paths where the condition is bypassed or causes fallback/loss.
  • Test coverage: coverage is improved, but missing negative/end-to-end coverage remains for the existing blocking cases.
  • Test results: I reviewed the added test assertions; no additional result-file issue was found.
  • Observability: no new observability requirement was identified.
  • Transaction/persistence/data writes: no EditLog or transaction protocol changes were found. Data correctness is still blocked by the existing Variant materialization and parse fallback issues.
  • FE/BE variable passing: not applicable.
  • Performance: the eager copied JSON tree and untracked allocations are already raised as a blocking performance/memory concern.
  • Other issues: no additional distinct issue beyond the existing inline review context was found.

User focus: no additional user-provided review focus was present, and I found no extra focus-specific issue.

@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 67a3b9f to 65ddea5 Compare May 18, 2026 05:02
@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes based on the unresolved blocking issues already raised in the existing inline review threads. I did not add new inline comments because the review context already covers the distinct blocking defects I found, and duplicating those threads is explicitly disallowed.

Critical checkpoint conclusions:

  • Goal/test proof: the PR aims to preserve high-precision JSON numeric tokens for predefined Variant decimal paths and adds unit/regression coverage, but the current implementation still misses important ingestion/materialization paths covered by existing threads, including doc-mode predefined decimal paths and decimal paths serialized through nested array JSONB.
  • Scope/clarity: the change is focused on Variant JSON parsing, but it introduces a full ondemand Node tree on the hot path and extra fallback behavior for oversized integers; existing comments identify correctness and performance risks that need resolution.
  • Concurrency/lifecycle: no new concurrency or cross-thread lifecycle issue was found in this pass. Parser-owned Node lifetime appears local to parse traversal, though the added eager tree allocations are a performance/memory-accounting concern already raised.
  • Configuration: no new config item is added. Existing Variant doc-mode configuration path is still not applying decimal preservation, as already commented.
  • Compatibility/storage format: no serialized format change was identified, but behavior compatibility for untyped or unrelated oversized numeric Variant paths remains blocked by the existing comments.
  • Parallel code paths: doc-mode Variant and JSONB nested-array paths remain parallel paths that do not get equivalent raw decimal preservation, already covered by existing threads.
  • Conditional checks: the new big-integer invalid/fallback handling still has known edge cases with unrelated paths; no additional distinct condition issue was found.
  • Test coverage: tests were added, but existing comments identify missing regressions for doc mode, nested array-of-object decimal paths, nullable/missing typed decimal casting, and unrelated oversized integers.
  • Observability: no additional observability requirement was identified.
  • Transaction/persistence/data writes: no direct transaction or edit-log change was found; ingestion correctness remains the main data correctness risk.
  • FE/BE protocol: no FE/BE protocol change was found.
  • Performance: the eager recursive JSON tree allocation remains a blocking performance/memory-accounting concern already raised.

Focus points: the focus file states no additional user-provided review focus, and I found no extra focus-specific issue beyond the existing review threads.

@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 65ddea5 to 624b378 Compare May 18, 2026 05:24
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR still needs changes before approval. I did not add duplicate inline comments because the blocking concerns are already present in the existing review threads.

Critical checkpoint conclusions:

  • Goal/test: The goal is to preserve precision for predefined Variant decimal paths, and the new regression/unit coverage targets direct insert, staged stream load, stream load, arrays, glob paths, exponent notation, and Decimal256 integers. However, already-open threads show important paths remain uncovered or incorrect, including doc-mode predefined decimals, decimal paths inside array-of-object JSONB serialization, nullable/missing decimal casts, and unrelated oversized integers during raw-preservation parsing.
  • Scope/focus: The change is focused on Variant JSON parsing, but it replaces the hot parser representation with a fully materialized Node tree and copies strings/numbers for all parses, which is broader than necessary and is already called out as a performance/memory-accounting blocker.
  • Concurrency/lifecycle: No new shared concurrency or lock ordering was introduced. Parser-owned Node lifetimes are bounded by parse traversal and released after parse; no static initialization concern found.
  • Configuration/compatibility: No new config item or storage-format incompatibility was introduced. FE/BE protocol compatibility is not affected.
  • Parallel paths: A blocking existing thread notes doc-mode Variant parsing is not configured with decimal preserve paths, so the non-doc and doc-mode paths are inconsistent.
  • Special conditions/error handling: Existing comments identify lossy or failing numeric handling for specific cases, including BigInteger conversion before preservation and JSONB serialization of nested arrays. These are correctness issues, not just style issues.
  • Test coverage/results: The added tests improve coverage, but the existing review context identifies missing regressions for doc mode, nullable Variant stream-load consistency, nullable/missing typed decimal casts, array-of-object decimal paths, escaped-key behavior, and oversized unrelated integer coexistence. No test results were provided here.
  • Observability: No new observability appears necessary for this parser-level precision fix.
  • Transactions/persistence/data writes: The change affects load/write materialization of Variant columns; correctness must hold across direct insert and stream load paths. Existing blockers show some write paths can still store rounded or missing values.
  • Performance: Existing hot-path concern remains significant because ordinary Variant ingestion now builds a recursive copied tree before traversal.

User focus: No additional user-provided review focus was supplied.

Please address the existing inline threads before requesting another approval pass.

@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 624b378 to 04ef206 Compare May 18, 2026 05:37
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional blocking gap beyond the existing inline threads. The PR goal is to preserve raw JSON numeric tokens for predefined Variant decimal typed paths, and the scalar-root INSERT/string path is covered by the new config and tests, but a normal stream-load row where the Variant field is a JSON object still bypasses that config.

Critical checkpoint conclusions: goal is only partially accomplished because direct object stream-load remains lossy; the change is focused but the config is not propagated to all parallel ingestion paths; no new concurrency or lifecycle risk was found beyond existing parser allocation/lifetime threads; no config item or storage-format compatibility change is introduced; FE/BE protocol is unchanged; test coverage covers INSERT, staged string stream-load, and string-valued direct stream-load, but misses direct object stream-load into the Variant column; existing review threads already cover doc-mode, nested JSONB array, nullable, big-integer, and performance concerns; no additional user focus was provided in review_focus.txt.

Comment thread be/src/exec/common/variant_util.cpp
@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 04ef206 to 28faf36 Compare May 18, 2026 06:02
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional blocking issue distinct from the existing inline threads.

Critical checkpoint conclusions:

  • Goal/test coverage: the PR improves predefined Variant decimal preservation and adds regression coverage for inserts and stream load, but it regresses the generic Variant JSON deserialization path used by casts, which is not covered by the new test.
  • Scope/focus: the parser changes are focused on decimal preservation, but changing DataTypeVariantSerDe::deserialize_one_cell_from_json() to always store a scalar string is broader than the precision fix and changes existing query semantics.
  • Concurrency/lifecycle: no new concurrent shared state or non-trivial static lifecycle issue found beyond already-open parser allocation/lifetime discussions.
  • Configuration/compatibility: no new config item; the serde change is a user-visible behavioral compatibility regression for CAST(... AS VARIANT).
  • Parallel paths: stream-load/direct-load paths are considered in tests, but the cast path that shares the same serde was missed.
  • Data correctness: decimal typed paths are better preserved in the covered paths, but Variant subpath data produced by casts can become inaccessible because JSON is stored as a scalar string.
  • Error handling/memory/observability: no separate new issue found in these areas beyond the existing review threads.
  • User focus: no additional user-provided review focus was specified.

variant_util::parse_json_to_variant(column, json_ref, nullptr, parse_config));
auto& variant = assert_cast<ColumnVariant&>(column);
Field field = Field::create_field<TYPE_STRING>(String(slice.data, slice.size));
VariantMap object;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This serde is also used by the generic cast path, not only by stream load. cast_from_string_to_generic() calls serde->deserialize_one_cell_from_json() for CAST(string AS VARIANT), and existing suites rely on cast('{"a":1}' as variant)['a'] exposing subpaths. With this change the same input is always inserted as a root TYPE_STRING, so query-time casts no longer parse the JSON object and subpath access like c1['a'] returns missing/null instead of 1. This is distinct from the direct Variant stream-load concern because it affects expression evaluation without the later storage-layer parse_and_materialize_variant_columns() pass. Please keep the old parse behavior for casts/generic serde callers and restrict raw-root preservation to the load path that can materialize later, and add a regression for cast('{"a":1}' as variant)['a'].

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 78.95% (285/361) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.51% (20659/38606)
Line Coverage 37.18% (195356/525397)
Region Coverage 33.55% (152903/455767)
Branch Coverage 34.60% (66720/192818)

@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 28faf36 to de9b9d8 Compare May 18, 2026 06:45
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review conclusion: I did not find any additional distinct inline issue beyond the existing review threads.

This is not an approval recommendation because the existing inline threads already document blocking correctness/performance concerns that remain part of the review context, including cast/generic serde behavior, direct object stream-load materialization, doc-mode decimal preservation, nullable Variant load consistency, JSONB nested-array decimal precision, big-integer handling, and raw-preservation overhead.

Critical checkpoints:

  • Goal/test proof: The PR aims to preserve predefined Variant decimal precision. The new tests cover several insert and stream-load paths, but existing review threads identify uncovered/incorrect paths that still need resolution.
  • Scope/focus: The current GitHub PR diff is focused on Variant JSON parsing/loading and related tests.
  • Concurrency/lifecycle: No new concurrency-sensitive path was identified. Parser-owned Node lifetime is bounded by JSONDataParser::parse() traversal.
  • Configuration: No new config item was added. Existing session variables are used in regression coverage.
  • Compatibility/protocol/storage: No FE/BE protocol change was identified. Existing threads cover behavior compatibility risks for cast/generic serde and doc-mode Variant storage.
  • Parallel paths: I checked stream-load, generic serde/cast, doc-mode/non-doc-mode, arrays, and direct object vs string-staged Variant load paths; the distinct concerns I found were already present in existing threads, so I did not duplicate them.
  • Conditional checks: The new raw-staging and decimal-path matching checks are localized; existing comments already cover the problematic branches.
  • Test coverage/results: New regression and unit tests exist, but existing threads point out missing negative/edge coverage for several important paths. I found no additional unique missing test beyond those.
  • Observability: No new observability requirement beyond normal error propagation was identified.
  • Transactions/data writes: The changes affect ingest/materialization but do not introduce new transaction metadata paths. Existing data correctness concerns are covered by prior threads.
  • Performance: Existing threads already cover the major hot-path allocation/copy regression risk.

User focus: no additional user-provided review focus was specified.

### What problem does this PR solve?

Issue Number: N/A (CIR-20240)

Related PR: N/A

Problem Summary: Variant predefined decimal typed paths could lose precision because JSON numeric values were materialized through binary numeric values before the later decimal cast. This change keeps raw number tokens in the simdjson-based Variant parser and, only when TabletSchema predefined Variant decimal typed paths are configured through ParseConfig, materializes matching numeric tokens as strings for the decimal cast. Untyped Variant numeric paths keep their existing numeric behavior.

### Release note

Fix Variant predefined decimal typed path precision loss for high precision and scientific notation JSON numbers.

### Check List (For Author)

- Test:
    - BE unit test: ./run-be-ut.sh --run --filter='JsonParserTest.*'
    - Regression test: ./run-regression-test.sh --conf tmp/regression-conf.auto.groovy --run -d variant_p0/predefine -s test_variant_high_precision_decimal
    - Build: BUILD_TYPE=ASAN USE_MEM_TRACKER=ON ./build.sh --be
    - Format: PATH=/tmp/doris-codex-clang:/mnt/disk1/claude-max/bin:/mnt/disk1/claude-max/.local/bin:/mnt/disk1/claude-max/.npm-global/bin:/mnt/disk1/claude-max/.opencode/bin:/mnt/disk1/claude-max/.nvm/versions/node/v24.14.1/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin build-support/check-format.sh, git diff --check
    - Static analysis: Not run to completion; clang-tidy was attempted but blocked by local toolchain/pre-existing header issues.
- Behavior changed: Yes. Variant predefined decimal typed paths preserve matching JSON numeric tokens for decimal materialization.
- Does this need documentation: No
@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from de9b9d8 to 38f5b5d Compare May 18, 2026 07:47
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants