fix(arrow/array): preserve large integer precision in JSON decoding#816
Open
zeroshade wants to merge 8 commits into
Open
fix(arrow/array): preserve large integer precision in JSON decoding#816zeroshade wants to merge 8 commits into
zeroshade wants to merge 8 commits into
Conversation
JSON decoding silently corrupted integer values larger than 2^53 because encoding/json decodes numeric tokens as float64 by default. Enable UseNumber unconditionally on the internal decoder in FromJSON, RecordFromJSON, and NewJSONReader so full int64 precision is preserved. Additional changes: - RecordBuilder gains UnmarshalOne and Unmarshal methods so caller- configured decoder options propagate through nested field decoding. RecordFromJSON now calls bldr.UnmarshalOne(dec) instead of dec.Decode(bldr), avoiding the bytes-roundtrip through json.Unmarshaler that drops outer decoder configuration. - All builder UnmarshalJSON([]byte) methods enable UseNumber on their internal decoder so large integers also survive the standard json.Unmarshal(data, &builder) path. - Duration, Timestamp, Time32, Time64, Date32 and Date64 builders now accept stringified integers (e.g. "9223372036854775807") as a fallback before their format-specific parsers, matching how Int64 and Decimal already behaved. - WithUseNumber and WithUseNumberJSONReader are deprecated; both are now no-ops since UseNumber is always enabled. Fixes apache#804
Now that UseNumber is unconditionally enabled in FromJSON, RecordFromJSON, and NewJSONReader, the explicit WithUseNumber() arguments in compute tests and the flightsql type_info example are redundant and trigger staticcheck SA1019. Remove them. The two regression tests in arrow/array that verify the deprecated options still function as no-ops keep their calls and gain a focused //nolint:staticcheck directive.
lidavidm
reviewed
May 15, 2026
Roborev pointed out that the previous fix relaxed validation for `json.Number` values across integer-backed builders to keep an existing test passing. Combined with making `UseNumber` the unconditional default, that turned several invalid inputs into silent data corruption: - fractional values (e.g. `1.5`) for any integer type were truncated - out-of-range values (e.g. `128` for int8) wrapped via float64 cast - negative values for unsigned types wrapped to large positives - fractional values for `Timestamp`, `Time32/64`, `Date32/64`, and `Duration` were coerced to integer-backed temporal values Restore the original strict semantics for `case json.Number`: - `Int*`/`Uint*` builders try `ParseInt`/`ParseUint` first (so large integers exceeding float64 mantissa precision still succeed), then fall back to `ParseFloat` only with the integral-value and range checks that were dropped in the previous commit. - `Date32/64`, `Time32/64`, `Duration`, and `Timestamp` revert to the prior `Int64`-only handling for `json.Number` so fractional input errors out instead of being silently truncated. The numeric-builder generated test data (`[0, 1, null, 2.3, -11]`) was relying on undocumented float64 truncation/wrap behavior that bypassed strict validation. It is updated to clean integer data (`[0, 1, null, 2, 100]`) that exercises the same code paths without depending on silent coercion. Adds `TestJSONNumberStrictValidation` covering 14 invalid-input cases across `Int*`/`Uint*`/`Date32/64`/`Time32/64`/`Duration`/`Timestamp` to guard against this regression returning. Closes roborev review #3576.
Roborev #3578 caught two remaining gaps in the strict json.Number validation introduced by the previous commit: 1. **HIGH** — `Time32` and `Date32` are int32 aliases but their `json.Number` case decoded via `v.Int64()` and cast directly to the int32-backed type, so values like `[2147483648]` or `[-2147483649]` silently wrapped. Added explicit range validation against the int32 bounds before appending. `Time64`/`Date64` are unaffected (already int64-backed). 2. **MEDIUM** — The `Uint*` `ParseFloat` fallback used `fval > float64(^uint64(0))`. For uint64, `float64(^uint64(0))` rounds up to 2^64, so the strict-greater check let the exact 2^64 boundary slip through. Additionally, `uint64(fval)` was computed before any range validation, which is undefined behavior for out-of-range floats. Reordered the check to (a) range-validate first using `fval >= max+1`, (b) only then perform the whole-number check on the already-validated truncated value. Same fix applied in both `case json.Number` and `case string` paths. Extends `TestJSONNumberStrictValidation` with 6 new sub-cases: `Date32OverflowPositive/Negative`, `Time32OverflowPositive/Negative`, `Uint64ExactBoundary` (`18446744073709551616`), and `Uint64ExponentialOverflow` (`1.8446744073709552e+19`). Closes roborev review #3578.
Roborev #3579 caught that the unsigned ParseFloat fallback (introduced to support exponential notation like "1e3") still reached uint64(fval) for inputs like ["NaN"]. Both NaN range checks return false (NaN compares unequal to everything), so NaN slipped past validation and the implementation-defined uint64(NaN) conversion ran before the whole-number check could reject it. The same UB exposure existed in the corresponding signed paths. Fix in both case string and case json.Number, for both Int* and Uint* builders, by: 1. Adding math.IsNaN(fval) || math.IsInf(fval, 0) to the existing range disjunction so non-finite floats are rejected before any conversion. 2. Refactoring the signed paths to mirror the unsigned shape: range-validate first, compute truncated := int64(fval), then check whole-number against the already-validated truncated value. This requires a new "math" import in the generated builder, threaded through the template. Extends TestJSONNumberStrictValidation with 5 quoted-string cases: Int64NaNString (`["NaN"]`), Int64InfString (`["+Inf"]`), Uint64NaNString, Uint64InfString, and Uint64ExponentialOverflowString (`["1.8446744073709552e+19"]`) per the reviewer's suggestion. Closes roborev review #3579.
The previous ast_grep_replace pass that added dec.UseNumber() to all builder UnmarshalJSON methods accidentally produced a duplicate call in BooleanBuilder, which already had one. Caught by the roborev branch duplication analysis. The duplicate is harmless but ugly; collapse to a single call.
Roborev #3583 caught that the Int64/Uint64 ParseFloat fallback still
silently rounded large exponent-form integers. For example:
ParseFloat("9.007199254740993e15") → 9007199254740992 (rounded down)
The whole-number check `fval == float64(truncated)` then passes, because
9007199254740992.0 == float64(int64(9007199254740992)) is trivially
true after rounding has already happened. The value 9007199254740993
the caller wrote is silently replaced with 9007199254740992.
Fix per the reviewer's "conservatively reject" suggestion: in the size-8
ParseFloat fallback paths (both signed and unsigned, both case string
and case json.Number), reject any value with `math.Abs(fval) >= 2^53`.
That is the boundary above which float64 cannot represent every integer
exactly, so any post-ParseFloat magnitude at-or-above this threshold
might have been silently rounded.
Smaller-width types (size 1/2/4) don't need this check because their
own value range is well below 2^53 — the existing range check already
filters anything that could be problematic.
Tradeoff: legitimate exponent-form input at exactly 2^53 (e.g.
"9.007199254740992e15") is now rejected. The caller can always rewrite
it as a plain integer literal (`9007199254740992`), which goes through
the exact ParseInt path.
Extends TestJSONNumberStrictValidation with 4 cases:
Int64ExponentBeyondMantissa, Uint64ExponentBeyondMantissa, plus their
quoted-string equivalents — all using `9.007199254740993e15` (2^53+1).
Closes roborev review #3583.
The earlier commit added WithUseNumberJSONReader as a new public option on JSONReader, only to deprecate it in the same patch series once UseNumber became the unconditional default. Shipping a brand-new "deprecated on arrival" symbol provides no migration value: there are no existing callers to grandfather, and the option is a no-op from the moment it appears in any released version. Remove the option entirely along with the now-unused useNumber field on JSONReader. NewJSONReader keeps unconditionally calling UseNumber, so callers automatically get precision-preserving decoding without any opt-in. The companion regression test that explicitly exercised the deprecated option is also removed; TestJSONReaderLargeInt64 (renamed from TestJSONReaderLargeInt64Default) covers the only remaining behavior.
lidavidm
approved these changes
May 15, 2026
Member
There was a problem hiding this comment.
I suppose it was originally like this but why does the Float branch not test float values?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #804.
JSON decoding in
arrow/arraysilently corrupted integer values larger than 2^53 because Go'sencoding/jsondecodes numeric tokens asfloat64by default, losing precision for very large integers. This PR makesUseNumberthe unconditional default acrossFromJSON,RecordFromJSON, andNewJSONReaderso fullint64precision is preserved without requiring callers to opt in.Changes
RecordBuildergainsUnmarshalOneandUnmarshalso caller-configuredjson.Decoderoptions propagate through nested field decoding.RecordFromJSONnow callsbldr.UnmarshalOne(dec)instead ofdec.Decode(bldr), avoiding the bytes-roundtrip throughjson.Unmarshalerthat previously dropped the outer decoder configuration (root cause noted in this comment).All builder
UnmarshalJSON([]byte)methods now enableUseNumberon their internal decoder so large integers also survive the standardjson.Unmarshal(data, &builder)path.String-encoded integers now work for time-based types.
Duration,Timestamp,Time32,Time64,Date32, andDate64builders accept stringified integers (e.g."9223372036854775807") as a fallback before their format-specific parsers — addressing the second snippet from the issue. This matches howInt64Builderand the decimal builders already behaved.WithUseNumberandWithUseNumberJSONReaderare deprecated (marked withDeprecated:per pkg.go.dev convention) sinceUseNumberis now always enabled. Both options remain functional no-ops for backward compatibility.Test Plan
15 new tests added in
arrow/array/util_test.goandarrow/array/json_reader_test.gocovering:RecordFromJSONandJSONReaderround-trippingMaxInt64/MinInt64with and without the deprecated options.RecordBuilder.UnmarshalJSONandUnmarshalOnedirectly preserving large integers.Duration,Timestamp,Time32,Time64,Date32,Date64."abc") still error; format-style duration strings ("3h2m0.5s") still parse.Verified passing locally:
./arrow/array/...(full suite)./arrow/compute/...(largest blast-radius consumer ofWithUseNumber— 49 test sites)./arrow/ipc/...,./arrow/scalar/...,./arrow/extensions/...,./arrow/encoded/...,./arrow/util/..../parquet/file/...(the./parquet/pqarrow/...package fails identically onmaindue to a missingPARQUET_TEST_DATAenv var — pre-existing, unrelated)All
lsp_diagnosticsclean across the 23 changed files.Behavior Change Notes
The only observable behavior change for existing callers is: large integer values that previously decoded to a wrong
int64(via float64 round-trip) now decode to the correct value. Existing tests inarrow/computecontinue to pass, including those that explicitly opt in viaWithUseNumberand the ones that don't. No production callers in this repo break (verified acrossflight,compute,parquet,ipc).