fix(arrow/array): preserve large integer precision in JSON decoding by zeroshade · Pull Request #816 · apache/arrow-go

zeroshade · 2026-05-14T23:35:18Z

Summary

Fixes #804.

JSON decoding in arrow/array silently corrupted integer values larger than 2^53 because Go's encoding/json decodes numeric tokens as float64 by default, losing precision for very large integers. This PR makes UseNumber the unconditional default across FromJSON, RecordFromJSON, and NewJSONReader so full int64 precision is preserved without requiring callers to opt in.

Changes

RecordBuilder gains UnmarshalOne and Unmarshal so caller-configured json.Decoder options propagate through nested field decoding. RecordFromJSON now calls bldr.UnmarshalOne(dec) instead of dec.Decode(bldr), avoiding the bytes-roundtrip through json.Unmarshaler that previously dropped the outer decoder configuration (root cause noted in this comment).
All builder UnmarshalJSON([]byte) methods now enable UseNumber on their internal decoder so large integers also survive the standard json.Unmarshal(data, &builder) path.
String-encoded integers now work for time-based types. Duration, Timestamp, Time32, Time64, Date32, and Date64 builders accept stringified integers (e.g. "9223372036854775807") as a fallback before their format-specific parsers — addressing the second snippet from the issue. This matches how Int64Builder and the decimal builders already behaved.
WithUseNumber and WithUseNumberJSONReader are deprecated (marked with Deprecated: per pkg.go.dev convention) since UseNumber is now always enabled. Both options remain functional no-ops for backward compatibility.

Test Plan

15 new tests added in arrow/array/util_test.go and arrow/array/json_reader_test.go covering:

RecordFromJSON and JSONReader round-tripping MaxInt64/MinInt64 with and without the deprecated options.
RecordBuilder.UnmarshalJSON and UnmarshalOne directly preserving large integers.
Stringified integer round-trip for Duration, Timestamp, Time32, Time64, Date32, Date64.
Regression: invalid duration strings ("abc") still error; format-style duration strings ("3h2m0.5s") still parse.

Verified passing locally:

./arrow/array/... (full suite)
./arrow/compute/... (largest blast-radius consumer of WithUseNumber — 49 test sites)
./arrow/ipc/..., ./arrow/scalar/..., ./arrow/extensions/..., ./arrow/encoded/..., ./arrow/util/...
./parquet/file/... (the ./parquet/pqarrow/... package fails identically on main due to a missing PARQUET_TEST_DATA env var — pre-existing, unrelated)

All lsp_diagnostics clean across the 23 changed files.

Behavior Change Notes

The only observable behavior change for existing callers is: large integer values that previously decoded to a wrong int64 (via float64 round-trip) now decode to the correct value. Existing tests in arrow/compute continue to pass, including those that explicitly opt in via WithUseNumber and the ones that don't. No production callers in this repo break (verified across flight, compute, parquet, ipc).

JSON decoding silently corrupted integer values larger than 2^53 because encoding/json decodes numeric tokens as float64 by default. Enable UseNumber unconditionally on the internal decoder in FromJSON, RecordFromJSON, and NewJSONReader so full int64 precision is preserved. Additional changes: - RecordBuilder gains UnmarshalOne and Unmarshal methods so caller- configured decoder options propagate through nested field decoding. RecordFromJSON now calls bldr.UnmarshalOne(dec) instead of dec.Decode(bldr), avoiding the bytes-roundtrip through json.Unmarshaler that drops outer decoder configuration. - All builder UnmarshalJSON([]byte) methods enable UseNumber on their internal decoder so large integers also survive the standard json.Unmarshal(data, &builder) path. - Duration, Timestamp, Time32, Time64, Date32 and Date64 builders now accept stringified integers (e.g. "9223372036854775807") as a fallback before their format-specific parsers, matching how Int64 and Decimal already behaved. - WithUseNumber and WithUseNumberJSONReader are deprecated; both are now no-ops since UseNumber is always enabled. Fixes apache#804

Now that UseNumber is unconditionally enabled in FromJSON, RecordFromJSON, and NewJSONReader, the explicit WithUseNumber() arguments in compute tests and the flightsql type_info example are redundant and trigger staticcheck SA1019. Remove them. The two regression tests in arrow/array that verify the deprecated options still function as no-ops keep their calls and gain a focused //nolint:staticcheck directive.

Roborev pointed out that the previous fix relaxed validation for `json.Number` values across integer-backed builders to keep an existing test passing. Combined with making `UseNumber` the unconditional default, that turned several invalid inputs into silent data corruption: - fractional values (e.g. `1.5`) for any integer type were truncated - out-of-range values (e.g. `128` for int8) wrapped via float64 cast - negative values for unsigned types wrapped to large positives - fractional values for `Timestamp`, `Time32/64`, `Date32/64`, and `Duration` were coerced to integer-backed temporal values Restore the original strict semantics for `case json.Number`: - `Int*`/`Uint*` builders try `ParseInt`/`ParseUint` first (so large integers exceeding float64 mantissa precision still succeed), then fall back to `ParseFloat` only with the integral-value and range checks that were dropped in the previous commit. - `Date32/64`, `Time32/64`, `Duration`, and `Timestamp` revert to the prior `Int64`-only handling for `json.Number` so fractional input errors out instead of being silently truncated. The numeric-builder generated test data (`[0, 1, null, 2.3, -11]`) was relying on undocumented float64 truncation/wrap behavior that bypassed strict validation. It is updated to clean integer data (`[0, 1, null, 2, 100]`) that exercises the same code paths without depending on silent coercion. Adds `TestJSONNumberStrictValidation` covering 14 invalid-input cases across `Int*`/`Uint*`/`Date32/64`/`Time32/64`/`Duration`/`Timestamp` to guard against this regression returning. Closes roborev review #3576.

Roborev #3578 caught two remaining gaps in the strict json.Number validation introduced by the previous commit: 1. **HIGH** — `Time32` and `Date32` are int32 aliases but their `json.Number` case decoded via `v.Int64()` and cast directly to the int32-backed type, so values like `[2147483648]` or `[-2147483649]` silently wrapped. Added explicit range validation against the int32 bounds before appending. `Time64`/`Date64` are unaffected (already int64-backed). 2. **MEDIUM** — The `Uint*` `ParseFloat` fallback used `fval > float64(^uint64(0))`. For uint64, `float64(^uint64(0))` rounds up to 2^64, so the strict-greater check let the exact 2^64 boundary slip through. Additionally, `uint64(fval)` was computed before any range validation, which is undefined behavior for out-of-range floats. Reordered the check to (a) range-validate first using `fval >= max+1`, (b) only then perform the whole-number check on the already-validated truncated value. Same fix applied in both `case json.Number` and `case string` paths. Extends `TestJSONNumberStrictValidation` with 6 new sub-cases: `Date32OverflowPositive/Negative`, `Time32OverflowPositive/Negative`, `Uint64ExactBoundary` (`18446744073709551616`), and `Uint64ExponentialOverflow` (`1.8446744073709552e+19`). Closes roborev review #3578.

Roborev #3579 caught that the unsigned ParseFloat fallback (introduced to support exponential notation like "1e3") still reached uint64(fval) for inputs like ["NaN"]. Both NaN range checks return false (NaN compares unequal to everything), so NaN slipped past validation and the implementation-defined uint64(NaN) conversion ran before the whole-number check could reject it. The same UB exposure existed in the corresponding signed paths. Fix in both case string and case json.Number, for both Int* and Uint* builders, by: 1. Adding math.IsNaN(fval) || math.IsInf(fval, 0) to the existing range disjunction so non-finite floats are rejected before any conversion. 2. Refactoring the signed paths to mirror the unsigned shape: range-validate first, compute truncated := int64(fval), then check whole-number against the already-validated truncated value. This requires a new "math" import in the generated builder, threaded through the template. Extends TestJSONNumberStrictValidation with 5 quoted-string cases: Int64NaNString (`["NaN"]`), Int64InfString (`["+Inf"]`), Uint64NaNString, Uint64InfString, and Uint64ExponentialOverflowString (`["1.8446744073709552e+19"]`) per the reviewer's suggestion. Closes roborev review #3579.

The previous ast_grep_replace pass that added dec.UseNumber() to all builder UnmarshalJSON methods accidentally produced a duplicate call in BooleanBuilder, which already had one. Caught by the roborev branch duplication analysis. The duplicate is harmless but ugly; collapse to a single call.

Roborev #3583 caught that the Int64/Uint64 ParseFloat fallback still silently rounded large exponent-form integers. For example: ParseFloat("9.007199254740993e15") → 9007199254740992 (rounded down) The whole-number check `fval == float64(truncated)` then passes, because 9007199254740992.0 == float64(int64(9007199254740992)) is trivially true after rounding has already happened. The value 9007199254740993 the caller wrote is silently replaced with 9007199254740992. Fix per the reviewer's "conservatively reject" suggestion: in the size-8 ParseFloat fallback paths (both signed and unsigned, both case string and case json.Number), reject any value with `math.Abs(fval) >= 2^53`. That is the boundary above which float64 cannot represent every integer exactly, so any post-ParseFloat magnitude at-or-above this threshold might have been silently rounded. Smaller-width types (size 1/2/4) don't need this check because their own value range is well below 2^53 — the existing range check already filters anything that could be problematic. Tradeoff: legitimate exponent-form input at exactly 2^53 (e.g. "9.007199254740992e15") is now rejected. The caller can always rewrite it as a plain integer literal (`9007199254740992`), which goes through the exact ParseInt path. Extends TestJSONNumberStrictValidation with 4 cases: Int64ExponentBeyondMantissa, Uint64ExponentBeyondMantissa, plus their quoted-string equivalents — all using `9.007199254740993e15` (2^53+1). Closes roborev review #3583.

The earlier commit added WithUseNumberJSONReader as a new public option on JSONReader, only to deprecate it in the same patch series once UseNumber became the unconditional default. Shipping a brand-new "deprecated on arrival" symbol provides no migration value: there are no existing callers to grandfather, and the option is a no-op from the moment it appears in any released version. Remove the option entirely along with the now-unused useNumber field on JSONReader. NewJSONReader keeps unconditionally calling UseNumber, so callers automatically get precision-preserving decoding without any opt-in. The companion regression test that explicitly exercised the deprecated option is also removed; TestJSONReaderLargeInt64 (renamed from TestJSONReaderLargeInt64Default) covers the only remaining behavior.

lidavidm · 2026-05-15T16:55:11Z

I suppose it was originally like this but why does the Float branch not test float values?

zeroshade requested a review from lidavidm May 14, 2026 23:37

lidavidm reviewed May 15, 2026

View reviewed changes

Comment thread arrow/array/booleanbuilder.go Outdated

Comment thread arrow/array/json_reader.go Outdated

Comment thread arrow/array/json_reader.go Outdated

zeroshade added 6 commits May 15, 2026 12:11

zeroshade requested a review from lidavidm May 15, 2026 16:43

lidavidm approved these changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(arrow/array): preserve large integer precision in JSON decoding#816

fix(arrow/array): preserve large integer precision in JSON decoding#816
zeroshade wants to merge 8 commits into
apache:mainfrom
zeroshade:fix/json-large-int-precision-804

zeroshade commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lidavidm May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zeroshade commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Plan

Behavior Change Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lidavidm May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zeroshade commented May 14, 2026 •

edited

Loading