Skip to content

fix(arrow/array): preserve large integer precision in JSON decoding#816

Open
zeroshade wants to merge 8 commits into
apache:mainfrom
zeroshade:fix/json-large-int-precision-804
Open

fix(arrow/array): preserve large integer precision in JSON decoding#816
zeroshade wants to merge 8 commits into
apache:mainfrom
zeroshade:fix/json-large-int-precision-804

Conversation

@zeroshade
Copy link
Copy Markdown
Member

@zeroshade zeroshade commented May 14, 2026

Summary

Fixes #804.

JSON decoding in arrow/array silently corrupted integer values larger than 2^53 because Go's encoding/json decodes numeric tokens as float64 by default, losing precision for very large integers. This PR makes UseNumber the unconditional default across FromJSON, RecordFromJSON, and NewJSONReader so full int64 precision is preserved without requiring callers to opt in.

Changes

  • RecordBuilder gains UnmarshalOne and Unmarshal so caller-configured json.Decoder options propagate through nested field decoding. RecordFromJSON now calls bldr.UnmarshalOne(dec) instead of dec.Decode(bldr), avoiding the bytes-roundtrip through json.Unmarshaler that previously dropped the outer decoder configuration (root cause noted in this comment).

  • All builder UnmarshalJSON([]byte) methods now enable UseNumber on their internal decoder so large integers also survive the standard json.Unmarshal(data, &builder) path.

  • String-encoded integers now work for time-based types. Duration, Timestamp, Time32, Time64, Date32, and Date64 builders accept stringified integers (e.g. "9223372036854775807") as a fallback before their format-specific parsers — addressing the second snippet from the issue. This matches how Int64Builder and the decimal builders already behaved.

  • WithUseNumber and WithUseNumberJSONReader are deprecated (marked with Deprecated: per pkg.go.dev convention) since UseNumber is now always enabled. Both options remain functional no-ops for backward compatibility.

Test Plan

15 new tests added in arrow/array/util_test.go and arrow/array/json_reader_test.go covering:

  • RecordFromJSON and JSONReader round-tripping MaxInt64/MinInt64 with and without the deprecated options.
  • RecordBuilder.UnmarshalJSON and UnmarshalOne directly preserving large integers.
  • Stringified integer round-trip for Duration, Timestamp, Time32, Time64, Date32, Date64.
  • Regression: invalid duration strings ("abc") still error; format-style duration strings ("3h2m0.5s") still parse.

Verified passing locally:

  • ./arrow/array/... (full suite)
  • ./arrow/compute/... (largest blast-radius consumer of WithUseNumber — 49 test sites)
  • ./arrow/ipc/..., ./arrow/scalar/..., ./arrow/extensions/..., ./arrow/encoded/..., ./arrow/util/...
  • ./parquet/file/... (the ./parquet/pqarrow/... package fails identically on main due to a missing PARQUET_TEST_DATA env var — pre-existing, unrelated)

All lsp_diagnostics clean across the 23 changed files.

Behavior Change Notes

The only observable behavior change for existing callers is: large integer values that previously decoded to a wrong int64 (via float64 round-trip) now decode to the correct value. Existing tests in arrow/compute continue to pass, including those that explicitly opt in via WithUseNumber and the ones that don't. No production callers in this repo break (verified across flight, compute, parquet, ipc).

JSON decoding silently corrupted integer values larger than 2^53 because
encoding/json decodes numeric tokens as float64 by default. Enable
UseNumber unconditionally on the internal decoder in FromJSON,
RecordFromJSON, and NewJSONReader so full int64 precision is preserved.

Additional changes:

- RecordBuilder gains UnmarshalOne and Unmarshal methods so caller-
  configured decoder options propagate through nested field decoding.
  RecordFromJSON now calls bldr.UnmarshalOne(dec) instead of
  dec.Decode(bldr), avoiding the bytes-roundtrip through json.Unmarshaler
  that drops outer decoder configuration.

- All builder UnmarshalJSON([]byte) methods enable UseNumber on their
  internal decoder so large integers also survive the standard
  json.Unmarshal(data, &builder) path.

- Duration, Timestamp, Time32, Time64, Date32 and Date64 builders now
  accept stringified integers (e.g. "9223372036854775807") as a fallback
  before their format-specific parsers, matching how Int64 and Decimal
  already behaved.

- WithUseNumber and WithUseNumberJSONReader are deprecated; both are
  now no-ops since UseNumber is always enabled.

Fixes apache#804
@zeroshade zeroshade requested a review from lidavidm May 14, 2026 23:37
Now that UseNumber is unconditionally enabled in FromJSON,
RecordFromJSON, and NewJSONReader, the explicit WithUseNumber()
arguments in compute tests and the flightsql type_info example are
redundant and trigger staticcheck SA1019. Remove them.

The two regression tests in arrow/array that verify the deprecated
options still function as no-ops keep their calls and gain a focused
//nolint:staticcheck directive.
Comment thread arrow/array/booleanbuilder.go Outdated
Comment thread arrow/array/json_reader.go Outdated
Comment thread arrow/array/json_reader.go Outdated
zeroshade added 6 commits May 15, 2026 12:11
Roborev pointed out that the previous fix relaxed validation for
`json.Number` values across integer-backed builders to keep an existing
test passing. Combined with making `UseNumber` the unconditional default,
that turned several invalid inputs into silent data corruption:

- fractional values (e.g. `1.5`) for any integer type were truncated
- out-of-range values (e.g. `128` for int8) wrapped via float64 cast
- negative values for unsigned types wrapped to large positives
- fractional values for `Timestamp`, `Time32/64`, `Date32/64`, and
  `Duration` were coerced to integer-backed temporal values

Restore the original strict semantics for `case json.Number`:

- `Int*`/`Uint*` builders try `ParseInt`/`ParseUint` first (so large
  integers exceeding float64 mantissa precision still succeed), then
  fall back to `ParseFloat` only with the integral-value and range
  checks that were dropped in the previous commit.
- `Date32/64`, `Time32/64`, `Duration`, and `Timestamp` revert to the
  prior `Int64`-only handling for `json.Number` so fractional input
  errors out instead of being silently truncated.

The numeric-builder generated test data (`[0, 1, null, 2.3, -11]`) was
relying on undocumented float64 truncation/wrap behavior that bypassed
strict validation. It is updated to clean integer data
(`[0, 1, null, 2, 100]`) that exercises the same code paths without
depending on silent coercion.

Adds `TestJSONNumberStrictValidation` covering 14 invalid-input cases
across `Int*`/`Uint*`/`Date32/64`/`Time32/64`/`Duration`/`Timestamp` to
guard against this regression returning.

Closes roborev review #3576.
Roborev #3578 caught two remaining gaps in the strict json.Number
validation introduced by the previous commit:

1. **HIGH** — `Time32` and `Date32` are int32 aliases but their
   `json.Number` case decoded via `v.Int64()` and cast directly to the
   int32-backed type, so values like `[2147483648]` or `[-2147483649]`
   silently wrapped. Added explicit range validation against the int32
   bounds before appending. `Time64`/`Date64` are unaffected (already
   int64-backed).

2. **MEDIUM** — The `Uint*` `ParseFloat` fallback used
   `fval > float64(^uint64(0))`. For uint64, `float64(^uint64(0))`
   rounds up to 2^64, so the strict-greater check let the exact 2^64
   boundary slip through. Additionally, `uint64(fval)` was computed
   before any range validation, which is undefined behavior for
   out-of-range floats. Reordered the check to (a) range-validate
   first using `fval >= max+1`, (b) only then perform the
   whole-number check on the already-validated truncated value. Same
   fix applied in both `case json.Number` and `case string` paths.

Extends `TestJSONNumberStrictValidation` with 6 new sub-cases:
`Date32OverflowPositive/Negative`, `Time32OverflowPositive/Negative`,
`Uint64ExactBoundary` (`18446744073709551616`), and
`Uint64ExponentialOverflow` (`1.8446744073709552e+19`).

Closes roborev review #3578.
Roborev #3579 caught that the unsigned ParseFloat fallback (introduced
to support exponential notation like "1e3") still reached uint64(fval)
for inputs like ["NaN"]. Both NaN range checks return false (NaN
compares unequal to everything), so NaN slipped past validation and
the implementation-defined uint64(NaN) conversion ran before the
whole-number check could reject it.

The same UB exposure existed in the corresponding signed paths. Fix in
both case string and case json.Number, for both Int* and Uint*
builders, by:

1. Adding math.IsNaN(fval) || math.IsInf(fval, 0) to the existing range
   disjunction so non-finite floats are rejected before any conversion.
2. Refactoring the signed paths to mirror the unsigned shape:
   range-validate first, compute truncated := int64(fval), then check
   whole-number against the already-validated truncated value.

This requires a new "math" import in the generated builder, threaded
through the template.

Extends TestJSONNumberStrictValidation with 5 quoted-string cases:
Int64NaNString (`["NaN"]`), Int64InfString (`["+Inf"]`),
Uint64NaNString, Uint64InfString, and Uint64ExponentialOverflowString
(`["1.8446744073709552e+19"]`) per the reviewer's suggestion.

Closes roborev review #3579.
The previous ast_grep_replace pass that added dec.UseNumber() to all
builder UnmarshalJSON methods accidentally produced a duplicate call in
BooleanBuilder, which already had one. Caught by the roborev branch
duplication analysis. The duplicate is harmless but ugly; collapse to
a single call.
Roborev #3583 caught that the Int64/Uint64 ParseFloat fallback still
silently rounded large exponent-form integers. For example:

  ParseFloat("9.007199254740993e15") → 9007199254740992 (rounded down)

The whole-number check `fval == float64(truncated)` then passes, because
9007199254740992.0 == float64(int64(9007199254740992)) is trivially
true after rounding has already happened. The value 9007199254740993
the caller wrote is silently replaced with 9007199254740992.

Fix per the reviewer's "conservatively reject" suggestion: in the size-8
ParseFloat fallback paths (both signed and unsigned, both case string
and case json.Number), reject any value with `math.Abs(fval) >= 2^53`.
That is the boundary above which float64 cannot represent every integer
exactly, so any post-ParseFloat magnitude at-or-above this threshold
might have been silently rounded.

Smaller-width types (size 1/2/4) don't need this check because their
own value range is well below 2^53 — the existing range check already
filters anything that could be problematic.

Tradeoff: legitimate exponent-form input at exactly 2^53 (e.g.
"9.007199254740992e15") is now rejected. The caller can always rewrite
it as a plain integer literal (`9007199254740992`), which goes through
the exact ParseInt path.

Extends TestJSONNumberStrictValidation with 4 cases:
Int64ExponentBeyondMantissa, Uint64ExponentBeyondMantissa, plus their
quoted-string equivalents — all using `9.007199254740993e15` (2^53+1).

Closes roborev review #3583.
The earlier commit added WithUseNumberJSONReader as a new public option
on JSONReader, only to deprecate it in the same patch series once
UseNumber became the unconditional default. Shipping a brand-new
"deprecated on arrival" symbol provides no migration value: there are
no existing callers to grandfather, and the option is a no-op from the
moment it appears in any released version.

Remove the option entirely along with the now-unused useNumber field
on JSONReader. NewJSONReader keeps unconditionally calling UseNumber,
so callers automatically get precision-preserving decoding without any
opt-in.

The companion regression test that explicitly exercised the deprecated
option is also removed; TestJSONReaderLargeInt64 (renamed from
TestJSONReaderLargeInt64Default) covers the only remaining behavior.
@zeroshade zeroshade requested a review from lidavidm May 15, 2026 16:43
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it was originally like this but why does the Float branch not test float values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Go] RecordFromJSON does not handle integer values not representable by double

2 participants