Skip to content

arrow::Decimal128::FromString silently truncates when the input string has more than 38 significant digits #49817

@dmitry-chirkov-dremio

Description

@dmitry-chirkov-dremio

Describe the bug, including details regarding any error messages, version, and platform.

arrow::Decimal128::FromString silently truncates (wraps mod 2¹²⁸) when the input string has more than 38 significant digits, producing a garbage unscaled value and returning Status::OK. The out-parameter precision does reflect the true parsed precision (> 38), but the Decimal128 value itself is already corrupted, and most callers do not validate precision against Decimal128::kMaxPrecision.

The same issue applies symmetrically in Decimal256::FromString when precision > 76 - both paths share DecimalFromString<T> in cpp/src/arrow/util/decimal.cc.

Why?

DecimalFromString feeds the digit string into ShiftAndAdd, which multiplies-and-adds into a fixed-size uint64_t array sized to the target decimal's bit width:

std::array<uint64_t, Decimal::kBitWidth / 64> little_endian_array{};
ShiftAndAdd(dec.whole_digits, little_endian_array.data(),
            little_endian_array.size());
ShiftAndAdd(dec.fractional_digits, little_endian_array.data(),
            little_endian_array.size());

ShiftAndAdd carries high bits only through out_size limbs and drops the remaining carry on the floor - no overflow check. DecimalFromString then returns Status::OK() without validating parsed_precision <= kMaxPrecision (and also without validating parsed_scale <= kMaxScale).

Relevant code (paths as of apache/arrow main):

  • ShiftAndAdd - cpp/src/arrow/util/decimal.cc:747
  • DecimalFromString - cpp/src/arrow/util/decimal.cc:857

Repro

// arrow_decimal_repro.cc
#include <arrow/util/decimal.h>
#include <iostream>
#include <string>

int main() {
  const std::string input =
      "1.55555555555555555555555555555555555555555555555555";

  arrow::Decimal128 dec;
  int32_t precision = 0, scale = 0;
  auto st = arrow::Decimal128::FromString(input, &dec, &precision, &scale);

  std::cout << "input:     " << input << "\n"
            << "status:    " << st.ToString() << "\n"
            << "precision: " << precision << "\n"
            << "scale:     " << scale << "\n"
            << "dec (raw): " << dec.ToString(0) << "\n";
  return 0;
}

Observed output (Arrow main, macOS arm64, clang++ -std=c++17):

input:     1.55555555555555555555555555555555555555555555555555
status:    OK
precision: 51
scale:     50
dec (raw): 151473969238762845885664163847966963939

That unscaled value is exactly the input (51-digit integer 155555555555555555555555555555555555555555555555555) reduced mod 2¹²⁸:

>>> 155555555555555555555555555555555555555555555555555 % (1 << 128)
151473969238762845885664163847966963939

Desired (?) behavior
Return Status::Invalid when parsed precision exceeds kMaxPrecision (or parsed scale exceeds kMaxScale) for the target decimal type, matching the error-reporting contract of Decimal::Rescale and friends.

Impact (breaking change?)
Any caller that trusts FromString to fail on invalid input silently produces wrong answers for user-supplied decimal strings longer than 38 significant digits. Downstream, Gandiva's castDECIMAL_utf8 - and any compute kernel that routes VARCHAR-to-DECIMAL through FromString before rescaling via Decimal128::Rescale / gandiva::decimalops::Convert - returns numerically wrong results for perfectly legal SQL casts such as:

CAST('1.55555555555555555555555555555555555555555555555555' AS DECIMAL(38,37))
-- expected: 1.5555555555555555555555555555555555556  (HALF_UP)
-- observed:  0.0000000000015147396923876284588566416

Component(s)

C++, Gandiva

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions