Skip to content

Commit bf4fc65

Browse files
authored
GH-47596: [C++][Parquet] Fix printing of large Decimal statistics (#47619)
### Rationale for this change Parquet CLI tools fail printing the statistics for a Decimal column with a precision larger than the max Decimal128 precision. Example: ```console $ /build/build-test/debug/parquet-reader --only-metadata /tmp/pqfuzz/pq-table-1 ... Column 5: col_6 (FIXED_LEN_BYTE_ARRAY(11) / Decimal(precision=24, scale=7) / DECIMAL(24,7)) Column 6: col_7 (FIXED_LEN_BYTE_ARRAY(18) / Decimal(precision=43, scale=7) / DECIMAL(43,7)) ... Column 5 Values: 375, Null Values: 74, Distinct Values: 0 Max (exact: true): 98505381700645007.0205463, Min (exact: true): -99708959786297168.1726196 Compression: UNCOMPRESSED, Encodings: PLAIN(DICT_PAGE) RLE_DICTIONARY Uncompressed Size: 3754, Compressed Size: 3754 Column 6 Values: 375, Null Values: 69, Distinct Values: 0 Max (exact: true): Parquet error: Failed to parse decimal value: Length of byte array passed to Decimal128::FromBigEndian was 18, but must be between 1 and 16 ... ``` ### What changes are included in this PR? Use Decimal256 instead of Decimal128 when printing a Decimal statistic. ### Are these changes tested? Yes, by new tests. ### Are there any user-facing changes? No. * GitHub Issue: #47596 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
1 parent d803afc commit bf4fc65

File tree

2 files changed

+18
-2
lines changed

2 files changed

+18
-2
lines changed

cpp/src/parquet/types.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ std::string FormatDecimalValue(Type::type parquet_type, ::std::string_view val,
137137
}
138138
case Type::FIXED_LEN_BYTE_ARRAY:
139139
case Type::BYTE_ARRAY: {
140-
auto decimal_result = ::arrow::Decimal128::FromBigEndian(
140+
auto decimal_result = ::arrow::Decimal256::FromBigEndian(
141141
reinterpret_cast<const uint8_t*>(val.data()), static_cast<int32_t>(val.size()));
142142
if (!decimal_result.ok()) {
143143
throw ParquetException("Failed to parse decimal value: ",

cpp/src/parquet/types_test.cc

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,8 @@ TEST(TypePrinter, StatisticsTypes) {
138138
ASSERT_EQ("0x696a6b6c6d6e6f70", FormatStatValue(Type::FIXED_LEN_BYTE_ARRAY, smax));
139139

140140
// Decimal
141+
142+
// If the physical type is INT32 or INT64, the decimal storage is little-endian.
141143
int32_t int32_decimal = 1024;
142144
smin = std::string(reinterpret_cast<char*>(&int32_decimal), sizeof(int32_t));
143145
ASSERT_EQ("10.24", FormatStatValue(Type::INT32, smin, LogicalType::Decimal(6, 2)));
@@ -147,7 +149,8 @@ TEST(TypePrinter, StatisticsTypes) {
147149
ASSERT_EQ("10240000.0000",
148150
FormatStatValue(Type::INT64, smin, LogicalType::Decimal(18, 4)));
149151

150-
std::vector<char> bytes = {0x11, 0x22, 0x33, 0x44};
152+
// If the physical type is BYTE_ARRAY or FLBA, the decimal storage is big-endian.
153+
std::vector<uint8_t> bytes = {0x11, 0x22, 0x33, 0x44};
151154
smin = std::string(bytes.begin(), bytes.end());
152155
ASSERT_EQ("28745.4020",
153156
FormatStatValue(Type::BYTE_ARRAY, smin, LogicalType::Decimal(10, 4)));
@@ -156,6 +159,19 @@ TEST(TypePrinter, StatisticsTypes) {
156159
ASSERT_EQ("0x11223344", FormatStatValue(Type::BYTE_ARRAY, smin));
157160
ASSERT_EQ("0x11223344", FormatStatValue(Type::FIXED_LEN_BYTE_ARRAY, smin));
158161

162+
bytes = {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
163+
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
164+
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xcf, 0xc7};
165+
smin = std::string(bytes.begin(), bytes.end());
166+
ASSERT_EQ("-12.345",
167+
FormatStatValue(Type::BYTE_ARRAY, smin, LogicalType::Decimal(40, 3)));
168+
ASSERT_EQ("-12.345", FormatStatValue(Type::FIXED_LEN_BYTE_ARRAY, smin,
169+
LogicalType::Decimal(40, 3)));
170+
ASSERT_EQ("0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffcfc7",
171+
FormatStatValue(Type::BYTE_ARRAY, smin));
172+
ASSERT_EQ("0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffcfc7",
173+
FormatStatValue(Type::FIXED_LEN_BYTE_ARRAY, smin));
174+
159175
// Float16
160176
bytes = {0x1c, 0x50};
161177
smin = std::string(bytes.begin(), bytes.end());

0 commit comments

Comments
 (0)