[C++][Parquet] Parquet writer supports writing int32/int64 for decimal type #15239

wgtmac · 2023-01-07T06:59:25Z

Describe the enhancement requested

As the parquet specs states below, decimal types with small precision can use int32/int64 physical types.

DECIMAL can be used to annotate the following types:

- int32: for 1 <= precision <= 9
- int64: for 1 <= precision <= 18; precision < 10 will produce a warning
- fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
- binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

The aim of this issue is to provide a writer option to write decimal types using int32 when 1 <= precision <= 9 and int64 when 10 <= precision <= 18.

Component(s)

C++, Parquet

The text was updated successfully, but these errors were encountered:

wgtmac · 2023-01-07T07:03:03Z

I will work on it shortly. cc @emkornfield @pitrou

…15244) As the parquet [specs](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal) states, DECIMAL can be used to annotate the following types: - int32: for 1 <= precision <= 9 - int64: for 1 <= precision <= 18; precision < 10 will produce a warning - fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits - binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used. The aim of this patch is to provide a writer option to use int32 to annotate decimal when 1 <= precision <= 9 and int64 when 10 <= precision <= 18. * Closes: #15239 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

alippai · 2023-01-11T18:39:55Z

When talking about datasets (multiple parquet files) are the mixed physical types supported? Some files written using the old way, some files with the improved physical type.

wjones127 · 2023-01-11T19:14:21Z

When talking about datasets (multiple parquet files) are the mixed physical types supported? Some files written using the old way, some files with the improved physical type.

The physical type does not change the logical type in the Parquet file, just how the data is serialized. Datasets shouldn't care about the Parquet physical type; it should only care about the logical one.

alippai · 2023-01-11T20:30:27Z

🥳 thanks!

wgtmac added the Type: enhancement label Jan 7, 2023

liukun4515 mentioned this issue Jan 7, 2023

Support decimal int32/64 for writer apache/arrow-rs#3431

Merged

wgtmac added a commit to wgtmac/arrow that referenced this issue Jan 7, 2023

apacheGH-15239: [C++][Parquet] Parquet writer writes decimal as int32/64

b4241ea

github-actions bot mentioned this issue Jan 7, 2023

GH-15239: [C++][Parquet] Parquet writer writes decimal as int32/64 #15244

Merged

github-actions bot assigned wgtmac Jan 7, 2023

kou added Component: Parquet Component: C++ labels Jan 7, 2023

wjones127 closed this as completed in #15244 Jan 11, 2023

wjones127 added this to the 11.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Parquet writer supports writing int32/int64 for decimal type #15239

[C++][Parquet] Parquet writer supports writing int32/int64 for decimal type #15239

wgtmac commented Jan 7, 2023

wgtmac commented Jan 7, 2023

alippai commented Jan 11, 2023

wjones127 commented Jan 11, 2023

alippai commented Jan 11, 2023

[C++][Parquet] Parquet writer supports writing int32/int64 for decimal type #15239

[C++][Parquet] Parquet writer supports writing int32/int64 for decimal type #15239

Comments

wgtmac commented Jan 7, 2023

Describe the enhancement requested

Component(s)

wgtmac commented Jan 7, 2023

alippai commented Jan 11, 2023

wjones127 commented Jan 11, 2023

alippai commented Jan 11, 2023