Construct decimal from i32 #5325

Ted-Jiang · 2024-01-23T05:06:05Z

          I think it would be clearer to use the constructors instead: https://docs.rs/parquet/latest/parquet/data_type/enum.Decimal.html#method.from_i32

                            let decimal = Decimal::from_i32(* v as i32, *p as i32, *s as i32)'
                            sbbf.check(&decimal)

(the same applies to the other types here as well)

Originally posted by @alamb in apache/datafusion#8930 (comment)

The text was updated successfully, but these errors were encountered:

Ted-Jiang · 2024-01-23T05:08:48Z

Seems in

arrow-rs/parquet/src/data_type.rs

Lines 371 to 379 in 7ba36b0

    
           /// Creates new decimal value from `i32`. 
        
           pub fn from_i32(value: i32, precision: i32, scale: i32) -> Self { 
        
               let bytes = value.to_be_bytes(); 
        
               Decimal::Int32 { 
        
                   value: bytes, 
        
                   precision, 
        
                   scale, 
        
               } 
        
           }

use big-endian byte order
but from the parquet doc : https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/Encodings.md?plain=1#L35-L42
Should use little-endian

@tustvold could you plz take a look at this, thx!

tustvold · 2024-01-23T13:13:53Z

Yes, this does not appear to be correct, but I am not very familiar with this code. It is part of the record API, and not used by the various arrow APIs.

Perhaps @sunchao has some context here?

Ted-Jiang · 2024-01-24T02:52:23Z

Yes, this does not appear to be correct, but I am not very familiar with this code. It is part of the record API, and not used by the various arrow APIs.

Perhaps @sunchao has some context here?

Thanks for your reply, like you saying it seems only used on some record API testing, i will try to fix this without other concerns。

sunchao · 2024-01-24T05:53:09Z

I think this code is for converting a i32/i64 read from Parquet into an internal representation which uses big-endian for the byte order. Therefore, it is unrelated to the Parquet format spec.

I don't remember exactly why we chose big-endian instead of little-endian for the above (it was added in sunchao/parquet-rs#103). If one needs to writes out to Parquet after reading the decimals, then it might be better to use little-endian here to save some conversion cost. On the other hand, Java's BigInteger (used by Spark) uses big-endian so if for some reason if the decimal read here need to be consumed by Java, then it might be more efficient to use big-endian here.

tustvold · 2024-01-24T13:00:43Z

Right, the part that confused me is the from_bytes constructor doesn't perform the same conversion. So it isn't clear whether the bytes stored are little or big endian...

alamb · 2024-01-24T13:59:54Z

I recommend we at least put some of this knowledge into the code as comments, even if we don't change the actual implementation

sunchao · 2024-01-24T16:52:40Z

Right, the part that confused me is the from_bytes constructor doesn't perform the same conversion. So it isn't clear whether the bytes stored are little or big endian...

I think the input of from_bytes is already big-endian as per Parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal, so no conversion required.

This is also another reason for using big-endian for i32 and i64. In the original PR, when converting the decimal values to strings, it just need to call BigInt::from_signed_bytes_be without having to treat i32/i64 separately.

tustvold · 2024-01-24T17:19:03Z

Aah, parquet itself is inconsistent 😭 Ok so the next step I think would be to document that the Decimal struct stores data big-endian encoded, and to highlight that if using i32 or i64 as the physical type, parquet uses little endian encoding including in statistics and bloom filters (as they're documented to mirror the PLAIN encoding), but uses big-endian encoding for byte-array based physical types, including in statistics and bloom filters.

Ted-Jiang · 2024-01-25T10:36:59Z

So, i assume https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/Encodings.md?plain=1#L41 bytes_array' bytes not define endian-order ,

but why in decimal need force it big-endian https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L214-L216 🤔

And i still not understand which situation need construct decimal from i32 use big-endian 😭

sunchao · 2024-01-25T20:59:01Z

@Ted-Jiang I guess it is because on the Java side, decimal is typically represented using BigDecimal which uses big-endian underneath (via BigInteger).

Ted-Jiang mentioned this issue Jan 23, 2024

feat: Support parquet bloom filter pruning for decimal128 apache/datafusion#8930

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Construct decimal from i32 #5325

Construct decimal from i32 #5325

Ted-Jiang commented Jan 23, 2024

Ted-Jiang commented Jan 23, 2024 •

edited

Loading

tustvold commented Jan 23, 2024

Ted-Jiang commented Jan 24, 2024

sunchao commented Jan 24, 2024

tustvold commented Jan 24, 2024

alamb commented Jan 24, 2024

sunchao commented Jan 24, 2024

tustvold commented Jan 24, 2024 •

edited

Loading

Ted-Jiang commented Jan 25, 2024 •

edited

Loading

sunchao commented Jan 25, 2024

Construct decimal from i32 #5325

Construct decimal from i32 #5325

Comments

Ted-Jiang commented Jan 23, 2024

Ted-Jiang commented Jan 23, 2024 • edited Loading

tustvold commented Jan 23, 2024

Ted-Jiang commented Jan 24, 2024

sunchao commented Jan 24, 2024

tustvold commented Jan 24, 2024

alamb commented Jan 24, 2024

sunchao commented Jan 24, 2024

tustvold commented Jan 24, 2024 • edited Loading

Ted-Jiang commented Jan 25, 2024 • edited Loading

sunchao commented Jan 25, 2024

Ted-Jiang commented Jan 23, 2024 •

edited

Loading

tustvold commented Jan 24, 2024 •

edited

Loading

Ted-Jiang commented Jan 25, 2024 •

edited

Loading