Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Construct decimal from i32 #5325

Open
Ted-Jiang opened this issue Jan 23, 2024 · 10 comments
Open

Construct decimal from i32 #5325

Ted-Jiang opened this issue Jan 23, 2024 · 10 comments

Comments

@Ted-Jiang
Copy link
Member

          I think it would be clearer to use the constructors instead: https://docs.rs/parquet/latest/parquet/data_type/enum.Decimal.html#method.from_i32
                            let decimal = Decimal::from_i32(* v as i32, *p as i32, *s as i32)'
                            sbbf.check(&decimal)

(the same applies to the other types here as well)

Originally posted by @alamb in apache/datafusion#8930 (comment)

@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Jan 23, 2024

Seems in

/// Creates new decimal value from `i32`.
pub fn from_i32(value: i32, precision: i32, scale: i32) -> Self {
let bytes = value.to_be_bytes();
Decimal::Int32 {
value: bytes,
precision,
scale,
}
}

use big-endian byte order
but from the parquet doc : https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/Encodings.md?plain=1#L35-L42
Should use little-endian

@tustvold could you plz take a look at this, thx!

@tustvold
Copy link
Contributor

Yes, this does not appear to be correct, but I am not very familiar with this code. It is part of the record API, and not used by the various arrow APIs.

Perhaps @sunchao has some context here?

@Ted-Jiang
Copy link
Member Author

Yes, this does not appear to be correct, but I am not very familiar with this code. It is part of the record API, and not used by the various arrow APIs.

Perhaps @sunchao has some context here?

Thanks for your reply, like you saying it seems only used on some record API testing, i will try to fix this without other concerns。

@sunchao
Copy link
Member

sunchao commented Jan 24, 2024

I think this code is for converting a i32/i64 read from Parquet into an internal representation which uses big-endian for the byte order. Therefore, it is unrelated to the Parquet format spec.

I don't remember exactly why we chose big-endian instead of little-endian for the above (it was added in sunchao/parquet-rs#103). If one needs to writes out to Parquet after reading the decimals, then it might be better to use little-endian here to save some conversion cost. On the other hand, Java's BigInteger (used by Spark) uses big-endian so if for some reason if the decimal read here need to be consumed by Java, then it might be more efficient to use big-endian here.

@tustvold
Copy link
Contributor

Right, the part that confused me is the from_bytes constructor doesn't perform the same conversion. So it isn't clear whether the bytes stored are little or big endian...

@alamb
Copy link
Contributor

alamb commented Jan 24, 2024

I recommend we at least put some of this knowledge into the code as comments, even if we don't change the actual implementation

@sunchao
Copy link
Member

sunchao commented Jan 24, 2024

Right, the part that confused me is the from_bytes constructor doesn't perform the same conversion. So it isn't clear whether the bytes stored are little or big endian...

I think the input of from_bytes is already big-endian as per Parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal, so no conversion required.

This is also another reason for using big-endian for i32 and i64. In the original PR, when converting the decimal values to strings, it just need to call BigInt::from_signed_bytes_be without having to treat i32/i64 separately.

@tustvold
Copy link
Contributor

tustvold commented Jan 24, 2024

Aah, parquet itself is inconsistent 😭 Ok so the next step I think would be to document that the Decimal struct stores data big-endian encoded, and to highlight that if using i32 or i64 as the physical type, parquet uses little endian encoding including in statistics and bloom filters (as they're documented to mirror the PLAIN encoding), but uses big-endian encoding for byte-array based physical types, including in statistics and bloom filters.

@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Jan 25, 2024

So, i assume https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/Encodings.md?plain=1#L41 bytes_array' bytes not define endian-order ,

but why in decimal need force it big-endian https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L214-L216 🤔

And i still not understand which situation need construct decimal from i32 use big-endian 😭

@sunchao
Copy link
Member

sunchao commented Jan 25, 2024

@Ted-Jiang I guess it is because on the Java side, decimal is typically represented using BigDecimal which uses big-endian underneath (via BigInteger).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants