-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10818: [Rust] Initial implementation for Decimal 128/256 #9232
Conversation
2d1f1f0
to
762bd31
Compare
f857d0a
to
7e2fca8
Compare
4a6a5fd
to
92beac8
Compare
rust/arrow/src/datatypes/mod.rs
Outdated
} | ||
|
||
#[derive(Debug)] | ||
pub enum DecimalType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb @andygrove @Dandandan @nevi-me @jorgecarleitao
I created representations for Decimal as Int32DecimalType
, Int64DecimalType
, Int128DecimalType
or LargeDecimal
, which uses BigInt.
- What is the best way to abstract representations of
DataType::Decimal in Rust
? Enum? - How should I work with
DecimalArray
? Should I make it as generic by passing representation type to it (but how should I detect type, match in everyplace? As I have known functions cannot return type) or make it generic on top of DecimalType and get byte size from it?
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ovr -- sorry for the lack of response. I am not familiar with this code so I will need some non trivial time to review and think about your question. I hope to find some time this weekend to do so.
Hi @ovr , I went through what is here so far. First of all, great stuff that you are taking this on. Broadly speaking, this PR currently contains the following "steps":
I suggest that 4-5 are done on a separate PR. 1-3 can be done on this PR, but I suggest that we split them on separate commits. Concerning step 2, the two main checks in the list are:
Looking at the spec, a decimal type only supports 128 and 256 bits. So, I am not understanding why we are trying to add support for EDIT: I searched for the wrong crate name ^_^ Could you clarify some of these points? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all thank you @ovr for this proposal and contribution. I also finally had a chance to go through this PR this morning and I broadly agree with @jorgecarleitao 's recommendations
Here are some additional thoughts:
- My reading of the arrow spec, also suggests that Arrow only defines the
Int128Decimal
andInt256Decimal
. In my opinion, we should not add types to the Rust implementation that are not in the arrow spec. - If we want to add arbitrary precision decimals to Arrow, perhaps we should propose that to the arrow mailing list and see what other implementers say
- https://github.com/rust-num/num-bigint looks like a solid choice of crate to base an arrow decimal implementation off of
How should I work with DecimalArray? Should I make it as generic by passing representation type to it (but how should I detect type, match in everyplace? As I have known functions cannot return type) or make it generic on top of DecimalType and get byte size from it?
I am not quite what you are asking, but the common pattern for passing arrays around is as an ArrayRef
and then downcasting to the appropriate type.
Given the implementation in this PR that is backed by different bit widths, I see you can't really treat all DecimalArrays the same. The normal arrow pattern I would expect would something like add a DataType::Decimal128
and DataType::Decimal256
type for the different underlying bit widths and then use them as follows:
let decimal_array: DecimalArray128 = make_a_decimal_array();
let array : ArrayRef = Arc::new(decimal_array);
// now pass along array as most of the rest of Arrow library deals in `ArrayRef`
To go from array
back to DecimalArray128
the pattern looks like:
match array.data_type() {
...
DataType::Decimal128(precision, scale) => {
let decimal_array = array.as_any().downcast_ref::<DecimalArray128>().unwrap();
// do whatever you need to do
}
Arrow is used inside DF, which is used to build databases on to of it. If the user defines
|
5fb220e
to
b352649
Compare
@ovr, maybe there is some misunderstanding of the Arrow format and the scope of this crate. Arrow format's goal is to address a common problem in performing analytics: interoperability of data between implementations. For example, the internal memory representation of data in spark is different from Python's numpy, which means that using some numpy function against that data is expensive because it incurs a complete serialization-deserialization roundtrip to handle spark datum. The goal of this crate is to offer the necessary implementation in Rust to allow handling the arrow format in memory. If we start adding extra types to the crate, we fall to the exact same problem that we were trying to solve in the first place: other implementations will need to implement a custom serialize and deserialize and therefore they will have to both implement it and incur the serialization-deserialization roundtrip cost. This is the reason for @alamb 's comment that, if you think that arrow would benefit from other decimal types, then the right place to do it is on a broader forum that includes all other implementations (and the discussion of the spec itself). Does this make sense?
The argument is not about memory usage, but about memory layout and how it is consistent with a specification. Arrow specification is very specific about how memory is lay out so that it enables a stable ABI to share data within the same process via foreign interfaces as well as across processes via the flight protocol. |
Thank you for the clarification @jorgecarleitao -- that is a great way to summarize the rationale for my suggestions. |
d4608a9
to
356c300
Compare
@ovr what do you think we should do with this PR? Is it worth keeping open as a Draft or shall we close it for now? |
83c0b7f
to
755dce5
Compare
It's using Yet another thing, it's overflowing.
I don't know how it should be done. Related to
|
@alamb @jorgecarleitao Can you take a look and do a review? Thanks! |
Thanks @ovr ! I have put this on my review queue -- but I will likely not be able to do so until tomorrow |
327b31f
to
2e5fe36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an epic PR -- Great work @ovr. Thank you so much for all this.
I didn't cover this entire PR as closely as I would have liked due to its size, but I did spend a significant amount of time reviewing it and left a bunch of feedback
Some of these comments could be addressed as potential follow on PRs if you prefer, but I think some are worth paying attention to.
One thing I was thinking was perhaps we want to mark DecimalArray
and Decimal128Type
as "Beta" or subject to change so as to allow further change without confusing users that this is likely to be an area subject to potential change for awhile
rust/arrow/src/array/iterator.rs
Outdated
@@ -240,6 +242,48 @@ impl<'a, T: BinaryOffsetSizeTrait> std::iter::Iterator for GenericBinaryIter<'a, | |||
} | |||
} | |||
|
|||
// In the feature, DecimalArray will become Generic and this iterator will use type of Decimal as T here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me as if the code already is generic over T
so I am not quite sure what this comment is trying to say
p: usize, | ||
s: usize, | ||
) -> Option<U> { | ||
Some(U::parse(n.to_string().as_str(), p, s).unwrap_or_else(|_e| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it will likely be quite slow (as it will convert the number to an allocated string, and then parse that string into a decimal, and then throw the string away)
Given that this is a new set of features, I think it would be ok potentially to merge this in (as it isn't a performance regression!) and optimize later, but I predict this will be needed as soon as anyone tries out the decimal type.
I bet you could do something in terms of ArrowNumericType
instead of T: ToString
, especially as ArrowDecimalType::from_i32
etc exists
rust/arrow/src/datatypes/decimal.rs
Outdated
pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq { | ||
const MAX_DIGITS: usize; | ||
|
||
// fn into_json_value(self) -> Option<Value>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this could be cleaned up?
// Decimal (precision, scale) = Decimal(1, 2) = 1.00 | ||
pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Decimal (precision, scale) = Decimal(1, 2) = 1.00 | |
pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq { | |
/// Decimal (precision, scale) = Decimal(1, 2) = 1.00 | |
/// Note that this is a trait to support different types of Decimal implementations | |
/// in the future, not just those that are 128 bits. | |
pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq { |
I think adding some comments about the rationale for this trait would also help.
I took an educated guess, but I am not sure that it is correct
// Rescale scale part | ||
fn rescale(&mut self, scale: usize); | ||
|
||
fn rescale_to_new(self, scale: usize) -> Self; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the comments could use some cleanup
rust/arrow/src/datatypes/decimal.rs
Outdated
//make_type!(Decimal64Type, i64, 18); | ||
|
||
// i128 max - 170_141_183_460_469_231_731_687_303_715_884_105_727i128 | ||
make_type!(Decimal128Type, i128, 38); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_type!(Decimal128Type, i128, 38); | |
/// Type which stores a decimal number using 128 bits | |
make_type!(Decimal128Type, i128, 38); |
return Err(ArrowError::InvalidArgumentError(format!( | ||
"all columns in a record batch must have the same length, expected {:?} but found {:?} at column {}", | ||
len, | ||
column.len(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the order of these arguments is not correct (data type should be second, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the order is fine, but we could use schema.field(i).name()
instead of data type.
We're checking that all columns have the same number of rows.
|
||
let scalar_value = $TYPE($VALUE); | ||
let b = GroupByScalar::try_from(&scalar_value).unwrap(); | ||
($EXPR:expr) => {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this change -- it seems to now be testing that try_from
returns the same value for the same input...
@@ -210,6 +213,14 @@ impl ScalarValue { | |||
/// Converts a scalar value into an array of `size` rows. | |||
pub fn to_array_of_size(&self, size: usize) -> ArrayRef { | |||
match self { | |||
ScalarValue::Decimal128(e, p, s) => match e { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this limitation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @ovr hasn't gotten to it? I also can't think of what a limitation would be
"SELECT COUNT(*) as cnt, c4 FROM aggregate_simple GROUP BY c4 ORDER BY cnt DESC"; | ||
let actual = execute(&mut ctx, sql).await; | ||
|
||
let expected = vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ovr , thank you for this PR and for this work.
Unfortunately, what I feared would be happening. Unfortunately the crate does not have a well documented design, which adds a lot of confusion on how to implement new types.
IMO this PR breaks 2 designs of this crate, and requires some changes for me to support it.
Issue 1
Let's first walk to the overall principle: all "getters" on the crate return the physical representation of a value without any logical content associated with it. For example, an iterator of Timestamp(Second, None)
returns i64
, not TimestampType(i64, Second, None)
. This is by design; the DataType
information is carried over by the Array's DataType
, not by the individual values of that array, so that the physical representation of the getters equals the in-memory representation.
In this PR, Array::value
, iter
, etc. are not returning the value at position i; they are allocating a new struct, Decimal128Type
, that contains logical information about its value (scale
and precision
). This creates a new level of indirection by converting, during operations, the physical memory i128
(or equivalent representation) into a struct {}
with scale and precision, which violates arrows' assumption that the in-memory representation follows a well defined convention. In other words, the struct Decimal128Type
has a different in-memory layout than arrows' layout, which violates arrows' design.
The consequence is that we do not leverage the benefits of arrow for compute and many other things; a generic operation op(array_a, array_b)
will allocate new 4*N^2
values: 2 u64s
per call of value()
on each side. The same happens to the values' iterator, that allocates a new struct per next
. This forbids the compiler from making optimizations and many other things (e.g. memory footprint).
Issue 2
Note that the memory layout of a Decimal type is the same as of a PrimitiveArray
; the only difference is that operations between two Decimal types require different semantics, exactly like operations between Timestamp
(e.g. casting a Timestamp to another Timestamp requires some unit conversion; it is not simply i64 -> i64
).
The problem is that DecimalType
cannot be implemented on a PrimitiveArray
because this crate has a design flaw on which ArrowNativeType
has a constant DataType
, but DecimalType
has two non-enum variable values, thereby forbidding us from implementing ArrowNativeType
for the DataType::DecimalType. An equivalent problem already exists for Timestamp(_, str)
, where str
is a large enum and thus we do not implement ArrowNativeType
for each different timezone.
For this reason, we must implement another array, DecimalArray<T: NativeType>
, and copy-paste all PrimitiveArray<T>
stuff except the constructors, that now require a DataType
(or a scale and precision, just like PrimitiveArray<T>::from_opt_vec
requires a timezone).
These are the steps that IMO we would need to address issue 2:
-
declare a struct that implements
NativeType
(e.g.[i64; 2]
for128
and[i64; 4]
fori256
, or simply implementNativeType
fori128
). -
Change the constraint on
Buffer::typed_data_mut
to acceptT: NativeType
instead ofArrowNativeType
-
change
impl<T: ArrowNativeType> ToByteSlice
->impl<T: NativeType> ToByteSlice
so that we flag to the Buffer that it can receiveNativeType
.
These are the steps that IMO we need to address issue 1:
- Remove
Decimal128Type
- Make
DecimalArray::value() -> T
- Make
DecimalIter<Item=Option<Option<T>>>
- Implement the traits that
Decimal128Type
on the new struct that implementsNativeType
rust/arrow/src/buffer/mutable.rs
Outdated
@@ -247,7 +247,7 @@ impl MutableBuffer { | |||
/// # Safety | |||
/// This function must only be used when this buffer was extended with items of type `T`. | |||
/// Failure to do so results in undefined behavior. | |||
pub fn typed_data_mut<T: ArrowNativeType>(&mut self) -> &mut [T] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do this, we must mark the function as unsafe.
The underlying design here is that every type representable in a buffer must be ArrowNativeType
. This opens up the opportunity to transmute the buffer to any type, which is undefined behavior.
Instead, IMO we should write NativeType
instead.
Thank you @alamb @jorgecarleitao for your such detailed review. Appreciate your work! I am going to rework this PR to finish it. |
5e28b6b
to
f681d8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @ovr, and I apologise for not being able to review this before.
I have a question on precision vs scale in one of the tests
@@ -210,6 +213,14 @@ impl ScalarValue { | |||
/// Converts a scalar value into an array of `size` rows. | |||
pub fn to_array_of_size(&self, size: usize) -> ArrayRef { | |||
match self { | |||
ScalarValue::Decimal128(e, p, s) => match e { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @ovr hasn't gotten to it? I also can't think of what a limitation would be
return Err(ArrowError::InvalidArgumentError(format!( | ||
"all columns in a record batch must have the same length, expected {:?} but found {:?} at column {}", | ||
len, | ||
column.len(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the order is fine, but we could use schema.field(i).name()
instead of data type.
We're checking that all columns have the same number of rows.
fn test_cast_f64_to_decimal() { | ||
let a = Float64Array::from(vec![1.0, 2.0]); | ||
let array = Arc::new(a) as ArrayRef; | ||
let b = cast(&array, &DataType::Decimal(5, 10)).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that scale should be <= precision, such that we can't have Decimal(5, 10)
. Is this correct, or is it me not understanding the rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not, it can be. I don't see any limitations.
let b = cast(&array, &DataType::Float64).unwrap(); | ||
let c = b.as_any().downcast_ref::<Float64Array>().unwrap(); | ||
|
||
// string compare to skip clippy::float_cmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also ignore the clippy lint here if it helps :)
5c32fe3
to
3ddb549
Compare
Hey @ovr, do you mind separating further commits, instead of rebasing all changes into 1 commit? It makes it difficult to see what exactly has changed when you update the code. We in any case squash everything into 1 commit. Thanks :) |
cf56f71
to
71c3132
Compare
@jorgecarleitao Can you take a look at it again? Thanks. Did I understand you correctly? Now Decimal types has ::NATIVE types, which represents a native type. For |
The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository Please see the mailing-list thread for more details We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs. |
#10096 has removed the arrow implementation from this repository (it now resides in https://github.com/apache/arrow-rs and https://github.com/apache/arrow-datafusion) in the hopes of streamlining the development process Please re-target this PR (let us know if you need help doing so) to one/both of the new repositories. Thank you for understanding and helping to make arrow-rs and datafusion better |
Implementation for DataType::Decimal which is represented by Decimal128Type on top of i128.
This PR Implements:
Notice: