Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10818: [Rust] Initial implementation for Decimal 128/256 #9232

Closed
wants to merge 1 commit into from

Conversation

ovr
Copy link
Contributor

@ovr ovr commented Jan 17, 2021

Implementation for DataType::Decimal which is represented by Decimal128Type on top of i128.

This PR Implements:

  • Math (Add/Sub)
  • Casts (integers/floats/utf8 to Decimal and back)
  • Reading CSV
  • GROUP BY

Notice:

  • This PR doesn't change ArrayDecimal to be generic, it's planned to be done in another PR with Decimal256 will be done.
  • This PR doesn't enable decimal in DataFushion, it continues to use Float64 for Decimal

@github-actions
Copy link

}

#[derive(Debug)]
pub enum DecimalType {
Copy link
Contributor Author

@ovr ovr Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @andygrove @Dandandan @nevi-me @jorgecarleitao

I created representations for Decimal as Int32DecimalType, Int64DecimalType, Int128DecimalType or LargeDecimal, which uses BigInt.

  1. What is the best way to abstract representations of DataType::Decimal in Rust? Enum?
  2. How should I work with DecimalArray? Should I make it as generic by passing representation type to it (but how should I detect type, match in everyplace? As I have known functions cannot return type) or make it generic on top of DecimalType and get byte size from it?

Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ovr -- sorry for the lack of response. I am not familiar with this code so I will need some non trivial time to review and think about your question. I hope to find some time this weekend to do so.

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Jan 24, 2021

Hi @ovr , I went through what is here so far.

First of all, great stuff that you are taking this on.

Broadly speaking, this PR currently contains the following "steps":

  1. make changes to the arrow crate to enable Decimal type without adding any new functionality (e.g. rename datatypes.rs)
  2. add the native type to the arrow crate
  3. add the Array type to the arrow crate
  4. add support to parquet
  5. add support to DataFusion

I suggest that 4-5 are done on a separate PR. 1-3 can be done on this PR, but I suggest that we split them on separate commits.

Concerning step 2, the two main checks in the list are:

  1. the type's logical representation exists in the arrows' spec
  2. the type's physical representation matches arrows' spec

Looking at the spec, a decimal type only supports 128 and 256 bits. So, I am not understanding why we are trying to add support for Int32,Int64,Int128,LargeDecimal here. Since Rust only supports i128, I would say that we should only go for i128 (the same reason we do not support f16).

EDIT: I searched for the wrong crate name ^_^

Could you clarify some of these points?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all thank you @ovr for this proposal and contribution. I also finally had a chance to go through this PR this morning and I broadly agree with @jorgecarleitao 's recommendations

Here are some additional thoughts:

  • My reading of the arrow spec, also suggests that Arrow only defines the Int128Decimal and Int256Decimal. In my opinion, we should not add types to the Rust implementation that are not in the arrow spec.
  • If we want to add arbitrary precision decimals to Arrow, perhaps we should propose that to the arrow mailing list and see what other implementers say
  • https://github.com/rust-num/num-bigint looks like a solid choice of crate to base an arrow decimal implementation off of

How should I work with DecimalArray? Should I make it as generic by passing representation type to it (but how should I detect type, match in everyplace? As I have known functions cannot return type) or make it generic on top of DecimalType and get byte size from it?

I am not quite what you are asking, but the common pattern for passing arrays around is as an ArrayRef and then downcasting to the appropriate type.

Given the implementation in this PR that is backed by different bit widths, I see you can't really treat all DecimalArrays the same. The normal arrow pattern I would expect would something like add a DataType::Decimal128 and DataType::Decimal256 type for the different underlying bit widths and then use them as follows:

let decimal_array: DecimalArray128 = make_a_decimal_array();

let array : ArrayRef = Arc::new(decimal_array);
// now pass along array as most of the rest of Arrow library deals in `ArrayRef`

To go from array back to DecimalArray128 the pattern looks like:

match array.data_type() {
...
  DataType::Decimal128(precision, scale) => {
    let decimal_array = array.as_any().downcast_ref::<DecimalArray128>().unwrap();
    // do whatever you need to do
 }

rust/arrow/src/compute/kernels/cast.rs Outdated Show resolved Hide resolved
@ovr
Copy link
Contributor Author

ovr commented Jan 24, 2021

Looking at the spec, a decimal type only supports 128 and 256 bits. So, I am not understanding why we are trying to add support for Int32,Int64,Int128,LargeDecimal here.

Arrow is used inside DF, which is used to build databases on to of it. If the user defines DECIMAL(2,2) it's a bit overhead to use i128 as representation for it.

  1. Memory usage, DecimalArray<Decimal128> will use 16 bytes for each element.
  2. Performance, Rust is using LLVM, which allows integers of any bit width from 1 to 2^23. Inside LLVM backend for x86-64, It will use rax, rcx to store it.

@ovr ovr force-pushed the issue-10818 branch 3 times, most recently from 5fb220e to b352649 Compare January 24, 2021 12:39
@jorgecarleitao
Copy link
Member

Arrow is used inside DF, which is used to build databases on to of it. If the user defines DECIMAL(2,2) it's a bit overhead to use i128 as representation for it.

@ovr, maybe there is some misunderstanding of the Arrow format and the scope of this crate.

Arrow format's goal is to address a common problem in performing analytics: interoperability of data between implementations. For example, the internal memory representation of data in spark is different from Python's numpy, which means that using some numpy function against that data is expensive because it incurs a complete serialization-deserialization roundtrip to handle spark datum.

The goal of this crate is to offer the necessary implementation in Rust to allow handling the arrow format in memory.

If we start adding extra types to the crate, we fall to the exact same problem that we were trying to solve in the first place: other implementations will need to implement a custom serialize and deserialize and therefore they will have to both implement it and incur the serialization-deserialization roundtrip cost.

This is the reason for @alamb 's comment that, if you think that arrow would benefit from other decimal types, then the right place to do it is on a broader forum that includes all other implementations (and the discussion of the spec itself).

Does this make sense?

Memory usage, DecimalArray will use 16 bytes for each element.

The argument is not about memory usage, but about memory layout and how it is consistent with a specification. Arrow specification is very specific about how memory is lay out so that it enables a stable ABI to share data within the same process via foreign interfaces as well as across processes via the flight protocol.

@alamb
Copy link
Contributor

alamb commented Jan 26, 2021

Thank you for the clarification @jorgecarleitao -- that is a great way to summarize the rationale for my suggestions.

@alamb
Copy link
Contributor

alamb commented Mar 3, 2021

@ovr what do you think we should do with this PR? Is it worth keeping open as a Draft or shall we close it for now?

@ovr ovr force-pushed the issue-10818 branch 3 times, most recently from 83c0b7f to 755dce5 Compare March 3, 2021 11:49
@ovr
Copy link
Contributor Author

ovr commented Mar 8, 2021

If we want to support Decimal(i128), can't we use Rust's native i128? E.g. does decimal128 + decimal128 correspond to Rust's i128 + i128, or is the math actually different?

@jorgecarleitao

It's using i128 under the hood as expected (which generates to u64 and i64 by LLVM). As you know, Decimal has precision/scale and It's not possible to calculate the sum of two different decimals with different scales without rescaling, for example: Decimal(1,1) and Decimal(1,5).

Yet another thing, it's overflowing.

DECIMAL(1) + DECIMAL(1) = DECIMAL(1), example 1 + 9.

I don't know how it should be done.

Related to Decimal256, probably it will use (but it will be done in another PR)

[i64, u64, u64, u64] or [i128, u128] or BigInt

@ovr ovr marked this pull request as ready for review March 8, 2021 18:29
@ovr
Copy link
Contributor Author

ovr commented Mar 8, 2021

@alamb @jorgecarleitao Can you take a look and do a review? Thanks!

@ovr ovr changed the title ARROW-10818: [Rust] Implement DecimalType ARROW-10818: [Rust] Initial implementation for Decimal (Decimal128Type only) Mar 8, 2021
@alamb
Copy link
Contributor

alamb commented Mar 8, 2021

Thanks @ovr ! I have put this on my review queue -- but I will likely not be able to do so until tomorrow

@ovr ovr force-pushed the issue-10818 branch 2 times, most recently from 327b31f to 2e5fe36 Compare March 9, 2021 19:26
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an epic PR -- Great work @ovr. Thank you so much for all this.

I didn't cover this entire PR as closely as I would have liked due to its size, but I did spend a significant amount of time reviewing it and left a bunch of feedback

Some of these comments could be addressed as potential follow on PRs if you prefer, but I think some are worth paying attention to.

One thing I was thinking was perhaps we want to mark DecimalArray and Decimal128Type as "Beta" or subject to change so as to allow further change without confusing users that this is likely to be an area subject to potential change for awhile

@@ -240,6 +242,48 @@ impl<'a, T: BinaryOffsetSizeTrait> std::iter::Iterator for GenericBinaryIter<'a,
}
}

// In the feature, DecimalArray will become Generic and this iterator will use type of Decimal as T here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me as if the code already is generic over T so I am not quite sure what this comment is trying to say

p: usize,
s: usize,
) -> Option<U> {
Some(U::parse(n.to_string().as_str(), p, s).unwrap_or_else(|_e| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it will likely be quite slow (as it will convert the number to an allocated string, and then parse that string into a decimal, and then throw the string away)

Given that this is a new set of features, I think it would be ok potentially to merge this in (as it isn't a performance regression!) and optimize later, but I predict this will be needed as soon as anyone tries out the decimal type.

I bet you could do something in terms of ArrowNumericType instead of T: ToString, especially as ArrowDecimalType::from_i32 etc exists

pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq {
const MAX_DIGITS: usize;

// fn into_json_value(self) -> Option<Value>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this could be cleaned up?

Comment on lines +44 to +47
// Decimal (precision, scale) = Decimal(1, 2) = 1.00
pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Decimal (precision, scale) = Decimal(1, 2) = 1.00
pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq {
/// Decimal (precision, scale) = Decimal(1, 2) = 1.00
/// Note that this is a trait to support different types of Decimal implementations
/// in the future, not just those that are 128 bits.
pub trait ArrowDecimalType: fmt::Debug + Send + Sync + FromStr + PartialEq {

I think adding some comments about the rationale for this trait would also help.

I took an educated guess, but I am not sure that it is correct

// Rescale scale part
fn rescale(&mut self, scale: usize);

fn rescale_to_new(self, scale: usize) -> Self;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the comments could use some cleanup

//make_type!(Decimal64Type, i64, 18);

// i128 max - 170_141_183_460_469_231_731_687_303_715_884_105_727i128
make_type!(Decimal128Type, i128, 38);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
make_type!(Decimal128Type, i128, 38);
/// Type which stores a decimal number using 128 bits
make_type!(Decimal128Type, i128, 38);

return Err(ArrowError::InvalidArgumentError(format!(
"all columns in a record batch must have the same length, expected {:?} but found {:?} at column {}",
len,
column.len(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the order of these arguments is not correct (data type should be second, right?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the order is fine, but we could use schema.field(i).name() instead of data type.

We're checking that all columns have the same number of rows.


let scalar_value = $TYPE($VALUE);
let b = GroupByScalar::try_from(&scalar_value).unwrap();
($EXPR:expr) => {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change -- it seems to now be testing that try_from returns the same value for the same input...

@@ -210,6 +213,14 @@ impl ScalarValue {
/// Converts a scalar value into an array of `size` rows.
pub fn to_array_of_size(&self, size: usize) -> ArrayRef {
match self {
ScalarValue::Decimal128(e, p, s) => match e {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this limitation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @ovr hasn't gotten to it? I also can't think of what a limitation would be

"SELECT COUNT(*) as cnt, c4 FROM aggregate_simple GROUP BY c4 ORDER BY cnt DESC";
let actual = execute(&mut ctx, sql).await;

let expected = vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 👍

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ovr , thank you for this PR and for this work.

Unfortunately, what I feared would be happening. Unfortunately the crate does not have a well documented design, which adds a lot of confusion on how to implement new types.

IMO this PR breaks 2 designs of this crate, and requires some changes for me to support it.

Issue 1

Let's first walk to the overall principle: all "getters" on the crate return the physical representation of a value without any logical content associated with it. For example, an iterator of Timestamp(Second, None) returns i64, not TimestampType(i64, Second, None). This is by design; the DataType information is carried over by the Array's DataType, not by the individual values of that array, so that the physical representation of the getters equals the in-memory representation.

In this PR, Array::value, iter, etc. are not returning the value at position i; they are allocating a new struct, Decimal128Type, that contains logical information about its value (scale and precision). This creates a new level of indirection by converting, during operations, the physical memory i128 (or equivalent representation) into a struct {} with scale and precision, which violates arrows' assumption that the in-memory representation follows a well defined convention. In other words, the struct Decimal128Type has a different in-memory layout than arrows' layout, which violates arrows' design.

The consequence is that we do not leverage the benefits of arrow for compute and many other things; a generic operation op(array_a, array_b) will allocate new 4*N^2 values: 2 u64s per call of value() on each side. The same happens to the values' iterator, that allocates a new struct per next. This forbids the compiler from making optimizations and many other things (e.g. memory footprint).

Issue 2

Note that the memory layout of a Decimal type is the same as of a PrimitiveArray; the only difference is that operations between two Decimal types require different semantics, exactly like operations between Timestamp (e.g. casting a Timestamp to another Timestamp requires some unit conversion; it is not simply i64 -> i64).

The problem is that DecimalType cannot be implemented on a PrimitiveArray because this crate has a design flaw on which ArrowNativeType has a constant DataType, but DecimalType has two non-enum variable values, thereby forbidding us from implementing ArrowNativeType for the DataType::DecimalType. An equivalent problem already exists for Timestamp(_, str), where str is a large enum and thus we do not implement ArrowNativeType for each different timezone.

For this reason, we must implement another array, DecimalArray<T: NativeType>, and copy-paste all PrimitiveArray<T> stuff except the constructors, that now require a DataType (or a scale and precision, just like PrimitiveArray<T>::from_opt_vec requires a timezone).

These are the steps that IMO we would need to address issue 2:

  1. declare a struct that implements NativeType (e.g. [i64; 2] for 128 and [i64; 4] for i256, or simply implement NativeType for i128).

  2. Change the constraint on Buffer::typed_data_mut to accept T: NativeType instead of ArrowNativeType

  3. change impl<T: ArrowNativeType> ToByteSlice -> impl<T: NativeType> ToByteSlice so that we flag to the Buffer that it can receive NativeType.

These are the steps that IMO we need to address issue 1:

  1. Remove Decimal128Type
  2. Make DecimalArray::value() -> T
  3. Make DecimalIter<Item=Option<Option<T>>>
  4. Implement the traits that Decimal128Type on the new struct that implements NativeType

@@ -247,7 +247,7 @@ impl MutableBuffer {
/// # Safety
/// This function must only be used when this buffer was extended with items of type `T`.
/// Failure to do so results in undefined behavior.
pub fn typed_data_mut<T: ArrowNativeType>(&mut self) -> &mut [T] {
Copy link
Member

@jorgecarleitao jorgecarleitao Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, we must mark the function as unsafe.

The underlying design here is that every type representable in a buffer must be ArrowNativeType. This opens up the opportunity to transmute the buffer to any type, which is undefined behavior.

Instead, IMO we should write NativeType instead.

rust/arrow/src/datatypes/decimal.rs Outdated Show resolved Hide resolved
@ovr
Copy link
Contributor Author

ovr commented Mar 10, 2021

Thank you @alamb @jorgecarleitao for your such detailed review. Appreciate your work! I am going to rework this PR to finish it.

@ovr ovr force-pushed the issue-10818 branch 2 times, most recently from 5e28b6b to f681d8f Compare March 13, 2021 10:45
Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @ovr, and I apologise for not being able to review this before.

I have a question on precision vs scale in one of the tests

@@ -210,6 +213,14 @@ impl ScalarValue {
/// Converts a scalar value into an array of `size` rows.
pub fn to_array_of_size(&self, size: usize) -> ArrayRef {
match self {
ScalarValue::Decimal128(e, p, s) => match e {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @ovr hasn't gotten to it? I also can't think of what a limitation would be

return Err(ArrowError::InvalidArgumentError(format!(
"all columns in a record batch must have the same length, expected {:?} but found {:?} at column {}",
len,
column.len(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the order is fine, but we could use schema.field(i).name() instead of data type.

We're checking that all columns have the same number of rows.

rust/arrow/src/datatypes/i256_type.rs Outdated Show resolved Hide resolved
fn test_cast_f64_to_decimal() {
let a = Float64Array::from(vec![1.0, 2.0]);
let array = Arc::new(a) as ArrayRef;
let b = cast(&array, &DataType::Decimal(5, 10)).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that scale should be <= precision, such that we can't have Decimal(5, 10). Is this correct, or is it me not understanding the rules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not, it can be. I don't see any limitations.

let b = cast(&array, &DataType::Float64).unwrap();
let c = b.as_any().downcast_ref::<Float64Array>().unwrap();

// string compare to skip clippy::float_cmp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also ignore the clippy lint here if it helps :)

@ovr ovr changed the title ARROW-10818: [Rust] Initial implementation for Decimal (Decimal128Type only) ARROW-10818: [Rust] Initial implementation for Decimal 128/256 Mar 15, 2021
@ovr ovr force-pushed the issue-10818 branch 2 times, most recently from 5c32fe3 to 3ddb549 Compare March 28, 2021 18:33
@nevi-me
Copy link
Contributor

nevi-me commented Mar 28, 2021

Hey @ovr, do you mind separating further commits, instead of rebasing all changes into 1 commit? It makes it difficult to see what exactly has changed when you update the code.

We in any case squash everything into 1 commit. Thanks :)

@ovr
Copy link
Contributor Author

ovr commented Apr 4, 2021

@jorgecarleitao Can you take a look at it again? Thanks. Did I understand you correctly? Now Decimal types has ::NATIVE types, which represents a native type. DecimalArray started to be generic and store values as native type (I didn't change builder for it, because PR is epic large). It doesn't allocate any new structure if it doesn't need it. BTW: I created a polyfill for i256 to test requirements for tests correctly.

For i128/i256 I created ArrowDecimalNativeType which is ArrowNativeType + two additional methods: pow_to and as_string which helps to generate string from native representation, without creation of DecimalType structure.

@alamb
Copy link
Contributor

alamb commented Apr 19, 2021

The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository

Please see the mailing-list thread for more details

We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.

@alamb
Copy link
Contributor

alamb commented May 3, 2021

#10096 has removed the arrow implementation from this repository (it now resides in https://github.com/apache/arrow-rs and https://github.com/apache/arrow-datafusion) in the hopes of streamlining the development process

Please re-target this PR (let us know if you need help doing so) to one/both of the new repositories.

Thank you for understanding and helping to make arrow-rs and datafusion better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants