-
Notifications
You must be signed in to change notification settings - Fork 779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]Change precision of decimal without validation #2357
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -431,8 +431,8 @@ pub fn cast_with_options( | |||||||||||||||||||||||||
return Ok(array.clone()); | ||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
match (from_type, to_type) { | ||||||||||||||||||||||||||
(Decimal128(_, s1), Decimal128(p2, s2)) => { | ||||||||||||||||||||||||||
cast_decimal_to_decimal(array, s1, p2, s2) | ||||||||||||||||||||||||||
(Decimal128(p1, s1), Decimal128(p2, s2)) => { | ||||||||||||||||||||||||||
cast_decimal_to_decimal(array, p1,s1, p2, s2) | ||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
(Decimal128(_, scale), _) => { | ||||||||||||||||||||||||||
// cast decimal to other type | ||||||||||||||||||||||||||
|
@@ -1254,6 +1254,7 @@ const fn time_unit_multiple(unit: &TimeUnit) -> i64 { | |||||||||||||||||||||||||
/// Cast one type of decimal array to another type of decimal array | ||||||||||||||||||||||||||
fn cast_decimal_to_decimal( | ||||||||||||||||||||||||||
array: &ArrayRef, | ||||||||||||||||||||||||||
input_precision, &usize, | ||||||||||||||||||||||||||
input_scale: &usize, | ||||||||||||||||||||||||||
output_precision: &usize, | ||||||||||||||||||||||||||
output_scale: &usize, | ||||||||||||||||||||||||||
|
@@ -1276,8 +1277,17 @@ fn cast_decimal_to_decimal( | |||||||||||||||||||||||||
.iter() | ||||||||||||||||||||||||||
.map(|v| v.map(|v| v.as_i128() * mul)) | ||||||||||||||||||||||||||
.collect::<Decimal128Array>() | ||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
.with_precision_and_scale(*output_precision, *output_scale)?; | ||||||||||||||||||||||||||
}; | ||||||||||||||||||||||||||
// For decimal cast to decimal, if the range of output is gt_eq than the input, don't need to | ||||||||||||||||||||||||||
// do validation. | ||||||||||||||||||||||||||
let output_array = match output_precision-output_scale>=input_precision - input_scale { | ||||||||||||||||||||||||||
true => { | ||||||||||||||||||||||||||
output_array.with_precision_and_scale(*output_precision, *output_scale, false) | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does If this is the truth, why do we check these things after we doing the casting? We should check the output precision and scale at the beginning of this method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
You miss one point of the The decimaarray are created from below case
, and need to reset/set the precision/scale for the result decimal array. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Could you please share the code link, I don't find it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @HaoYang670
create a new decimal array from casting arrow-rs/arrow/src/compute/kernels/cast.rs Line 1255 in 9a630a1
In cast case, if the target decimal type has larger range, we can ignore the validation. |
||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
false => { | ||||||||||||||||||||||||||
output_array.with_precision_and_scale(*output_precision, *output_scale, true) | ||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
}?; | ||||||||||||||||||||||||||
Comment on lines
+1283
to
+1290
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need this?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if cast decimal(5,2) to decimal(5,3). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am still learning the logic. Here is just a nit:
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This out-of-range should already be caught now as But I got your point of the change here. I'd suggest to rewrite the condition as |
||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
Ok(Arc::new(output_array)) | ||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -184,6 +184,8 @@ where | |
let a = arrow::compute::cast(&array, &ArrowType::Date32)?; | ||
arrow::compute::cast(&a, &target_type)? | ||
} | ||
// In the parquet file, if the logical/converted type is decimal and the physical type | ||
// is INT32 or INT64, don't need to do validation. | ||
ArrowType::Decimal128(p, s) => { | ||
let array = match array.data_type() { | ||
ArrowType::Int32 => array | ||
|
@@ -208,7 +210,7 @@ where | |
)) | ||
} | ||
} | ||
.with_precision_and_scale(p, s)?; | ||
.with_precision_and_scale(p, s, false)?; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, how do we know that we don't need to do validation here? The decimal type can be specified, what if user specify a smaller precision? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. from the codebase and arrow_reader from parquet, the decimal type of arrow is converted from the decimal type of parquet schema/meatdata. physical type of parquet just contains below types:
The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can make this assumption, we validate UTF-8 strings for example... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Do you mean that the parquet data is may not match with its metadata or schema? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An API is unsound if you can violate invariants of a data structure, therefore potentially introducing UB, without using unsafe APIs to achieve this. As parquet IO is safe, if parquet did not validate the data on read it would be unsound. This leads to a couple of options, and possibly more besides:
I want to take a step back before undertaking any of these, as frankly I am deeply confused by what this precision argument is actually for - why arbitrarily truncate your value space? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Why user can specify a new data type and not the type from parquet schema? If you can specify the type there may be other error when the specified type is not compatible with the data in the parquet. Although the current interface is this with specified data type. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In order to represent the decimal, some system or rule just use the siged integer to store the data, and scale/precision just as the metadata. We can get the exact value from this two parts. For example decimal(3,1), the value of @tustvold |
||
|
||
Arc::new(array) as ArrayRef | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need_validation
is somewhat confused to me.Do we need to check the precision, or check the scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just precision, because the decimal will to stored as a bigint without fractional part and just need to check the bounds/ranges for the max/min for the precision.
The max/min value is determined by precision.
For example decimal(4,n), the max bigint is
9999
and the min is-9999
, we should do validation for the input decimal or value in the decimal array if needed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I prefer renaming it and add more docs.