Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Change precision of decimal without validation #2357

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions arrow/src/array/array_decimal.rs
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ use crate::util::decimal::{BasicDecimal, Decimal128, Decimal256};
/// // set precision and scale so values are interpreted
/// // as `8887.000000`, `Null`, and `-8887.000000`
/// let decimal_array = decimal_array
/// .with_precision_and_scale(23, 6)
/// .with_precision_and_scale(23, 6, true)
/// .unwrap();
///
/// assert_eq!(&DataType::Decimal128(23, 6), decimal_array.data_type());
Expand Down Expand Up @@ -253,11 +253,15 @@ pub trait BasicDecimalArray<T: BasicDecimal, U: From<ArrayData>>:
/// Returns a Decimal array with the same data as self, with the
/// specified precision.
///
/// If make sure that all values in this array are not out of ranges/bounds with the specified precision,
/// please set `need_validation` to `false, otherwise set to `true`.
///
/// Returns an Error if:
/// 1. `precision` is larger than [`Self::MAX_PRECISION`]
/// 2. `scale` is larger than [`Self::MAX_SCALE`];
/// 3. `scale` is > `precision`
fn with_precision_and_scale(self, precision: usize, scale: usize) -> Result<U>
/// 4. `need_validation` is `true`, but some values are out of ranges/bounds
fn with_precision_and_scale(self, precision: usize, scale: usize, need_validation: bool) -> Result<U>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need_validation is somewhat confused to me.
Do we need to check the precision, or check the scale?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just precision, because the decimal will to stored as a bigint without fractional part and just need to check the bounds/ranges for the max/min for the precision.
The max/min value is determined by precision.
For example decimal(4,n), the max bigint is 9999 and the min is -9999, we should do validation for the input decimal or value in the decimal array if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I prefer renaming it and add more docs.

where
Self: Sized,
{
Expand All @@ -282,10 +286,9 @@ pub trait BasicDecimalArray<T: BasicDecimal, U: From<ArrayData>>:
)));
}

// Ensure that all values are within the requested
// precision. For performance, only check if the precision is
// decreased
self.validate_decimal_precision(precision)?;
if need_validation {
self.validate_decimal_precision(precision)?;
}

let data_type = if Self::VALUE_LENGTH == 16 {
DataType::Decimal128(self.precision(), self.scale())
Expand Down
18 changes: 14 additions & 4 deletions arrow/src/compute/kernels/cast.rs
Original file line number Diff line number Diff line change
Expand Up @@ -431,8 +431,8 @@ pub fn cast_with_options(
return Ok(array.clone());
}
match (from_type, to_type) {
(Decimal128(_, s1), Decimal128(p2, s2)) => {
cast_decimal_to_decimal(array, s1, p2, s2)
(Decimal128(p1, s1), Decimal128(p2, s2)) => {
cast_decimal_to_decimal(array, p1,s1, p2, s2)
}
(Decimal128(_, scale), _) => {
// cast decimal to other type
Expand Down Expand Up @@ -1254,6 +1254,7 @@ const fn time_unit_multiple(unit: &TimeUnit) -> i64 {
/// Cast one type of decimal array to another type of decimal array
fn cast_decimal_to_decimal(
array: &ArrayRef,
input_precision, &usize,
input_scale: &usize,
output_precision: &usize,
output_scale: &usize,
Expand All @@ -1276,8 +1277,17 @@ fn cast_decimal_to_decimal(
.iter()
.map(|v| v.map(|v| v.as_i128() * mul))
.collect::<Decimal128Array>()
}
.with_precision_and_scale(*output_precision, *output_scale)?;
};
// For decimal cast to decimal, if the range of output is gt_eq than the input, don't need to
// do validation.
let output_array = match output_precision-output_scale>=input_precision - input_scale {
true => {
output_array.with_precision_and_scale(*output_precision, *output_scale, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does with_precision_and_scale check if you set need_validation to false?
It only checks:
1. output precision <=? max precision
2. output scale <=? max scale
3. output precision >= output scale I guess.

If this is the truth, why do we check these things after we doing the casting? We should check the output precision and scale at the beginning of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does with_precision_and_scale check if you set need_validation to false? It only checks: 1. output precision <=? max precision 2. output scale <=? max scale 3. output precision >= output scale I guess.

If this is the truth, why do we check these things after we doing the casting? We should check the output precision and scale at the beginning of this method.

You miss one point of the with_precision_and_scale which will reset/set the DataType for this array.

The decimaarray are created from below case

array
            .iter()
            .map(|v| v.map(|v| v.as_i128() * mul))
            .collect::<Decimal128Array>()

, and need to reset/set the precision/scale for the result decimal array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You miss one point of the with_precision_and_scale which will reset/set the DataType for this array.

Could you please share the code link, I don't find it.

Copy link
Contributor Author

@liukun4515 liukun4515 Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaoYang670
create array by reading data from parquet

ArrowType::Decimal128(p, s) => {

create a new decimal array from casting

fn cast_decimal_to_decimal(

In cast case, if the target decimal type has larger range, we can ignore the validation.
For example, cast decimal(10,3) to decimal(12,3) don't need to do the validation, even int32 to decimal(12,0) also don't need to do validation.

}
false => {
output_array.with_precision_and_scale(*output_precision, *output_scale, true)
}
}?;
Comment on lines +1283 to +1290
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

cast_decimal_to_decimal already adjusts scale so seems only precision does matter here. with_precision_and_scale also only does validation if new precision is less.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cast decimal(5,2) to decimal(5,3).
a value of 123.45 will be convert to 123.450, but the 123450 is out of range of decimal(5,3).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still learning the logic. Here is just a nit:

Suggested change
let output_array = match output_precision-output_scale>=input_precision - input_scale {
true => {
output_array.with_precision_and_scale(*output_precision, *output_scale, false)
}
false => {
output_array.with_precision_and_scale(*output_precision, *output_scale, true)
}
}?;
let output_array = output_array.with_precision_and_scale(
*output_precision,
*output_scale,
output_precision - output_scale < input_precision - input_scale)?;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cast decimal(5,2) to decimal(5,3). a value of 123.45 will be convert to 123.450, but the 123450 is out of range of decimal(5,3).

This out-of-range should already be caught now as with_precision_and_scale will do the validation.

But I got your point of the change here. with_precision_and_scale will do validation here all the time. But for some cases output precision >= input precision + (output scale - input scale), we can skip the validation.

I'd suggest to rewrite the condition as output precision >= input precision + (output scale - input scale).


Ok(Arc::new(output_array))
}
Expand Down
4 changes: 3 additions & 1 deletion parquet/src/arrow/array_reader/primitive_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,8 @@ where
let a = arrow::compute::cast(&array, &ArrowType::Date32)?;
arrow::compute::cast(&a, &target_type)?
}
// In the parquet file, if the logical/converted type is decimal and the physical type
// is INT32 or INT64, don't need to do validation.
ArrowType::Decimal128(p, s) => {
let array = match array.data_type() {
ArrowType::Int32 => array
Expand All @@ -208,7 +210,7 @@ where
))
}
}
.with_precision_and_scale(p, s)?;
.with_precision_and_scale(p, s, false)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how do we know that we don't need to do validation here? The decimal type can be specified, what if user specify a smaller precision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the codebase and arrow_reader from parquet, the decimal type of arrow is converted from the decimal type of parquet schema/meatdata.

physical type of parquet just contains below types:

PhysicalType::BOOLEAN => Arc::new(BooleanArray::from(array_data)) as ArrayRef,
            PhysicalType::INT32 => Arc::new(Int32Array::from(array_data)) as ArrayRef,
            PhysicalType::INT64 => Arc::new(Int64Array::from(array_data)) as ArrayRef,
            PhysicalType::FLOAT => Arc::new(Float32Array::from(array_data)) as ArrayRef,
            PhysicalType::DOUBLE => Arc::new(Float64Array::from(array_data)) as ArrayRef,
            PhysicalType::INT96
            | PhysicalType::BYTE_ARRAY
            | PhysicalType::FIXED_LEN_BYTE_ARRAY

The target_type arrow type is from logical type of parquet file, so when reading from the parquet file, the data must be right or valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the data must be right or valid.

I'm not sure we can make this assumption, we validate UTF-8 strings for example...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can make this assumption, we validate UTF-8 strings for example...

Do you mean that the parquet data is may not match with its metadata or schema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An API is unsound if you can violate invariants of a data structure, therefore potentially introducing UB, without using unsafe APIs to achieve this. As parquet IO is safe, if parquet did not validate the data on read it would be unsound.

This leads to a couple of options, and possibly more besides:

  1. Don't care about decimals exceeding the precision and don't make this an invariant
  2. We make this an invariant and validate on read
  3. We add an unsafe flag to skip validation on read

I want to take a step back before undertaking any of these, as frankly I am deeply confused by what this precision argument is actually for - why arbitrarily truncate your value space?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how do we know that we don't need to do validation here? The decimal type can be specified, what if user specify a smaller precision?

Why user can specify a new data type and not the type from parquet schema? If you can specify the type there may be other error when the specified type is not compatible with the data in the parquet. Although the current interface is this with specified data type.

Copy link
Contributor Author

@liukun4515 liukun4515 Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to take a step back before undertaking any of these, as frankly I am deeply confused by what this precision argument is actually for - why arbitrarily truncate your value space?

In order to represent the decimal, some system or rule just use the siged integer to store the data, and scale/precision just as the metadata. We can get the exact value from this two parts.

For example decimal(3,1), the value of 10.1 will be stored/encoded as 101.

@tustvold
Maybe the parquet doc can help you
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L67
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal


Arc::new(array) as ArrayRef
}
Expand Down
6 changes: 4 additions & 2 deletions parquet/src/arrow/buffer/converter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -82,23 +82,25 @@ impl Converter<Vec<Option<FixedLenByteArray>>, Decimal128Array>
for DecimalArrayConverter
{
fn convert(&self, source: Vec<Option<FixedLenByteArray>>) -> Result<Decimal128Array> {
// In the parquet file, if the logical/converted type is decimal, don't need to do validation.
let array = source
.into_iter()
.map(|array| array.map(|array| from_bytes_to_i128(array.data())))
.collect::<Decimal128Array>()
.with_precision_and_scale(self.precision as usize, self.scale as usize)?;
.with_precision_and_scale(self.precision as usize, self.scale as usize, false)?;

Ok(array)
}
}

impl Converter<Vec<Option<ByteArray>>, Decimal128Array> for DecimalArrayConverter {
fn convert(&self, source: Vec<Option<ByteArray>>) -> Result<Decimal128Array> {
// In the parquet file, if the logical/converted type is decimal, don't need to do validation.
let array = source
.into_iter()
.map(|array| array.map(|array| from_bytes_to_i128(array.data())))
.collect::<Decimal128Array>()
.with_precision_and_scale(self.precision as usize, self.scale as usize)?;
.with_precision_and_scale(self.precision as usize, self.scale as usize, false)?;

Ok(array)
}
Expand Down