Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally disable validate_decimal_precision check in DecimalBuilder.append_value for interop test #1767

Merged
merged 10 commits into from
Jun 3, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented May 31, 2022

Which issue does this PR close?

Closes #1766.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 31, 2022
@codecov-commenter
Copy link

codecov-commenter commented May 31, 2022

Codecov Report

Merging #1767 (5665071) into master (5cf06bf) will increase coverage by 0.17%.
The diff coverage is 97.61%.

@@            Coverage Diff             @@
##           master    #1767      +/-   ##
==========================================
+ Coverage   83.32%   83.49%   +0.17%     
==========================================
  Files         196      196              
  Lines       55961    55954       -7     
==========================================
+ Hits        46627    46719      +92     
+ Misses       9334     9235      -99     
Impacted Files Coverage Δ
arrow/src/ffi.rs 86.97% <ø> (ø)
arrow/src/ipc/writer.rs 80.37% <ø> (ø)
arrow/src/array/data.rs 83.95% <96.42%> (+0.29%) ⬆️
arrow/src/array/array_binary.rs 93.22% <100.00%> (-0.09%) ⬇️
arrow/src/array/builder.rs 86.79% <100.00%> (-0.01%) ⬇️
arrow/src/array/transform/mod.rs 87.41% <100.00%> (+0.44%) ⬆️
arrow/src/csv/reader.rs 89.89% <100.00%> (ø)
arrow/src/ipc/reader.rs 90.84% <100.00%> (+2.11%) ⬆️
parquet/src/util/cursor.rs 63.86% <0.00%> (-13.45%) ⬇️
parquet/src/util/io.rs 88.33% <0.00%> (-1.67%) ⬇️
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cf06bf...5665071. Read the comment docs.

@tustvold
Copy link
Contributor

I'm not sure how I feel about this, the promise of the builders is they won't let you construct an ArrayData that would fail validation, and can therefore elide this validation. I think this change violates that?

Tbh this feels like a bug in the C++ implementation... Perhaps we should raise a ticket to at the very least get context?

@viirya
Copy link
Member Author

viirya commented May 31, 2022

In C++, the similar check is done in validating ArrayData, not during appending values into the builder. I'm not sure which one is more correct under Arrow's context.

I'm trying to remove the check and (not committed yet) put the precision check into the validation of ArrayData (this is what C++ Arrow does).

In one way, as we don't do full validation when finishing the builder, it seems this check is necessary if it is the promise of the builder to always build a valid ArrayData. Then the check is needed and it seems C++ needs to add the check there. I will open a ticket there and try to get some feedbacks.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting Rust to follow the C++ implementation seems like a good idea to me. The ticket does a good job of explaining this.

cc @liukun4515 who originally wrote these checks, I think

@@ -1497,38 +1497,6 @@ mod tests {
assert_eq!(16, decimal_array.value_length());
}

#[test]
fn test_decimal_append_error_value() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we change this test to show that it is ok to store these values in a decimal rather than removing it completely (aka change the test to validate there is no error).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the test. As I move the precision check to ArrayData full validation, I add another test for that too.

@alamb
Copy link
Contributor

alamb commented May 31, 2022

In one way, as we don't do full validation when finishing the builder, it seems this check is necessary if it is the promise of the builder to always build a valid ArrayData. Then the check is needed and it seems C++ needs to add the check there. I will open a ticket there and try to get some feedbacks.

I think this is the right thing to do -- thank you @viirya

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after @alamb's suggestion

@viirya
Copy link
Member Author

viirya commented May 31, 2022

Thanks @alamb @nevi-me @tustvold. I've opened a ticket https://issues.apache.org/jira/browse/ARROW-16696 to hear some feedback there.

@@ -999,6 +999,27 @@ impl ArrayData {

pub fn validate_dictionary_offset(&self) -> Result<()> {
match &self.data_type {
DataType::Decimal(p, _) => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ ArrayData full validation performs the precision check for decimal type. I think this is necessary to add even we don't remove validate_decimal_precision from DecimalBuilder.append_value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this code is necessary in validate 👍

@@ -1486,7 +1486,7 @@ mod tests {
192, 219, 180, 17, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 36, 75, 238, 253,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
];
let array_data = ArrayData::builder(DataType::Decimal(23, 6))
let array_data = ArrayData::builder(DataType::Decimal(38, 6))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught this invalid decimal value by decimal check in full validation. Increasing precision to pass it.

@@ -1130,6 +1131,7 @@ mod tests {
}

#[test]
#[cfg(not(feature = "force_validate"))]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests use the same 0.14.1 decimal file too.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @viirya

After thinking about it for a while, I think:

  1. Adding validation checks to ArrayData::validate for decimal is ❤️
  2. I am a little worried about removing the validation from DecimalBuilder -- I think one of the philosophies we are shooting for is if you use safe APIs you will create valid Arrow arrays.

What would you think about making validation optional for the builder?

Something like:

struct DecimalBuilder { 
 ...
  /// Should i128 values be validated for compatibility with scale and precision?
  /// defaults to true
  value_validation: true
}
impl DecomalBuilder {
  ...
  /// Disable validation
  pub unsafe disable_value_validation(mut self) -> Self {
    self.validate_values = false;
  }

  #[inline]
  pub fn append_value(&mut self, value: i128) -> Result<()> {
    let value = if sef.validate_values {
      validate_decimal_precision(value, self.precision)?
    } else {
      value
    }
   }
  ...
}

I think this would avoid having to add #[cfg(not(feature = "force_validate"))] and it would allow the interop test to bypass value validation only when needed?

@@ -999,6 +999,27 @@ impl ArrayData {

pub fn validate_dictionary_offset(&self) -> Result<()> {
match &self.data_type {
DataType::Decimal(p, _) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this code is necessary in validate 👍

};
let as_array = raw_val.try_into();
match as_array {
Ok(v) if raw_val.len() == 16 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason this code seems overly complicated -- I wonder if you could call i128::from_le_bytes directly on a slice like

for pos in 0..values_buffer.len() {
  let v = value_buffer.data[pos..pos+16];
  let value = i128::from_le_bytes(v)
}

or something 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data is private. 😞

I simplified it as:

for pos in 0..values_buffer.len() {
   let raw_val = unsafe {
    std::slice::from_raw_parts(
      values_buffer.as_ptr().add(pos),
      16_usize,
    )
  };
  let value = i128::from_le_bytes(raw_val.try_into().unwrap());
  validate_decimal_precision(value, *p)?;
}

@viirya
Copy link
Member Author

viirya commented Jun 2, 2022

I am a little worried about removing the validation from DecimalBuilder -- I think one of the philosophies we are shooting for is if you use safe APIs you will create valid Arrow arrays.

Hmm, seems at C++ implementation, the builder is unsafe API, in Rust parlance. This is based on the reply I got in the JIRA ticket. This seems the difference between Rust and C++ implementations.

@viirya
Copy link
Member Author

viirya commented Jun 2, 2022

What would you think about making validation optional for the builder?
I think this would avoid having to add #[cfg(not(feature = "force_validate"))] and it would allow the interop test to bypass value validation only when needed?

Generally it sounds good to me. But in the test, we are not using the builder directly. The builder is called by more higher API when reading JSON/Arrow files into Arrays. So seems we need more than adding an API like disable_value_validation to DecomalBuilder.

Let me try to play it around and see if an optional validation of the builder is possible to solve the interop test issue.

@alamb
Copy link
Contributor

alamb commented Jun 2, 2022

I wonder if this could be related to #1779 just filed by @tustvold 🤔

@viirya
Copy link
Member Author

viirya commented Jun 2, 2022

Hmm, I think it is not directly related. C++ also does this check but it does in full validation, not in the builder. Actually C++ Arrow avoids the full validation too in the interop to avoid similar issue.

@viirya
Copy link
Member Author

viirya commented Jun 3, 2022

I think this would avoid having to add #[cfg(not(feature = "force_validate"))]

Oh, BTW, adding #[cfg(not(feature = "force_validate"))] to several tests is for the added check in ArrayData::validate, not due to removal of the check from DecimalBuilder.

These tests build ArrayData with incompatible decimal values. As this adds the check when full validation, these tests will fail during that.

@viirya
Copy link
Member Author

viirya commented Jun 3, 2022

Generally it sounds good to me. But in the test, we are not using the builder directly. The builder is called by more higher API when reading JSON/Arrow files into Arrays. So seems we need more than adding an API like disable_value_validation to DecomalBuilder.

I'm wrong. Fortunately, for interop test, we directly invoke DecimalBuilder to read JSON file to Array. So adding disable_value_validation call there can fix the interop issue. Verified it locally.

@viirya viirya changed the title Remove validate_decimal_precision check in DecimalBuilder.append_value Optionally disable validate_decimal_precision check in DecimalBuilder.append_value for interop test Jun 3, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya -- this looks wonderful. Thank you for sticking with it


#[test]
#[cfg(not(feature = "force_validate"))]
fn test_decimal_full_validation() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 29c49d8 into apache:master Jun 3, 2022
@viirya
Copy link
Member Author

viirya commented Jun 3, 2022

Thanks @alamb @tustvold @nevi-me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove validate_decimal_precision check in DecimalBuilder.append_value
5 participants