-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11072: [Rust] [Parquet] Support reading decimal from physical int types #9047
Conversation
ARROW-${JIRA_ID} :) |
rust/parquet/src/arrow/schema.rs
Outdated
.. | ||
} => (*precision, *scale), | ||
_ => { | ||
return Err(ArrowError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use assertion here since this is only supposed to be called on primitive types. And it's more concise to do something like this:
match self.schema {
Type::PrimitiveType {
ref precision,
ref scale,
..
} => Ok(DataType::Decimal(*precision as usize, *scale as usize)),
_ => Err(ArrowError(
"Expected a physical type, not a group type".to_string(),
)),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao thank you for your review comments! I changed this part as suggested. Do you have an example of an assertion checking for a type? The only thing I could come up with (still quite new to Rust) is something along the lines of
if let Type::PrimitiveType {ref precision, ref scale, .. } = self.schema {
DataType::Decimal(*precision as usize, *scale as usize)
} else {
panic!("Expected a physical type, not a group type")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking something like:
assert!(self.schema.is_primitive());
if let ... {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! I changed it to use an assertion. Let me know if you prefer this variant - there are other methods in schema.rs
that follow a similar pattern and I could also change them accordingly while I am already here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Looking good! the else
branch in the if let
block seems unnecessary?
@@ -532,6 +533,7 @@ pub fn cast(array: &ArrayRef, to_type: &DataType) -> Result<ArrayRef> { | |||
(Int64, Int32) => cast_numeric_arrays::<Int64Type, Int32Type>(array), | |||
(Int64, Float32) => cast_numeric_arrays::<Int64Type, Float32Type>(array), | |||
(Int64, Float64) => cast_numeric_arrays::<Int64Type, Float64Type>(array), | |||
(Int64, Decimal(p, s)) => int64_to_decimal_cast(array, *p, *s), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... is this related to the parquet-side change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you stumbled upon the issue I tried to explain in the PR description - I am now sure that it was a bad idea to use casts in this way :)
I removed the cast operations and replaced it with some logic in PrimitiveArrayReader::next_batch
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this looks a little weird - I'm thinking that instead of handling this in the compute kernel, the parquet/arrow reader should just understand the conversion from parquet's decimal type to that of arrow's and generate decimal arrow arrays directly.
rust/parquet/src/arrow/schema.rs
Outdated
@@ -591,6 +591,7 @@ impl ParquetTypeConverter<'_> { | |||
LogicalType::INT_32 => Ok(DataType::Int32), | |||
LogicalType::DATE => Ok(DataType::Date32(DateUnit::Day)), | |||
LogicalType::TIME_MILLIS => Ok(DataType::Time32(TimeUnit::Millisecond)), | |||
LogicalType::DECIMAL => self.to_decimal(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be handled somehow in the parquet/arrow reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having a hard time finding the right place to construct the decimal type. My main concern with the changes I made is that Decimal is the only type that needs additional info from the schema in order to define precision and scale - all other types can be built directly from the logical type. However, if I put it in PrimitiveArrayReader::new
I am only able to accomplish this by reimplementing parts of the ParquetTypeConverter specifically for Decimal. Do you have a suggestion how I can handle this in a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, should this be handled in build_for_primitive_type_inner
inside array_reader.rs
? where we construct a ComplexObjectArrayReader
with the right converter for decimal types. I see we currently handle the case for FIXED_LEN_BYTE_ARRAY
but not for INT32
/INT64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an implementation using converters (I have not pushed it yet), but the creation of ParquetTypeConverter
occurs before we even get to build_for_primitive_type_inner
:
ParquetFileArrowReader::get_record_reader ->
ParquetFileArrowReader::get_record_reader_by_columns ->
ParquetFileArrowReader::get_schema ->
schema::parquet_to_arrow_schema ->
schema::parquet_to_arrow_schema_by_columns ->
ParquetTypeConverter::new
Thus, if we want to avoid the decimal case, a decimal specific implementation in get_schema
would probably be required.
The converters could avoid the decimal specific branch in PrimitiveArrayReader::next_batch:
ArrowType::Decimal(p, s) => {
let to_int64 = arrow::compute::cast(&array, &ArrowType::Int64)?;
let mut builder = DecimalBuilder::new(to_int64.len(), *p, *s);
let values = to_int64.as_any().downcast_ref::<Int64Array>().unwrap();
for maybe_value in values.iter() {
match maybe_value {
Some(value) => builder.append_value(value as i128)?,
None => builder.append_null()?
}
}
Arc::new(builder.finish()) as ArrayRef
}
We could avoid the cast to i64 by adding converters for i32/i64 for but I think this is not such a big deal - what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. This looks good but why we need to do another cast on the arrow array though? it could hurt performance. Is it possible to directly build decimal array from the input parquet data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sweb what does downcasting the array straignt to an i64 result in? Does the below panic?
let mut builder = DecimalBuilder::new(array.len(), *p, *s);
let values = array.as_any().downcast_ref::<Int64Array>().unwrap();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cast is necessary, because downcasting an i32 array to i64 results in None, panicking as a result. I have removed the cast and handle the building of the decimal array now separately, but I do not really like my solution:
let mut builder = DecimalBuilder::new(array.len(), *p, *s);
match array.data_type() {
ArrowType::Int32 => {
let values = array.as_any().downcast_ref::<Int32Array>().unwrap();
for maybe_value in values.iter() {
match maybe_value {
Some(value) => builder.append_value(value as i128)?,
None => builder.append_null()?,
}
}
}
ArrowType::Int64 => {
let values = array.as_any().downcast_ref::<Int64Array>().unwrap();
for maybe_value in values.iter() {
match maybe_value {
Some(value) => builder.append_value(value as i128)?,
None => builder.append_null()?,
}
}
}
_ => {
return Err(ArrowError(format!(
"Cannot convert {:?} to decimal",
array.data_type()
)))
}
}
At the moment, I am not able to convince the compiler to treat downcasts to i32 and i64 as the same primitive array and then use as
on the resulting iterator to convert from types that are unknown at compile time to i128
. I will probably need to spend some time figuring this out.
Alternatively, we could use converters, similar to the fixed length case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. This PR looks good and we can address the remaining in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao thank you for your help with this PR. The result is much better than my initial proposal!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np @sweb - thanks for the contribution!
The full set of Rust CI tests did not run on this PR :( Can you please rebase this PR against apache/master to pick up the changes in #9056 so that they do? I apologize for the inconvenience. |
Codecov Report
@@ Coverage Diff @@
## master #9047 +/- ##
==========================================
+ Coverage 82.55% 82.61% +0.05%
==========================================
Files 203 204 +1
Lines 50043 50213 +170
==========================================
+ Hits 41313 41481 +168
- Misses 8730 8732 +2
Continue to review full report at Codecov.
|
rust/parquet/src/arrow/schema.rs
Outdated
} = self.schema | ||
{ | ||
DataType::Decimal(*precision as usize, *scale as usize) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think I am not familiar enough with Rust yet. I was under the impression that I have to provide an implementation for all possible branches and I get a compiler error when I remove the else branch. Is it possible to unwrap the if part somehow?
error[E0317]: `if` may be missing an `else` clause
--> parquet/src/arrow/schema.rs:650:9
|
650 | / if let Type::PrimitiveType {
651 | | precision, scale, ..
652 | | } = self.schema
653 | | {
654 | | DataType::Decimal(*precision as usize, *scale as usize)
| | ------------------------------------------------------- found here
655 | | }
| |_________^ expected `()`, found enum `arrow::datatypes::DataType`
|
= note: `if` expressions without `else` evaluate to `()`
= help: consider adding an `else` block that evaluates to the expected type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops my bad - this looks good then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may make sense to add get_scale
and get_precision
to Type
as well, similar to other getters there, but it's just a good to have :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! I added both methods which cleans up to_decimal
quite a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rust/parquet/src/schema/types.rs
Outdated
pub fn get_scale(&self) -> i32 { | ||
match *self { | ||
Type::PrimitiveType { scale, ..} => scale, | ||
_ => panic!("Cannot call get_scale() on non-primitive type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is much nicer (and it is now clearer what is expected) 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one more commit, using the new getters in array_reader
to simplify the decimal converter construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the PR, thanks @sunchao for reviewing.
I only have one question around the cast
, we can merge after addressing it.
rust/parquet/src/arrow/schema.rs
Outdated
@@ -591,6 +591,7 @@ impl ParquetTypeConverter<'_> { | |||
LogicalType::INT_32 => Ok(DataType::Int32), | |||
LogicalType::DATE => Ok(DataType::Date32(DateUnit::Day)), | |||
LogicalType::TIME_MILLIS => Ok(DataType::Time32(TimeUnit::Millisecond)), | |||
LogicalType::DECIMAL => self.to_decimal(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. This PR looks good and we can address the remaining in a followup.
…nt types This PR adds capabilities to read decimal columns in parquet files that store them as i32 or i64. I tried to follow the approach in apache#8926 by using casts. Closes apache#9047 from sweb/ARROW-11072/support-int-types Authored-by: Florian Müller <florian@tomueller.de> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR adds capabilities to read decimal columns in parquet files that store them as i32 or i64.
I tried to follow the approach in #8926 by using casts.