Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11072: [Rust] [Parquet] Support reading decimal from physical int types #9047

Closed
wants to merge 10 commits into from

Conversation

sweb
Copy link
Contributor

@sweb sweb commented Dec 30, 2020

This PR adds capabilities to read decimal columns in parquet files that store them as i32 or i64.

I tried to follow the approach in #8926 by using casts.

@mqy
Copy link
Contributor

mqy commented Dec 30, 2020

ARROW-${JIRA_ID} :)

@sweb sweb changed the title Arrow 11072: [Rust] [Parquet] Support reading decimal from physical int types ARROW-11072: [Rust] [Parquet] Support reading decimal from physical int types Dec 30, 2020
@github-actions
Copy link

@apache apache deleted a comment from github-actions bot Dec 30, 2020
..
} => (*precision, *scale),
_ => {
return Err(ArrowError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use assertion here since this is only supposed to be called on primitive types. And it's more concise to do something like this:

        match self.schema {
            Type::PrimitiveType {
                ref precision,
                ref scale,
                ..
            } => Ok(DataType::Decimal(*precision as usize, *scale as usize)),
            _ => Err(ArrowError(
                "Expected a physical type, not a group type".to_string(),
            )),
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao thank you for your review comments! I changed this part as suggested. Do you have an example of an assertion checking for a type? The only thing I could come up with (still quite new to Rust) is something along the lines of

if let Type::PrimitiveType {ref precision, ref scale, .. } = self.schema {
  DataType::Decimal(*precision as usize, *scale as usize)
} else {
  panic!("Expected a physical type, not a group type")
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking something like:

assert!(self.schema.is_primitive());
if let ... {
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! I changed it to use an assertion. Let me know if you prefer this variant - there are other methods in schema.rs that follow a similar pattern and I could also change them accordingly while I am already here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looking good! the else branch in the if let block seems unnecessary?

@@ -532,6 +533,7 @@ pub fn cast(array: &ArrayRef, to_type: &DataType) -> Result<ArrayRef> {
(Int64, Int32) => cast_numeric_arrays::<Int64Type, Int32Type>(array),
(Int64, Float32) => cast_numeric_arrays::<Int64Type, Float32Type>(array),
(Int64, Float64) => cast_numeric_arrays::<Int64Type, Float64Type>(array),
(Int64, Decimal(p, s)) => int64_to_decimal_cast(array, *p, *s),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... is this related to the parquet-side change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you stumbled upon the issue I tried to explain in the PR description - I am now sure that it was a bad idea to use casts in this way :)

I removed the cast operations and replaced it with some logic in PrimitiveArrayReader::next_batch.

Copy link
Member

@sunchao sunchao Jan 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this looks a little weird - I'm thinking that instead of handling this in the compute kernel, the parquet/arrow reader should just understand the conversion from parquet's decimal type to that of arrow's and generate decimal arrow arrays directly.

@@ -591,6 +591,7 @@ impl ParquetTypeConverter<'_> {
LogicalType::INT_32 => Ok(DataType::Int32),
LogicalType::DATE => Ok(DataType::Date32(DateUnit::Day)),
LogicalType::TIME_MILLIS => Ok(DataType::Time32(TimeUnit::Millisecond)),
LogicalType::DECIMAL => self.to_decimal(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be handled somehow in the parquet/arrow reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having a hard time finding the right place to construct the decimal type. My main concern with the changes I made is that Decimal is the only type that needs additional info from the schema in order to define precision and scale - all other types can be built directly from the logical type. However, if I put it in PrimitiveArrayReader::new I am only able to accomplish this by reimplementing parts of the ParquetTypeConverter specifically for Decimal. Do you have a suggestion how I can handle this in a better way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, should this be handled in build_for_primitive_type_inner inside array_reader.rs? where we construct a ComplexObjectArrayReader with the right converter for decimal types. I see we currently handle the case for FIXED_LEN_BYTE_ARRAY but not for INT32/INT64.

Copy link
Contributor Author

@sweb sweb Jan 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an implementation using converters (I have not pushed it yet), but the creation of ParquetTypeConverter occurs before we even get to build_for_primitive_type_inner:

ParquetFileArrowReader::get_record_reader ->
ParquetFileArrowReader::get_record_reader_by_columns -> 
ParquetFileArrowReader::get_schema ->
schema::parquet_to_arrow_schema ->
schema::parquet_to_arrow_schema_by_columns ->
ParquetTypeConverter::new

Thus, if we want to avoid the decimal case, a decimal specific implementation in get_schema would probably be required.

The converters could avoid the decimal specific branch in PrimitiveArrayReader::next_batch:

ArrowType::Decimal(p, s) => {
                let to_int64 = arrow::compute::cast(&array, &ArrowType::Int64)?;
                let mut builder = DecimalBuilder::new(to_int64.len(), *p, *s);
                let values = to_int64.as_any().downcast_ref::<Int64Array>().unwrap();
                for maybe_value in values.iter() {
                    match maybe_value {
                        Some(value) => builder.append_value(value as i128)?,
                        None => builder.append_null()?
                    }
                }
                Arc::new(builder.finish()) as ArrayRef
            }

We could avoid the cast to i64 by adding converters for i32/i64 for but I think this is not such a big deal - what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. This looks good but why we need to do another cast on the arrow array though? it could hurt performance. Is it possible to directly build decimal array from the input parquet data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sweb what does downcasting the array straignt to an i64 result in? Does the below panic?

let mut builder = DecimalBuilder::new(array.len(), *p, *s);
let values = array.as_any().downcast_ref::<Int64Array>().unwrap();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast is necessary, because downcasting an i32 array to i64 results in None, panicking as a result. I have removed the cast and handle the building of the decimal array now separately, but I do not really like my solution:

                let mut builder = DecimalBuilder::new(array.len(), *p, *s);
                match array.data_type() {
                    ArrowType::Int32 => {
                        let values = array.as_any().downcast_ref::<Int32Array>().unwrap();
                        for maybe_value in values.iter() {
                            match maybe_value {
                                Some(value) => builder.append_value(value as i128)?,
                                None => builder.append_null()?,
                            }
                        }
                    }
                    ArrowType::Int64 => {
                        let values = array.as_any().downcast_ref::<Int64Array>().unwrap();
                        for maybe_value in values.iter() {
                            match maybe_value {
                                Some(value) => builder.append_value(value as i128)?,
                                None => builder.append_null()?,
                            }
                        }
                    }
                    _ => {
                        return Err(ArrowError(format!(
                            "Cannot convert {:?} to decimal",
                            array.data_type()
                        )))
                    }
                }

At the moment, I am not able to convince the compiler to treat downcasts to i32 and i64 as the same primitive array and then use as on the resulting iterator to convert from types that are unknown at compile time to i128. I will probably need to spend some time figuring this out.

Alternatively, we could use converters, similar to the fixed length case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. This PR looks good and we can address the remaining in a followup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao thank you for your help with this PR. The result is much better than my initial proposal!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np @sweb - thanks for the contribution!

@alamb
Copy link
Contributor

alamb commented Dec 31, 2020

The full set of Rust CI tests did not run on this PR :(

Can you please rebase this PR against apache/master to pick up the changes in #9056 so that they do?

I apologize for the inconvenience.

@codecov-io
Copy link

codecov-io commented Jan 1, 2021

Codecov Report

Merging #9047 (ba1c6aa) into master (cc0ee5e) will increase coverage by 0.05%.
The diff coverage is 73.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9047      +/-   ##
==========================================
+ Coverage   82.55%   82.61%   +0.05%     
==========================================
  Files         203      204       +1     
  Lines       50043    50213     +170     
==========================================
+ Hits        41313    41481     +168     
- Misses       8730     8732       +2     
Impacted Files Coverage Δ
rust/parquet/src/arrow/schema.rs 91.52% <50.00%> (+0.89%) ⬆️
rust/parquet/src/schema/types.rs 89.52% <50.00%> (-0.42%) ⬇️
rust/parquet/src/arrow/array_reader.rs 75.44% <69.56%> (+0.09%) ⬆️
rust/arrow/src/array/array_binary.rs 90.63% <100.00%> (+0.02%) ⬆️
rust/parquet/src/arrow/arrow_reader.rs 91.57% <100.00%> (+0.18%) ⬆️
...datafusion/src/optimizer/hash_build_probe_order.rs 54.73% <0.00%> (-3.70%) ⬇️
rust/datafusion/src/physical_plan/hash_join.rs 86.32% <0.00%> (-3.62%) ⬇️
rust/datafusion/src/scalar.rs 56.17% <0.00%> (-2.90%) ⬇️
rust/datafusion/src/physical_plan/group_scalar.rs 67.85% <0.00%> (-2.52%) ⬇️
...tafusion/src/physical_plan/datetime_expressions.rs 92.60% <0.00%> (-1.66%) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc0ee5e...ba1c6aa. Read the comment docs.

} = self.schema
{
DataType::Decimal(*precision as usize, *scale as usize)
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I am not familiar enough with Rust yet. I was under the impression that I have to provide an implementation for all possible branches and I get a compiler error when I remove the else branch. Is it possible to unwrap the if part somehow?

error[E0317]: `if` may be missing an `else` clause
   --> parquet/src/arrow/schema.rs:650:9
    |
650 | /         if let Type::PrimitiveType {
651 | |             precision, scale, ..
652 | |         } = self.schema
653 | |         {
654 | |             DataType::Decimal(*precision as usize, *scale as usize)
    | |             ------------------------------------------------------- found here
655 | |         } 
    | |_________^ expected `()`, found enum `arrow::datatypes::DataType`
    |
    = note: `if` expressions without `else` evaluate to `()`
    = help: consider adding an `else` block that evaluates to the expected type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops my bad - this looks good then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may make sense to add get_scale and get_precision to Type as well, similar to other getters there, but it's just a good to have :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! I added both methods which cleans up to_decimal quite a bit.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sweb and @sunchao ! This PR looks good to me. Once the PR is green I think it is good to merge

pub fn get_scale(&self) -> i32 {
match *self {
Type::PrimitiveType { scale, ..} => scale,
_ => panic!("Cannot call get_scale() on non-primitive type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is much nicer (and it is now clearer what is expected) 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one more commit, using the new getters in array_reader to simplify the decimal converter construction.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the PR, thanks @sunchao for reviewing.

I only have one question around the cast, we can merge after addressing it.

@@ -591,6 +591,7 @@ impl ParquetTypeConverter<'_> {
LogicalType::INT_32 => Ok(DataType::Int32),
LogicalType::DATE => Ok(DataType::Date32(DateUnit::Day)),
LogicalType::TIME_MILLIS => Ok(DataType::Time32(TimeUnit::Millisecond)),
LogicalType::DECIMAL => self.to_decimal(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. This PR looks good and we can address the remaining in a followup.

@alamb alamb closed this in 5db1d2a Jan 4, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…nt types

This PR adds capabilities to read decimal columns in parquet files that store them as i32 or i64.

I tried to follow the approach in apache#8926 by using casts.

Closes apache#9047 from sweb/ARROW-11072/support-int-types

Authored-by: Florian Müller <florian@tomueller.de>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants