Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11149: [Rust] DF Support List/LargeList/FixedSizeList in create_batch_empty #9114

Closed
wants to merge 1 commit into from

Conversation

ovr
Copy link
Contributor

@ovr ovr commented Jan 6, 2021

Previously build_empty_list_array was declared inside Parquet (array_reader), but I will use this function inside DataFushion's create_batch_empty (it's used inside hash_aggregate to make an empty batch from the provided schema that contains type for columns). I moved it to Arrow (because it's common and useful) and made build_empty_large_list_array (for large lists) on top of macros with different implementation than build_empty_list_array.

$list_builder
)
}
_ => Err(ArrowError::Unsupported(format!(
Copy link
Contributor Author

@ovr ovr Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What ArrowError should I throw on unimplemented? todo!? Or let's declare this one?

Thanks

@github-actions
Copy link

github-actions bot commented Jan 6, 2021

@jorgecarleitao
Copy link
Member

I believe that we can make this with generics, now that we have GenericListBuilder::<Offset>(). It would make the code simpler to read, and a consistent function signature.

@ovr
Copy link
Contributor Author

ovr commented Jan 7, 2021

@jorgecarleitao You are right, but I was thinking about modifying make_empty_list_fn macros to support FixedSizeListBuilder, which is not GenericListBuilder.

Something like:

make_empty_list_fn!(build_empty_fixed_size_list_array, FixedSizeListBuilder);

But I am having a problem with constructing it, FixedSizeListBuilder requires 2 arguments in new method instead of one like GenericListBuilder.

pub fn new(values_builder: T, length: i32) -> Self {

Do you have any solutions how is it possible to do by Marcos? And is it a good idea?

Thanks!

@jorgecarleitao
Copy link
Member

Last time I tried something similar, I had to implement the FixedSizeList differently. A fixed sized will always need to receive an extra parameter.

@nevi-me
Copy link
Contributor

nevi-me commented Jan 7, 2021

Yes, better to implement it differently

@ovr ovr changed the title ARROW-11149: [Rust] DF Support List/LargeList in create_batch_empty ARROW-11149: [Rust] DF Support List/LargeList/FixedSizeList in create_batch_empty Jan 7, 2021
@ovr
Copy link
Contributor Author

ovr commented Jan 7, 2021

@jorgecarleitao and @nevi-me 👍

I've moved build_empty_list_array to function build_empty_list_array<OffsetSize: OffsetSizeTrait>, and introduce build_empty_fixed_size_list_array by different implementation as suggested above.

Thanks

ovr added a commit to ovr/arrow that referenced this pull request Jan 10, 2021
In apache#9114, I've prepared support for List/LargeList/FixedSizeList, but will be great to support more types
ovr added a commit to ovr/arrow that referenced this pull request Jan 10, 2021
In apache#9114, I've prepared support for List/LargeList/FixedSizeList, but will be great to support more types
@ovr
Copy link
Contributor Author

ovr commented Jan 11, 2021

@andygrove Can you take a look? Thanks

@ovr ovr force-pushed the issue-11149 branch 2 times, most recently from 3da157f to f0251d7 Compare January 13, 2021 21:52
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this, and it looks good and really useful.

I would just prefer that we have at least 1 test for this, for verification. I left some other minor comments, all optional.

Great work, @ovr

($type_builder:ident, $offset_type:ident) => {{
let values_builder = $type_builder::new(0);
let mut builder =
GenericListBuilder::<$offset_type, $type_builder>::new(values_builder);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that for an empty list, we know that the offset buffer will be a single entry, 0, and the values buffer will be an empty buffer (len = 0). Therefore, this code could be simplified by just passing the buffers directly instead of using builders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow API is a Complex, I still dont know how to do it pretty simple as you suggested with Buffer. I think it's not a big performance impact to use builder in this place.

))),
}
}

#[cfg(test)]
mod tests {
use crate::{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test just to verify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added.

@@ -90,6 +91,9 @@ impl From<serde_json::Error> for ArrowError {
impl Display for ArrowError {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
match self {
ArrowError::Unsupported(source) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of Unimplemented instead of Unsupported? Just to be consistent with the unimplemented! macro that rust already offers.

Copy link
Contributor Author

@ovr ovr Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a better name from reading parquet sources, It's called NYI (Not yet implemented), I think it's better to use similar names accross packages.

Renamed.

Thanks

@ovr ovr force-pushed the issue-11149 branch 2 times, most recently from 849d8d3 to c898107 Compare January 16, 2021 14:03
@codecov-io
Copy link

Codecov Report

Merging #9114 (849d8d3) into master (1393188) will decrease coverage by 0.01%.
The diff coverage is 39.79%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9114      +/-   ##
==========================================
- Coverage   81.61%   81.59%   -0.02%     
==========================================
  Files         215      215              
  Lines       51867    51928      +61     
==========================================
+ Hits        42329    42370      +41     
- Misses       9538     9558      +20     
Impacted Files Coverage Δ
rust/datafusion/src/physical_plan/common.rs 74.28% <0.00%> (-4.51%) ⬇️
rust/parquet/src/arrow/array_reader.rs 73.65% <0.00%> (+2.25%) ⬆️
rust/arrow/src/array/array_list.rs 83.33% <43.67%> (-9.78%) ⬇️
rust/arrow/src/error.rs 9.52% <50.00%> (-0.48%) ⬇️
rust/parquet/src/encodings/encoding.rs 94.86% <0.00%> (-0.20%) ⬇️
rust/arrow/src/array/builder.rs 86.30% <0.00%> (+0.40%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eaa7b7a...c898107. Read the comment docs.

@@ -24,6 +24,9 @@ use std::error::Error;
/// Many different operations in the `arrow` crate return this error type.
#[derive(Debug)]
pub enum ArrowError {
/// "Not yet implemented" Arrow error.
/// Returned when functionality is not yet available.
NYI(String),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the error name should be explicit, like all others. I can't understand what NYI means without having to go to the docs, and in an error message, the person often has no access to the docs (at least not in a 1 click thing).

NotYetImplemented?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that NotYetImplemented is a better name

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ovr -- I think this PR is looking good to go.

Ideally we would could get the rename of NYI to NotYetImplemented but I also think we could merge this PR and rename the enum in a follow on PR

However, since this this PR needs a rebase, sadly, perhaps we can do the rename as part of the rebase.

@@ -24,6 +24,9 @@ use std::error::Error;
/// Many different operations in the `arrow` crate return this error type.
#[derive(Debug)]
pub enum ArrowError {
/// "Not yet implemented" Arrow error.
/// Returned when functionality is not yet available.
NYI(String),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that NotYetImplemented is a better name

@ovr
Copy link
Contributor Author

ovr commented Jan 20, 2021

@alamb Rebased + renamed NYI -> NotYetImplemented in Arrow. Thanks

@alamb
Copy link
Contributor

alamb commented Jan 20, 2021

Awesome -- thanks @ovr !

@alamb
Copy link
Contributor

alamb commented Jan 20, 2021

The travis CI run is backed up -- https://github.com/apache/arrow/pull/9114/checks?check_run_id=1735587739 hasn't finished -- and this PR has no non-rust changes. I think it is good to go

@alamb alamb closed this in 23550c2 Jan 20, 2021
@ovr ovr deleted the issue-11149 branch January 20, 2021 18:26
kszucs pushed a commit that referenced this pull request Jan 25, 2021
…_batch_empty

Previously `build_empty_list_array` was declared inside Parquet (`array_reader`), but I will use this function inside DataFushion's  `create_batch_empty` (it's used inside hash_aggregate to make an empty batch from the provided schema that contains type for columns).  I moved it to Arrow (because it's common and useful) and made `build_empty_large_list_array` (for large lists) on top of macros with different implementation than build_empty_list_array.

Closes #9114 from ovr/issue-11149

Authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…_batch_empty

Previously `build_empty_list_array` was declared inside Parquet (`array_reader`), but I will use this function inside DataFushion's  `create_batch_empty` (it's used inside hash_aggregate to make an empty batch from the provided schema that contains type for columns).  I moved it to Arrow (because it's common and useful) and made `build_empty_large_list_array` (for large lists) on top of macros with different implementation than build_empty_list_array.

Closes apache#9114 from ovr/issue-11149

Authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…_batch_empty

Previously `build_empty_list_array` was declared inside Parquet (`array_reader`), but I will use this function inside DataFushion's  `create_batch_empty` (it's used inside hash_aggregate to make an empty batch from the provided schema that contains type for columns).  I moved it to Arrow (because it's common and useful) and made `build_empty_large_list_array` (for large lists) on top of macros with different implementation than build_empty_list_array.

Closes apache#9114 from ovr/issue-11149

Authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants