Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"out of order projection is not supported" after Fix Parquet Arrow Schema Inference #1701

Closed
alamb opened this issue May 13, 2022 · 9 comments · Fixed by #1716
Closed

"out of order projection is not supported" after Fix Parquet Arrow Schema Inference #1701

alamb opened this issue May 13, 2022 · 9 comments · Fixed by #1716
Labels
bug parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented May 13, 2022

Describe the bug
After #1682 from @tustvold some tests in datafusion begin to fail with

"out of order projection is not supported" after Fix Parquet Arrow Schema Inference

Note this code is not included in 14.0.0

To Reproduce
See the reproduction instructions on apache/datafusion#2530

Expected behavior
tests should pass

Additional context
Add any other context about the problem here.

@alamb alamb added parquet Changes to the parquet crate bug labels May 13, 2022
@tustvold
Copy link
Contributor

This is expected behaviour, DataFusion currently passed out of order projection to the parquet reader despite the ArrayReader having never supported it. We now complain explicitly that the reader does not support this, but imo this is better than silently ignoring it?

@alamb
Copy link
Contributor Author

alamb commented May 13, 2022

despite the ArrayReader having never supported it.

What happened if the reader was passed out of order projections? Was data ignored?

@tustvold
Copy link
Contributor

tustvold commented May 13, 2022

It silently ignored the out-of-order projection, i.e.

#[test]
    fn test_out_of_order_projection() {
        let testdata = arrow::util::test_util::parquet_test_data();
        let path = format!("{}/alltypes_plain.parquet", testdata);
        let file = File::open(&path).unwrap();
        let reader = SerializedFileReader::try_from(file).unwrap();
        let expected_rows = reader.metadata().file_metadata().num_rows() as usize;

        let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(reader));
        let b1 = arrow_reader.get_record_reader_by_columns([0, 1], 2).unwrap();

        let b2 = arrow_reader.get_record_reader_by_columns([1, 0], 2).unwrap();

        assert_eq!(b1.schema, b2.schema);
    }

Would pass.

To add to the mess

#[test]
    fn test_out_of_order_projection() {
        let testdata = arrow::util::test_util::parquet_test_data();
        let path = format!("{}/alltypes_plain.parquet", testdata);
        let file = File::open(&path).unwrap();
        let reader = SerializedFileReader::try_from(file).unwrap();
        let expected_rows = reader.metadata().file_metadata().num_rows() as usize;

        let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(reader));
        let b1 = arrow_reader.get_schema_by_columns([0, 1], true).unwrap();
        let b2 = arrow_reader.get_schema_by_columns([1, 0], true).unwrap();

        assert_eq!(b1, b2);
    }

Would not pass, as it sort of understood column projection - although it was completely broken for nested types

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2022
tustvold added a commit that referenced this issue May 24, 2022
…1701) (#1716)

* Add explicit column mask construction (#1701)

* Fix ParquetRecordBatchStream

* Fix docs

* Fix async_reader test

* Review feedback
@alamb
Copy link
Contributor Author

alamb commented May 24, 2022

Do we need to make any changes to DataFusion?

@tustvold
Copy link
Contributor

Not beyond updating it to use the new ProjectionMask within ParquetExec, so changing from column_reader_with_columns(iter) to column_reader_with_columns(ProjectionMask::roots(iter))

@alamb
Copy link
Contributor Author

alamb commented May 24, 2022

Ok, Maybe I can find time to do so as part of preparing the arrow 15 release

@tustvold
Copy link
Contributor

I'd be happy to help if you run into any roadblocks, it should be straightforward

@alamb
Copy link
Contributor Author

alamb commented May 24, 2022

Update: @tustvold has kindly offered to take a stab at preparing a datafusion PR to upgrade to latest arrow, prior to #1727

@tustvold
Copy link
Contributor

PR here - apache/datafusion#2631

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants