Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to write parquet file in parallel results in corrupt file #1717

Closed
alamb opened this issue May 21, 2022 · 2 comments · Fixed by #1719
Closed

Trying to write parquet file in parallel results in corrupt file #1717

alamb opened this issue May 21, 2022 · 2 comments · Fixed by #1719
Assignees
Labels
bug parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented May 21, 2022

Describe the bug
(from the mailing list)

Apparently, you can make a program that appears to write a parquet file in parallel, but it will currently produce corrupt parquet data.

To Reproduce
Description in the email says:

I was attempting to build a single Parquet from the batches in what I thought was a parallel manner using the ArrowWriter. I tried to "parallelise" the following serial code.

            let cursor = InMemoryWriteableCursor::default();
            let mut writer = ArrowWriter::try_new(cursor.clone(), schema, None)?;
            for batch in batches {
                writer.write(batch)?;
            }
            writer.close()?;

I realised that although the compiler accepted my incorrect parallel version of this code, it in-fact was not sound which caused the corruption.

Expected behavior
The API should not allow corrupted data / produce a compiler error

Actually writing a parquet file in parallel is tracked in #1718

Additional context
Mailing list https://lists.apache.org/thread/rbhfwcpd6qfk52rtzm2t6mo3fhvdpc91

@alamb
Copy link
Contributor Author

alamb commented May 23, 2022

Note there is a proposed PR for this: #1719

tustvold added a commit that referenced this issue May 25, 2022
…`ParquetWriter` trait (#1717) (#1163) (#1719)

* Rustify parquet writer (#1717) (#1163)

* Fix parquet_derive

* Fix benches

* Fix parquet_derive tests

* Use raw vec instead of Cursor

* Review feedback

* Fix unnecessary unwrap
@alamb
Copy link
Contributor Author

alamb commented May 25, 2022

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
2 participants