Support encoding a single parquet file using multiple threads #1718

alamb · 2022-05-21T10:41:12Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The encoding / compression is most often the bottleneck for increasing the throughput of writing parquet files. Even though the actual writing of bytes must be done serially, the encoding could be done in parallel (into memory buffers) before the actual write

Describe the solution you'd like
I would like a way (either an explicit API or an example) that allows using multiple cores to write ArrowRecord batches to a file.

Note that trying to parallelize writes today results in corrupted parquet files, see #1717

Describe alternatives you've considered
There is a high level description of parallel decoding in @jorgecarleitao 's parquet2 https://github.com/jorgecarleitao/parquet2#higher-parallelism (focused on reading)

Additional context
Mailing list https://lists.apache.org/thread/rbhfwcpd6qfk52rtzm2t6mo3fhvdpc91

Also, #1711 is possibly related

The text was updated successfully, but these errors were encountered:

devinjdangelo · 2023-09-05T18:42:32Z

I am interested in working on this. Does anyone know if there are existing parallelized parquet write implementations in other languages we could reference? I am particularly interested in what the best approach is between:

Serialize multiple columns in a single row group in parallel
Serialize multiple row groups in parallel
A combination of 1 and 2
Serialize multiple pages of a single column in a single row group in parallel
A combination of all of the above

Number 2 could be a challenge if we don't know up front how many total row groups we want in the file.

tustvold · 2023-09-05T20:06:29Z

Option 1 is likely the most tractable, ArrowWriter already encodes columns to separate memory regions and then stitches the encoded column chunks together. I could conceive doing something similar for a parallel writer.

I think the biggest question in my mind is the mechanics of parallelism. A naive solution might be to just spawn tokio tasks for each column of each batch, but this will have very poor thread locality, high per-batch overheads, and in general feels a little off... I don't have a good solution here, typically we have avoided adding notions of threading into this crate...

tustvold · 2023-09-05T20:13:39Z

One option might be for systems like DataFusion with a notion of partitioning to simply write each partition to separate memory regions, and then later stitch these together using the append_column APIs 🤔

This would allow DataFusion to remain in control of the threading, and should be possible with the existing APIs

alamb · 2023-09-05T20:13:56Z

Maybe we could add a new API for encoding that took ArrayRefs rather than RecordBatches

The existing API is like: https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html#method.write

Maybe a new one could look something like:

let writer: ArrowWriter = ...;

// Create some new structure  for writing each row group
let row_group_writer = writer.new_row_group();
// could also call row_group_writer.write_column() for multiple columns concurrently
for col in record_batch.col() {
  row_group_writer.write(col)?
}
row_group_writer.finish()?

alamb · 2023-09-05T20:14:39Z

Though encoding different RowGroups and using append_column is an interesting idea 🤔

devinjdangelo · 2023-09-06T22:35:22Z

One option might be for systems like DataFusion with a notion of partitioning to simply write each partition to separate memory regions, and then later stitch these together using the append_column APIs 🤔

I like the idea of DataFusion and other arrow-rs users having control over how this is parallelized in terms of threads vs. tokio. My original thought was creating a Send+Sync parquet::ArrowWriter , but I think serializing to independent memory regions and concatenating sounds slick.

I found https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-concat.rs. The only thing that it doesn't do we would want in DataFusion is support for bloom filters. I could see how merging bloom filters could be a bit complicated though.

devinjdangelo · 2023-09-06T22:40:16Z

The other thing that might be a challenge if we go the concatenation route is that DataFusion writers preserve input ordering (in the sense that if you run "COPY (select * from my_table order by my_col) to my_file.parquet" then my_file.parquet should be sorted according to the input query).

If we construct multiple parquet files/concat, they would need to be deserialized and sorted again which would defeat any possibility of a speed up.

tustvold · 2023-09-07T04:32:22Z

preserve input ordering

The DF partitioning logic is smart enough to not destroy sort orders, the flipside is writing an ordered parquet file will not be parallelized, but given such a query has a sort which will likely dominate execution time, perhaps that doesn't matter

alamb · 2023-09-07T16:41:30Z

The DF partitioning logic is smart enough to not destroy sort orders, the flipside is writing an ordered parquet file will not be parallelized, but given such a query has a sort which will likely dominate execution time, perhaps that doesn't matter

I don't understand why we couldn't write row groups in parallel (we would have to buffer enough data to encode in each row group prior to starting the next one). The buffering might not be the best idea but I think it would be possible

tustvold · 2023-09-09T09:10:22Z

The buffering might not be the best idea but I think it would be possible

I had presumed it was a non-starter given #3871

alamb · 2023-09-09T10:44:10Z

I had presumed it was a non-starter given #3871

Indeed -- we would have to limit the row group size (in rows) to control the buffer needed, thus resulting in a tradeoff between compression and buffering.

I think we should parallelize writing sorted data at a higher level (aka write multiple files) at first, and treat parallelizing the write of a single file as a follow on project

devinjdangelo · 2023-09-15T01:51:03Z

One option might be for systems like DataFusion with a notion of partitioning to simply write each partition to separate memory regions, and then later stitch these together using the append_column APIs 🤔

@alamb @tustvold I took a stab at this approach in apache/datafusion#7562. Any feedback is appreciated (especially ideas to reduce the memory requirements).

alamb · 2023-09-15T10:44:16Z

@alamb @tustvold I took a stab at this approach in apache/datafusion#7562. Any feedback is appreciated (especially ideas to reduce the memory requirements).

This sounds amazing @devinjdangelo -- I will take a look today

devinjdangelo · 2023-09-23T12:16:54Z

@alamb @tustvold I took another pass at improving on the first implementation. This time I needed to make a few arrow-rs level changes.

apache/datafusion#7632
#4850

devinjdangelo · 2023-09-23T12:32:38Z

Also, I think we should be able to get further improvement for parquet files with many columns (i.e. most of them) by providing a way to parallelize the loop in this method:

arrow-rs/parquet/src/arrow/arrow_writer/mod.rs

Lines 389 to 397 in 8465ed4

    
           fn write(&mut self, batch: &RecordBatch) -> Result<()> { 
        
               self.buffered_rows += batch.num_rows(); 
        
               let mut writers = self.writers.iter_mut().map(|(_, x)| x); 
        
               for (array, field) in batch.columns().iter().zip(&self.schema.fields) { 
        
                   let mut levels = calculate_array_levels(array, field)?.into_iter(); 
        
                   write_leaves(&mut writers, &mut levels, array.as_ref())?; 
        
               } 
        
               Ok(()) 
        
           }

From looking through the code, it appears that serializing each column is already independent so it should not be too difficult to parallelize. The main question is how much of that should be done is arrow-rs vs. marking lower level objects public and allowing DataFusion/other systems to parallelize as they desire (via async or threadpools ect.).

tustvold · 2023-09-23T15:38:06Z

I definitely would prefer to keep arrow-rs runtime agnostic as much as possible, i.e. not making concurrency decisions in arrow-rs

devinjdangelo · 2023-09-25T23:39:51Z

Option 1 is likely the most tractable, ArrowWriter already encodes columns to separate memory regions and then stitches the encoded column chunks together. I could conceive doing something similar for a parallel writer.

@tustvold your intial intuition was spot on! I reworked the datafusion parallel parquet writer to primarily use column wise parallelization. It is around 20% faster and 90% lower memory overhead vs. the previous attempt.

PRs open with more details for this new approach:

devinjdangelo · 2023-09-26T00:27:10Z

A naive solution might be to just spawn tokio tasks for each column of each batch, but this will have very poor thread locality, high per-batch overheads, and in general feels a little off.

Regarding this point, apache/datafusion#7655 only spawns 1 task for each column for each row group (not each record batch). Each record batch is sent via a channel to the parallel tasks. Once max_row_group_size is reached, the parallel tasks are joined and new ones spawned in their place.

tustvold · 2023-10-23T22:08:48Z

Closed by #4871

hbpeng0115 · 2024-03-07T07:35:48Z

Is it possible to implement in Java language? Due to we're using Java/Scala Flink to write parquet files.

tustvold · 2024-03-07T08:46:34Z

I think that would be a question for the maintainers of whichever parquet writer those systems use

alamb · 2024-03-08T20:08:34Z

For the curious, we plan to make a higher level API for encoding parquet files with multiple-threads available in DataFusion -- see apache/datafusion#9493

alamb added the enhancement Any new improvement worthy of a entry in the changelog label May 21, 2022

alamb mentioned this issue May 21, 2022

Trying to write parquet file in parallel results in corrupt file #1717

Closed

ahmedriza mentioned this issue May 24, 2022

Why Parquet is a part of Arrow? #1715

Closed

alamb mentioned this issue May 23, 2023

Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter #3871

Closed

devinjdangelo mentioned this issue Sep 5, 2023

Write Multiple Parquet Files in Parallel apache/datafusion#7483

Merged

devinjdangelo mentioned this issue Sep 15, 2023

Parallelize Parquet Serialization apache/datafusion#7562

Merged

This was referenced Sep 23, 2023

Support All Statistics and Enable Backpressure in Parallel Parquet Writer apache/datafusion#7632

Closed

Make ArrowRowGroupWriter Public and SerializedRowGroupWriter Send #4850

Merged

This was referenced Sep 25, 2023

Parallelize Serialization of Columns within Parquet RowGroups apache/datafusion#7655

Merged

Enable External ArrowColumnWriter Access #4859

Closed

tustvold mentioned this issue Sep 27, 2023

Support Encoding Parquet Columns in Parallel #4871

Merged

tustvold closed this as completed Oct 23, 2023

kevinjqliu mentioned this issue Feb 17, 2024

Parallel Table.append apache/iceberg-python#428

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support encoding a single parquet file using multiple threads #1718

Support encoding a single parquet file using multiple threads #1718

alamb commented May 21, 2022

devinjdangelo commented Sep 5, 2023 •

edited

tustvold commented Sep 5, 2023

tustvold commented Sep 5, 2023

alamb commented Sep 5, 2023

alamb commented Sep 5, 2023

devinjdangelo commented Sep 6, 2023

devinjdangelo commented Sep 6, 2023

tustvold commented Sep 7, 2023

alamb commented Sep 7, 2023

tustvold commented Sep 9, 2023

alamb commented Sep 9, 2023

devinjdangelo commented Sep 15, 2023

alamb commented Sep 15, 2023

devinjdangelo commented Sep 23, 2023

devinjdangelo commented Sep 23, 2023 •

edited

tustvold commented Sep 23, 2023

devinjdangelo commented Sep 25, 2023

devinjdangelo commented Sep 26, 2023

tustvold commented Oct 23, 2023

hbpeng0115 commented Mar 7, 2024

tustvold commented Mar 7, 2024

alamb commented Mar 8, 2024

Support encoding a single parquet file using multiple threads #1718

Support encoding a single parquet file using multiple threads #1718

Comments

alamb commented May 21, 2022

devinjdangelo commented Sep 5, 2023 • edited

tustvold commented Sep 5, 2023

tustvold commented Sep 5, 2023

alamb commented Sep 5, 2023

alamb commented Sep 5, 2023

devinjdangelo commented Sep 6, 2023

devinjdangelo commented Sep 6, 2023

tustvold commented Sep 7, 2023

alamb commented Sep 7, 2023

tustvold commented Sep 9, 2023

alamb commented Sep 9, 2023

devinjdangelo commented Sep 15, 2023

alamb commented Sep 15, 2023

devinjdangelo commented Sep 23, 2023

devinjdangelo commented Sep 23, 2023 • edited

tustvold commented Sep 23, 2023

devinjdangelo commented Sep 25, 2023

devinjdangelo commented Sep 26, 2023

tustvold commented Oct 23, 2023

hbpeng0115 commented Mar 7, 2024

tustvold commented Mar 7, 2024

alamb commented Mar 8, 2024

devinjdangelo commented Sep 5, 2023 •

edited

devinjdangelo commented Sep 23, 2023 •

edited