Move parallel parquet serialization to blocking threads #9605

devinjdangelo · 2024-03-14T12:21:19Z

Which issue does this PR close?

related to #9493

Rationale for this change

Serialization is CPU intensive and could starve the object store writer futures resulting in failed/timed out writes.

I believe that all parts of parallel parquet writing apart from concatenate_parallel_row_groups (which does the object store put) could be moved to a sync/blocking thread. However, I think serialization is the only one CPU intensive enough between calls to .await to potentially cause issues.

What changes are included in this PR?

Moves parquet column serialization to blocking threads with spawn_blocking method.

Are these changes tested?

Yes, by existing tests

Are there any user-facing changes?

No

tustvold · 2024-03-14T17:10:27Z

So yes in an ideal world all CPU computation would be spawned to rayon or a similar blocking threadpool as in this PR. However, unfortunately this isn't the way DF has been implemented.

Instead the compromise we use in IOx is to just spawn the IO off to a separate runtime, and accept that the runtime DF is running in will not have good poll latencies.

Therefore rather than spawning the CPU bound work, we instead want to spawn the IO of the multipart upload.

alamb · 2024-03-14T18:42:22Z

So yes in an ideal world all CPU computation would be spawned to rayon or a similar blocking threadpool as in this PR. However, unfortunately this isn't the way DF has been implemented.

I will continue to agree to disagree on this point. But I think @tustvold may be trolling me 🤔 :)

Anyhow, for the record, I think using 2 separate tokio runtimes is perfectly acceptable and good for reasons I have soapbox'ed on at length about. However, using tokio for CPU bound threads means it is very easy to schedule CPU and IO on the same thread pool which will become a bottleneck in some scenarios

So in the sense that using a different threadpool API would make it impossible to mix IO and CPU work it would be an improvement. However I think it would be a heavy price to pay

alamb · 2024-03-14T18:43:47Z

Therefore rather than spawning the CPU bound work, we instead want to spawn the IO of the multipart upload.

I agree with this assesment -- basically the threadpool running the DataFusion plan should be doing CPU work and ideally not also IO work

devinjdangelo · 2024-03-14T22:44:44Z

I agree with this assesment -- basically the threadpool running the DataFusion plan should be doing CPU work and ideally not also IO work

I understand the context here for influxdb, but it would also be interesting to have a deeper discussion on this in the context of DataFusion as a standalone execution engine. I.e. should we be doing anything differently to make sure users of datafusion-cli running a query like

COPY (select * from 's3://bucket/table') to 's3://bucket/parquet.file'

won't run into poll latency issues reading/writing from remote object stores. Perhaps one of two things is true:

Poll latency actually isn't that big of an issue for multipart objectstore writes. The batch sizes are small enough that the time between .awaits will not negatively impact a streaming multipart write workload.
Poll latency does cause unpredictable job failure and DataFusion should perhaps manage two tokio runtimes itself for IO/CPU or make more consistent use of spawn_blocking

I have tested queries like the above myself (albeit unscientifically) and not run into any issues. It may be a good idea to test this more thoroughly.

tustvold · 2024-03-14T23:08:46Z

Poll latency does cause unpredictable job failure and DataFusion should perhaps manage two tokio runtimes itself for IO/CPU or make more consistent use of spawn_blocking

Certainly the lancedb users have reported issues related to this apache/arrow-rs#5366, and we have seen similar issues on the read side in IOx in the past before we split out to use a separate runtime.

I suspect in most cases it will just limit throughput, it would take a very contended system for it to result in an actual failure

move serialization to blocking threadpool

e6de971

github-actions bot added the core Core DataFusion crate label Mar 14, 2024

devinjdangelo mentioned this pull request Mar 14, 2024

Supporting using parallel parquet writer outside of Datafusion query execution #9493

Open

devinjdangelo changed the title ~~Move parallel parquet serialization to blocking threadpool~~ Move parallel parquet serialization to blocking threads Mar 14, 2024

devinjdangelo closed this Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move parallel parquet serialization to blocking threads #9605

Move parallel parquet serialization to blocking threads #9605

devinjdangelo commented Mar 14, 2024

tustvold commented Mar 14, 2024

alamb commented Mar 14, 2024

alamb commented Mar 14, 2024

devinjdangelo commented Mar 14, 2024 •

edited

Loading

tustvold commented Mar 14, 2024 •

edited

Loading

Move parallel parquet serialization to blocking threads #9605

Move parallel parquet serialization to blocking threads #9605

Conversation

devinjdangelo commented Mar 14, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold commented Mar 14, 2024

alamb commented Mar 14, 2024

alamb commented Mar 14, 2024

devinjdangelo commented Mar 14, 2024 • edited Loading

tustvold commented Mar 14, 2024 • edited Loading

devinjdangelo commented Mar 14, 2024 •

edited

Loading

tustvold commented Mar 14, 2024 •

edited

Loading