[Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista #1805

mingmwang · 2022-02-10T11:23:54Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

A new feature enhancement

Describe the solution you'd like

Current Ballista’s shuffle implementation is very similar to Spark’s early version. It’s the hash-based shuffle solution where shuffle data is materialized to disks, each map task produces separate files for each reduce task. For a shuffle operation involves M map tasks and N reduce task, it will generate M * N files. Too many tiny files will cause performance, memory and scalability issues. Later Spark version introduced the sort-based shuffle solution and became the default shuffle implementation. the sort-based shuffle will not generate M*N files, each map task sort the records by the partition id + key and generate a pair of files, all records are consolidate in one data file and an index file is created to manage the data range metadata for different partitions. In Spark 2.0, the hash-based shuffle code was removed. Spark also introduced external shuffle services to serve materialized intermediate shuffle data in order to achieve better fault-tolerance and performance isolation. In the recent Spark 3.2 release, it introduces a push based shuffle solution (SPARK-30602) to further improve the shuffle stability and IO performance. With spark’s push-based shuffle, shuffle is performed at the end of the map tasks and shuffle blocks are pre-merged and pushed to selected reducer nodes or upload to spark external shuffle servers.

Other distributed compute engines like Flink and Presto also support the shuffle operation. But they didn’t materialize the shuffle data to disks, instead, shuffle data is streamingly materialized into an in-memory buffer, the reduce tasks poll the shuffle data from map tasks’ in-memory buffer to minimize the end-to-end latency.

Here, we propose a new streaming style push-based shuffle solution for Ballista. Where shuffle is performed at the end of map tasks. Instead of materializing the intermediate shuffle data to disks and generate M*N files, shuffle data is directly pushed to the reduce tasks via Arrow-Flight gRpc call to achieve very low latency. This is important for low latency queries. The corresponding Stage scheduling will be enhanced to support the All-at-Once scheduling. With all-at-once scheduling, all the stages of a SQL/Job will be scheduled at almost the same time. The distributed DAG of the query is fixed at the beginning, so that the map tasks can streamingly push the shuffle data to downstream reduce tasks.

I will draft a detailed design doc to cover the proposed API changes later.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

mingmwang · 2022-02-11T04:23:20Z

Add a design doc for further discussion.

https://docs.google.com/document/d/17J9H6gGBVktmRAFYNQu-v52QUUPlghRnVLIZC3mFYFY/edit?usp=sharing

xudong963 · 2022-02-11T04:45:56Z

Hi, @mingmwang , nice work!

You can open an RFC ticket if you are willing to, then we can discuss based on github. After we get the final version, we can merge it and save the RFC in our codebase. FYI, https://github.com/apache/arrow-datafusion/tree/master/docs/source/specification/rfcs

mingmwang · 2022-02-11T05:36:02Z

Hi, @mingmwang , nice work!

You can open an RFC ticket if you are willing to, then we can discuss based on github. After we get the final version, we can merge it and save the RFC in our codebase. FYI, https://github.com/apache/arrow-datafusion/tree/master/docs/source/specification/rfcs

Can we discuss in the google doc or in this thread directly ? Everyone can comment on the google doc.
I will open a PR to cover all the related code changes so that everyone who interest can take a look and give me advice.

xudong963 · 2022-02-11T05:43:15Z

Can we discuss in the google doc or in this thread directly ? Everyone can comment on the google doc.

Both are ok, just a suggestion.

I will open a PR to cover all the related code changes so that everyone who interest can take a look and give me advice.

👍

thinkharderdev · 2022-02-18T11:00:17Z

This sounds great!

houqp · 2022-02-21T07:30:13Z

The design looks good to me, thanks for writing it up @mingmwang , left a minor question in the doc.

EricJoy2048 · 2022-02-26T14:30:04Z

It's great!

heroWang · 2023-07-06T01:44:57Z

Any progress on this new feature?

mingmwang · 2023-07-13T15:20:28Z

@heroWang
I will start work on this maybe next month. Recently I'm busy with other stuff and do not get any bandwidth on DataFusion and Ballista

mingmwang added the enhancement New feature or request label Feb 10, 2022

mingmwang changed the title ~~Streaming style push-based shuffle and All-at-once stage scheduling in Ballista~~ [Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista Feb 10, 2022

yahoNanJing mentioned this issue May 19, 2022

Ballista Enhancement Overview apache/datafusion-ballista#7

Open

15 tasks

mingmwang mentioned this issue Feb 16, 2022

[Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista #1842

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista #1805

[Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista #1805

mingmwang commented Feb 10, 2022 •

edited

Loading

mingmwang commented Feb 11, 2022

xudong963 commented Feb 11, 2022

mingmwang commented Feb 11, 2022

xudong963 commented Feb 11, 2022

thinkharderdev commented Feb 18, 2022

houqp commented Feb 21, 2022

EricJoy2048 commented Feb 26, 2022

heroWang commented Jul 6, 2023

mingmwang commented Jul 13, 2023

[Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista #1805

[Ballista] Streaming style push-based shuffle and All-at-once stage scheduling in Ballista #1805

Comments

mingmwang commented Feb 10, 2022 • edited Loading

mingmwang commented Feb 11, 2022

xudong963 commented Feb 11, 2022

mingmwang commented Feb 11, 2022

xudong963 commented Feb 11, 2022

thinkharderdev commented Feb 18, 2022

houqp commented Feb 21, 2022

EricJoy2048 commented Feb 26, 2022

heroWang commented Jul 6, 2023

mingmwang commented Jul 13, 2023

mingmwang commented Feb 10, 2022 •

edited

Loading