Task assignment between Scheduler and Executors #1221

mingmwang · 2021-11-02T09:13:54Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When I read the code, I see the task assignment between executors and the Scheduler was the schedulers consistently poll works from the Scheduler. And if there is no task to run, the poll loop will sleep for 100 ms. I think a better way should be let the Scheduler assign the available tasks to selected executors to make better use of CPU cores. The existing loop can keep there for heartbeat purpose. Need a new RPC method between the executor and Scheduler for task assignment.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

alamb · 2021-11-02T20:15:38Z

Maybe related to #700

cc @Dandandan

Dandandan · 2021-11-02T20:29:31Z

Maybe related to #700

cc @Dandandan

Yes, it's the same observation.
To me the proposed solution sounds like a good idea. FYI @andygrove

jon-chuang · 2021-11-07T20:57:58Z

Hi, this seems interesting, would love to try implementing it

jon-chuang · 2021-11-08T02:43:49Z

Here is the proposed design:

when an executor initially connects to the scheduler, it also tells the scheduler how many task slots it has. The amount of memory per task as per [Epic] Optionally Limit memory used by DataFusion plan #587 could also be negotiated here.
As long as the executor is alive, it tries to send jobs to it. The scheduler tries to prioritize sending tasks to executors with most slots available.

Just wondering if the cardinality estimates/execution cost model could be used for more intelligent scheduling. Also wondering if each task runs single threaded or if they can exploit more cores on the system, and if so, if they utilize a common threadpool or shard the number of cores in the system so that each of the executor's n slots has 1/n of the cores.

alamb · 2021-11-10T14:06:29Z

@jon-chuang sounds like a good start.

I think something else that the scheduler should be able to take advantage of in the future might be "data locality" -- that is if a plan looks like

(plan section 1) -- writes intermediate results --> (plan section 2)

It is likely advantageous in may cases to run section 1 and section 2 on the same executor, if possible, to avoid having to send ("reshuffle") the intermediate results around

jon-chuang · 2021-11-14T19:34:17Z

Regarding shuffling, I saw in some benchmarks for TiDB's distributed query engine (incidentally also relying on columnar storage) that an MPP style shuffle seemed to produce better results than map reduce style of Apache Spark. I think there are some open questions, such as whether Java could be the cause of this discrepancy. But maybe it's also worth thinking about how to optimize the shuffles.

I don't know enough about DataFusion to know if it takes into account data movement when generating query plans.

mingmwang · 2021-11-15T03:42:21Z

@jon-chuang sounds like a good start.

I think something else that the scheduler should be able to take advantage of in the future might be "data locality" -- that is if a plan looks like
(plan section 1) -- writes intermediate results --> (plan section 2)
It is likely advantageous in may cases to run section 1 and section 2 on the same executor, if possible, to avoid having to send ("reshuffle") the intermediate results around

Can you please explain the "data locality" requirements a little more ? I think for normal source tasks which read data from remote storage(cloud storage or Hdfs), there is no data locality. And for shuffle readers which have to read data from all map tasks, there is no data locality either.

mingmwang · 2021-11-15T03:51:20Z

Regarding shuffling, I saw in some benchmarks for TiDB's distributed query engine (incidentally also relying on columnar storage) that an MPP style shuffle seemed to produce better results than map reduce style of Apache Spark. I think there are some open questions, such as whether Java could be the cause of this discrepancy. But maybe it's also worth thinking about how to optimize the shuffles.

I don't know enough about DataFusion to know if it takes into account data movement when generating query plans.

Actually I'm working on a MPP style shuffle implementation, most of the coding part is done and I'm doing the testing.
I'm not sure whether the community need this feature or not.

houqp · 2021-11-15T06:39:04Z

Actually I'm working on a MPP style shuffle implementation, most of the coding part is done and I'm doing the testing.
I'm not sure whether the community need this feature or not.

I am very interested in this, do you mind sharing with us when it's ready?

alamb · 2021-11-15T12:12:17Z

Can you please explain the "data locality" requirements a little more ? I think for normal source tasks which read data from remote storage(cloud storage or Hdfs), there is no data locality. And for shuffle readers which have to read data from all map tasks, there is no data locality either.

I was thinking of a plan such as the following. There may be cases when reshuffling between scan/filter and aggregate is worthwhile (e.g. to distribute the load better) I think the cost of reshuffling will mostly end up dominating any savings

                                                                      
                                                                      
        rest of plan                                                  
                                                                      
                                                                      
              │                                                       
              │                                                       
              │                                                       
┌ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ┐                                       
              ▼                                                       
│   ┌───────────────────┐     │                                       
    │   HashAggregate   │                                             
│   └───────────────────┘     │                                       
              │                               Data is not reshuffled  
│             │               │              between scan, filter and 
              ▼                  ◀ ─ ─ ─ ─ ─        aggregate         
│   ┌───────────────────┐     │                                       
    │      Filter       │                                             
│   └───────────────────┘     │                                       
              │                                                       
│             │               │                                       
              ▼                                                       
│   ┌───────────────────┐     │                                       
    │     TableScan     │                                             
│   └───────────────────┘     │                                       
                                                                      
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘

liukun4515 · 2021-11-18T03:53:56Z

Actually I'm working on a MPP style shuffle implementation, most of the coding part is done and I'm doing the testing.
I'm not sure whether the community need this feature or not.

I am very interested in this, do you mind sharing with us when it's ready?

@mingmwang you can file a draft pr first.

yahoNanJing · 2022-01-12T06:04:07Z

Hi @alamb and @houqp, recently we have implemented an initial version of push-based task scheduling. Here's the design document. Could you help have a review?
https://docs.google.com/document/d/1Z1GO2A3bo7M_N26w_5t-9h3AhIPC2Huoh0j8jgwFETk/edit?usp=sharing

PR is ongoing.

houqp · 2022-01-12T06:12:21Z

Thank you @yahoNanJing for writing up the design doc! I will take a close look at it this weekend 👍

mingmwang · 2022-01-12T07:02:57Z

👍

alamb · 2022-01-13T18:06:14Z

I will also try and find some time to read this document, but it may not be for a few days

yahoNanJing · 2022-01-14T04:15:49Z

An initial version of PR is there #1560.

jon-chuang · 2022-01-14T09:07:38Z

Hi all, I've been working on a Rust API for the Ray distributed computing framework that powers many popular python ML libraries like RLLib, Ray Train and Ray Tune.

The Rust API is currently nearing the end of the prototype phase and we are looking for real-world usage for the project. You can view the tracking issue: ray-project/ray#20609 and prototype progress: ray-project/ray#21572

I'm quite interested in exploring the use of the Ray for highly-performant and efficient scheduling of tasks for Ballista. Note that one can do locality-aware scheduling with Ray, which can perform well even without randomized data partitioning etc. - thus opening new possibilities for Ballista's performance.

A second advantage of Ray is that the API is simple, so we don't need to deal with networking code and communication protocols which are difficult to maintain.

// This proc macro generates data marshalling, function registration 
// and internal ray::core API calls for the remote function
#[ray::remote]
fn my_task(..) {
  ..
}

fn main() {
  let obj = T::new();
  
  let id = ray::put::<T>(obj); // put the object into shared memory / object store
  
  // This can run on a remote node, 
  // as scheduled by the distributed scheduler
  let id2 = ray::task(my_task).remote(id); 
  
  let result = ray::get::<T2>(id2); // get object from shared memory
  
  println!("{:?}", result);
}

In the future, we are also interested in supporting GPU tasks via rustc's PTX backend that can be run on any NVIDIA GPU. So we could maybe accelerate Ballista the way that RAPIDS accelerates Spark etc, by converting physical operators into GPU kernels.

#[ray::remote(enable_for_gpu)]
fn my_compute_intensive_task(..) {
  arrow_data[ptx::idy() * N + ptx::idx()] = ..;
}

fn main() {
  let arrow_data = ray::task(load_distributed_data).remote(partition_id);
  ray::task(my_compute_intensive_task).as_gpu_task().remote(arrrow_data);
}

Our plan is also to support zero-copy reading of (immutable) Arrow data directly from the object store (on the same node) across multiple tasks.

Do let me know if anyone is interested in this. I will be happy to chat.

You can also shoot me an email chuang {d0t} jon [AT] gmail - dott - com

yjshen · 2022-01-14T11:20:08Z

@jon-chuang Thanks for bringing this up. I may mistake something for Ray, please point out.

IMHO, Ray is designed to ease the development of the general purposed distributed program. It's more like "parallel your machine learning code and run on a cluster without pain", just like what you have provided in the code sample above.

On the other hand, Ballista is meant to be a distributed SQL query engine, the code to distribute and run is quite limited, it's all about DataFusion's limited number of physical operators. So what should I expect from Ray integration? Does Ray provide core abilities like task scheduling, keepalive monitoring, struggler detection, and speculative task execution? Therefore I could easily build a distributed SQL engine on top of DataFusion with little effort?

jon-chuang · 2022-01-14T12:15:21Z

@yjshen thanks for your questions

task scheduling, keepalive monitoring, struggler detection, and speculative task execution\

yes.
yes and failure recovery at task level. We also have a worker monitoring dashboard with basic resource utilization info.
we do not have robust distributed tracing tools yet, but it is planned. As for scheduling, it does not currently take into account global information like straggling in an execution DAG and try to prioritize bottlenecked tasks. However, we are looking into priority mechanism for tasks, through which a user (or external monitoring tool) could prioritize bottlenecked tasks.
Note that Ray will always try to schedule tasks if there are resources available. So if the dataframe/SQL operation does not have an all-to-all dependency, it will automatically proceed to the next stage. We also have plans to preempt workers in anticipation of OOM.

Therefore I could easily build a distributed SQL engine on top of DataFusion with little effort?

This is unclear to me, and requires more investigation. However, note that the distributed dataframe project Modin was built on top of Ray.

the code to distribute and run is quite limited, it's all about DataFusion's limited number of physical operators.

Yes. I think the use-case is perhaps for incremental and interactive SQL queries that can take advantage of low-latency scheduling. For instance, backend serving for many queries (> 10-100 MOps) over a distributed dataset.

I think these workloads might currently be out of scope for Ballista, which is aimed at analytics just like Spark is, but it is interesting to consider.

For instance, time series DBs and Materialize DB offer this sort of streaming SQL computation. Also consider something like NoriaDB which is optimized for read-heavy serving workloads and offers incremental SQL computation.

mingmwang added the enhancement New feature or request label Nov 2, 2021

alamb added the ballista label Nov 2, 2021

yahoNanJing mentioned this issue Jan 14, 2022

Introduce push-based task scheduling for Ballista #1560

Merged

jon-chuang mentioned this issue Jan 14, 2022

[RFC] Job queueing functionality with Ray Serve + Workflows ray-project/ray#21161

Closed

5 tasks

houqp closed this as completed in #1560 Jan 23, 2022

yahoNanJing mentioned this issue May 19, 2022

Ballista Enhancement Overview apache/datafusion-ballista#7

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task assignment between Scheduler and Executors #1221

Task assignment between Scheduler and Executors #1221

mingmwang commented Nov 2, 2021 •

edited

Loading

alamb commented Nov 2, 2021

Dandandan commented Nov 2, 2021

jon-chuang commented Nov 7, 2021

jon-chuang commented Nov 8, 2021

alamb commented Nov 10, 2021

jon-chuang commented Nov 14, 2021 •

edited

Loading

mingmwang commented Nov 15, 2021

mingmwang commented Nov 15, 2021

houqp commented Nov 15, 2021

alamb commented Nov 15, 2021

liukun4515 commented Nov 18, 2021 •

edited

Loading

yahoNanJing commented Jan 12, 2022

houqp commented Jan 12, 2022

mingmwang commented Jan 12, 2022

alamb commented Jan 13, 2022

yahoNanJing commented Jan 14, 2022

jon-chuang commented Jan 14, 2022 •

edited

Loading

yjshen commented Jan 14, 2022

jon-chuang commented Jan 14, 2022 •

edited

Loading

Task assignment between Scheduler and Executors #1221

Task assignment between Scheduler and Executors #1221

Comments

mingmwang commented Nov 2, 2021 • edited Loading

alamb commented Nov 2, 2021

Dandandan commented Nov 2, 2021

jon-chuang commented Nov 7, 2021

jon-chuang commented Nov 8, 2021

alamb commented Nov 10, 2021

jon-chuang commented Nov 14, 2021 • edited Loading

mingmwang commented Nov 15, 2021

mingmwang commented Nov 15, 2021

houqp commented Nov 15, 2021

alamb commented Nov 15, 2021

liukun4515 commented Nov 18, 2021 • edited Loading

yahoNanJing commented Jan 12, 2022

houqp commented Jan 12, 2022

mingmwang commented Jan 12, 2022

alamb commented Jan 13, 2022

yahoNanJing commented Jan 14, 2022

jon-chuang commented Jan 14, 2022 • edited Loading

yjshen commented Jan 14, 2022

jon-chuang commented Jan 14, 2022 • edited Loading

mingmwang commented Nov 2, 2021 •

edited

Loading

jon-chuang commented Nov 14, 2021 •

edited

Loading

liukun4515 commented Nov 18, 2021 •

edited

Loading

jon-chuang commented Jan 14, 2022 •

edited

Loading

jon-chuang commented Jan 14, 2022 •

edited

Loading