[Design] High level API proposal for ballista #463
Comments
This proposal would fix #457, by the way. |
Here is a post from AWS that explains why it is better for control plane (scheduler) to contact the data plane (executor). For safety mechanism, the executor would be tracking progress, e.g. how many bytes scanned, and having some watchdog mechanisms, e.g. CPU/memory too high. Then, when a scheduler contacts, it has more information and it can decide whether to let another executor take-over, or fail the job, or interrupt the current task and re-partition it to other executors. Btw, I'm not very familiar with the distributed execution path yet. Is that merged into mainline? What would be a good reference point? Does the scheduler already know all workers' addresses that will participate in a query when it start the scheduling? |
That was an interesting read, thanks for the link! Distributed execution in Ballista 0.4 has just been merged at #491 There's still progress to be made there, so right now there's no fault tolerance for executor failures for example. One difference between what is mentioned in the AWS article you linked and ballista is that control messages in ballista need to flow bidirectionally: the scheduler will send work to the executors, but it is equally important for the executors to be able to tell the scheduler that they have finished their job and whether they can accept more workload. That means that the API call rate is never going to be 100% controlled by the scheduler. I do like the strategy of opening a long-lived connection from the executor and using that as a proxy for liveness. There's some increased complexity, since the scheduler will have to make a conscious effort to decide how many connections it wants. It also makes auto-scaling the scheduler cluster more difficult, since adding a new instance won't immediately remove any load from the existing schedulers. The design I was proposing mimicks what Kafka does, which is reportedly able to handle 10k connections per broker. If I find some time, I might run some benchmarks on the current code to see what's the scalability of a scheduler right now. |
@edrevo Have you thought about how the scheduler cluster might maintain shared state? I am thinking about things like partition statistics and possibly the provenance (and location) of partitions, i.e the chain of operations that produced a partition which could be useful in saving work for future queries. I am thinking schedulers could be in a raft (or your favourite consensus algorithm) cluster. |
Right now the scheduler cluster maintains state either through "standalone mode", which doesn't support cluster mode (so no HA), or through etcd. Once #574 is merged, the shared state will contain information about each job, stage and task/partition that has been submitted to the cluster, including which executor is handling each partition and where the results can be fetched from. Etcd is using the raft protocol behind the scenes, but using the raft protocol directly on the scheduler and avoiding the external dependency on etcd sounds great too. The way of implementing this would be to create a new ConfigBackendClient which does the necessary service discovery to find the other scheduler nodes and uses the raft protocol directly. Maybe that could be done by extending the Standalone config backend instead of implementing a new one. |
Thank you, that helps clarify my mental model of the design. I am curious about the performance implications of running our own raft cluster vs using etcd. Using raft in the standalone mode would certainly add a lot more complexity. |
I've been thinking lately about what the best API is for the different parts of ballista. For now, I'm still tabling the discussion around resource managers (K8s, Mesos, YARN, etc.) and I'll focus on scheduler, executors and clients.
I see that right now the executors implement the flight protocol, which makes perfect sense for arrow data transmission. I think this is a great fit for the "data plane" in ballista: executor<->executor communication and executor <-> client communication (for
.collect
).When it comes to the "control plane", though, I think stream-based mechanisms like the flight protocol (or even the bidirectional streams in gRPC) aren't great: they pin the communication of the client to a specific server (scheduler, in this case), so that makes it harder to dynamically increase the number of instances in a scheduler cluster (you could increase the instances, but all existing clients would still be talking the the same initial scheduler).
I also think that unary RPCs are easier to implement, and have a higher degree of self-documentation through the protobuf definition.
So, here's my proposal for the control plane:
The arrows go from client to server in this above image.
The scheduler would have an API that would look something like this:
I can drill down into the message definitions if you want, but I'm still not 100% sure of the complete amount of info that needs to go there (I know part of it).
The proposal for data plane is much simpler:
All of these would be based on the Flight Protocol, but we wouldn't use any of the control messages that flight offers (i.e. no DoAction).
Thoughts?
The text was updated successfully, but these errors were encountered: