These are the most common terms used across the documentation to describe the platform.
- JobSpec: A JobSpec is the meta definition of a runnable job. It contains information about the implementation of the job, the artifact in that contains that implementation and the parameters it accepts (among other things).
- ExecutionPlan: An ExecutionPlan is the main controller of runnable jobs. It is in charge of creating new execution instances and to trigger them at the appropriate moment in time.
- Execution: An Execution is an instance of a runnable job. These are managed internally inside Chronos although client applications can access the data they hold in order to get information of the different state changes of a given execution or what was the execution outcome (if finished).
- Registry: It's a sort of repository in which all the available JobSpecs are kept.
- Scheduler: The main core component of the platform, in charge of scheduling executions, triggering them and send them to the worker instances.
- Worker: Ad-hoc node, not member of the scheduling cluster, that performs the actual execution of the job that was scheduled in the first place.
How does it work?
Quckoo doesn't know anything about jobs that haven't been registered in within the system yet. So before being able to schedule any kind of task, we first need to tell Quckoo what are the specific details of that kind of task implementation.
Following diagram shows how this process works:
It is itself quite self-explanatory and compound of the following steps:
- A client sends a request to the Registry saying that it wants to register job A.
- The Registry communicates with the Resolver to validate the job details it has received.
- The Resolver replies back saying that the specification looks right.
- The Registry stores the job specification in within itself and replies back to the client saying that the job has been accepted.
The scheduling (and execution) process it's much more involved than the previous one and goes through very different phases until it reaches the point in which a specific task has been completed.
Following diagram tries to depict this process:
The set of steps of this process are as follows:
- A client requests to schedule a previously registered job specification (as per a trigger definition).
- The Scheduler asks the Registry about the job details of that specific job.
- The Registry replies back with the previously requested job details.
- The Scheduler then creates a new instance of an Execution Plan. Which is responsible to manage the the triggering of the job itself.
- The Execution Plan creates a new Execution in
- At the triggered specific time, the Execution Plan notifies the Execution to wake up
- The Execution attempts to enqueue a task in the pending tasks queue.
- The Task Queue notifies the Worker nodes that there are tasks pending to be executed.
- Non busy Worker nodes reply back asking for new tasks to work on
- The Task Queue delivers one of the pending tasks to a Worker node.
- The Worker sends the task itself to an executor, which knows how to execute the task.
- The Executor asks the Resolver for all the dependencies needed to run the task.
- The Resolver replies back with the previously requested dependencies.
- The Executor replies to the Worker with result of that specific task.
- The Worker sends the execution result to the Task Queue. At this point, the Task Queue will get back to point number 8 if there are more pending tasks.
- The Task Queue notifies the Execution instance that the execution job has been finished. At this point, the Execution Plan can create and trigger more Executions according to its trigger rules.
Of course in previous diagram flows there are many components that can go wrong at any specific moment in time. To be resilient to failures the following decisions have been taking so far:
- The job registry and live executions are written to disk whilst keeping a copy of the data in memory.
- The job registry is as well sharded across the cluster.
- Task queues are only in memory. The idea behind it is that the execution instances themselves are the ones that know they have been scheduled, what it's their timeout, etc. Executions themselves supervise the task queue which was associated with them and in case the queue is lost they must be able to re-enqueue themselves in another instance of the queue somewhere else in the cluster.
- Execution plans are also persisted to disk, and sharded across the cluster too. This gives linear scalability of the master nodes.
- Workers live outside the cluster as ad hoc nodes. They register themselves to all the scheduler instances available in the cluster and when they are idle they request new work to any of those instances.
- Still not available in current implementation is an execution retry process (this kind of processes are usually quite complex) but some research is being done. The purpose of this retry process is to be able to identify worker node failures and distinguish them from task execution failures to help making proper judgement of whether a specific task must be retried or not.