flynn-host cluster refactor #1373
Comments
@titanous this sounds sensible, one thing I am wondering is if the monitor and scheduler are the same thing? The scheduler needs to know what is running across the cluster to intelligently schedule jobs, and it watches all the hosts to track job states, which sounds like what the monitor service would also do? |
Yeah, I'm going back and forth on that. The main thing is figuring out how one-off jobs are scheduled in that scenario? Does the scheduler expose an API? That seems a bit weird as the scheduler is specific to the controller and there is the possibility of running many schedulers (what about them? do they implement the same monitoring code?) |
I think it makes sense for the scheduler to run the one-off jobs too, perhaps the controller creates job records in the database, and there is an WRT multiple schedulers running, I was wondering whether we should relax the constraint that only the leader schedules jobs, and use postgres advisory locks (like the deployer) to "lock" a formation / job and make the necessary cluster changes? Thinking more about the schedulers being the monitors, they could maintain the full cluster state in service metadata, and then when scheduling a job they first place the job in-memory and then try to update the service metadata, and only if that succeeds would they then actually run the requested jobs (which would make it similar to the shared-state scheduler described in the Omega paper). Of course my suggestions require a functioning etcd, discoverd & postgres, so we may need to handle the case of scheduling those separately. |
After attempting to implement #1170 a few different ways, and observing some production failures of sampi, I have a new plan.
In the new model, each
flynn-host
instance exposes an API that can be used to run jobs on that instance.The concept of a cluster state leader (
sampi
) is removed and replaced with a set of "monitors" that keep track of the cluster state. The sampi model had a serious failing in that it made it appear that there is a consistent view of the cluster, when that is not actually true. At any point in time there may be hosts that are not connected to sampi due to a network partition or other issues. Hosts could also disconnect during scheduling, causing requested jobs to never be run. Since pretending that there is a consistent cluster state only complicates things, we can switch to an eventually consistent model. The first iteration will have "monitor" instances that discover hosts via service discovery and unify the state from all of them in memory. To run jobs, the scheduler must connect to the target host and request that the jobs be run. Since the scheduler needs to connect to the host anyway to monitor the job, this has basically no impact.Layer 0 bootstrap is removed entirely, and the concept of layer 0 mostly disappears. All services, including discoverd, etcd, and flannel are started by bootstrap. A discovery URL or list of
flynn-host
IPs is passed to bootstrap in order to find all of the nodes. This simplifies bootstrap and allows us to have tighter control over the process without splitting configuration across multiple steps (no more implicit layer 0 bootstrap + etcd config). The user experience is also better because we can health check etcd/flannel while starting them, and expose any error logs and diagnostics right away (#1148). It also becomes much easier to register everything in the controller for management and updates, and move to idempotent bootstrap (#225) in the future.In the future, we'll want to apply policy across the cluster to schedulers (for example limiting the total amount of resources allowed). This will be accomplished by having a set of "arbiter" services that watch trusted monitors. In order to schedule a new job or resize an existing job, a scheduler will provide details to and request a ticket from the arbiter. The ticket will time-limited and be accepted by the specified host, providing permission to access the resources.
TODO
The text was updated successfully, but these errors were encountered: