flynn-host cluster refactor #1373

titanous · 2015-03-28T18:18:39Z

After attempting to implement #1170 a few different ways, and observing some production failures of sampi, I have a new plan.

In the new model, each flynn-host instance exposes an API that can be used to run jobs on that instance.

The concept of a cluster state leader (sampi) is removed and replaced with a set of "monitors" that keep track of the cluster state. The sampi model had a serious failing in that it made it appear that there is a consistent view of the cluster, when that is not actually true. At any point in time there may be hosts that are not connected to sampi due to a network partition or other issues. Hosts could also disconnect during scheduling, causing requested jobs to never be run. Since pretending that there is a consistent cluster state only complicates things, we can switch to an eventually consistent model. The first iteration will have "monitor" instances that discover hosts via service discovery and unify the state from all of them in memory. To run jobs, the scheduler must connect to the target host and request that the jobs be run. Since the scheduler needs to connect to the host anyway to monitor the job, this has basically no impact.

Layer 0 bootstrap is removed entirely, and the concept of layer 0 mostly disappears. All services, including discoverd, etcd, and flannel are started by bootstrap. A discovery URL or list of flynn-host IPs is passed to bootstrap in order to find all of the nodes. This simplifies bootstrap and allows us to have tighter control over the process without splitting configuration across multiple steps (no more implicit layer 0 bootstrap + etcd config). The user experience is also better because we can health check etcd/flannel while starting them, and expose any error logs and diagnostics right away (#1148). It also becomes much easier to register everything in the controller for management and updates, and move to idempotent bootstrap (#225) in the future.

In the future, we'll want to apply policy across the cluster to schedulers (for example limiting the total amount of resources allowed). This will be accomplished by having a set of "arbiter" services that watch trusted monitors. In order to schedule a new job or resize an existing job, a scheduler will provide details to and request a ticket from the arbiter. The ticket will time-limited and be accepted by the specified host, providing permission to access the resources.

TODO

Remove sampi, and modify each host to be standalone, capable of running jobs via its API.
Remove layer 0 bootstrap.
Implement discovery URL system for hosts.
Add API to configure networking, remove knowledge of flannel.
Add API to configure discoverd registration.
Move etcd, discoverd, flannel to bootstrap, register in controller.
Update scheduler.
Update all consumers of the cluster API to use the new system.

The text was updated successfully, but these errors were encountered:

lmars · 2015-03-29T19:25:37Z

@titanous this sounds sensible, one thing I am wondering is if the monitor and scheduler are the same thing? The scheduler needs to know what is running across the cluster to intelligently schedule jobs, and it watches all the hosts to track job states, which sounds like what the monitor service would also do?

titanous · 2015-03-29T19:36:54Z

Yeah, I'm going back and forth on that. The main thing is figuring out how one-off jobs are scheduled in that scenario? Does the scheduler expose an API? That seems a bit weird as the scheduler is specific to the controller and there is the possibility of running many schedulers (what about them? do they implement the same monitoring code?)

lmars · 2015-03-29T21:55:28Z

I think it makes sense for the scheduler to run the one-off jobs too, perhaps the controller creates job records in the database, and there is an AFTER INSERT trigger which the scheduler uses to actually run the job? Even if the monitor and scheduler are separate though, we still have that issue right?

WRT multiple schedulers running, I was wondering whether we should relax the constraint that only the leader schedules jobs, and use postgres advisory locks (like the deployer) to "lock" a formation / job and make the necessary cluster changes?

Thinking more about the schedulers being the monitors, they could maintain the full cluster state in service metadata, and then when scheduling a job they first place the job in-memory and then try to update the service metadata, and only if that succeeds would they then actually run the requested jobs (which would make it similar to the shared-state scheduler described in the Omega paper).

Of course my suggestions require a functioning etcd, discoverd & postgres, so we may need to handle the case of scheduling those separately.

titanous mentioned this issue Mar 28, 2015

Refactor layer 0 bootstrap #1170

Closed

10 tasks

titanous added kind/enhancement component/host labels Mar 28, 2015

titanous self-assigned this Mar 28, 2015

titanous mentioned this issue Apr 5, 2015

After leaving the cluster to run for two days, flynn-host ps returns nothing #1405

Closed

This was referenced Apr 29, 2015

Flynn updater #276

Closed

Scheduler V2 #1484

Closed

titanous mentioned this issue May 19, 2015

No hosts found when pushing to Flynn #1545

Closed

titanous mentioned this issue Jun 5, 2015

Failed to run remote commands #1573

Closed

titanous mentioned this issue Jun 14, 2015

Host refactor #1597

Merged

titanous closed this as completed in #1597 Jun 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flynn-host cluster refactor #1373

flynn-host cluster refactor #1373

titanous commented Mar 28, 2015

lmars commented Mar 29, 2015

titanous commented Mar 29, 2015

lmars commented Mar 29, 2015

flynn-host cluster refactor #1373

flynn-host cluster refactor #1373

Comments

titanous commented Mar 28, 2015

lmars commented Mar 29, 2015

titanous commented Mar 29, 2015

lmars commented Mar 29, 2015