Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

flynn-host cluster refactor #1373

Closed
8 tasks done
titanous opened this issue Mar 28, 2015 · 3 comments
Closed
8 tasks done

flynn-host cluster refactor #1373

titanous opened this issue Mar 28, 2015 · 3 comments

Comments

@titanous
Copy link
Contributor

After attempting to implement #1170 a few different ways, and observing some production failures of sampi, I have a new plan.

In the new model, each flynn-host instance exposes an API that can be used to run jobs on that instance.

The concept of a cluster state leader (sampi) is removed and replaced with a set of "monitors" that keep track of the cluster state. The sampi model had a serious failing in that it made it appear that there is a consistent view of the cluster, when that is not actually true. At any point in time there may be hosts that are not connected to sampi due to a network partition or other issues. Hosts could also disconnect during scheduling, causing requested jobs to never be run. Since pretending that there is a consistent cluster state only complicates things, we can switch to an eventually consistent model. The first iteration will have "monitor" instances that discover hosts via service discovery and unify the state from all of them in memory. To run jobs, the scheduler must connect to the target host and request that the jobs be run. Since the scheduler needs to connect to the host anyway to monitor the job, this has basically no impact.

Layer 0 bootstrap is removed entirely, and the concept of layer 0 mostly disappears. All services, including discoverd, etcd, and flannel are started by bootstrap. A discovery URL or list of flynn-host IPs is passed to bootstrap in order to find all of the nodes. This simplifies bootstrap and allows us to have tighter control over the process without splitting configuration across multiple steps (no more implicit layer 0 bootstrap + etcd config). The user experience is also better because we can health check etcd/flannel while starting them, and expose any error logs and diagnostics right away (#1148). It also becomes much easier to register everything in the controller for management and updates, and move to idempotent bootstrap (#225) in the future.

In the future, we'll want to apply policy across the cluster to schedulers (for example limiting the total amount of resources allowed). This will be accomplished by having a set of "arbiter" services that watch trusted monitors. In order to schedule a new job or resize an existing job, a scheduler will provide details to and request a ticket from the arbiter. The ticket will time-limited and be accepted by the specified host, providing permission to access the resources.

TODO

  • Remove sampi, and modify each host to be standalone, capable of running jobs via its API.
  • Remove layer 0 bootstrap.
  • Implement discovery URL system for hosts.
  • Add API to configure networking, remove knowledge of flannel.
  • Add API to configure discoverd registration.
  • Move etcd, discoverd, flannel to bootstrap, register in controller.
  • Update scheduler.
  • Update all consumers of the cluster API to use the new system.
@lmars
Copy link
Contributor

lmars commented Mar 29, 2015

@titanous this sounds sensible, one thing I am wondering is if the monitor and scheduler are the same thing? The scheduler needs to know what is running across the cluster to intelligently schedule jobs, and it watches all the hosts to track job states, which sounds like what the monitor service would also do?

@titanous
Copy link
Contributor Author

Yeah, I'm going back and forth on that. The main thing is figuring out how one-off jobs are scheduled in that scenario? Does the scheduler expose an API? That seems a bit weird as the scheduler is specific to the controller and there is the possibility of running many schedulers (what about them? do they implement the same monitoring code?)

@lmars
Copy link
Contributor

lmars commented Mar 29, 2015

I think it makes sense for the scheduler to run the one-off jobs too, perhaps the controller creates job records in the database, and there is an AFTER INSERT trigger which the scheduler uses to actually run the job? Even if the monitor and scheduler are separate though, we still have that issue right?

WRT multiple schedulers running, I was wondering whether we should relax the constraint that only the leader schedules jobs, and use postgres advisory locks (like the deployer) to "lock" a formation / job and make the necessary cluster changes?

Thinking more about the schedulers being the monitors, they could maintain the full cluster state in service metadata, and then when scheduling a job they first place the job in-memory and then try to update the service metadata, and only if that succeeds would they then actually run the requested jobs (which would make it similar to the shared-state scheduler described in the Omega paper).

Of course my suggestions require a functioning etcd, discoverd & postgres, so we may need to handle the case of scheduling those separately.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants