Skip to content
This repository has been archived by the owner on Nov 30, 2021. It is now read-only.

deis-controller generated fleet units should [X-Fleet] Conflicts with the others of the same app #2920

Closed
ianblenke opened this issue Jan 19, 2015 · 10 comments
Assignees

Comments

@ianblenke
Copy link

When there is an instance failure, units migrate to the remaining fleet machines. Over time, as fleet machines fail and are terminated by the AWS auto-scaling group and replaced with new instances, this tends to leave the longest-running fleet machine with a good chance of having multiple if not all units for an app on it.

This just happened to me this evening. I had a 10 minute production outage where the deis-registry, deis-store-gateway, and one of my critical web apps had all 3 units somehow allocated on the the same fleet machine which suddenly stopped responding (thanks AWS!) The 3 app web units were trying to be relocated by fleet, but they were trying to docker pull from the deis-registry which was waiting on the deis-store-gateway to finish starting up which was waiting on ceph which was running in that normal degraded slow way ceph does when one of its monitors/OSDs isn't up. It did eventually repair itself, but that web app wasn't being serviced in the interim.

While it might be fair to say "well fleet shouldn't schedule all 3 on the same machine!" - how is Fleet to know?

At the moment, deis-controller generates fleet units without any kind of affinity for the Fleet scheduler to know better than to allocate all of the units of an app on the same fleet machine.

While using deis tags would afford us the ability to tag generated units with fleet machine affinity, it still doesn't prevent the occurrence of a fleet tagged machine from acquiring multiple or all units for an app. Meaning: fleet will merrily schedule multiple units to any available fleet machine with the app's tags.

What we need is a way of tagging deis-controller generated fleet units with an [X-Fleet] Conflicts.

For example, a myapp v3 web proctype unit would ideally have this appended to it:

[X-Fleet]
Conflicts=myapp_v3.web.*

Granted, on an N instance deis fleet cluster, this limits it to deis ps:scale web=N. Any more than N, and the units simply won't get scheduled by fleet.

As Conflicts is a simple glob, in order to allow overcommits of units, we would need to introduce a "oversubscription" ordinal component to the generated unit service name.

For example, if myapp were on a 3 machine fleet cluster, and we set deis ps:scale web=6, in order to allow for oversubscription of 2 (and only 2) units on any given fleet machine, we would need to generate service names like:

myapp_v3.web.1.1
myapp_v3.web.1.2
myapp_v3.web.1.3
myapp_v3.web.2.1
myapp_v3.web.2.2
myapp_v3.web.2.3

And have a unit [X-Fleet] Conflicts something like:

[X-Fleet]
Conflicts=myapp_v3.web.1.*
Conflicts=myapp_v3.web.2.*

If the second ordinal is the modulus of N (given an N machine fleet cluster), the first ordinal would be a dividend, and would reflect this oversubcription.

Depending on the memory size limit of each unit, this will likely require some deis client support of setting the maximum oversubscription of an app unit per machine as well. Something like:

deis ps:maxoversubscription web=2

This would imply that, on a fleet cluster of N machines, if I were to set scale greater than N*2, that the additional units beyond the maxoversubscription dividend would end up spreading across the cluster as per usual fleet rules (as there would be no applicable X-Fleet Conflicts glob to prevent it).

Anyway, this is all conjecture, I'm sure OpDemand will contrive something that solves these kinds of problems.

Thanks!

@gabrtv gabrtv self-assigned this Jan 22, 2015
@gabrtv gabrtv added this to the v1.3 milestone Jan 22, 2015
@bacongobbler
Copy link
Member

With an X-Conflicts, that means that we cannot have more than one instance of an application on each host, which some may consider a good thing if they have the resources to spare.

I'd prefer if there's a way we can evenly spread out the containers through a fix upstream (which is the preferred way of doing this) instead of through some scheduler-specific logic. It means that we are essentially doing the scheduling for the scheduler, which is not the intention of the scheduler module. It's supposed to be an endpoint to communicate and ship jobs to a scheduler, and the host balancing/job scheduling logic is performed on their end.

@bacongobbler
Copy link
Member

Also, as I mentioned in #2959, we should get to the bottom of the issue here with AWS. I've had a cluster running for up to two weeks on Rackspace and Digital Ocean with no downtime, so this has to do with something on that specific provider (though let's keep that in the other ticket).

@ianblenke
Copy link
Author

I'm not saying that we shouldn't also fix the AWS issue, but we must have a way of guaranteeing that units get distributed on more than one host. Principle of least surprise.

Like right now I have a 5 node cluster. I do a deis config:set on a scale web=5 app, and the new version of the app's units all get scheduled to a single CoreOS machine. So I do config:set again, same thing: the new version gets scheduled to the exact same CoreOS machine. I do it 4 more times, Fleet consistently schedules all of the new units on the same fleet machine. Trying to do a deis ps:scale up and down, all of the units get scheduled to the same fleet machine. This shouldn't be possible, yet it is consistently having the worst possible deployment case: all of my critical production app units getting deployed to the same fleet machine.

I'm not sure how anyone can run like this.

@johanneswuerbach
Copy link
Contributor

Yep, I had the same issue and until it is fixed or a different solution is provided coreos/fleet#943, a workaround should be implemented. Having all containers of an app on the same instance doesn't allow HA.

@ianblenke fleet schedules new units to the instance, which runs the least number of units (only the number counts). So you would first have "to even" all instances manually (scaling a dummy app or something like that) and then scale your production app. Not great, but works.

@ianblenke
Copy link
Author

Thinking about this, it might be nicer to say "mark this app as all units will be deployed as Global", wherein setting a scale= will imply that you want to run that many units per fleet machine instead of total cluster wide.

That would allow me to ensure that I have a unit running on all fleet machines, while simultaneously allow me to scale it beyond one-per-machine

If app_v1.web and app_v2.web were both global on a 3 node cluster (scale=2), that would imply that there would be 6 units deployed as a result. That seems both obvious and adequate to the problem at hand.

As @aledbf mentioned in the IRC chat room, fleet schedules based on the number of units on a given fleet machine.

@bacongobbler mentioned that this is the real place to fix this problem:
https://github.com/coreos/fleet/blob/b298bbefa0f6334344e5ef3ba08f789faf8c02ad/engine/scheduler_test.go#L27-L66

Deploying a bunch of dummy fleet units with an X-Fleet MachineID= affinity would be one approach.

Barring that, I now use this script to re-balance with X-Fleet conflicts:
https://gist.github.com/ianblenke/038c6540cd37d776664b

@carmstrong
Copy link
Contributor

@ianblenke What's the best way to recreate this? In general, we haven't seen the extreme behavior you're seeing, but you seem to hit it consistently.

Should I provision a 5-node cluster and deploy an app? Or do you typically see it after a config:set? I'll try to recreate this.

@johanneswuerbach
Copy link
Contributor

@carmstrong you can reproduce this by deploying a 3 node cluster, deploy app A (scale=3), reboot one instance, deploy another app B and scale B to 3. As the rebooted instance didn't run any containers, fleet usually schedules all B containers to the rebooted instance.

Rebooting is the easiest way, but fleet cashes or instance failures are triggering the same problem.

@carmstrong
Copy link
Contributor

Rebooting is the easiest way, but fleet cashes or instance failures are triggering the same problem.

I'll give it a go. I'm really hoping we can get to the bottom of the instance failures, as preventing those would alleviate a lot of potential issues.

@mboersma mboersma modified the milestones: v1.3, v1.4 Jan 29, 2015
@bacongobbler bacongobbler modified the milestone: v1.4 Feb 13, 2015
@carmstrong carmstrong self-assigned this Feb 13, 2015
@carmstrong
Copy link
Contributor

I reached out the CoreOS folks on this, and they pointed me to coreos/fleet#1023 where they're tracking rebalancing of units on hosts.

They'd love some feedback on the proposal, and will help implement the solution into fleet itself. I don't think hacking the Deis scheduler to load a bunch of dummy units is the best way to approach this, and since we have a path forward with CoreOS/fleet directly, I'm going to close this issue in Deis.

Let's get the Deis community behind an implementation on that issue, and work with the CoreOS team to implement it. I'll volunteer some time to help coordinate folks from our community to work on this.

/cc @bcwaldon

@ianblenke
Copy link
Author

Agreed. Also, the docker swarm scheduling stuff and ECS mesos integrations for docker scheduling are also quite compelling:

https://github.com/docker/swarm/tree/master/scheduler
https://github.com/awslabs/ecs-mesos-scheduler-driver

Much innovation happening right now. Keeping a close eye on it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants