Skip to content
This repository has been archived by the owner on Jan 25, 2018. It is now read-only.

Set up autoscaling for nodes #74

Closed
SamLau95 opened this issue Dec 3, 2016 · 8 comments
Closed

Set up autoscaling for nodes #74

SamLau95 opened this issue Dec 3, 2016 · 8 comments
Assignees

Comments

@SamLau95
Copy link
Member

SamLau95 commented Dec 3, 2016

We should automatically scale up and down nodes as needed.

See https://github.com/data-8/jupyterhub-k8s/blob/master/docs/autoscaling.md for an explanation of rationale and a starting point for implementation.

@yuvipanda
Copy link
Contributor

yuvipanda commented Dec 28, 2016

'Autoscaling' can refer to three different things in Kubernetes:

  1. Horizontal Pod Autoscaling. This is useful when you have 5 pods serving the same stateless application, and need to automatically scale it up and down based on traffic / CPU / Memory usage. This does not apply to us at all.
  2. Vertical Pod Autoscaling / Autosizing. This grants individual pods more CPU / RAM / Disk based on certain criteria. This also doe not apply to us at all.
  3. Cluster Autoscaling. This checks if there are enough nodes with enough CPU / RAM resources for scheduling the pods we want to schedule, and adds more nodes if needed. This is not built into kubernetes but a function of the cloud provider it runs on (such as Google / Azure), since this will need to provision new VMs. It might also destroy VMs when we are over provisioned. This is what we want.

Due to the way Kubernetes is architected, Cluster Autoscalers are easy to build. You have a program that runs in a loop, doing the following things:

  1. Establish our goal for the cluster autoscaling. This could be something like 'be able to launch 50 pods with a 95th percentile waiting time of 5s' or 'cluster must be no more than 80% full at all times'.
  2. Make API request to Kubernetes, find distribution of pods among nodes.
  3. Check if new nodes need to be added for the cluster to fit the constraints we decided on in step 0.
  4. If so, make an API request to the cloud provider (Google / Azure / AWS) to spawn a new VM and add it to the kubernetes cluster.
  5. If we can safely delete a VM (for some definition of 'safely'), then make an API request to do so.

GKE provides an alpha autoscaler that can be used to accomplish certain autoscaling goals. Unfortunately, it does not really work for us yet. This is because:

Cluster Autoscaler assumes that all replicated pods can be restarted on some other node, possibly causing a brief disruption. If this is not the case then Cluster Autoscaler should not be used (yet).

is not true for us at all - destroying pods will cause a student to possibly lose their work, and will not be restarted on another node without manual action. This means their definition of safety doesn't work for us yet. This will hopefully change in the future.

Also, it only spawns new nodes then there are pods that are already scheduled but there is no place to run them. This means that we can not set a goal like 'cluster is at most 80% full at all times' - the autoscaler is hard coded to a particular goal, which is to kick in after the cluster is 100% full. This will probably be made configurable in the future, but not yet.

Azure Container Services doesn't have any default cluster autoscaler yet at all.

So for now, we've to write our own. It should not be too complicated, however.

@yuvipanda
Copy link
Contributor

For Azure, we'll need to make this REST API call to change the agentPoolProfiles.count property.

For GKE I think we can use initialNodeCount in https://cloud.google.com/container-engine/reference/rest/v1/projects.zones.clusters.nodePools

@SaladRaider SaladRaider assigned jiefugong and linbrian and unassigned linbrian Feb 14, 2017
@tonyyanga tonyyanga self-assigned this Feb 17, 2017
@jiefugong
Copy link

https://cloud.google.com/sdk/gcloud/reference/container/clusters/resize

It looks like resizing clusters down choose random nodes in the cluster as opposed to allowing users to specify them (this may be problematic without a replication controller) -- has anyone seen any other solutions?

@tonyyanga
Copy link
Collaborator

Is it possible to use kubectl delete to remove the node from the cluster (https://kubernetes.io/docs/user-guide/kubectl/kubectl_delete/) and then use gcloud to shutdown the Google Compute Engine VM?

@yuvipanda
Copy link
Contributor

yuvipanda commented Feb 17, 2017 via email

@yuvipanda
Copy link
Contributor

I'm testing this out now. Will report back with findings.

@yuvipanda
Copy link
Contributor

I tried several approaches, and found one that worked!

Failed approach 1 - gcloud container clusters resize.

This doesn't work for us since it could delete instances with running pods in them. This will disrupt users.

From https://cloud.google.com/container-engine/docs/resize-cluster:

Note that the managed instance group does not discern between instances running pods and instances without pods. Resizing down will pick instances to remove at random.

Failed approach 2 - Just delete the node with gcloud

I just deleted a node with gcloud:

gcloud compute instances delete gke-dev-default-pool-dbd2a02e-2hrq

This worked for about 1min - but then a new instances was automatically spawned to replace it.

Working approach - gcloud compute instance-groups managed delete-instances

Failed approach 2 made me look around to see what mechanism was creating these nodes, and I discovered managed instance groups. They are used to maintain a given number of instances at all times, so when they notice an instance has died, they create another one!

Some more digging led me to the instance-groups managed delete-instances command, which does exactly what we want. It changes the number of instances that the instance group is managing, and deletes the instances we specifically ask it to.

@jiefugong
Copy link

Thanks for the help Yuvi!

Tested it myself and confirmed that this is what we're looking for. For future reference, the command will look like this for removing an instance in the dev cluster:
sudo gcloud compute instance-groups managed delete-instances gke-dev-default-pool-dbd2a02e-grp --instances=gke-dev-default-pool-dbd2a02e-l39n

Where the instance group and instance name are to be specified

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants