Set up autoscaling for nodes #74

SamLau95 · 2016-12-03T10:34:44Z

We should automatically scale up and down nodes as needed.

See https://github.com/data-8/jupyterhub-k8s/blob/master/docs/autoscaling.md for an explanation of rationale and a starting point for implementation.

yuvipanda · 2016-12-28T22:39:42Z

'Autoscaling' can refer to three different things in Kubernetes:

Horizontal Pod Autoscaling. This is useful when you have 5 pods serving the same stateless application, and need to automatically scale it up and down based on traffic / CPU / Memory usage. This does not apply to us at all.
Vertical Pod Autoscaling / Autosizing. This grants individual pods more CPU / RAM / Disk based on certain criteria. This also doe not apply to us at all.
Cluster Autoscaling. This checks if there are enough nodes with enough CPU / RAM resources for scheduling the pods we want to schedule, and adds more nodes if needed. This is not built into kubernetes but a function of the cloud provider it runs on (such as Google / Azure), since this will need to provision new VMs. It might also destroy VMs when we are over provisioned. This is what we want.

Due to the way Kubernetes is architected, Cluster Autoscalers are easy to build. You have a program that runs in a loop, doing the following things:

Establish our goal for the cluster autoscaling. This could be something like 'be able to launch 50 pods with a 95th percentile waiting time of 5s' or 'cluster must be no more than 80% full at all times'.
Make API request to Kubernetes, find distribution of pods among nodes.
Check if new nodes need to be added for the cluster to fit the constraints we decided on in step 0.
If so, make an API request to the cloud provider (Google / Azure / AWS) to spawn a new VM and add it to the kubernetes cluster.
If we can safely delete a VM (for some definition of 'safely'), then make an API request to do so.

GKE provides an alpha autoscaler that can be used to accomplish certain autoscaling goals. Unfortunately, it does not really work for us yet. This is because:

Cluster Autoscaler assumes that all replicated pods can be restarted on some other node, possibly causing a brief disruption. If this is not the case then Cluster Autoscaler should not be used (yet).

is not true for us at all - destroying pods will cause a student to possibly lose their work, and will not be restarted on another node without manual action. This means their definition of safety doesn't work for us yet. This will hopefully change in the future.

Also, it only spawns new nodes then there are pods that are already scheduled but there is no place to run them. This means that we can not set a goal like 'cluster is at most 80% full at all times' - the autoscaler is hard coded to a particular goal, which is to kick in after the cluster is 100% full. This will probably be made configurable in the future, but not yet.

Azure Container Services doesn't have any default cluster autoscaler yet at all.

So for now, we've to write our own. It should not be too complicated, however.

yuvipanda · 2016-12-28T22:48:34Z

For Azure, we'll need to make this REST API call to change the agentPoolProfiles.count property.

For GKE I think we can use initialNodeCount in https://cloud.google.com/container-engine/reference/rest/v1/projects.zones.clusters.nodePools

jiefugong · 2017-02-17T23:45:29Z

https://cloud.google.com/sdk/gcloud/reference/container/clusters/resize

It looks like resizing clusters down choose random nodes in the cluster as opposed to allowing users to specify them (this may be problematic without a replication controller) -- has anyone seen any other solutions?

tonyyanga · 2017-02-17T23:52:45Z

Is it possible to use kubectl delete to remove the node from the cluster (https://kubernetes.io/docs/user-guide/kubectl/kubectl_delete/) and then use gcloud to shutdown the Google Compute Engine VM?

yuvipanda · 2017-02-17T23:56:57Z

We can test that with the dev cluster easily. Also, we should use https://cloud.google.com/compute/docs/tutorials/python-guide for the autoscaler, rather than shelling out to gcloud.

…

On Fri, Feb 17, 2017 at 3:52 PM, Tony Yang ***@***.***> wrote: Is it possible to use kubectl delete to remove the node from the cluster ( https://kubernetes.io/docs/user-guide/kubectl/kubectl_delete/) and then use gcloud to shutdown the Google Compute Engine VM? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#74 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAB23iVuBvv_pXauN3FzYLyeufLPzBO-ks5rdjLNgaJpZM4LDQmw> .

-- Yuvi Panda T http://yuvi.in/blog

yuvipanda · 2017-02-18T01:24:39Z

I'm testing this out now. Will report back with findings.

yuvipanda · 2017-02-18T01:39:20Z

I tried several approaches, and found one that worked!

Failed approach 1 - `gcloud container clusters resize`.

This doesn't work for us since it could delete instances with running pods in them. This will disrupt users.

From https://cloud.google.com/container-engine/docs/resize-cluster:

Note that the managed instance group does not discern between instances running pods and instances without pods. Resizing down will pick instances to remove at random.

Failed approach 2 - Just delete the node with `gcloud`

I just deleted a node with gcloud:

gcloud compute instances delete gke-dev-default-pool-dbd2a02e-2hrq

This worked for about 1min - but then a new instances was automatically spawned to replace it.

Working approach - `gcloud compute instance-groups managed delete-instances`

Failed approach 2 made me look around to see what mechanism was creating these nodes, and I discovered managed instance groups. They are used to maintain a given number of instances at all times, so when they notice an instance has died, they create another one!

Some more digging led me to the instance-groups managed delete-instances command, which does exactly what we want. It changes the number of instances that the instance group is managing, and deletes the instances we specifically ask it to.

jiefugong · 2017-02-19T22:00:27Z

Thanks for the help Yuvi!

Tested it myself and confirmed that this is what we're looking for. For future reference, the command will look like this for removing an instance in the dev cluster:
sudo gcloud compute instance-groups managed delete-instances gke-dev-default-pool-dbd2a02e-grp --instances=gke-dev-default-pool-dbd2a02e-l39n

Where the instance group and instance name are to be specified

SaladRaider assigned jiefugong and linbrian and unassigned linbrian Feb 14, 2017

tonyyanga self-assigned this Feb 17, 2017

tonyyanga mentioned this issue Feb 20, 2017

Autoscaling for nodes (working with Google Cloud Platform) #117

Merged

yuvipanda closed this as completed in #117 Apr 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up autoscaling for nodes #74

Set up autoscaling for nodes #74

SamLau95 commented Dec 3, 2016

yuvipanda commented Dec 28, 2016 •

edited

Loading

yuvipanda commented Dec 28, 2016

jiefugong commented Feb 17, 2017

tonyyanga commented Feb 17, 2017

yuvipanda commented Feb 17, 2017 via email

yuvipanda commented Feb 18, 2017

yuvipanda commented Feb 18, 2017

jiefugong commented Feb 19, 2017

Set up autoscaling for nodes #74

Set up autoscaling for nodes #74

Comments

SamLau95 commented Dec 3, 2016

yuvipanda commented Dec 28, 2016 • edited Loading

yuvipanda commented Dec 28, 2016

jiefugong commented Feb 17, 2017

tonyyanga commented Feb 17, 2017

yuvipanda commented Feb 17, 2017 via email

yuvipanda commented Feb 18, 2017

yuvipanda commented Feb 18, 2017

Failed approach 1 - gcloud container clusters resize.

Failed approach 2 - Just delete the node with gcloud

Working approach - gcloud compute instance-groups managed delete-instances

jiefugong commented Feb 19, 2017

yuvipanda commented Dec 28, 2016 •

edited

Loading

Failed approach 1 - `gcloud container clusters resize`.

Failed approach 2 - Just delete the node with `gcloud`

Working approach - `gcloud compute instance-groups managed delete-instances`