Add initial NodePool structure for cluster autoscaling #173

tuommaki · 2018-11-29T12:05:57Z

When implementing cluster autoscaling, new fields for min and max size
of worker ASG must be added to CR. In the near future this is
information specific for a single node pool so it makes sense to
introduce initial NodePool structure here and then improve it later when
implementing actual node pools.

Towards https://github.com/giantswarm/giantswarm/issues/4543

When implementing cluster autoscaling, new fields for min and max size of worker ASG must be added to CR. In the near future this is information specific for a single node pool so it makes sense to introduce initial NodePool structure here and then improve it later when implementing actual node pools.

tuommaki · 2018-11-29T12:06:57Z

pkg/apis/provider/v1alpha1/cluster_types.go

@@ -105,3 +106,13 @@ type ClusterKubernetesSSHUser struct {
 type ClusterNode struct {
 	ID string `json:"id" yaml:"id"`
 }
+
+type ClusterNodePool struct {
+	ID   string              `json:"id" yaml:"id"`


I think ID makes more sense here than Name although I can imagine that customer wants to name his/her node pools with some descriptive name.

I think we had Name on the cluster object and it was actually a Description. We also had IDs on the nodes we track in the CRs so far and they have never been really useful I guess. Maybe in KVM. Uncertain.

The thing here is that there should be some way to claim 1:1 match between node pool in Spec and one in Status. I wouldn't accept blind trust on keeping ordering the same. There is also necessity to identify one in AWS API when we have more than one of them. There's also this "ASG name" field required for cluster-autoscaler configuration and I'm not sure if that is AWS resource ID or an actual identifier set by "human".

To summarize: We need something to identify specific node pool. The question is if it is machine generated (perhaps better option) or "set by human" (that could be just "workers" before we have actual node pools in the UI).

Having an ID for internal purposes and a Description for customer use makes sense to me then.

My request would be to just make sure if this stays ID it is not set by human in any case. Otherwise Name + validation makes more sense IMO.

tuommaki · 2018-11-29T12:07:47Z

pkg/apis/provider/v1alpha1/cluster_types.go

+type ClusterNodePool struct {
+	ID   string              `json:"id" yaml:"id"`
+	Size ClusterNodePoolSize `json:"size" yaml:"size"`
+}


Tags would probably be also useful in the future when working on actual node pool roadmap story, but I don't add them here as it's out of scope.

tuommaki · 2018-11-29T12:08:43Z

pkg/apis/provider/v1alpha1/aws_types.go

+}
+
+type AWSConfigStatusAWSNodePoolSize struct {
+	Desired int `json:"desired" yaml:"desired"`


I was thinking between Current and Desired but I think this makes more sense as CR is not 100% in sync with actual cluster state so this way it's semantically more correct.

I don't follow what this desired value should be here. We enter min and max, who tells us the desired number?

This is Status section of AWSConfig. aws-operator should periodically check what are current values of {desired,max,min} for given worker ASG in AWS (because cluster-autoscaler adjusts them dynamically). Via this status field we can then provide this information for interested clients such as Happa.

The status of the CR is the actual, the current state. The desired state is defined in the spec. I would not go with Desired here as this says there is desired state in the current state.

No, Status field is not current state most of the time because there is considerable delay between reality of ASG state and interval of CR status updates. When cluster-autoscaler is active and present, I would also not go with Desired in Spec as it should be Min by default and properly adjusted by autoscaler.

Unless Desired is ASG property I also find it confusing.

It is indeed an ASG property (that cluster-autoscaler is adjusting to get ASG scale number of nodes). My worry here is that when ResyncPeriod in aws-operator is 5min (https://github.com/giantswarm/aws-operator/blob/master/service/controller/cluster.go#L1) a lot can happen during that period of time and when our target is to expose these CRs to end users, there is whole bunch of room for misunderstanding and failure.

But this will be shared also by Azure. What is the naming there? Does it make sense to try to make it provider agnostic?

This specific field is AWS specific status field, but you are correct in the sense that this concept is shared with Azure. The thing there just seems to be that VMSS only has Capacity and no other config. It doesn't seem to have Min nor Max and therefore also the Desired value is missing.

If you think that possible discrepancy of +/- 1-3 of what is actual number of nodes and what is in status field, does not matter, then I think I can settle with Size.Current here, but it should be somehow clearly and explicitly documented to end users that this is not the actual state but last observed state that may have changed since then.

...to add a bit: The reason why I think this matters is that we are in general direction towards exposing control plane API and these CRs to end users who would be then able to write their own controllers that monitor these resources and react based on values presented here. If anyone confuses this field to be present state then the logic executed could be incorrect and potentially harmful in many ways. Once it's in use, it's also difficult to change so that's why the naming in these kind of cases matter, I think.

The thing there just seems to be that VMSS only has Capacity and no other config

So unless we make it provider specific I don't see a point of creating more properties. We can't set them for Azure. Isn't VMSS Capacity a max? For the current value we can take number of instances in VMSS. If not Size.Current can just become Size IMO.

it should be somehow clearly and explicitly documented to end users that this is not the actual state but last observed state that may have changed since then.

It should be clear that we don't update CRs in NRT. What we should educate users is that status is last observed state in general. And it can be like 5min old.

xh3b4sd · 2018-11-29T12:56:14Z

pkg/apis/provider/v1alpha1/aws_types.go

+}
+
+type AWSConfigStatusAWSNodePoolSize struct {
+	Desired int `json:"desired" yaml:"desired"`


The status of the CR is the actual, the current state. The desired state is defined in the spec. I would not go with Desired here as this says there is desired state in the current state.

xh3b4sd · 2018-11-29T12:56:31Z

pkg/apis/provider/v1alpha1/aws_types.go

+type AWSConfigStatusAWSNodePoolSize struct {
+	Desired int `json:"desired" yaml:"desired"`
+	Max     int `json:"max" yaml:"max"`
+	Min     int `json:"min" yaml:"min"`


Min and max are given in the spec. No need to mirror them here.

Agree. Will remove.

teemow · 2018-11-29T17:33:32Z

This is imo a step in the wrong direction. Node pools will be defined in MachineSet resources in the cluster api.

tuommaki · 2018-12-03T14:21:33Z

Ok. The title and description is now misleading, but based on discussion we had in Team Spirit, I dropped the notion of node pools here and just added a structure to configure worker's ASG.

There's no Status field for this any more after we paired with Tim and realized that when StatusCluster already has reconciled number of nodes as part of it, there's no additional information provided via ASG's Desired field.

xh3b4sd

That works for me. The naming here is pretty irrelevant for me personally. Does not matter if it is ASG or Scaling or whatever, because the structures will be outdated when we go with node pools and the cluster API magics.

teemow · 2018-12-03T16:10:24Z

pkg/apis/provider/v1alpha1/aws_types.go

-	Etcd              AWSConfigSpecAWSEtcd `json:"etcd" yaml:"etcd"`
+	AvailabilityZones int `json:"availabilityZones" yaml:"availabilityZones"`
+	// ASG contains configuration for current worker's ASG.
+	ASG              AWSConfigSpecAWSASG  `json:"asg" yaml:"asg"`


This shouldn't be AWS specific. Actually I don't know why the availability zones are in here as well. I tried to make sure that this is general configuration independent of provider. Don't we have a shared struct for this?

teemow · 2018-12-03T17:04:11Z

@xh3b4sd and I had a chat about this. Since the autoscaler (and the autoscaling group) can have a different value for desired than the actual number of nodes it would be good to add the desired number of nodes to the status as well. Eg the autoscaler could define that 5 nodes are desired but still only 4 nodes are running. This could be only short term until the 5th node is started but it could also be for longer if eg the instance type is unavailable in the AZs.

See my comment here: https://github.com/giantswarm/giantswarm/pull/2206/files#r238167965

teemow · 2018-12-04T12:15:55Z

@tuommaki this looks good now.

Questions:

What about the availability zones? Shall we keep them as is? It is likely that we will not add azure AZs before node pools.
What about the desired count of the autoscaler? See Add initial NodePool structure for cluster autoscaling #173 (comment)

tuommaki · 2018-12-04T12:29:33Z

What about the availability zones? Shall we keep them as is? It is likely that we will not add azure AZs before node pools.

I didn't consider availability zones here since they aren't part of the story. When we don't work on Azure implementation for them, I wouldn't mix related work here. Otherwise we need to take into account required migration for that as well.

What about the desired count of the autoscaler? See Add initial NodePool structure for cluster autoscaling #173 (comment)

Do you mean having the status field for present value of Desired property in ASG? That's the StatusClusterScaling.Size value: https://github.com/giantswarm/apiextensions/pull/173/files#diff-9a7661b5789fe3a5ed215d862c2c3b1eR110

In AWS AutoScalingGroup has Min, Max, and Desired properties.
In Azure VirtualMachineScaleSet has Capacity.

When we create structure that should be provider agnostic, I feel like the terminology should be as well. Hence Size instead of Desired.

teemow · 2018-12-04T13:00:39Z

That's the StatusClusterScaling.Size value

Sorry I overlooked that. DesiredCapacity would be a good name for it imo (it is also named like this in the aws api: https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_SetDesiredCapacity.html). Size doesn't tell if it is actual or desired.

tuommaki self-assigned this Nov 29, 2018

tuommaki commented Nov 29, 2018

View reviewed changes

tuommaki requested review from a team November 29, 2018 12:09

xh3b4sd reviewed Nov 29, 2018

View reviewed changes

Remove duplicated information.

0a83164

teemow mentioned this pull request Dec 3, 2018

Support autoscaling of tenant clusters giantswarm/api-spec#121

Merged

tuommaki and others added 3 commits December 3, 2018 16:16

Change from NodePools towards ASG configuration

0ef3e02

Remove last bits of status fields

680d179

Merge branch 'master' into add-nodepools-for-autoscaling

e1fdaef

xh3b4sd approved these changes Dec 3, 2018

View reviewed changes

teemow reviewed Dec 3, 2018

View reviewed changes

Change structure and naming according to review comments

4f65b4a

Change field name based on review comments

8bf76d3

teemow approved these changes Dec 4, 2018

View reviewed changes

tuommaki merged commit fd67263 into master Dec 4, 2018

tuommaki deleted the add-nodepools-for-autoscaling branch December 4, 2018 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial NodePool structure for cluster autoscaling #173

Add initial NodePool structure for cluster autoscaling #173

tuommaki commented Nov 29, 2018 •

edited

Loading

tuommaki Nov 29, 2018

xh3b4sd Nov 29, 2018

tuommaki Nov 29, 2018

xh3b4sd Nov 29, 2018

kopiczko Nov 29, 2018

tuommaki Nov 29, 2018

tuommaki Nov 29, 2018

MarcelMue Nov 29, 2018

tuommaki Nov 29, 2018

xh3b4sd Nov 29, 2018

tuommaki Nov 29, 2018

tuommaki Nov 30, 2018

kopiczko Nov 30, 2018

tuommaki Nov 30, 2018

tuommaki Nov 30, 2018

kopiczko Nov 30, 2018

xh3b4sd Nov 29, 2018

xh3b4sd Nov 29, 2018

tuommaki Nov 29, 2018

teemow commented Nov 29, 2018

tuommaki commented Dec 3, 2018

xh3b4sd left a comment

teemow Dec 3, 2018

teemow commented Dec 3, 2018

teemow commented Dec 4, 2018

tuommaki commented Dec 4, 2018

teemow commented Dec 4, 2018

Add initial NodePool structure for cluster autoscaling #173

Add initial NodePool structure for cluster autoscaling #173

Conversation

tuommaki commented Nov 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teemow commented Nov 29, 2018

tuommaki commented Dec 3, 2018

xh3b4sd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teemow commented Dec 3, 2018

teemow commented Dec 4, 2018

tuommaki commented Dec 4, 2018

teemow commented Dec 4, 2018

tuommaki commented Nov 29, 2018 •

edited

Loading