Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial NodePool structure for cluster autoscaling #173

Merged
merged 7 commits into from
Dec 4, 2018

Conversation

tuommaki
Copy link
Contributor

@tuommaki tuommaki commented Nov 29, 2018

When implementing cluster autoscaling, new fields for min and max size
of worker ASG must be added to CR. In the near future this is
information specific for a single node pool so it makes sense to
introduce initial NodePool structure here and then improve it later when
implementing actual node pools.

Towards https://github.com/giantswarm/giantswarm/issues/4543

When implementing cluster autoscaling, new fields for min and max size
of worker ASG must be added to CR. In the near future this is
information specific for a single node pool so it makes sense to
introduce initial NodePool structure here and then improve it later when
implementing actual node pools.
@tuommaki tuommaki self-assigned this Nov 29, 2018
@@ -105,3 +106,13 @@ type ClusterKubernetesSSHUser struct {
type ClusterNode struct {
ID string `json:"id" yaml:"id"`
}

type ClusterNodePool struct {
ID string `json:"id" yaml:"id"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ID makes more sense here than Name although I can imagine that customer wants to name his/her node pools with some descriptive name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had Name on the cluster object and it was actually a Description. We also had IDs on the nodes we track in the CRs so far and they have never been really useful I guess. Maybe in KVM. Uncertain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing here is that there should be some way to claim 1:1 match between node pool in Spec and one in Status. I wouldn't accept blind trust on keeping ordering the same. There is also necessity to identify one in AWS API when we have more than one of them. There's also this "ASG name" field required for cluster-autoscaler configuration and I'm not sure if that is AWS resource ID or an actual identifier set by "human".

To summarize: We need something to identify specific node pool. The question is if it is machine generated (perhaps better option) or "set by human" (that could be just "workers" before we have actual node pools in the UI).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an ID for internal purposes and a Description for customer use makes sense to me then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My request would be to just make sure if this stays ID it is not set by human in any case. Otherwise Name + validation makes more sense IMO.

type ClusterNodePool struct {
ID string `json:"id" yaml:"id"`
Size ClusterNodePoolSize `json:"size" yaml:"size"`
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tags would probably be also useful in the future when working on actual node pool roadmap story, but I don't add them here as it's out of scope.

}

type AWSConfigStatusAWSNodePoolSize struct {
Desired int `json:"desired" yaml:"desired"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking between Current and Desired but I think this makes more sense as CR is not 100% in sync with actual cluster state so this way it's semantically more correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what this desired value should be here. We enter min and max, who tells us the desired number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Status section of AWSConfig. aws-operator should periodically check what are current values of {desired,max,min} for given worker ASG in AWS (because cluster-autoscaler adjusts them dynamically). Via this status field we can then provide this information for interested clients such as Happa.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status of the CR is the actual, the current state. The desired state is defined in the spec. I would not go with Desired here as this says there is desired state in the current state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Status field is not current state most of the time because there is considerable delay between reality of ASG state and interval of CR status updates. When cluster-autoscaler is active and present, I would also not go with Desired in Spec as it should be Min by default and properly adjusted by autoscaler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless Desired is ASG property I also find it confusing.

It is indeed an ASG property (that cluster-autoscaler is adjusting to get ASG scale number of nodes). My worry here is that when ResyncPeriod in aws-operator is 5min (https://github.com/giantswarm/aws-operator/blob/master/service/controller/cluster.go#L1) a lot can happen during that period of time and when our target is to expose these CRs to end users, there is whole bunch of room for misunderstanding and failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this will be shared also by Azure. What is the naming there? Does it make sense to try to make it provider agnostic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific field is AWS specific status field, but you are correct in the sense that this concept is shared with Azure. The thing there just seems to be that VMSS only has Capacity and no other config. It doesn't seem to have Min nor Max and therefore also the Desired value is missing.

If you think that possible discrepancy of +/- 1-3 of what is actual number of nodes and what is in status field, does not matter, then I think I can settle with Size.Current here, but it should be somehow clearly and explicitly documented to end users that this is not the actual state but last observed state that may have changed since then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...to add a bit: The reason why I think this matters is that we are in general direction towards exposing control plane API and these CRs to end users who would be then able to write their own controllers that monitor these resources and react based on values presented here. If anyone confuses this field to be present state then the logic executed could be incorrect and potentially harmful in many ways. Once it's in use, it's also difficult to change so that's why the naming in these kind of cases matter, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing there just seems to be that VMSS only has Capacity and no other config

So unless we make it provider specific I don't see a point of creating more properties. We can't set them for Azure. Isn't VMSS Capacity a max? For the current value we can take number of instances in VMSS. If not Size.Current can just become Size IMO.

it should be somehow clearly and explicitly documented to end users that this is not the actual state but last observed state that may have changed since then.

It should be clear that we don't update CRs in NRT. What we should educate users is that status is last observed state in general. And it can be like 5min old.

@tuommaki tuommaki requested review from a team November 29, 2018 12:09
}

type AWSConfigStatusAWSNodePoolSize struct {
Desired int `json:"desired" yaml:"desired"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status of the CR is the actual, the current state. The desired state is defined in the spec. I would not go with Desired here as this says there is desired state in the current state.

type AWSConfigStatusAWSNodePoolSize struct {
Desired int `json:"desired" yaml:"desired"`
Max int `json:"max" yaml:"max"`
Min int `json:"min" yaml:"min"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Min and max are given in the spec. No need to mirror them here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Will remove.

@teemow
Copy link
Member

teemow commented Nov 29, 2018

This is imo a step in the wrong direction. Node pools will be defined in MachineSet resources in the cluster api.

@tuommaki
Copy link
Contributor Author

tuommaki commented Dec 3, 2018

Ok. The title and description is now misleading, but based on discussion we had in Team Spirit, I dropped the notion of node pools here and just added a structure to configure worker's ASG.

There's no Status field for this any more after we paired with Tim and realized that when StatusCluster already has reconciled number of nodes as part of it, there's no additional information provided via ASG's Desired field.

Copy link
Contributor

@xh3b4sd xh3b4sd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me. The naming here is pretty irrelevant for me personally. Does not matter if it is ASG or Scaling or whatever, because the structures will be outdated when we go with node pools and the cluster API magics.

Etcd AWSConfigSpecAWSEtcd `json:"etcd" yaml:"etcd"`
AvailabilityZones int `json:"availabilityZones" yaml:"availabilityZones"`
// ASG contains configuration for current worker's ASG.
ASG AWSConfigSpecAWSASG `json:"asg" yaml:"asg"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be AWS specific. Actually I don't know why the availability zones are in here as well. I tried to make sure that this is general configuration independent of provider. Don't we have a shared struct for this?

@teemow
Copy link
Member

teemow commented Dec 3, 2018

@xh3b4sd and I had a chat about this. Since the autoscaler (and the autoscaling group) can have a different value for desired than the actual number of nodes it would be good to add the desired number of nodes to the status as well. Eg the autoscaler could define that 5 nodes are desired but still only 4 nodes are running. This could be only short term until the 5th node is started but it could also be for longer if eg the instance type is unavailable in the AZs.

See my comment here: https://github.com/giantswarm/giantswarm/pull/2206/files#r238167965

@teemow
Copy link
Member

teemow commented Dec 4, 2018

@tuommaki this looks good now.

Questions:

  1. What about the availability zones? Shall we keep them as is? It is likely that we will not add azure AZs before node pools.
  2. What about the desired count of the autoscaler? See Add initial NodePool structure for cluster autoscaling #173 (comment)

@tuommaki
Copy link
Contributor Author

tuommaki commented Dec 4, 2018

  1. What about the availability zones? Shall we keep them as is? It is likely that we will not add azure AZs before node pools.

I didn't consider availability zones here since they aren't part of the story. When we don't work on Azure implementation for them, I wouldn't mix related work here. Otherwise we need to take into account required migration for that as well.

  1. What about the desired count of the autoscaler? See Add initial NodePool structure for cluster autoscaling #173 (comment)

Do you mean having the status field for present value of Desired property in ASG? That's the StatusClusterScaling.Size value: https://github.com/giantswarm/apiextensions/pull/173/files#diff-9a7661b5789fe3a5ed215d862c2c3b1eR110

  • In AWS AutoScalingGroup has Min, Max, and Desired properties.
  • In Azure VirtualMachineScaleSet has Capacity.

When we create structure that should be provider agnostic, I feel like the terminology should be as well. Hence Size instead of Desired.

@teemow
Copy link
Member

teemow commented Dec 4, 2018

That's the StatusClusterScaling.Size value

Sorry I overlooked that. DesiredCapacity would be a good name for it imo (it is also named like this in the aws api: https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_SetDesiredCapacity.html). Size doesn't tell if it is actual or desired.

@tuommaki tuommaki merged commit fd67263 into master Dec 4, 2018
@tuommaki tuommaki deleted the add-nodepools-for-autoscaling branch December 4, 2018 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants