-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial NodePool structure for cluster autoscaling #173
Conversation
When implementing cluster autoscaling, new fields for min and max size of worker ASG must be added to CR. In the near future this is information specific for a single node pool so it makes sense to introduce initial NodePool structure here and then improve it later when implementing actual node pools.
@@ -105,3 +106,13 @@ type ClusterKubernetesSSHUser struct { | |||
type ClusterNode struct { | |||
ID string `json:"id" yaml:"id"` | |||
} | |||
|
|||
type ClusterNodePool struct { | |||
ID string `json:"id" yaml:"id"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ID
makes more sense here than Name
although I can imagine that customer wants to name his/her node pools with some descriptive name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we had Name
on the cluster object and it was actually a Description
. We also had IDs on the nodes we track in the CRs so far and they have never been really useful I guess. Maybe in KVM. Uncertain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing here is that there should be some way to claim 1:1 match between node pool in Spec
and one in Status
. I wouldn't accept blind trust on keeping ordering the same. There is also necessity to identify one in AWS API when we have more than one of them. There's also this "ASG name" field required for cluster-autoscaler
configuration and I'm not sure if that is AWS resource ID or an actual identifier set by "human".
To summarize: We need something to identify specific node pool. The question is if it is machine generated (perhaps better option) or "set by human" (that could be just "workers" before we have actual node pools in the UI).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an ID
for internal purposes and a Description
for customer use makes sense to me then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My request would be to just make sure if this stays ID
it is not set by human in any case. Otherwise Name
+ validation makes more sense IMO.
type ClusterNodePool struct { | ||
ID string `json:"id" yaml:"id"` | ||
Size ClusterNodePoolSize `json:"size" yaml:"size"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tags
would probably be also useful in the future when working on actual node pool roadmap story, but I don't add them here as it's out of scope.
} | ||
|
||
type AWSConfigStatusAWSNodePoolSize struct { | ||
Desired int `json:"desired" yaml:"desired"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking between Current
and Desired
but I think this makes more sense as CR is not 100% in sync with actual cluster state so this way it's semantically more correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow what this desired value should be here. We enter min and max, who tells us the desired number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Status
section of AWSConfig
. aws-operator
should periodically check what are current values of {desired,max,min}
for given worker ASG in AWS (because cluster-autoscaler
adjusts them dynamically). Via this status field we can then provide this information for interested clients such as Happa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status of the CR is the actual, the current state. The desired state is defined in the spec. I would not go with Desired
here as this says there is desired state in the current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, Status
field is not current state most of the time because there is considerable delay between reality of ASG state and interval of CR status updates. When cluster-autoscaler
is active and present, I would also not go with Desired
in Spec as it should be Min
by default and properly adjusted by autoscaler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless Desired is ASG property I also find it confusing.
It is indeed an ASG property (that cluster-autoscaler
is adjusting to get ASG scale number of nodes). My worry here is that when ResyncPeriod
in aws-operator
is 5min (https://github.com/giantswarm/aws-operator/blob/master/service/controller/cluster.go#L1) a lot can happen during that period of time and when our target is to expose these CRs to end users, there is whole bunch of room for misunderstanding and failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this will be shared also by Azure. What is the naming there? Does it make sense to try to make it provider agnostic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific field is AWS specific status field, but you are correct in the sense that this concept is shared with Azure. The thing there just seems to be that VMSS only has Capacity
and no other config. It doesn't seem to have Min
nor Max
and therefore also the Desired
value is missing.
If you think that possible discrepancy of +/- 1-3 of what is actual number of nodes and what is in status field, does not matter, then I think I can settle with Size.Current
here, but it should be somehow clearly and explicitly documented to end users that this is not the actual state but last observed state that may have changed since then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...to add a bit: The reason why I think this matters is that we are in general direction towards exposing control plane API and these CRs to end users who would be then able to write their own controllers that monitor these resources and react based on values presented here. If anyone confuses this field to be present state then the logic executed could be incorrect and potentially harmful in many ways. Once it's in use, it's also difficult to change so that's why the naming in these kind of cases matter, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing there just seems to be that VMSS only has Capacity and no other config
So unless we make it provider specific I don't see a point of creating more properties. We can't set them for Azure. Isn't VMSS Capacity
a max? For the current value we can take number of instances in VMSS. If not Size.Current
can just become Size
IMO.
it should be somehow clearly and explicitly documented to end users that this is not the actual state but last observed state that may have changed since then.
It should be clear that we don't update CRs in NRT. What we should educate users is that status
is last observed state in general. And it can be like 5min old.
} | ||
|
||
type AWSConfigStatusAWSNodePoolSize struct { | ||
Desired int `json:"desired" yaml:"desired"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status of the CR is the actual, the current state. The desired state is defined in the spec. I would not go with Desired
here as this says there is desired state in the current state.
type AWSConfigStatusAWSNodePoolSize struct { | ||
Desired int `json:"desired" yaml:"desired"` | ||
Max int `json:"max" yaml:"max"` | ||
Min int `json:"min" yaml:"min"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Min and max are given in the spec. No need to mirror them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Will remove.
This is imo a step in the wrong direction. Node pools will be defined in |
Ok. The title and description is now misleading, but based on discussion we had in Team Spirit, I dropped the notion of node pools here and just added a structure to configure worker's ASG. There's no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for me. The naming here is pretty irrelevant for me personally. Does not matter if it is ASG
or Scaling
or whatever, because the structures will be outdated when we go with node pools and the cluster API magics.
Etcd AWSConfigSpecAWSEtcd `json:"etcd" yaml:"etcd"` | ||
AvailabilityZones int `json:"availabilityZones" yaml:"availabilityZones"` | ||
// ASG contains configuration for current worker's ASG. | ||
ASG AWSConfigSpecAWSASG `json:"asg" yaml:"asg"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be AWS specific. Actually I don't know why the availability zones are in here as well. I tried to make sure that this is general configuration independent of provider. Don't we have a shared struct for this?
@xh3b4sd and I had a chat about this. Since the autoscaler (and the autoscaling group) can have a different value for desired than the actual number of nodes it would be good to add the desired number of nodes to the status as well. Eg the autoscaler could define that 5 nodes are desired but still only 4 nodes are running. This could be only short term until the 5th node is started but it could also be for longer if eg the instance type is unavailable in the AZs. See my comment here: https://github.com/giantswarm/giantswarm/pull/2206/files#r238167965 |
@tuommaki this looks good now. Questions:
|
I didn't consider availability zones here since they aren't part of the story. When we don't work on Azure implementation for them, I wouldn't mix related work here. Otherwise we need to take into account required migration for that as well.
Do you mean having the status field for present value of
When we create structure that should be provider agnostic, I feel like the terminology should be as well. Hence |
Sorry I overlooked that. |
When implementing cluster autoscaling, new fields for min and max size
of worker ASG must be added to CR. In the near future this is
information specific for a single node pool so it makes sense to
introduce initial NodePool structure here and then improve it later when
implementing actual node pools.
Towards https://github.com/giantswarm/giantswarm/issues/4543