Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add servergroup for workerpools #170

Merged
merged 1 commit into from
Dec 10, 2020

Conversation

kon-angelo
Copy link
Contributor

@kon-angelo kon-angelo commented Oct 28, 2020

How to categorize this PR?

/area control-plane
/kind enhancement
/priority normal
/platform openstack

What this PR does / why we need it:
Allows to create worker pool workers in server groups. Server-groups can have a defined affinity that will protect the workers from being scheduled in the same hypervisor, thus improving HA setups.

Which issue(s) this PR fixes:1
Fixes #163

Special notes for your reviewer:
As part of the reconciliation loop we add the capability to create server groups in the DeployMachineDependencies function. Background and discussion topics for some of the design choices follow:

  • We are mostly interested in introducing soft-anti-affinity and anti-affinitypolicies.
  • Openstack APIs e.g. Nova aside from the major version also have a microversion which changes the API behavior. The client uses the lowest supported microversion by default, but soft-* affinity variants may be introduced in a later microversion like in ConvergedCloud's case. Our implementation parses and uses the highest supported MV, but because this may not be supported in all OS environments, we opted to allow admins to set allowed policies in CloudProfile.
  • Servergroups by default have no update operations. Any requested change e.g to the policy would mean to destroy and recreate the server group (and make a rolling update of nodes).
  • Based on the previous point we opted to make the following implementation for the introduction of the feature. The validation of the servergroups will check to see if the WorkerPoolProviderConfig is immutable, disabling any update to the servergroup itself.
  • Internally though, the reconciliation loop supports such operations. If the spec is changed a new server group will be created and new machineclasses will be created with a new hash. After the rolling node update, the old servergroup will eventually be disposed of. The update operation is only disabled by validation at the moment.
  • Servergroups is a project-wide object, but openstack doesn't support labels in objects, thus making it difficult for us to recognize our resources vs user-created-resources. In order to avoid conflicts in naming by multiple shoots in a project we use the resource's Name to create a labeling scheme. Servergroups are thus named as clusterName-workerPoolName-<random-10char>. The maximum allowed names are 255 chars so there is plenty of room to increase the random component. But as part of the naming scheme any server group in the OS project that has a prefix as clusterName-workerPoolName is thought to be managed by gardener and subject to deletion.

Merge is pending update in MCM to include the servergroup scheduler changes

Release note:

Adds an additional option for the worker pools to specify a server group policy. If this option is set, a new server group with the defined policy will be created and nodes managed by the worker pool will become members. Allowed policy values can be defined in the provider's `CloudProfile`.

@kon-angelo kon-angelo requested review from a team as code owners October 28, 2020 13:39
@gardener-robot gardener-robot added kind/api-change API change with impact on API users needs/second-opinion Needs second review by someone else labels Oct 28, 2020
@gardener-robot
Copy link

@kon-angelo Label area/control-lane does not exist.

@gardener-robot gardener-robot added kind/enhancement Enhancement, improvement, extension platform/openstack OpenStack platform/infrastructure priority/normal needs/review Needs review size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) labels Oct 28, 2020
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 28, 2020
@kon-angelo kon-angelo changed the title fadd servergroup for workerpools add servergroup for workerpools Oct 28, 2020
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 28, 2020
@gardener-robot gardener-robot added the area/control-plane Control plane related label Oct 28, 2020
@kon-angelo kon-angelo added the reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies label Oct 28, 2020
@dkistner
Copy link
Member

Awesome. Thanks. Will review it soon :)

docs/usage-as-end-user.md Outdated Show resolved Hide resolved
pkg/apis/openstack/helper/scheme.go Outdated Show resolved Hide resolved
pkg/apis/openstack/types_worker.go Outdated Show resolved Hide resolved
pkg/apis/openstack/v1alpha1/types_worker.go Outdated Show resolved Hide resolved
pkg/apis/openstack/v1alpha1/types_worker.go Outdated Show resolved Hide resolved
pkg/openstack/client/client.go Show resolved Hide resolved
pkg/openstack/client/client.go Show resolved Hide resolved
pkg/openstack/client/client.go Show resolved Hide resolved
pkg/openstack/client/compute.go Show resolved Hide resolved
pkg/openstack/client/types.go Show resolved Hide resolved
@gardener-robot gardener-robot removed the needs/review Needs review label Oct 29, 2020
@dkistner
Copy link
Member

dkistner commented Nov 2, 2020

Can we move the PR to draft until the depending MCM change is integrated? gardener/machine-controller-manager@0094e63

@kon-angelo kon-angelo marked this pull request as draft November 2, 2020 09:26
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Nov 3, 2020
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 3, 2020
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Nov 3, 2020
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 3, 2020
@kon-angelo
Copy link
Contributor Author

Main changes on 2nd revision:

  • Rewrote update validation following @rfranzke suggestions.
  • One additional validation rule has been added for shoot creation. Using "hard" affinity on workers with multiple availability zones is now forbidden since it results in impossible scheduling. (thanks to @dkistner for helping discover this)
  • cloudprofile validation
  • docstrings and export scoping improvements .

Currently, the only configuration available is the `serverGroupConfig`. If this option is set, then a new server group will be created with the configured policy and all machines managed by this worker pool will be assigned as members of the group.

Please note that the available options for `serverGroupConfig` are defined in the provider specific section of `CloudProfile`.
When you specify `serverGroupConfig` in your worker configuration, a new server group will be created with the configured policy and all machines managed by this worker will be assigned as members of the created server group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an end-user I would be interested if that happens in-place of whether my existing machines will be terminated/replaced by new machines (i.e., rolling update)?

+ The `serverGroupConfig` section is optional, but if it is included in the shoot spec, it must contain a valid policy value.
+ The available `policy` values that can be used, are defined in the provider specific section of `CloudProfile`, by your operator.
+ Certain policy values may induce further constraints. Using the `affinity` value is only allowed when the workers utilize a single zone.
+ The `serverGroupConfig` and `serverGroupConfig.policy` fields are immutable upon creation. Users are not allowed to change the policy value and neither add or remove a `serverGroupConfig` section from a worker after it has been created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a worker

from a worker pool

+ The available `policy` values that can be used, are defined in the provider specific section of `CloudProfile`, by your operator.
+ Certain policy values may induce further constraints. Using the `affinity` value is only allowed when the workers utilize a single zone.
+ The `serverGroupConfig` and `serverGroupConfig.policy` fields are immutable upon creation. Users are not allowed to change the policy value and neither add or remove a `serverGroupConfig` section from a worker after it has been created.
Users can on the other hand, add new workers managed by server groups to an existing shoot and migrate their workloads to the new workers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now reading this, the first sentence (with my comment about in-place) is clarified. Maybe this can be made more clear from the beginning, i.e., that all of this only applies for newly created worker pools while existing pools are immutable.

```yaml
apiVersion: openstack.provider.extensions.gardener.cloud/v1alpha1
kind: WorkerPoolProviderConfig
serverGroupConfig:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this just serverGroup?

+ The `serverGroupConfig` section is optional, but if it is included in the shoot spec, it must contain a valid policy value.
+ The available `policy` values that can be used, are defined in the provider specific section of `CloudProfile`, by your operator.
+ Certain policy values may induce further constraints. Using the `affinity` value is only allowed when the workers utilize a single zone.
+ The `serverGroupConfig` and `serverGroupConfig.policy` fields are immutable upon creation. Users are not allowed to change the policy value and neither add or remove a `serverGroupConfig` section from a worker after it has been created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would mean if a user wants to change the a workerpool to use a different servergroup policy it would be required to create a new workerpool and tear down the old one?
Technically that is valid as the servergroup policies are only evaluated at creation time of a machine and based on this the placement of the machine is done. So maybe we should mention this?

Comment on lines 109 to 110
allErrs = append(allErrs, field.Invalid(parent.Child("serverGroupConfig", "policy"), workerConfig.ServerGroupConfig.Policy, "no matching server group policy in cloudprofile"))
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only when we not use named return values.

Suggested change
allErrs = append(allErrs, field.Invalid(parent.Child("serverGroupConfig", "policy"), workerConfig.ServerGroupConfig.Policy, "no matching server group policy in cloudprofile"))
return
return append(allErrs, field.Invalid(parent.Child("serverGroupConfig", "policy"), workerConfig.ServerGroupConfig.Policy, "no matching server group policy in cloudprofile"))

}
}
}
return allErrs
}

func validateWorkerConfig(parent *field.Path, worker *core.Worker, workerConfig *api.WorkerConfig, cloudProfileConfig *api.CloudProfileConfig) (allErrs field.ErrorList) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor thing: We do not really much use named return values.
Do you consider to declare and initialise allErrors explicitly and return it?

Comment on lines 91 to 92
allErrs = append(allErrs, field.Invalid(parent.Child("serverGroupConfig", "policy"), workerConfig.ServerGroupConfig.Policy, "policy field cannot be empty"))
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only when we not use named return values.

Suggested change
allErrs = append(allErrs, field.Invalid(parent.Child("serverGroupConfig", "policy"), workerConfig.ServerGroupConfig.Policy, "policy field cannot be empty"))
return
return append(allErrs, field.Invalid(parent.Child("serverGroupConfig", "policy"), workerConfig.ServerGroupConfig.Policy, "policy field cannot be empty"))

Comment on lines 125 to 128
var (
newWorkerCfg, oldWorkerCfg *api.WorkerConfig
err error
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to declare those variables explicitly? I mean we can also declare and initialise them via := later.

// a) worker is terminating and all server groups have to be deleted
// b) worker pool is deleted
// c) worker pool's server group configuration (e.g. policy) changed
// d) worker pool no longer requires use of server groups
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is case d) actually possible? I though the servergroup + policy are immutable and can't be changed. I mean anyway the handling is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller fully supports c) and d) cases although these are prohibited in the validation for now. To support these the internal implementation does need to do a rolling node update by generating new MachineClasses. But generally we can disable the immutability checks at any point in the future and the deletion will work.


// Include the given worker pool dependencies into the hash.
if serverGroupDependency != nil {
additionalHashData = append(additionalHashData, serverGroupDependency.Name, serverGroupDependency.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: Wouldn't it be sufficient to just add the servergroup dependency id to the hash? I mean this one should be unique.

// Source: github.com/gardener/gardener-extension-provider-openstack/pkg/openstack/client (interfaces: Factory,Compute)

// Package mocks is a generated GoMock package.
package mocks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should move all mock packages to pkg/mock with subdirectories for the client or any other mock. g/g is doing it also this way. https://github.com/gardener/gardener/tree/master/pkg/mock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What i don't like about this:
a) if we need changes to the provider and need to use a new openstack service not mentioned here, we would need a g/g release.
b) I do not consider these interfaces as stable to move them into the central repo. In comparison the mcm-provider-openstack uses a slightly different structure. If we consolidate them we could think about it but its premature now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was not to move the mocks of the openstack extension to the g/g repo. I wanted to suggest to move all the mocks of the openstack extension to a central place like pkg/mocks in the openstack extension repo. Then the structure would be similar like in the g/g repo. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood. This is possible, yes. We can open an issue and give it a try

@@ -12,22 +12,51 @@
// See the License for the specific language governing permissions and
// limitations under the License.

//go:generate mockgen -destination=mocks/client_mocks.go -package=mocks . Factory,Compute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as in pkg/openstack/client/mocks/client_mocks.go

@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 3, 2020
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 3, 2020
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 3, 2020
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 3, 2020
@kon-angelo
Copy link
Contributor Author

  • made documentation more explicit about migration to worker pools with server groups
  • explicit validation when the serverGroup is missing or added and added one more unit test for this
  • comments addressed

@rfranzke maybe can you check the documentation one more time. I used the term worker groups instead of pools because that's how its referred in the UI. I hope its better now

dkistner
dkistner previously approved these changes Dec 8, 2020
Copy link
Member

@dkistner dkistner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a pair review with @kon-angelo
/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging needs/changes Needs (more) changes and removed needs/changes Needs (more) changes needs/second-opinion Needs second review by someone else reviewed/lgtm Has approval for merging labels Dec 8, 2020
@gardener-robot gardener-robot removed the needs/changes Needs (more) changes label Dec 8, 2020
@gardener-robot gardener-robot added needs/review Needs review needs/second-opinion Needs second review by someone else labels Dec 10, 2020
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 10, 2020
@dkistner dkistner merged commit d22c1be into gardener:master Dec 10, 2020
@kon-angelo kon-angelo deleted the openstack-servergroup3 branch April 28, 2021 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/api-change API change with impact on API users kind/enhancement Enhancement, improvement, extension needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review needs/second-opinion Needs second review by someone else platform/openstack OpenStack platform/infrastructure reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ServerGroup per Workerpool to define Node Affinity/Anti-Affinity
7 participants