Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anexia Provider: Utilize Creating state instead of blocking Create call #10

Closed
wants to merge 35 commits into from

Conversation

marioreggiori
Copy link

What this PR does / why we need it:

The machine-controller calls the providers Create method sequentially and only starts with the provisioning of another instance after Create returned without any errors. This PR changes the

  • Create method to not block until the VM provisioning has completed
  • Get method to check via the instance status whether InstanceID and ProvisioningID are set

and utilizes the Creating state until the VM is ready.

This reduces the provisioning duration of node pools with more than one replica significantly as there are now only a couple of seconds (~10s) between the provisioning starts.

Which issue(s) this PR fixes (optional, in fixes #<issue number> format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Optional Release Note:


embik and others added 30 commits September 2, 2022 14:05
* Check in partially migrated code base

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Finish migration to AWS SDK v2

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Fix linter issues

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Bump memory to 7Gi to prevent random OOM kills

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Move AWS SDK requires into shared block

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>
* Update to Go 1.19

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

* Update license validation CI job

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

* refactored code

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>
…bermatic#1424)

Signed-off-by: Sachin Tiptur <sachin@kubermatic.com>

Signed-off-by: Sachin Tiptur <sachin@kubermatic.com>
….24.5 respectively (kubermatic#1431)

* Bump kubernetes versions used for E2E tests to 1.22.14, 1.23.11 and 1.24.5 respectively

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

* Skip TLS verification for vSphere

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>
)

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>
Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>
* Remove references to proxy in Nutanix e2e testing

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Change run-e2e-test.sh script

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>
…rmatic#1430)

Signed-off-by: Sachin Tiptur <sachin@kubermatic.com>

Signed-off-by: Sachin Tiptur <sachin@kubermatic.com>
Signed-off-by: Artiom Diomin <kron82@gmail.com>

Signed-off-by: Artiom Diomin <kron82@gmail.com>
* Add support for Ubuntu 22.04

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

* Use containerd v1.5 as default when docker is selected

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>
* Add enableBootDiagnostics field to Azure cloud spec

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Update complexity in .golangci.yml

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>
Signed-off-by: Artiom Diomin <kron82@gmail.com>

Signed-off-by: Artiom Diomin <kron82@gmail.com>
* Update OSM image for e2e tests

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

* Add TODO

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>
Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>
kubermatic#1451)

* Add an option to enable nested virtualization and CPU platform for GCP

Signed-off-by: Marcin Franczyk <marcin0franczyk@gmail.com>

* adjust minCPUPlatform name to linter rules

Signed-off-by: Marcin Franczyk <marcin0franczyk@gmail.com>

Signed-off-by: Marcin Franczyk <marcin0franczyk@gmail.com>
Signed-off-by: Sankalp Rangare <sankalprangare786@gmail.com>

Signed-off-by: Sankalp Rangare <sankalprangare786@gmail.com>
Signed-off-by: Marcin Franczyk <marcin0franczyk@gmail.com>

Signed-off-by: Marcin Franczyk <marcin0franczyk@gmail.com>
…ubermatic#1428)

* Deprecate use-osm flag, remove osm dependency

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Add a note to pkg/bootstrap/types.go

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Remove provisioning secret as implementation detail of OSM

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Add legacy label again but also set it if external bootstrapping is disabled

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Fix linting issues

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Annotate testdata with default OSPs

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>
* Kubevirt switch to instancetype

Signed-off-by: Helene Durand <helene@kubermatic.com>

* Fix example file

Signed-off-by: Helene Durand <helene@kubermatic.com>

* Code review comment

Signed-off-by: Helene Durand <helene@kubermatic.com>

Signed-off-by: Helene Durand <helene@kubermatic.com>
…ic#1459)

* Update containerized check

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Test if kindnet CNI works

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Revert "Test if kindnet CNI works"

This reverts commit 35ad25c.

* Replace Flannel with Cilium

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Set operator.replicas=1 for cilium

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Remove CentOS 7 tests

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Add Calico as CNI

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Do not check node readiness, just verify node exists

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Revert "Remove CentOS 7 tests"

This reverts commit 72b0d18.

* Add a test for node conditions except NodeReady

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Fix yamllint issues

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Fix linting issue

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

* Disable upgrade Prow job

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>
Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>
Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Signed-off-by: Marvin Beckers <marvin@kubermatic.com>
Signed-off-by: Sachin Tiptur <sachin@kubermatic.com>

Signed-off-by: Sachin Tiptur <sachin@kubermatic.com>
containerd can be configured for multiple registry mirrors per registry,
this extends machine-controller to configure mirrors for registries
other than docker.io, reusing the existing command line flag.

Signed-off-by: Mara Sophie Grosch <mgrosch@anexia-it.com>

Signed-off-by: Mara Sophie Grosch <mgrosch@anexia-it.com>
Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>
Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>
Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>
lucakuendig and others added 4 commits November 3, 2022 16:16
* add optional storageSubnet to Nutanix provider

Signed-off-by: Luca Kuendig <luca.kuendig@bedag.ch>

* add additionalSubnetNames field to ntnx-provider

Signed-off-by: Luca Kuendig <luca.kuendig@bedag.ch>

Signed-off-by: Luca Kuendig <luca.kuendig@bedag.ch>
…kubermatic#1476)

* restart containerd for flatcar so the drop in conf will be considered

Signed-off-by: Luca Kuendig <luca.kuendig@bedag.ch>

* update fixture data

Signed-off-by: Luca Kuendig <luca.kuendig@bedag.ch>

Signed-off-by: Luca Kuendig <luca.kuendig@bedag.ch>
Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>

Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>
* refactor overwriteCloudConfigSpec for MDs

Signed-off-by: Moath Qasim <moad.qassem@gmail.com>

* generate fixtures

Signed-off-by: Moath Qasim <moad.qassem@gmail.com>

Signed-off-by: Moath Qasim <moad.qassem@gmail.com>
@LittleFox94
Copy link
Collaborator

Oh wow, that looks really nice!

I seem to remember we had problems with parallel IP reservations in the past and this being the reason to serialize machine creation - did you test this or know anything more about it?

@marioreggiori
Copy link
Author

Thanks 🙂

I seem to remember we had problems with parallel IP reservations in the past and this being the reason to serialize machine creation - did you test this or know anything more about it?

I only know about this problem from the ticket related to this PR. But this also shouldn't be an issue as long as the Create method (which indirectly reserves the address) is called sequentially, right? I created two replicas=3 node pools with the problem in mind and couldn't find anything out of the ordinary.

@LittleFox94
Copy link
Collaborator

Thanks slightly_smiling_face

I seem to remember we had problems with parallel IP reservations in the past and this being the reason to serialize machine creation - did you test this or know anything more about it?

I only know about this problem from the ticket related to this PR. But this also shouldn't be an issue as long as the Create method (which indirectly reserves the address) is called sequentially, right? I created two replicas=3 node pools with the problem in mind and couldn't find anything out of the ordinary.

Could we, just to be safe, have a mutex around the "Reserve IP" and "Create VM" operation? Locking before reserving the IP, unlocking once the VM creating request is sent and progress is at least at 10% or something? How would this affect the timing?

@marioreggiori
Copy link
Author

Thanks slightly_smiling_face

I seem to remember we had problems with parallel IP reservations in the past and this being the reason to serialize machine creation - did you test this or know anything more about it?

I only know about this problem from the ticket related to this PR. But this also shouldn't be an issue as long as the Create method (which indirectly reserves the address) is called sequentially, right? I created two replicas=3 node pools with the problem in mind and couldn't find anything out of the ordinary.

Could we, just to be safe, have a mutex around the "Reserve IP" and "Create VM" operation? Locking before reserving the IP, unlocking once the VM creating request is sent and progress is at least at 10% or something? How would this affect the timing?

Re "affect on timing": I'd have to test this, as I'm not sure how linear the progress returned by the engine is.

I think a mutex guarding provider.Create (which also includes the reservation) wouldn't hurt. But I fear that we or the machine-controller could easily lock provisioning indefinitely by accident if we only release the mutex at >=10% progress.

@LittleFox94
Copy link
Collaborator

Let's add a mutex around the IP reservation call, reading through old tickets this should be enough :)

…` call

Signed-off-by: Mario Schäfer <mschaefer@anexia-it.com>
@@ -248,6 +203,8 @@ func provisionVM(ctx context.Context, client anxclient.Client) error {
return updateMachineStatus(reconcileContext.Machine, *status, reconcileContext.ProviderData.Update)
}

var _engsup3404mutex sync.Mutex
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably give this another name, given how people outside Anexia cannot see this ticket and this is going upstream

@LittleFox94
Copy link
Collaborator

merged upstream as kubermatic#1483

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet