Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed cluster cannot recover after failed reconciliation when trying to delete network resources #461

Closed
namsral opened this issue Jun 15, 2022 · 5 comments
Labels
area/control-plane Control plane related kind/bug Bug platform/openstack OpenStack platform/infrastructure status/closed Issue is closed (either delivered or triaged)

Comments

@namsral
Copy link

namsral commented Jun 15, 2022

How to categorize this issue?

/area control-plane
/kind bug
/platform openstack

What happened:
Managed seed cluster fails reconciliation because network resources fail to delete which prevents me from updating the cluster and its extensions.

Reconciliation is attempting to delete following resources:

  • seed's security group rule to allow any ingress UDP traffic
  • seed's single subnet

Status:

% kubectl get managedseed -A
NAMESPACE   NAME          STATUS       SHOOT   GARDENLET   AGE
garden      seed-ams2-1   Registered   dave    True        18d

% kubectl get seed -A
NAME          STATUS   PROVIDER    REGION   AGE   VERSION   K8S VERSION
first-seed    Ready    openstack   ams2     19d   v1.48.3   v1.23.6
seed-ams2-1   Ready    openstack   ams2     18d   v1.47.1   v1.23.3

% kubectl -n shoot--garden--dave get infrastructures
NAME   TYPE        REGION   STATUS   AGE
dave   openstack   ams2     Error    18d

% kubectl get shoots -A
NAMESPACE       NAME         CLOUDPROFILE   PROVIDER    REGION   K8S VERSION   HIBERNATION   LAST OPERATION           STATUS      AGE
...
garden          dave         private        openstack   ams2     1.23.3        Awake         Reconcile Failed (28%)   unhealthy   18d

Error

% kubectl -n shoot--garden--dave describe infrastructures dave
...
  lastError:
    codes:
    - ERR_INFRA_DEPENDENCIES
    description: |-
      Error reconciling infrastructure: failed to apply the terraform config: Terraform execution for command 'apply' could not be completed:

      * Error waiting for openstack_networking_subnet_v2 <omitted> to become deleted: timeout while waiting for state to become 'DELETED' (last state: 'ACTIVE', timeout: 10m0s)
    lastUpdateTime: "2022-06-15T10:02:15Z"
  lastOperation:
    description: |-
      Error reconciling infrastructure: failed to apply the terraform config: Terraform execution for command 'apply' could not be completed:

      * Error waiting for openstack_networking_subnet_v2 <omitted> to become deleted: timeout while waiting for state to become 'DELETED' (last state: 'ACTIVE', timeout: 10m0s)
    lastUpdateTime: "2022-06-15T10:02:15Z"
    progress: 50
    state: Error
    type: Reconcile
  observedGeneration: 3

Error in gardenlet:

% kubectl -n garden logs gardenlet-9d49544cd-nrzn6
...
{"error":"error during reconciliation: Error reconciling infrastructure: failed to apply the terraform config: Terraform execution for command 'apply' could not be completed:\n\n* Error creating openstack_networking_secgroup_rule_v2: Expected HTTP response code [] when accessing [POST https://core.fuga.cloud:9696/v2.0/security-group-rules], but got 409 instead\n{\"NeutronError\": {\"type\": \"SecurityGroupRuleExists\", \"message\": \"Security group rule already exists. Rule id is <omitted>.\", \"detail\": \"\"}}\n  with openstack_networking_secgroup_rule_v2.cluster_udp_all,\n  on main.tf line 88, in resource \"openstack_networking_secgroup_rule_v2\" \"cluster_udp_all\":\n  88: resource \"openstack_networking_secgroup_rule_v2\" \"cluster_udp_all\" {\n* Error waiting for openstack_networking_subnet_v2 <omitted> to become deleted: timeout while waiting for state to become 'DELETED' (last state: 'ACTIVE', timeout: 10m0s)","level":"error","msg":"Infrastructure shoot--garden--dave/dave did not get ready yet","operation":"reconcile","shoot":"garden/dave","ts":"2022-06-15T07:51:32.083Z"}

What you expected to happen:
I expect the subnet not being deleted during a reconcile as a dozen managed ports are attached.

How to reproduce it (as minimally and precisely as possible):
Deploy a shoot cluster, convert it to a managed seed then force reconcile.

Anything else we need to know?:

Environment:

  • Gardener version (if relevant): v1.48.3
  • Extension version: 1.0.58
  • Kubernetes version (use kubectl version): v1.23.6
  • Cloud provider or hardware configuration: OpenStack Ussuri
  • Others:
@gardener-robot gardener-robot added area/control-plane Control plane related kind/bug Bug platform/openstack OpenStack platform/infrastructure labels Jun 15, 2022
@kon-angelo
Copy link
Contributor

Hello @namsral. This is indicative that there are resources on the infrastructure that fail to be deleted. These keep the subnet "busy" and openstack refuses to delete it.

In my experience, the usual suspect in such cases are either loadbalancers or ports. It would be helpful if you can check what resources have not been deleted so that we can find the root cause easier.

@namsral
Copy link
Author

namsral commented Jun 17, 2022

Thanks @kon-angelo, removing the port connecting the shoot's subnet and router resolved the issue. As both the port, subnet and router are managed by Gardener I consider this a bug but I'm not sure in which system.

Although not tested, it might have been sufficient to clear the port's device_owner containing network:router_interface.

@namsral namsral closed this as completed Jun 17, 2022
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 17, 2022
@kon-angelo
Copy link
Contributor

@namsral Its good that you managed to resolve it on your own. If you see that happening consistently then please let us know about the orphan resources you find and we can discuss about the responsible component.

As a point of reference, if the issues is with Loadbalancers then its most likely the problem of openstack's cloud-controller-manager. If however the ports are used by the nodes, then it is a problem with our MCM.

@namsral
Copy link
Author

namsral commented Jun 24, 2022

New information reveals that ports of new spawned machines prevents removal of the subnet.

Seed failed on similar error:

Error waiting for openstack_networking_subnet_v2 <omitted> to become deleted: timeout while waiting for state to become 'DELETED' (last state: 'ACTIVE', timeout: 10m0s)```

Steps to recover failed seed:

  1. Delete two remaining ports on the subnet; ports attached to spawned machines
  2. Delete subnet
  3. Force reconcile of the seed cluster
  4. Delete the spawned machines attached to deleted ports from step 1

This looks like a race condition between the removal of the subnet and the spawning of machines in the subnet.

@namsral
Copy link
Author

namsral commented Jul 4, 2022

For future reference, the issue was caused by a syntax error in shoot manifest and was resolved by correcting the syntax in the shoot manifest and infra config.

Although the shoot's subnet was correctly created and functional, the difference in notation caused Terraform to recreate the subnet during a reconciliation:

Terraform will perform the following actions:
...
  # openstack_networking_subnet_v2.cluster must be replaced
-/+ resource "openstack_networking_subnet_v2" "cluster" {
      ~ all_tags          = [] -> (known after apply)
      ~ cidr              = "10.240.0.0/16" -> "10.240.0/16" # forces replacement
      ~ gateway_ip        = "10.240.0.1" -> (known after apply)
...
Plan: 2 to add, 0 to change, 2 to destroy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/bug Bug platform/openstack OpenStack platform/infrastructure status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

3 participants