Seed cluster cannot recover after failed reconciliation when trying to delete network resources #461

namsral · 2022-06-15T10:46:49Z

How to categorize this issue?

/area control-plane
/kind bug
/platform openstack

What happened:
Managed seed cluster fails reconciliation because network resources fail to delete which prevents me from updating the cluster and its extensions.

Reconciliation is attempting to delete following resources:

seed's security group rule to allow any ingress UDP traffic
seed's single subnet

Status:

% kubectl get managedseed -A
NAMESPACE   NAME          STATUS       SHOOT   GARDENLET   AGE
garden      seed-ams2-1   Registered   dave    True        18d

% kubectl get seed -A
NAME          STATUS   PROVIDER    REGION   AGE   VERSION   K8S VERSION
first-seed    Ready    openstack   ams2     19d   v1.48.3   v1.23.6
seed-ams2-1   Ready    openstack   ams2     18d   v1.47.1   v1.23.3

% kubectl -n shoot--garden--dave get infrastructures
NAME   TYPE        REGION   STATUS   AGE
dave   openstack   ams2     Error    18d

% kubectl get shoots -A
NAMESPACE       NAME         CLOUDPROFILE   PROVIDER    REGION   K8S VERSION   HIBERNATION   LAST OPERATION           STATUS      AGE
...
garden          dave         private        openstack   ams2     1.23.3        Awake         Reconcile Failed (28%)   unhealthy   18d

Error

% kubectl -n shoot--garden--dave describe infrastructures dave
...
  lastError:
    codes:
    - ERR_INFRA_DEPENDENCIES
    description: |-
      Error reconciling infrastructure: failed to apply the terraform config: Terraform execution for command 'apply' could not be completed:

      * Error waiting for openstack_networking_subnet_v2 <omitted> to become deleted: timeout while waiting for state to become 'DELETED' (last state: 'ACTIVE', timeout: 10m0s)
    lastUpdateTime: "2022-06-15T10:02:15Z"
  lastOperation:
    description: |-
      Error reconciling infrastructure: failed to apply the terraform config: Terraform execution for command 'apply' could not be completed:

      * Error waiting for openstack_networking_subnet_v2 <omitted> to become deleted: timeout while waiting for state to become 'DELETED' (last state: 'ACTIVE', timeout: 10m0s)
    lastUpdateTime: "2022-06-15T10:02:15Z"
    progress: 50
    state: Error
    type: Reconcile
  observedGeneration: 3

Error in gardenlet:

% kubectl -n garden logs gardenlet-9d49544cd-nrzn6
...
{"error":"error during reconciliation: Error reconciling infrastructure: failed to apply the terraform config: Terraform execution for command 'apply' could not be completed:\n\n* Error creating openstack_networking_secgroup_rule_v2: Expected HTTP response code [] when accessing [POST https://core.fuga.cloud:9696/v2.0/security-group-rules], but got 409 instead\n{\"NeutronError\": {\"type\": \"SecurityGroupRuleExists\", \"message\": \"Security group rule already exists. Rule id is <omitted>.\", \"detail\": \"\"}}\n  with openstack_networking_secgroup_rule_v2.cluster_udp_all,\n  on main.tf line 88, in resource \"openstack_networking_secgroup_rule_v2\" \"cluster_udp_all\":\n  88: resource \"openstack_networking_secgroup_rule_v2\" \"cluster_udp_all\" {\n* Error waiting for openstack_networking_subnet_v2 <omitted> to become deleted: timeout while waiting for state to become 'DELETED' (last state: 'ACTIVE', timeout: 10m0s)","level":"error","msg":"Infrastructure shoot--garden--dave/dave did not get ready yet","operation":"reconcile","shoot":"garden/dave","ts":"2022-06-15T07:51:32.083Z"}

What you expected to happen:
I expect the subnet not being deleted during a reconcile as a dozen managed ports are attached.

How to reproduce it (as minimally and precisely as possible):
Deploy a shoot cluster, convert it to a managed seed then force reconcile.

Anything else we need to know?:

Environment:

Gardener version (if relevant): v1.48.3
Extension version: 1.0.58
Kubernetes version (use kubectl version): v1.23.6
Cloud provider or hardware configuration: OpenStack Ussuri
Others:

The text was updated successfully, but these errors were encountered:

kon-angelo · 2022-06-17T08:34:20Z

Hello @namsral. This is indicative that there are resources on the infrastructure that fail to be deleted. These keep the subnet "busy" and openstack refuses to delete it.

In my experience, the usual suspect in such cases are either loadbalancers or ports. It would be helpful if you can check what resources have not been deleted so that we can find the root cause easier.

namsral · 2022-06-17T11:06:04Z

Thanks @kon-angelo, removing the port connecting the shoot's subnet and router resolved the issue. As both the port, subnet and router are managed by Gardener I consider this a bug but I'm not sure in which system.

Although not tested, it might have been sufficient to clear the port's device_owner containing network:router_interface.

kon-angelo · 2022-06-20T10:42:12Z

@namsral Its good that you managed to resolve it on your own. If you see that happening consistently then please let us know about the orphan resources you find and we can discuss about the responsible component.

As a point of reference, if the issues is with Loadbalancers then its most likely the problem of openstack's cloud-controller-manager. If however the ports are used by the nodes, then it is a problem with our MCM.

namsral · 2022-06-24T11:42:34Z

New information reveals that ports of new spawned machines prevents removal of the subnet.

Seed failed on similar error:

Error waiting for openstack_networking_subnet_v2 <omitted> to become deleted: timeout while waiting for state to become 'DELETED' (last state: 'ACTIVE', timeout: 10m0s)```

Steps to recover failed seed:

Delete two remaining ports on the subnet; ports attached to spawned machines
Delete subnet
Force reconcile of the seed cluster
Delete the spawned machines attached to deleted ports from step 1

This looks like a race condition between the removal of the subnet and the spawning of machines in the subnet.

namsral · 2022-07-04T09:20:10Z

For future reference, the issue was caused by a syntax error in shoot manifest and was resolved by correcting the syntax in the shoot manifest and infra config.

Although the shoot's subnet was correctly created and functional, the difference in notation caused Terraform to recreate the subnet during a reconciliation:

Terraform will perform the following actions:
...
  # openstack_networking_subnet_v2.cluster must be replaced
-/+ resource "openstack_networking_subnet_v2" "cluster" {
      ~ all_tags          = [] -> (known after apply)
      ~ cidr              = "10.240.0.0/16" -> "10.240.0/16" # forces replacement
      ~ gateway_ip        = "10.240.0.1" -> (known after apply)
...
Plan: 2 to add, 0 to change, 2 to destroy.

gardener-robot added area/control-plane Control plane related kind/bug Bug platform/openstack OpenStack platform/infrastructure labels Jun 15, 2022

namsral closed this as completed Jun 17, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed cluster cannot recover after failed reconciliation when trying to delete network resources #461

Seed cluster cannot recover after failed reconciliation when trying to delete network resources #461

namsral commented Jun 15, 2022 •

edited

Loading

kon-angelo commented Jun 17, 2022

namsral commented Jun 17, 2022 •

edited

Loading

kon-angelo commented Jun 20, 2022

namsral commented Jun 24, 2022 •

edited

Loading

namsral commented Jul 4, 2022

Seed cluster cannot recover after failed reconciliation when trying to delete network resources #461

Seed cluster cannot recover after failed reconciliation when trying to delete network resources #461

Comments

namsral commented Jun 15, 2022 • edited Loading

kon-angelo commented Jun 17, 2022

namsral commented Jun 17, 2022 • edited Loading

kon-angelo commented Jun 20, 2022

namsral commented Jun 24, 2022 • edited Loading

namsral commented Jul 4, 2022

namsral commented Jun 15, 2022 •

edited

Loading

namsral commented Jun 17, 2022 •

edited

Loading

namsral commented Jun 24, 2022 •

edited

Loading