Skip to content

CKS Firewall and scaling cluster problem if default firewall rules delete #11779

@baltazorbest

Description

@baltazorbest

problem

After creating k8s cluster and remove default firewall rules, I cannot scaling cluster, with network error:

2025-10-02 13:02:14,919 WARN [o.a.c.m.w.WebhookServiceImpl] (API-Job-Executor-36:[ctx-11ddf09f, job-9391, ctx-04c8b6be, ctx-35bd1fab, ctx-42198853]) (logid:2fbf4611) Skipping delivering event Event {"description":"{"event":"VM.START","status":"Completed"}","eventId":null,"eventType":"VM.START","eventUuid
":null,"resourceType":"VirtualMachine","resourceUUID":null} to any webhook as account ID is missing
2025-10-02 13:02:14,919 WARN [o.a.c.f.e.EventDistributorImpl] (API-Job-Executor-36:[ctx-11ddf09f, job-9391, ctx-04c8b6be, ctx-35bd1fab, ctx-42198853]) (logid:2fbf4611) Failed to publish event [category: ActionEvent, type: VM.START] on bus webhookEventBus
2025-10-02 13:02:14,936 ERROR [c.c.k.c.a.KubernetesClusterScaleWorker] (API-Job-Executor-36:[ctx-11ddf09f, job-9391, ctx-04c8b6be]) (logid:2fbf4611) Scaling failed for Kubernetes cluster : my-k8s, unable to update network rules com.cloud.exception.ManagementServerException: Firewall rule for node SSH access can't
be provisioned
at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleKubernetesClusterIsolatedNetworkRules(KubernetesClusterScaleWorker.java:128)
at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleKubernetesClusterNetworkRules(KubernetesClusterScaleWorker.java:176)
at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleUpKubernetesClusterSize(KubernetesClusterScaleWorker.java:388)
at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleKubernetesClusterSize(KubernetesClusterScaleWorker.java:424)
at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleCluster(KubernetesClusterScaleWorker.java:477)
at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.scaleKubernetesCluster(KubernetesClusterManagerImpl.java:1767)
at jdk.internal.reflect.GeneratedMethodAccessor1219.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:569)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.apache.cloudstack.network.contrail.management.EventUtils$EventInterceptor.invoke(EventUtils.java:105)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
at com.cloud.event.ActionEventInterceptor.invoke(ActionEventInterceptor.java:52)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
at jdk.proxy3/jdk.proxy3.$Proxy517.scaleKubernetesCluster(Unknown Source)
at org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd.execute(ScaleKubernetesClusterCmd.java:160)
at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:173)
at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:110)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:652)
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:600)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)

versions

OS is ubuntu 22.04
Cloudstack version is 4.20.1
K8s version is v1.33.1-calico-x86_64
Primary storage is Ceph RBD 19.2.3
Libvirt version is 8.0.0-1ubuntu7.12

The steps to reproduce the bug

  1. Create a network with any subnet (e.g., 10.10.10.1/24).
  2. Create a k8s cluster in HA mode with one worker node, using the previously created external network.
  3. Remove the default firewall rules:
  • 0.0.0.0/0 TCP 6443 6443
  • 0.0.0.0/0 TCP 2222 2225
  1. Add new firewall rules:
  • 10.10.10.1/24 TCP 1 65534
  • 1.1.1.1/32 TCP 1 65534
  1. Try to scale the cluster to two worker nodes.

Result:
An error occurs, although the new instance is created.

Workaround
When using the following firewall rules instead:

  • 10.10.10.1/24 TCP 6443 6443
  • 10.10.10.1/24 TCP 2222 2225

→ Scaling the cluster works correctly.

Additional Issues Observed

  1. Opening SSH (2222–2225) and k8s management (6443) to 0.0.0.0/0 is a security risk.
  2. When k8s enters the alert state, it is impossible to repair the cluster. The only options available are stop or delete.
  • After stopping and starting the cluster, it change state to running. The new worker instance is created, but it does not join the Kubernetes cluster (it isn’t present in the cluster).
  • However, scaling the cluster is still not possible, and deleting an individual instance also fails.
  • The only option left is to remove the entire cluster and create it again.

What to do about it?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions