Skip to content

Self-manged k8s cluster: Controller not updating the targets after pod mutation #476

@pmankad96

Description

@pmankad96

I am running a self-manged (kOps based) k8s cluster.

  • Restarted the deployment 1. Looks like it is taking forever for the controller to detect the new IPs and update the target group with those IPs, as a result targets stay unhealthy. I am testing it now, it has been 20+ minutes and targets are still pointing to stale IPs as a result staying unhealthy.
Admin:~/environment $ k get pods -n kops-sample-webapp-1 sample-read-endpoint-1-79db8686dd-fls6q -owide
NAME                                      READY   STATUS    RESTARTS   AGE   IP             NODE                  NOMINATED NODE   READINESS GATES
sample-read-endpoint-1-79db8686dd-fls6q   1/1     Running   0          62m   10.100.12.44   i-09c46490ebd9d7a7a   <none>           <none>
Admin:~/environment $ k get pods -n kops-sample-webapp-1 sample-read-endpoint-1-79db8686dd-7gr4t  -owide
NAME                                      READY   STATUS    RESTARTS   AGE   IP             NODE                  NOMINATED NODE   READINESS GATES
sample-read-endpoint-1-79db8686dd-7gr4t   1/1     Running   0          63m   10.100.11.72   i-06976711a4a8d5bda   <none>           <none>
Admin:~/environment $ 

Admin:~/environment $ aws vpc-lattice list-targets --target-group-identifier tg-0f64ed0c67e307b45
{
    "items": [
        {
            "id": "10.100.11.6",
            "port": 80,
            "reasonCode": "ConnectionTimeout",
            "status": "UNHEALTHY"
        },
        {
            "id": "10.100.12.108",
            "port": 80,
            "reasonCode": "ConnectionTimeout",
            "status": "UNHEALTHY"
        }
    ]
}
Admin:~/environment $
  • Restarted yet another deployment - deployment 2. The new pods for deployment 2 claimed the old stale IPs of the write endpoint. And now the deployment 1 targets are marked healthy (as they were never removed from the target group and health check succeeded). Now traffic intended for write endpoint is being served by/route to read endpoint while traffic to read endpoint is hanging:
Admin:~/environment $ k get pods -n kops-sample-webapp-1 sample-write-endpoint-1-84cfb6f8bd-b29pn -owide                                                                                                                     
NAME                                       READY   STATUS    RESTARTS   AGE   IP              NODE                  NOMINATED NODE   READINESS GATES
sample-write-endpoint-1-84cfb6f8bd-b29pn   1/1     Running   0          88m   10.100.12.252   i-09c46490ebd9d7a7a   <none>           <none>
Admin:~/environment $ k get pods -n kops-sample-webapp-1 sample-write-endpoint-1-84cfb6f8bd-vxfhn -owide
NAME                                       READY   STATUS    RESTARTS   AGE   IP             NODE                  NOMINATED NODE   READINESS GATES
sample-write-endpoint-1-84cfb6f8bd-vxfhn   1/1     Running   0          88m   10.100.11.57   i-06976711a4a8d5bda   <none>           <none>
Admin:~/environment $ 

Admin:~/environment $ aws vpc-lattice list-targets --target-group-identifier tg-095f618fb72e3c199
{
    "items": [
        {
            "id": "10.100.12.44",
            "port": 80,
            "status": "HEALTHY"
        },
        {
            "id": "10.100.11.72",
            "port": 80,
            "status": "HEALTHY"
        }
    ]
}
Admin:~/environment $ 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions