wait container | Failed to establish pod watch ... dial tcp i/o timeout #4993

RysBen · 2021-02-02T09:31:18Z

Summary

Hi all,

We submit hundreds of workflows at specific times of the day. The status of some steps in workflows would be "Error/Failed", and MESSAGE is as follows:

argo -n sxlj get new-k007098t-k007098t-panel063-18-ct-sub

Name:                new-k007098t-k007098t-panel063-18-ct-sub
Namespace:           sxlj
ServiceAccount:      default
Status:              Failed
Message:             child 'new-k007098t-k007098t-panel063-18-ct-sub-2059211949' failed
Created:             Sat Jan 30 15:15:27 +0800 (36 minutes ago)
Started:             Sat Jan 30 15:15:27 +0800 (36 minutes ago)
Finished:            Sat Jan 30 15:21:34 +0800 (30 minutes ago)
Duration:            6 minutes 7 seconds
Parameters:
  argo-yaml:         /path/to/new_K007098T_K007098T_panel063-18-CT.sub.yaml

STEP                                                 PODNAME                                              DURATION  MESSAGE
 ✔ new-k007098t-k007098t-panel063-18-ct-sub
 └-✔ new-k007098t-k007098t-panel063-18-ct-sub-sh(0)  new-k007098t-k007098t-panel063-18-ct-sub-710601264   2m

 ✖ new-k007098t-k007098t-panel063-18-ct-sub.onExit                                                                  child 'new-k007098t-k007098t-panel063-18-ct-sub-2059211949' failed
 └---⚠ succeed                                       new-k007098t-k007098t-panel063-18-ct-sub-2059211949  2m        failed to save outputs: Failed to establish pod watch: Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods?fieldSelector=metadata.name%3Dnew-k007098t-k007098t-panel063-18-ct-sub-2059211949&watch=true: dial tcp 10.100.0.1:443: i/o timeout

At first, I thought it was caused by overloading the cluster. After observing it for a while, I found that the problem was all happening on the wait container, and main containers were normal. Why is it? And is there any way to solve this problem?

Any suggestion would be appreciated.

More Info

kubectl describe po new-k007098t-k007098t-panel063-18-ct-sub-2059211949

Name:               new-k007098t-k007098t-panel063-18-ct-sub-2059211949
Namespace:          sxlj
Priority:           0
PriorityClassName:  <none>
Node:               ${node_name}/${node_ip}
Start Time:         Sat, 30 Jan 2021 15:18:55 +0800
Labels:             workflows.argoproj.io/completed=true
                    workflows.argoproj.io/on-exit=true
                    workflows.argoproj.io/workflow=new-k007098t-k007098t-panel063-18-ct-sub
Annotations:        cni.projectcalico.org/podIP=10.100.188.241/32
                    cni.projectcalico.org/podIPs=10.100.188.241/32
                    workflows.argoproj.io/node-message=Failed to establish pod watch: Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods?fieldSelector=metadata.name%3Dnew-k007098t-k007098t-panel063-18-ct-sub-20592119...
                    workflows.argoproj.io/node-name=new-k007098t-k007098t-panel063-18-ct-sub.onExit[0].succeed
                    workflows.argoproj.io/outputs={"exitCode":"0"}
                    workflows.argoproj.io/template={"name":"exit-handler-succeed","arguments":{},"inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"...
Status:             Failed
IP:                 10.100.188.241
Controlled By:      Workflow/new-k007098t-k007098t-panel063-18-ct-sub
Containers:
  wait:
    Container ID:  docker://b4675e87273e5db9d400d27ff8a05ff913cfa2f0c86c4afc434217e09e6fcb49
    Image:         argoproj/argoexec:v2.8.0
    Image ID:      docker-pullable://argoproj/argoexec@sha256:9eb71f2dae8405d8c053d5beccce15ecf179448ad0809d9f55f1333bf1d71a01
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      wait
    State:          Terminated
      Reason:       Error
      Message:      Failed to establish pod watch: Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods?fieldSelector=metadata.name%3Dnew-k007098t-k007098t-panel063-18-ct-sub-2059211949&watch=true: dial tcp 10.100.0.1:443: i/o timeout
      Exit Code:    1
      Started:      Sat, 30 Jan 2021 15:19:14 +0800
      Finished:     Sat, 30 Jan 2021 15:20:29 +0800
    Ready:          False
    Restart Count:  0
    Environment:
      ARGO_POD_NAME:  new-k007098t-k007098t-panel063-18-ct-sub-2059211949 (v1:metadata.name)
    Mounts:
      /argo/podmetadata from podmetadata (rw)
      /mainctrfs/RawData_classed_EMC from emc (rw)
      /mainctrfs/clinical from clinical (rw)
      /mainctrfs/etc/resolv.conf from dns (rw)
      /mainctrfs/hongshan from hongshan (rw)
      /var/run/docker.sock from docker-sock (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-46jzs (ro)
  main:
    Container ID:  docker://9b5afc63336c44554530bc9499c782c691de92a705950a9158a79b67f50c953f
    Image:         ...
    Image ID:      docker-pullable://...
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
    Args:
      set -x && echo "`date` succeeded!"
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 30 Jan 2021 15:19:14 +0800
      Finished:     Sat, 30 Jan 2021 15:19:14 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     3600m
      memory:  30000Mi
    Requests:
      cpu:        3600m
      memory:     30000Mi
    Environment:  <none>
    Mounts:
      /pan1 from em (rw)
      /pan2 from cl (rw)
      /etc/resolv.conf from dns (rw)
      /p3 from hs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-46jzs (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  podmetadata:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  docker-sock:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:  Socket
  clinical:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    ${nas_addr}
    Path:      ...
    ReadOnly:  false
  emc:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    ${nas_addr}
    Path:      ...
    ReadOnly:  false
  dns:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/resolv.conf
    HostPathType:  File
  hongshan:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  myclaim
    ReadOnly:   false
  default-token-46jzs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-46jzs
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node-role.genetronhealth.com/deliver=aliyun:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age        From                  Message
  ----    ------     ----       ----                  -------
  Normal  Scheduled  <unknown>  default-scheduler     Successfully assigned sxlj/new-k007098t-k007098t-panel063-18-ct-sub-2059211949 to node157
  Normal  Pulled     59m        kubelet, node157  Container image "argoproj/argoexec:v2.8.0" already present on machine
  Normal  Created    59m        kubelet, node157  Created container wait
  Normal  Started    59m        kubelet, node157  Started container wait
  Normal  Pulled     59m        kubelet, node157  Container image "..." already present on machine
  Normal  Created    59m        kubelet, node157  Created container main
  Normal  Started    59m        kubelet, node157  Started container main

kubectl logs new-k007098t-k007098t-panel063-18-ct-sub-2059211949 wait

time="2021-01-30T07:19:14Z" level=info msg="Starting Workflow Executor" version=v2.8.0-rc4+8f69617.dirty
time="2021-01-30T07:19:14Z" level=info msg="Creating a docker executor"
time="2021-01-30T07:19:14Z" level=info msg="Executor (version: v2.8.0-rc4+8f69617.dirty, build_date: 2020-05-12T15:17:15Z) initialized (pod: sxlj/new-k007098t-k007098t-panel063-18-ct-sub-2059211949) with template:\n{\"name\":\"exit-handler-succeed\",\"arguments\":{},\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"...\",\"command\":[\"bash\",\"-c\"],\"args\":[\"set -x \\u0026\\u0026 echo \\\"`date` succeeded!\\\" \"],\"resources\":{\"limits\":{\"cpu\":\"3600m\",\"memory\":\"30000Mi\"},\"requests\":{\"cpu\":\"3600m\",\"memory\":\"30000Mi\"}},\"volumeMounts\":[{\"name\":\"clinical\",\"mountPath\":\"/clinical\"},{\"name\":\"emc\",\"mountPath\":\"/RawData_classed_EMC\"},{\"name\":\"dns\",\"mountPath\":\"/etc/resolv.conf\"},{\"name\":\"hongshan\",\"mountPath\":\"/hongshan\"}]}}"
time="2021-01-30T07:19:14Z" level=info msg="Waiting on main container"
time="2021-01-30T07:19:44Z" level=error msg="executor error: Failed to establish pod watch: Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods?fieldSelector=metadata.name%3Dnew-k007098t-k007098t-panel063-18-ct-sub-2059211949&watch=true: dial tcp 10.100.0.1:443: i/o timeout\ngithub.com/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngithub.com/argoproj/argo/errors.InternalWrapErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:78\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).waitMainContainerStart\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:897\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:861\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
time="2021-01-30T07:19:44Z" level=info msg="No output parameters"
time="2021-01-30T07:19:44Z" level=info msg="No output artifacts"
time="2021-01-30T07:19:44Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-01-30T07:19:44Z" level=info msg="Capturing script exit code"
time="2021-01-30T07:20:14Z" level=warning msg="Failed to get pod 'new-k007098t-k007098t-panel063-18-ct-sub-2059211949': Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods/new-k007098t-k007098t-panel063-18-ct-sub-2059211949: dial tcp 10.100.0.1:443: i/o timeout"
time="2021-01-30T07:20:29Z" level=info msg="[docker inspect 9b5afc63336c44554530bc9499c782c691de92a705950a9158a79b67f50c953f --format='{{.State.ExitCode}}']"
time="2021-01-30T07:20:29Z" level=info msg="Annotating pod with output"
time="2021-01-30T07:20:29Z" level=info msg="Killing sidecars"
time="2021-01-30T07:20:29Z" level=info msg="Alloc=5611 TotalAlloc=11627 Sys=70592 NumGC=4 Goroutines=6"
time="2021-01-30T07:20:29Z" level=fatal msg="Failed to establish pod watch: Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods?fieldSelector=metadata.name%3Dnew-k007098t-k007098t-panel063-18-ct-sub-2059211949&watch=true: dial tcp 10.100.0.1:443: i/o timeout\ngithub.com/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngithub.com/argoproj/argo/errors.InternalWrapErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:78\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).waitMainContainerStart\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:897\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:861\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

franza73 · 2021-02-16T23:01:49Z

We are experiencing this same issue in our Argo Workflow deployment.

terrytangyuan · 2021-02-17T00:01:38Z

It's likely that your cluster/apiserver is super unstable. There's an environment variable TRANSIENT_ERROR_PATTERN you can use to specify a regular expression for additional errors that you'd like to retry. There are also variables like RETRY_BACKOFF_* that you can set as well for further customization of the retry behavior. More details in this doc: https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md

RysBen · 2021-02-18T02:47:14Z

@terrytangyuan Thanks for your answers!
@franza73 Several days ago, I found argo list spent ~20s in my cluster, while argo get only spent ~1s. After deleting completed workflows, argo list only spent ~5s, and the issue did not happen again. If there were too many history workflows in your cluster, you can delete them. If not, you can try the method provided by @terrytangyuan

stale · 2021-04-30T20:43:14Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

fr-ser · 2021-05-01T08:16:27Z

We are experiencing the same problem at our cluster. I cannot confirm the observations of @RysBen, however.
The argo commands take a reasonable amount of time.

Out Kubernetes is an AWS EKS cluster, which I would follow best practices so I cannot say much about customizations regarding Kubernetes API agents...

RysBen added the type/bug label Feb 2, 2021

stale bot added the wontfix label Apr 30, 2021

stale bot closed this as completed May 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait container | Failed to establish pod watch ... dial tcp i/o timeout #4993

wait container | Failed to establish pod watch ... dial tcp i/o timeout #4993

RysBen commented Feb 2, 2021 •

edited

Loading

franza73 commented Feb 16, 2021

terrytangyuan commented Feb 17, 2021

RysBen commented Feb 18, 2021

stale bot commented Apr 30, 2021

fr-ser commented May 1, 2021

wait container | Failed to establish pod watch ... dial tcp i/o timeout #4993

wait container | Failed to establish pod watch ... dial tcp i/o timeout #4993

Comments

RysBen commented Feb 2, 2021 • edited Loading

Summary

More Info

franza73 commented Feb 16, 2021

terrytangyuan commented Feb 17, 2021

RysBen commented Feb 18, 2021

stale bot commented Apr 30, 2021

fr-ser commented May 1, 2021

RysBen commented Feb 2, 2021 •

edited

Loading