wait container | Failed to establish pod watch ... dial tcp i/o timeout #4980

RysBen · 2021-01-30T08:49:04Z

RysBen
Jan 30, 2021

Hi all,

We submit hundreds of workflows at specific times of the day. The status of some workflows would be "Error/Failed", and MESSAGE is as follows:

argo -n sxlj get new-k007098t-k007098t-panel063-18-ct-sub

Name:                new-k007098t-k007098t-panel063-18-ct-sub
Namespace:           sxlj
ServiceAccount:      default
Status:              Failed
Message:             child 'new-k007098t-k007098t-panel063-18-ct-sub-2059211949' failed
Created:             Sat Jan 30 15:15:27 +0800 (36 minutes ago)
Started:             Sat Jan 30 15:15:27 +0800 (36 minutes ago)
Finished:            Sat Jan 30 15:21:34 +0800 (30 minutes ago)
Duration:            6 minutes 7 seconds
Parameters:
  argo-yaml:         /clinical/ctdna/panel063-18-CT/new_K007098T_K007098T_panel063-18-CT/log/new_K007098T_K007098T_panel063-18-CT.sub.yaml

STEP                                                 PODNAME                                              DURATION  MESSAGE
 ✔ new-k007098t-k007098t-panel063-18-ct-sub
 └-✔ new-k007098t-k007098t-panel063-18-ct-sub-sh(0)  new-k007098t-k007098t-panel063-18-ct-sub-710601264   2m

 ✖ new-k007098t-k007098t-panel063-18-ct-sub.onExit                                                                  child 'new-k007098t-k007098t-panel063-18-ct-sub-2059211949' failed
 └---⚠ succeed                                       new-k007098t-k007098t-panel063-18-ct-sub-2059211949  2m        failed to save outputs: Failed to establish pod watch: Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods?fieldSelector=metadata.name%3Dnew-k007098t-k007098t-panel063-18-ct-sub-2059211949&watch=true: dial tcp 10.100.0.1:443: i/o timeout

At first, I thought it was caused by overloading the cluster. After observing it for a while, I found that the problem was all happening on the wait container. Is there a bug in the wait container? and is there any way to solve this problem?

Any suggestion would be appreciated.

More Information

kubectl describe po new-k007098t-k007098t-panel063-18-ct-sub-2059211949

Name:               new-k007098t-k007098t-panel063-18-ct-sub-2059211949
Namespace:          sxlj
Priority:           0
PriorityClassName:  <none>
Node:               fsz-node157/172.16.36.157
Start Time:         Sat, 30 Jan 2021 15:18:55 +0800
Labels:             workflows.argoproj.io/completed=true
                    workflows.argoproj.io/on-exit=true
                    workflows.argoproj.io/workflow=new-k007098t-k007098t-panel063-18-ct-sub
Annotations:        cni.projectcalico.org/podIP=10.100.188.241/32
                    cni.projectcalico.org/podIPs=10.100.188.241/32
                    workflows.argoproj.io/node-message=Failed to establish pod watch: Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods?fieldSelector=metadata.name%3Dnew-k007098t-k007098t-panel063-18-ct-sub-20592119...
                    workflows.argoproj.io/node-name=new-k007098t-k007098t-panel063-18-ct-sub.onExit[0].succeed
                    workflows.argoproj.io/outputs={"exitCode":"0"}
                    workflows.argoproj.io/template={"name":"exit-handler-succeed","arguments":{},"inputs":{},"outputs":{},"metadata":{},"container":{"name":"","image":"registry-vpc.cn-beijing.aliyuncs.com/genetron/ctdna:...
Status:             Failed
IP:                 10.100.188.241
Controlled By:      Workflow/new-k007098t-k007098t-panel063-18-ct-sub
Containers:
  wait:
    Container ID:  docker://b4675e87273e5db9d400d27ff8a05ff913cfa2f0c86c4afc434217e09e6fcb49
    Image:         argoproj/argoexec:v2.8.0
    Image ID:      docker-pullable://argoproj/argoexec@sha256:9eb71f2dae8405d8c053d5beccce15ecf179448ad0809d9f55f1333bf1d71a01
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      wait
    State:          Terminated
      Reason:       Error
      Message:      Failed to establish pod watch: Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods?fieldSelector=metadata.name%3Dnew-k007098t-k007098t-panel063-18-ct-sub-2059211949&watch=true: dial tcp 10.100.0.1:443: i/o timeout
      Exit Code:    1
      Started:      Sat, 30 Jan 2021 15:19:14 +0800
      Finished:     Sat, 30 Jan 2021 15:20:29 +0800
    Ready:          False
    Restart Count:  0
    Environment:
      ARGO_POD_NAME:  new-k007098t-k007098t-panel063-18-ct-sub-2059211949 (v1:metadata.name)
    Mounts:
      /argo/podmetadata from podmetadata (rw)
      /mainctrfs/RawData_classed_EMC from emc (rw)
      /mainctrfs/clinical from clinical (rw)
      /mainctrfs/etc/resolv.conf from dns (rw)
      /mainctrfs/hongshan from hongshan (rw)
      /var/run/docker.sock from docker-sock (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-46jzs (ro)
  main:
    Container ID:  docker://9b5afc63336c44554530bc9499c782c691de92a705950a9158a79b67f50c953f
    Image:         registry-vpc.cn-beijing.aliyuncs.com/genetron/ctdna:latest_6b07b8bae0050f9830541a68ac7a790023cf7843
    Image ID:      docker-pullable://registry-vpc.cn-beijing.aliyuncs.com/genetron/ctdna@sha256:7339aea3ba8cec8eec604bb4bc1b2c227a1e3acd7b6bad17f362dfcf2756799f
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
    Args:
      set -x && echo "`date` succeeded!"
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 30 Jan 2021 15:19:14 +0800
      Finished:     Sat, 30 Jan 2021 15:19:14 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     3600m
      memory:  30000Mi
    Requests:
      cpu:        3600m
      memory:     30000Mi
    Environment:  <none>
    Mounts:
      /RawData_classed_EMC from emc (rw)
      /clinical from clinical (rw)
      /etc/resolv.conf from dns (rw)
      /hongshan from hongshan (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-46jzs (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  podmetadata:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  docker-sock:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:  Socket
  clinical:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    fsz.isilon.com
    Path:      /ifs/clinical
    ReadOnly:  false
  emc:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    fsz.isilon.com
    Path:      /ifs/RawData_classed
    ReadOnly:  false
  dns:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/resolv.conf
    HostPathType:  File
  hongshan:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  hongshanclaim
    ReadOnly:   false
  default-token-46jzs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-46jzs
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node-role.genetronhealth.com/deliver=aliyun:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age        From                  Message
  ----    ------     ----       ----                  -------
  Normal  Scheduled  <unknown>  default-scheduler     Successfully assigned sxlj/new-k007098t-k007098t-panel063-18-ct-sub-2059211949 to fsz-node157
  Normal  Pulled     59m        kubelet, fsz-node157  Container image "argoproj/argoexec:v2.8.0" already present on machine
  Normal  Created    59m        kubelet, fsz-node157  Created container wait
  Normal  Started    59m        kubelet, fsz-node157  Started container wait
  Normal  Pulled     59m        kubelet, fsz-node157  Container image "registry-vpc.cn-beijing.aliyuncs.com/genetron/ctdna:latest_6b07b8bae0050f9830541a68ac7a790023cf7843" already present on machine
  Normal  Created    59m        kubelet, fsz-node157  Created container main
  Normal  Started    59m        kubelet, fsz-node157  Started container main

kubectl logs new-k007098t-k007098t-panel063-18-ct-sub-2059211949 wait

time="2021-01-30T07:19:14Z" level=info msg="Starting Workflow Executor" version=v2.8.0-rc4+8f69617.dirty
time="2021-01-30T07:19:14Z" level=info msg="Creating a docker executor"
time="2021-01-30T07:19:14Z" level=info msg="Executor (version: v2.8.0-rc4+8f69617.dirty, build_date: 2020-05-12T15:17:15Z) initialized (pod: sxlj/new-k007098t-k007098t-panel063-18-ct-sub-2059211949) with template:\n{\"name\":\"exit-handler-succeed\",\"arguments\":{},\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"registry-vpc.cn-beijing.aliyuncs.com/genetron/ctdna:latest_6b07b8bae0050f9830541a68ac7a790023cf7843\",\"command\":[\"bash\",\"-c\"],\"args\":[\"set -x \\u0026\\u0026 echo \\\"`date` succeeded!\\\" \"],\"resources\":{\"limits\":{\"cpu\":\"3600m\",\"memory\":\"30000Mi\"},\"requests\":{\"cpu\":\"3600m\",\"memory\":\"30000Mi\"}},\"volumeMounts\":[{\"name\":\"clinical\",\"mountPath\":\"/clinical\"},{\"name\":\"emc\",\"mountPath\":\"/RawData_classed_EMC\"},{\"name\":\"dns\",\"mountPath\":\"/etc/resolv.conf\"},{\"name\":\"hongshan\",\"mountPath\":\"/hongshan\"}]}}"
time="2021-01-30T07:19:14Z" level=info msg="Waiting on main container"
time="2021-01-30T07:19:44Z" level=error msg="executor error: Failed to establish pod watch: Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods?fieldSelector=metadata.name%3Dnew-k007098t-k007098t-panel063-18-ct-sub-2059211949&watch=true: dial tcp 10.100.0.1:443: i/o timeout\ngithub.com/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngithub.com/argoproj/argo/errors.InternalWrapErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:78\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).waitMainContainerStart\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:897\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:861\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
time="2021-01-30T07:19:44Z" level=info msg="No output parameters"
time="2021-01-30T07:19:44Z" level=info msg="No output artifacts"
time="2021-01-30T07:19:44Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-01-30T07:19:44Z" level=info msg="Capturing script exit code"
time="2021-01-30T07:20:14Z" level=warning msg="Failed to get pod 'new-k007098t-k007098t-panel063-18-ct-sub-2059211949': Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods/new-k007098t-k007098t-panel063-18-ct-sub-2059211949: dial tcp 10.100.0.1:443: i/o timeout"
time="2021-01-30T07:20:29Z" level=info msg="[docker inspect 9b5afc63336c44554530bc9499c782c691de92a705950a9158a79b67f50c953f --format='{{.State.ExitCode}}']"
time="2021-01-30T07:20:29Z" level=info msg="Annotating pod with output"
time="2021-01-30T07:20:29Z" level=info msg="Killing sidecars"
time="2021-01-30T07:20:29Z" level=info msg="Alloc=5611 TotalAlloc=11627 Sys=70592 NumGC=4 Goroutines=6"
time="2021-01-30T07:20:29Z" level=fatal msg="Failed to establish pod watch: Get https://10.100.0.1:443/api/v1/namespaces/sxlj/pods?fieldSelector=metadata.name%3Dnew-k007098t-k007098t-panel063-18-ct-sub-2059211949&watch=true: dial tcp 10.100.0.1:443: i/o timeout\ngithub.com/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngithub.com/argoproj/argo/errors.InternalWrapErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:78\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).waitMainContainerStart\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:897\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:861\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:40\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.4-0.20181021141114-fe5e611709b0/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"

Answered by terrytangyuan

Apr 11, 2021

Copying my response from #4993 to this discussion and hopefully others will find it useful as well:

It's likely that your cluster/apiserver is super unstable. There's an environment variable TRANSIENT_ERROR_PATTERN you can use to specify a regular expression for additional errors that you'd like to retry. There are also variables like RETRY_BACKOFF_* that you can set as well for further customization of the retry behavior. More details in this doc: https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md

View full answer

alexec · 2021-02-01T20:09:39Z

alexec
Feb 1, 2021
Collaborator

I think this should be raised as an issue rather than a discusssion.

2 replies

RysBen Feb 2, 2021
Author

Thanks for your suggestion, I will post this on issue.

terrytangyuan Feb 17, 2021
Maintainer

This discussion will continue in this issue: #4993

terrytangyuan · 2021-04-11T18:41:24Z

terrytangyuan
Apr 11, 2021
Maintainer

Copying my response from #4993 to this discussion and hopefully others will find it useful as well:

It's likely that your cluster/apiserver is super unstable. There's an environment variable TRANSIENT_ERROR_PATTERN you can use to specify a regular expression for additional errors that you'd like to retry. There are also variables like RETRY_BACKOFF_* that you can set as well for further customization of the retry behavior. More details in this doc: https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait container | Failed to establish pod watch ... dial tcp i/o timeout #4980

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

wait container | Failed to establish pod watch ... dial tcp i/o timeout #4980

RysBen Jan 30, 2021

Replies: 2 comments · 2 replies

alexec Feb 1, 2021 Collaborator

RysBen Feb 2, 2021 Author

terrytangyuan Feb 17, 2021 Maintainer

terrytangyuan Apr 11, 2021 Maintainer

RysBen
Jan 30, 2021

Replies: 2 comments 2 replies

alexec
Feb 1, 2021
Collaborator

RysBen Feb 2, 2021
Author

terrytangyuan Feb 17, 2021
Maintainer

terrytangyuan
Apr 11, 2021
Maintainer