Che deployment should not fail because pulling images took too long #17942

sleshchenko · 2020-09-23T12:40:58Z

Is your task related to a problem? Please describe.

I'm not sure but I suppose that it makes sense to increase pod wait timeout in chectl.
I'm raising this question because when I deploy Che on CRC I got:

~/projects > chectl server:start --platform=crc                                                                                                                50s 15:22:27
› Current Kubernetes context: 'default/api-crc-testing:6443/kube:admin'
....
  ❯ ✅  Post installation checklist
    ✔ PostgreSQL pod bootstrap
      ✔ scheduling...done.
      ✔ downloading images...done.
      ✔ starting...done.
    ❯ Keycloak pod bootstrap
      ✔ scheduling...done.
      ✔ downloading images...done.
      ✖ starting
        → ERR_TIMEOUT: Timeout set to pod ready timeout 130000
      Devfile registry pod bootstrap
      Plugin registry pod bootstrap
      Eclipse Che pod bootstrap
      Retrieving Eclipse Che server URL
      Eclipse Che status check
    Retrieving Keycloak admin credentials
    Retrieving Che self-signed CA certificate
 ›   Error: Error: ERR_TIMEOUT: Timeout set to pod ready timeout 130000
 ›   Installation failed, check logs in '/tmp/chectl-logs/1600863851534'

which is kind of OK because keycloak is started in 120107ms ( which is less than the default 130000 timeout but if we add image pulling here - it exceeded timeout)

12:25:24,089 INFO  [org.jboss.modules] (CLI command executor) JBoss Modules version 1.9.0.Final
12:25:24,927 INFO  [org.jboss.msc] (CLI command executor) JBoss MSC version 1.4.5.Final
12:25:25,031 INFO  [org.jboss.threads] (CLI command executor) JBoss Threads version 2.3.3.Final
...
12:28:30,407 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0212: Resuming server
12:28:30,422 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://127.0.0.1:9990/management
12:28:30,425 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990
12:28:30,426 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: Keycloak 6.0.1 (WildFly Core 8.0.0.Final) started in 120107ms - Started 617 of 880 services (563 services are lazy, passive or on-demand)

Describe the solution you'd like

The simplest way to help me is just increasing the default value for timeout. But below see maybe even better alternatives:

Describe alternatives you've considered

We reset timeout when pod waits get to next step, so we have different timeout for pod scheduling, image pulling and containers start up.
Waiting process is interactive by default which means when we exceed timeout - we ask user is they want to keep waiting or interrupt waiting.

in my particular case I would wait because I need keycloak admin password which is printed only when Che is fully started.

Additional context

The text was updated successfully, but these errors were encountered:

l0rd · 2020-09-23T21:19:47Z

Another option is not to use a timeout at all and double check for errors instead (in events, in operator, in pod status). Timeout is frustrating if something fails (we need to wait for 2 minutes when we could have been notified imediately) and is frustrating when everything went well but pulling images took too long (we tell the user that the deployment was a failure when it actually worked fine).

sleshchenko added kind/task Internal things, technical debt, and to-do tasks to be performed. area/chectl Issues related to chectl, the CLI of Che labels Sep 23, 2020

che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Sep 23, 2020

l0rd added severity/P2 Has a minor but important impact to the usage or development of the system. and removed status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Sep 23, 2020

tolusha mentioned this issue Sep 28, 2020

chectl must detect issues with Che Operator image #17937

Closed

23 tasks

tolusha added this to the 7.21 milestone Oct 1, 2020

l0rd added severity/P1 Has a major impact to usage or development of the system. and removed severity/P2 Has a minor but important impact to the usage or development of the system. labels Oct 1, 2020

This was referenced Oct 1, 2020

Che Deploy Sprint #191 #18011

Closed

More forgiving image pull behavior #18017

Closed

l0rd changed the title ~~[chectl] the default timeout for podwait is not enough big~~ Che deployment should not fail because pulling images took too long Oct 5, 2020

tolusha mentioned this issue Oct 7, 2020

feat: Detects issues with downloading images and starting containers che-incubator/chectl#908

Merged

9 tasks

tolusha closed this as completed Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Che deployment should not fail because pulling images took too long #17942

Che deployment should not fail because pulling images took too long #17942

sleshchenko commented Sep 23, 2020 •

edited

Loading

l0rd commented Sep 23, 2020 •

edited

Loading

Che deployment should not fail because pulling images took too long #17942

Che deployment should not fail because pulling images took too long #17942

Comments

sleshchenko commented Sep 23, 2020 • edited Loading

Is your task related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

l0rd commented Sep 23, 2020 • edited Loading

sleshchenko commented Sep 23, 2020 •

edited

Loading

l0rd commented Sep 23, 2020 •

edited

Loading