Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Che deployment should not fail because pulling images took too long #17942

Closed
sleshchenko opened this issue Sep 23, 2020 · 1 comment
Closed
Labels
area/chectl Issues related to chectl, the CLI of Che kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.
Milestone

Comments

@sleshchenko
Copy link
Member

sleshchenko commented Sep 23, 2020

Is your task related to a problem? Please describe.

I'm not sure but I suppose that it makes sense to increase pod wait timeout in chectl.
I'm raising this question because when I deploy Che on CRC I got:

~/projects > chectl server:start --platform=crc                                                                                                                50s 15:22:27
› Current Kubernetes context: 'default/api-crc-testing:6443/kube:admin'
....
  ❯ ✅  Post installation checklist
    ✔ PostgreSQL pod bootstrap
      ✔ scheduling...done.
      ✔ downloading images...done.
      ✔ starting...done.
    ❯ Keycloak pod bootstrap
      ✔ scheduling...done.
      ✔ downloading images...done.
      ✖ starting
        → ERR_TIMEOUT: Timeout set to pod ready timeout 130000
      Devfile registry pod bootstrap
      Plugin registry pod bootstrap
      Eclipse Che pod bootstrap
      Retrieving Eclipse Che server URL
      Eclipse Che status check
    Retrieving Keycloak admin credentials
    Retrieving Che self-signed CA certificate
 ›   Error: Error: ERR_TIMEOUT: Timeout set to pod ready timeout 130000
 ›   Installation failed, check logs in '/tmp/chectl-logs/1600863851534'

which is kind of OK because keycloak is started in 120107ms ( which is less than the default 130000 timeout but if we add image pulling here - it exceeded timeout)

12:25:24,089 INFO  [org.jboss.modules] (CLI command executor) JBoss Modules version 1.9.0.Final
12:25:24,927 INFO  [org.jboss.msc] (CLI command executor) JBoss MSC version 1.4.5.Final
12:25:25,031 INFO  [org.jboss.threads] (CLI command executor) JBoss Threads version 2.3.3.Final
...
12:28:30,407 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0212: Resuming server
12:28:30,422 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://127.0.0.1:9990/management
12:28:30,425 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990
12:28:30,426 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: Keycloak 6.0.1 (WildFly Core 8.0.0.Final) started in 120107ms - Started 617 of 880 services (563 services are lazy, passive or on-demand)

Describe the solution you'd like

The simplest way to help me is just increasing the default value for timeout. But below see maybe even better alternatives:

Describe alternatives you've considered

  1. We reset timeout when pod waits get to next step, so we have different timeout for pod scheduling, image pulling and containers start up.
  2. Waiting process is interactive by default which means when we exceed timeout - we ask user is they want to keep waiting or interrupt waiting.
  • in my particular case I would wait because I need keycloak admin password which is printed only when Che is fully started.

Additional context

@sleshchenko sleshchenko added kind/task Internal things, technical debt, and to-do tasks to be performed. area/chectl Issues related to chectl, the CLI of Che labels Sep 23, 2020
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Sep 23, 2020
@l0rd l0rd added severity/P2 Has a minor but important impact to the usage or development of the system. and removed status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Sep 23, 2020
@l0rd
Copy link
Contributor

l0rd commented Sep 23, 2020

Another option is not to use a timeout at all and double check for errors instead (in events, in operator, in pod status). Timeout is frustrating if something fails (we need to wait for 2 minutes when we could have been notified imediately) and is frustrating when everything went well but pulling images took too long (we tell the user that the deployment was a failure when it actually worked fine).

@tolusha tolusha added this to the 7.21 milestone Oct 1, 2020
@l0rd l0rd added severity/P1 Has a major impact to usage or development of the system. and removed severity/P2 Has a minor but important impact to the usage or development of the system. labels Oct 1, 2020
@l0rd l0rd changed the title [chectl] the default timeout for podwait is not enough big Che deployment should not fail because pulling images took too long Oct 5, 2020
@tolusha tolusha closed this as completed Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/chectl Issues related to chectl, the CLI of Che kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
Development

No branches or pull requests

4 participants