Skip to content

postStartTimeout causes workspace to fail even when postStart command succeeds for some container images #1529

@rohanKanojia

Description

@rohanKanojia

Description

Note: This only happens with some specific container images. It doesn't happen with all container images. If I change the container base image to quay.io/wto/web-terminal-tooling:next, this workspace comes into Running state.

When the DevWorkspaceOperatorConfig is configured with a config.workspace.postStartTimeout (e.g., 5m), a DevWorkspace with a postStart event referencing a command fails to start and enters the Failing phase. The pod for the workspace enters a CrashLoopBackOff state due to a FailedPostStartHook.

This issue does not occur if the postStartTimeout is removed from the configuration.

Example DevWorkspaceOperatorConfig snippet:

config:
  workspace:
    postStartTimeout: 5m

Here is the DevWorkspace I was trying to create. It has a very simple postStart hook that should succeed:

apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: working-post-start-ws
  annotations:
    controller.devfile.io/debug-start: "true"
spec:
  started: true
  template:
    components:
      - name: tools
        container:
          image: quay.io/wto/web-terminal-tooling:latest
          sourceMapping: /projects
          command: [ "tail" ]
          args: [ "-f", "/dev/null" ]
    commands:
      - id: failing-command
        exec:
          commandLine: |
            echo "Execuet poststart ls"
            ls -lt
          component: tools
    events:
      postStart:
        - failing-command

However, after creating this DevWorkspace goes into this state

NAMESPACE             NAME                    DEVWORKSPACE ID             PHASE     INFO
openshift-operators   working-post-start-ws   workspacef36328c1632b4957   Failing   Error creating DevWorkspace deployment: Detected unrecoverable event FailedPostStartHook: [postStart hook] failed with an unknown error (see pod events or container logs for more details)

NAME                                               READY   STATUS             RESTARTS      AGE
workspacef36328c1632b4957-8588d4d77-rpgkh         0/1     CrashLoopBackOff   5 (64s ago)   5m26s

# Rendered lifecycle.postStart in pod spec

image: quay.io/wto/web-terminal-tooling:latest
imagePullPolicy: Always
lifecycle:
  postStart:
    exec:
      command:
      - /bin/sh
      - -c
      - |
        {
          # This script block ensures its exit code is preserved
          # while its stdout and stderr are tee'd.
          _script_to_run() {

          export POSTSTART_TIMEOUT_DURATION="300"
          export POSTSTART_KILL_AFTER_DURATION="5"

          _TIMEOUT_COMMAND_PART=""
          _WAS_TIMEOUT_USED="false" # Use strings "true" or "false" for shell boolean

          if command -v timeout >/dev/null 2>&1; then
            echo "[postStart hook] Executing commands with timeout: ${POSTSTART_TIMEOUT_DURATION} seconds, kill after: ${POSTSTART_KILL_AFTER_DURATION} seconds" >&2
            _TIMEOUT_COMMAND_PART="timeout --preserve-status --kill-after=${POSTSTART_KILL_AFTER_DURATION} ${POSTSTART_TIMEOUT_DURATION}"
            _WAS_TIMEOUT_USED="true"
          else
            echo "[postStart hook] WARNING: 'timeout' utility not found. Executing commands without timeout." >&2
          fi

          # Execute the user's script
          ${_TIMEOUT_COMMAND_PART} /bin/sh -c 'set -e
          echo "Execuet poststart ls"
          ls -lt 
          '
          exit_code=$?

          # Check the exit code based on whether timeout was attempted
          if [ "$_WAS_TIMEOUT_USED" = "true" ]; then
            if [ $exit_code -eq 143 ]; then # 128 + 15 (SIGTERM)
              echo "[postStart hook] Commands terminated by SIGTERM (likely timed out after ${POSTSTART_TIMEOUT_DURATION}s). Exit code 143." >&2
            elif [ $exit_code -eq 137 ]; then # 128 + 9 (SIGKILL)
              echo "[postStart hook] Commands forcefully killed by SIGKILL (likely after --kill-after ${POSTSTART_KILL_AFTER_DURATION}s expired). Exit code 137." >&2
            elif [ $exit_code -ne 0 ]; then # Catches any other non-zero exit code
              echo "[postStart hook] Commands failed with exit code $exit_code." >&2
            else
              echo "[postStart hook] Commands completed successfully within the time limit." >&2
            fi
          else
            if [ $exit_code -ne 0 ]; then
              echo "[postStart hook] Commands failed with exit code $exit_code (no timeout)." >&2
            else
              echo "[postStart hook] Commands completed successfully (no timeout)." >&2
            fi
          fi

          exit $exit_code
          }
          _script_to_run
        } 1> >(tee -a "/tmp/poststart-stdout.txt") 2> >(tee -a "/tmp/poststart-stderr.txt" >&2)

I observed this issue is more related to the image used in DevWorkspace spec. here are few observations:

  • quay.io/wto/web-terminal-tooling:next works
  • quay.io/wto/web-terminal-tooling:latest doesn't work
  • quay.io/devfile/universal-developer-image:latest works
  • quay.io/devfile/universal-developer-image:ubi8-latest doesn't work

I checked timeout utility is present in all these images. I'm not 100% sure whether it's due to some configuration mistake from my side or an actual issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions