- 
                Notifications
    You must be signed in to change notification settings 
- Fork 68
feat (postStart) : Allow debugging poststart failures with sleep by trapping errors #1522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat (postStart) : Allow debugging poststart failures with sleep by trapping errors #1522
Conversation
| Skipping CI for Draft Pull Request. | 
| [APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rohanKanojia The full list of commands accepted by this bot can be found here. 
Needs approval from an approver in each of these files:
 Approvers can indicate their approval by writing  | 
5e1a317    to
    bfa0ab7      
    Compare
  
    | /ok-to-test | 
bfa0ab7    to
    ea21eb5      
    Compare
  
    | /ok-to-test | 
ea21eb5    to
    9559169      
    Compare
  
    | /ok-to-test | 
fd1df4a    to
    542c5ff      
    Compare
  
    | /ok-to-test | 
542c5ff    to
    9853a13      
    Compare
  
    | /ok-to-test | 
9853a13    to
    605efe4      
    Compare
  
    | /ok-to-test | 
605efe4    to
    ff5a0d9      
    Compare
  
    | /ok-to-test | 
| For some reasons my workspace is running (tested on OpenShift) oc get dw    -A
NAMESPACE   NAME                         DEVWORKSPACE ID             PHASE     INFO
test        failing-poststart-debug-dw   workspaced0882b8ed1fc4c69   Running   Workspace is running | 
|  | ||
| postStartDebugTrapSleepDuration := "" | ||
| if workspace.Annotations[constants.DevWorkspaceDebugStartAnnotation] == "true" { | ||
| postStartDebugTrapSleepDuration = workspace.Config.Workspace.ProgressTimeout | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I don't like using ProgressTimeout for this purpose.
But on the other hand I don't have another solution but some constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I use ProgressTimeout to be consistent with the behavior of the Debug annotation when it fails for the main component.
We do not scale down the failing workspace until the failing timeout is satisfied:
devworkspace-operator/controllers/workspace/devworkspace_controller.go
Lines 170 to 173 in b61eaed
| // If debug annotation is present, leave the deployment in place to let users | |
| // view logs. | |
| if workspace.Annotations[constants.DevWorkspaceDebugStartAnnotation] == "true" { | |
| if isTimeout, err := checkForFailingTimeout(workspace); err != nil { | 
Inside the checkForFailingTimeout, we're parsing ProgressTimeout:
| timeout, err := time.ParseDuration(workspace.Config.Workspace.ProgressTimeout) | 
        
          
                pkg/library/lifecycle/poststart.go
              
                Outdated
          
        
      | #!/bin/sh | ||
| %s | ||
| EOF | ||
| chmod +x /tmp/poststart.sh | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohanKanojia
Were you able to test this snippet?
I am not sure if chmod +x will work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really apologize for my mistake 🙏 . This seems to be a leftover from previous attempts.
I'll remove it.
| 
 @tolusha : Could you please share which OCP version you were using? I have tested it on CRC 2.53.0 with OpenShift 4.19.3. | 
ff5a0d9    to
    67661ea      
    Compare
  
    | @tolusha : I've created these videos based on OpenShift 4.20 via clusterbot Scenario 1 : No Poststart Timeout Configureddwo-debug-poststart-normal-scenario.mp4Scenario 2: PostStart Timeout Configureddwo-debug-poststart-poststart-timeout-configured.mp4 | 
67661ea    to
    99ad59e      
    Compare
  
    | There is a corner case. | 
…rapping errors Add an optional debug mechanism for postStart lifecycle hooks. When enabled via the `controller.devfile.io/debug-start: "true"` annotation, any failure in a postStart command results in the container sleeping for some seconds as per configured progressTimeout, allowing developers time to inspect the container state. - Added `enableDebugStart` parameter to poststart methods. - Injects `trap ... sleep` into postStart scripts when debug mode is enabled. - Includes support for both timeout-wrapped (`postStartTimeout`) and non-timeout lifecycle scripts. This feature improves debuggability of DevWorkspaces where postStart hooks fail and would otherwise cause container crash/restarts. Signed-off-by: Rohan Kumar <rohaan@redhat.com>
99ad59e    to
    58c9221      
    Compare
  
    
What does this PR do?
In #425
controller.devfile.io/debug-startannotation was added to aid in debugging failed devworkspaces: Debugging a failing workspaceWe extend the use case of this annotation so that any failure in a postStart command results in the container sleeping for a specified number of seconds, as per the configured
progressTimeout, allowing developers time to inspect the container state.enableDebugStartparameter to poststart methods.trap ... sleepinto postStart scripts when debug mode is enabled.postStartTimeout) and non-timeout lifecycle scripts.This feature improves the debuggability of DevWorkspaces where postStart hooks fail and would otherwise cause container crashes/restarts.
What issues does this PR fix or reference?
eclipse-che/che#23404
Is it tested? How?
With Changes
make docker && make installContainerCreatingphaseoc get dw NAME DEVWORKSPACE ID PHASE INFO failing-poststart-debug-dw workspace55bf350cfb754260 Starting Waiting for workspace deployment oc get pods NAME READY STATUS RESTARTS AGE workspace55bf350cfb754260-54749bf7c5-288vt 0/1 ContainerCreating 0 10s/tmp/poststart-stderr.txtto see root cause of failure:With PostStartTimeout Enabled
PostStart commands are processed slightly differently when
postStartTimeoutfield is enabled in DevWorkspaceOperatorConfig. You can verify the above flow after enabling it:Without Changes
With the current changes in the main, when we create a DevWorkspace with a failing post-start event. The pod immediately goes into
PostStartHookFailederror and thenCrashLoopbackOfferror. It doesn't allow execution into it to view failure:PR Checklist
/test v8-devworkspace-operator-e2e, v8-che-happy-pathto trigger)v8-devworkspace-operator-e2e: DevWorkspace e2e testv8-che-happy-path: Happy path for verification integration with Che