Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle failures in routing reconciler in recoverable way #1166

Merged
merged 1 commit into from
Aug 22, 2023

Conversation

amisevsk
Copy link
Collaborator

What does this PR do?

When a DevWorkspaceRouting fails to provision objects, it enters the 'Failed' phase. In order to prevent repeated unnecessary reconciles, the controller ignores DWRs that are already Failed, which leads to the 'Failed' phase being permanent.

To avoid this, we instead set a Stopped phase when the parent workspace is stopped. This will clear the Failed phase and any message that was set, so the DevWorkspace controller will need to propagate it into the DevWorkspace object to allow it to be shown.

What issues does this PR fix or reference?

Closes #923

Is it tested? How?

To test, follow the reproducer in #923 and verify this PR fixes the issue (allows failed workspaces to be restarted):

  1. Install DWO and the Eclipse Che operator in a cluster
  2. Create two CheCluster CRs in the cluster and wait for them to be set up. This will cause all routingClass: che DWRs to fail during reconcile.
  3. Create a DevWorkspace with routingClass: che. Workspace will fail to start
  4. Delete one of the two CheClusters. While new routingClass: che workspaces will start successfully, it's not possible to restart the failed workspace.

alternatively, this PR can be tested by patching the DevWorkspace controller pod to mark routing objects as failed instead of ready by replacing this block with

r.markRoutingFailed(instance, "testing failure")

in order to get a failed DevWorkspaceRouting. Then undo the patch and retest to ensure workspace can be restarted after failure.

In testing, when the DevWorkspaceRouting fails, the DevWorkspace should also be marked as failed, and the failure message on the DevWorkspace should include all applicable context from the DevWorkspaceRouting.

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

When a DevWorkspaceRouting fails to provision objects, it enters the
'Failed' phase. In order to prevent repeated unnecessary reconciles, the
controller ignores DWRs that are already Failed, which leads to the
'Failed' phase being permanent.

To avoid this, we instead set a Stopped phase when the parent workspace
is stopped. This will clear the Failed phase and any message that was
set, so the DevWorkspace controller will need to propagate it into the
DevWorkspace object to allow it to be shown.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
@amisevsk
Copy link
Collaborator Author

I've also checked that this PR doesn't break debug-start functionality; if the controller.devfile.io/debug-start: 'true' annotation is applied to the DevWorkspace, the DevWorkspaceRouting stays in the Failed state with the old status until the workspace is stopped or the annotation is removed.

@codecov
Copy link

codecov bot commented Aug 22, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.02% ⚠️

Comparison is base (65bdb9e) 52.53% compared to head (341c600) 52.51%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1166      +/-   ##
==========================================
- Coverage   52.53%   52.51%   -0.02%     
==========================================
  Files          82       82              
  Lines        7460     7468       +8     
==========================================
+ Hits         3919     3922       +3     
- Misses       3260     3266       +6     
+ Partials      281      280       -1     
Files Changed Coverage Δ
...workspacerouting/devworkspacerouting_controller.go 52.53% <0.00%> (-2.02%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@AObuchow AObuchow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
I was unable to reproduce the original bug following the reproduer issue: Che's validating webhook prevents the creation of 2 Che Cluster CR's so I had to delete the webhook and quickly create another Che Cluster CR. Even then, the second Che Cluster CR wouldn't get fully set up (The dashboard deployment would endlessly be created and deleted), and creating a workspace with the che routing class wouldn't cause the workspace to fail.

However, I was able to patch the DWR as suggested in the second testing approach and got the workspace to fail with the appropriate message from the DWR controller:

NAME                 DEVWORKSPACE ID             PHASE    INFO
plain-devworkspace   workspace56a1660c61624a42   Failed   Failed to set up networking for workspace: testing failure

I undid the patch, set the workspace to started and it deployed successfully as expected.

@openshift-ci
Copy link

openshift-ci bot commented Aug 22, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amisevsk, AObuchow

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@amisevsk amisevsk merged commit 05bdc38 into devfile:main Aug 22, 2023
6 of 8 checks passed
@amisevsk amisevsk deleted the routing-failure-handling branch August 22, 2023 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DevWorkspaceRouting can get stuck in failed state
2 participants