Skip to content

Conversation

AObuchow
Copy link
Collaborator

@AObuchow AObuchow commented Jun 2, 2022

What does this PR do?

Currently, in the reconcile loop, there is a bug with marking a workspace as having an error after the common PVC cleanup job fails.

From my understanding, after detecting that the cleanup job has failed and returning a ProvisionError from storageProvisioner.CleanupWorkspaceStorage(), the workspace's status phase is set to Error. However, the reconcile function checks for deleted workspaces before checking for workspaces with errors/failures. The existing logic in the finalize function (which is called when reconciling deleted workspaces) then overwrites the workspace's status phase (setting it to Terminating) and eventually runs storageProvisioner.CleanupWorkspaceStorage() again, leading to an endless loop.

My current fix simply modifies the finalize function to check if the workspace's status phase is set to Error, and if so, it does not overwrite the workspace's status phase and returns early.

What issues does this PR fix or reference?

Fix #845

Is it tested? How?

  1. Start up DWO
  2. Create 2 workspaces that use the common PVC storage-class strategy
  3. Delete one of the workspaces so that the common PVC cleanup job will be run, eg. kubectl delete dw theia-next -n $NAMESPACE
  4. Wait for all the PVC cleanup job-related pods to fail
  5. DWO will now log an error similar to the following, only once:
{"level":"error","ts":1654116778.779755,"logger":"controllers.DevWorkspace","msg":"Failed to clean up DevWorkspace storage","Request.Namespace":"devworkspace-controller","Request.Name":"theia-next","devworkspace_id":"workspace10db2004baac400d","error":"DevWorkspace PVC cleanup job failed: see logs for job \"cleanup-workspace10db2004baac400d\" for details","stacktrace":"github.com/devfile/devworkspace-operator/controllers/workspace.(*DevWorkspaceReconciler).finalize\n\t/home/aobuchow/git/devworkspace-operator/controllers/workspace/finalize.go:64\ngithub.com/devfile/devworkspace-operator/controllers/workspace.(*DevWorkspaceReconciler).Reconcile\n\t/home/aobuchow/git/devworkspace-operator/controllers/workspace/devworkspace_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:214"}
  1. Ensure that the workspace which was deleted is set to the Error phase by doing kubectl get devworkspace -n $NAMESPACE:
NAME          DEVWORKSPACE ID             PHASE     INFO
theia-next    workspace10db2004baac400d   Error     DevWorkspace PVC cleanup job failed: see logs for job "cleanup-workspace10db2004baac400d" for details
theia-next2   workspace9e7aa15f7aa44c92   Running   https://workspace9e7aa15f7aa44c92-theia-3100.192.168.39.100.nip.io/

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

@AObuchow AObuchow requested review from amisevsk and ibuziuk as code owners June 2, 2022 05:42
if err != nil && !k8sErrors.IsConflict(err) {
return reconcile.Result{}, err
}
if workspace.Status.Phase != dw.DevWorkspaceStatusError {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are potentially other places where the workspace status phase may be set to Error that do not concern the common PVC cleanup job.

For example, finalizeServiceAccount() may set the workspace status phase to Error. Furthermore, it is possible that in the future, other unrelated conditions may set the workspace's status phase to Error.

Thus, it might be best to also check the workspaces condition message for "Failed to clean up DevWorkspace storage" in the finalize function (unless this bug also occurs with finalizeServiceAccount(), which may be likely? 🤔 )

Args: []string{
"-c",
fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)),
"exit 1",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit/change needs to be removed before the PR can be merged. It's only here to facilitate testing.

Copy link
Collaborator

@amisevsk amisevsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

This PR is pretty far behind main right now, and does not include the PVC cleanup changes -- I rebased on main while testing to verify how this works with "cleanup PVC on deletion".

Also, don't forget to remove the temporary patch before merging :)

@dkwon17
Copy link
Collaborator

dkwon17 commented Jun 2, 2022

I've tested the PR with the provided instructions, and it is working 👍
}
I see that:

{"level":"error","ts":1654116778.779755,"logger":"controllers.DevWorkspace","msg":"Failed to clean up DevWorkspace storage","Request.Namespace":"devworkspace-controller","Request.Name":"theia-next","devworkspace_id":"workspace10db2004baac400d","error":"DevWorkspace PVC cleanup job failed: see logs for job \"cleanup-workspace10db2004baac400d\" for details","stacktrace":"github.com/devfile/devworkspace-operator/controllers/workspace.(*DevWorkspaceReconciler).finalize\n\t/home/aobuchow/git/devworkspace-operator/controllers/workspace/finalize.go:64\ngithub.com/devfile/devworkspace-operator/controllers/workspace.(*DevWorkspaceReconciler).Reconcile\n\t/home/aobuchow/git/devworkspace-operator/controllers/workspace/devworkspace_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:214"}

is being printed quite often, but that is expected since new cleanup pods will start and will fail continuously for our test case, is that right?

@AObuchow
Copy link
Collaborator Author

AObuchow commented Jun 2, 2022

I've tested the PR with the provided instructions, and it is working +1 } I see that:

{"level":"error","ts":1654116778.779755,"logger":"controllers.DevWorkspace","msg":"Failed to clean up DevWorkspace storage","Request.Namespace":"devworkspace-controller","Request.Name":"theia-next","devworkspace_id":"workspace10db2004baac400d","error":"DevWorkspace PVC cleanup job failed: see logs for job \"cleanup-workspace10db2004baac400d\" for details","stacktrace":"github.com/devfile/devworkspace-operator/controllers/workspace.(*DevWorkspaceReconciler).finalize\n\t/home/aobuchow/git/devworkspace-operator/controllers/workspace/finalize.go:64\ngithub.com/devfile/devworkspace-operator/controllers/workspace.(*DevWorkspaceReconciler).Reconcile\n\t/home/aobuchow/git/devworkspace-operator/controllers/workspace/devworkspace_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:214"}

is being printed quite often, but that is expected since new cleanup pods will start and will fail continuously for our test case, is that right?

Thanks for testing @dkwon17 :)
That error is supposed to be logged, but it should stop being logged after a while - are you seeing it continuously being logged infinitely ? Because if so, the original bug is still occuring.

Copy link
Contributor

@ibuziuk ibuziuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please do not forget to replace exit 1 before merging

@openshift-ci
Copy link

openshift-ci bot commented Jun 3, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amisevsk, AObuchow, ibuziuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dkwon17
Copy link
Collaborator

dkwon17 commented Jun 3, 2022

@AObuchow , sorry, I meant to write my comment: #851 (comment) on your other PR: #846

I just tested this PR and it is working for me , as I only see the

Failed to clean up DevWorkspace storage

log only once 👍

Fix devfile#845

Signed-off-by: Andrew Obuchowicz <aobuchow@redhat.com>
@AObuchow AObuchow force-pushed the mark_workspace_failed_when_pvc_cleanup_fails branch from 5203833 to 28a3599 Compare June 3, 2022 18:05
@openshift-ci openshift-ci bot removed the lgtm label Jun 3, 2022
@openshift-ci
Copy link

openshift-ci bot commented Jun 3, 2022

New changes are detected. LGTM label has been removed.

@AObuchow
Copy link
Collaborator Author

AObuchow commented Jun 3, 2022

@AObuchow , sorry, I meant to write my comment: #851 (comment) on your other PR: #846

I just tested this PR and it is working for me , as I only see the

Failed to clean up DevWorkspace storage

log only once +1

Awesome, thanks for confirmation @dkwon17 :)

@AObuchow AObuchow merged commit e4013f7 into devfile:main Jun 3, 2022
@AObuchow AObuchow deleted the mark_workspace_failed_when_pvc_cleanup_fails branch June 3, 2022 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DWO continues to reconcile workspace when common PVC cleanup job fails

4 participants