-
Notifications
You must be signed in to change notification settings - Fork 68
Description
Description
Recently DWO began watching PVC cleanup jobs for errors and reporting them as failures in workspace cleanup. However, a side-effect of this detection is that it can result in DevWorkspaces unnecessarily being stuck in a terminating state in the event that a cleanup job encounters a transient error that later resolves:
- DevWorkspace is deleted, cleanup job is created
- Cleanup job encounters an error, workspace is set to
Errored
state - Error in cleanup job is resolved, job runs successfully
- Finalizer is not cleared as we don't check errored workspaces
This is a significant issue, as unlike the DevWorkspace startup case (where a DevWorkspace can just be restarted), there's no way to clear the errored status from a DevWorkspace. As a result, users must check the cleanup jobs status, notice that it completed successfully, and then remove the finalizer from the DevWorkspace manually.
How To Reproduce
Not easy to reproduce as it requires a transient error in the cluster, but the recent encounter was a few workspaces that were stuck terminating due to CreateContainerError
errors in the cleanup job. This seems to have been due to some temporary issue on the cluster as all the jobs had been completed and event history had been cleared by the time it was noticed.