Skip to content

Handling of cleanup job errors should be improved #877

@amisevsk

Description

@amisevsk

Description

Recently DWO began watching PVC cleanup jobs for errors and reporting them as failures in workspace cleanup. However, a side-effect of this detection is that it can result in DevWorkspaces unnecessarily being stuck in a terminating state in the event that a cleanup job encounters a transient error that later resolves:

  1. DevWorkspace is deleted, cleanup job is created
  2. Cleanup job encounters an error, workspace is set to Errored state
  3. Error in cleanup job is resolved, job runs successfully
  4. Finalizer is not cleared as we don't check errored workspaces

This is a significant issue, as unlike the DevWorkspace startup case (where a DevWorkspace can just be restarted), there's no way to clear the errored status from a DevWorkspace. As a result, users must check the cleanup jobs status, notice that it completed successfully, and then remove the finalizer from the DevWorkspace manually.

How To Reproduce

Not easy to reproduce as it requires a transient error in the cluster, but the recent encounter was a few workspaces that were stuck terminating due to CreateContainerError errors in the cleanup job. This seems to have been due to some temporary issue on the cluster as all the jobs had been completed and event history had been cleared by the time it was noticed.

Additional context

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions