PVC clean up job is not stable #301

sleshchenko · 2021-03-05T08:25:37Z

There are two issues with PVC clean up job:

Sometimes the created pod fails with:

Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: time="2021-03-05T08:20:44Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"

and continue fail with:

Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: time="2021-03-05T08:17:20Z" level=warning msg="Timed out while waiting for StopUnit(crio-c7b61b00b6956b61c4dd78c2a311df3bc0d52ae4f81725cfb3c571cb32fbd48b.scope) completion signal from dbus. Continuing..." time="2021-03-05T08:18:06Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: "

The same with quay.io/libpod/busybox:1.30.1 and http://registry.access.redhat.com/ubi8-minimal:8.3-291
On my RHDPS OpenShift 4.6 or 4.7 is happens pretty often, 7 workspaces are stuck in finalizing phase:

kc get pod
NAME                                      READY   STATUS                 RESTARTS   AGE
cleanup-workspace0626cde384894b68-c6gkm   0/1     CreateContainerError   0          6m19s
cleanup-workspace434f901860ac4833-hgzm5   0/1     CreateContainerError   0          5m47s
cleanup-workspace4b001b2ad6a54bfc-snzt8   0/1     CreateContainerError   0          6m
cleanup-workspace524f63636daa4156-7cc56   0/1     CreateContainerError   0          5m52s
cleanup-workspace52e229018fcd40dc-tzz65   0/1     CreateContainerError   0          6m9s
cleanup-workspace82a2427ff8174376-jrd5b   0/1     Error                  0          5m37s
cleanup-workspace82a2427ff8174376-xzfcb   0/1     CreateContainerError   0          49s

The second issue, after I updated busybox to ubi, jobs were recreated but old busybox pods where left on cluster as zoobies without ownerRef, maybe we need to set explicitly propagation strategy, when we remove job.
^ I'm not sure but I may have both (local and on cluster) controllers run and it may cause that failure.

The text was updated successfully, but these errors were encountered:

amisevsk · 2021-03-05T22:47:25Z

Regarding issue 2, I think it may be the propagation policy that's the issue -- I think k8s had a similar issue in the past, judging by kubernetes/kubernetes#71801. I've added this to the linked PR.

sleshchenko added this to the v0.2.0 milestone Mar 5, 2021

sleshchenko assigned amisevsk Mar 5, 2021

sleshchenko mentioned this issue Mar 5, 2021

Che Controller Sprint 198 eclipse-che/che#19183

Closed

31 tasks

amisevsk mentioned this issue Mar 5, 2021

Increase memoryLimit on PVC cleanup job to prevent failures #304

Merged

sleshchenko closed this as completed in #304 Mar 9, 2021

sleshchenko added the sprint/current Is assigned to issues which are planned to work on in the current team sprint label Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVC clean up job is not stable #301

PVC clean up job is not stable #301

sleshchenko commented Mar 5, 2021 •

edited

Loading

amisevsk commented Mar 5, 2021

PVC clean up job is not stable #301

PVC clean up job is not stable #301

Comments

sleshchenko commented Mar 5, 2021 • edited Loading

amisevsk commented Mar 5, 2021

sleshchenko commented Mar 5, 2021 •

edited

Loading