Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volume may leak when PVC is deleted before return from CreateVolume #4280

Closed
llamerada-jp opened this issue Nov 24, 2023 · 1 comment
Closed

Comments

@llamerada-jp
Copy link

Describe the bug

If PVCs are created and deleted immediately, the volume may leak on rare occasions. This problem occurs especially when a large number of PVCs are created and deleted repeatedly at the same time. From our survey, it seems that the external provisioner finishes the process without calling DeleteVolume if the PVC is deleted before the PV is created after the CSI driver's CreateVolume is called. Therefore, ceph-csi may have continued the CreateVolume process and allocated the volume, and then DeleteVolume was not called, and a leak may have occurred. This behavior cannot be solved by fixing ceph-csi alone, but I report it as what we could actually observe.

Environment details

  • Image/version of Ceph CSI driver :
    cephcsi : 3.9.0
    csi-node-driver-registrar : 2.8.0
    csi-provisioner : 3.5.0
    csi-resizer : 1.8.0
    csi-attacher : 4.3.0
    csi-snapshotter : 6.2.2
  • Helm chart version : not used
  • Kernel version : 5.15.133-flatcar
  • Mounter used for mounting PVC : krbd
  • Kubernetes cluster version : 1.26.6
  • Ceph cluster version : 17.2.6

Steps to reproduce

  • Create and delete a large number of PVCs at the same time.
  • In our environment, we use csi monitor (pie). It repeats the creation and deletion of PVCs at regular intervals.
    Especially, pie v0.4.1 and under act naively, so it creates PVC at the same time. And it seems to be the cause of this problem frequently.
  • To find the leaking volume that cannot be tied with PV, I used the attached script.

check_volume.sh.txt

Actual results

The attached logs can be obtained. In our environment, we found more than 900 volumes. However, this is the result of over a year of creating and deleting more than 100 PVCs per minute.

volumes.log

Expected behavior

The volume is deleted or PV should remain until the volume is deleted. And need a way to clean up the leaked volumes.

Logs

It reports the logs and timestamps checked for the following records.

PV:pvc-0009ee96-ae41-4bc5-bfdd-c22ca4bed3bc OBJ_SUB:b7c4886f-ac02-11ed-9570-4a1fea0f3475 exists?:false

The stat of the volume.

$ kubectl exec -n ceph-ssd deploy/rook-ceph-tools -- rados stat -p ceph-ssd-block-pool csi.volume.b7c4886f-ac02-11ed-9570-4a1fea0f3475
ceph-ssd-block-pool/csi.volume.b7c4886f-ac02-11ed-9570-4a1fea0f3475 mtime 2023-02-14T00:58:39.000000+0000, size 0

The log of the provisioner.

# exp for the loki: {namespace="ceph-ssd", pod=~"csi-rbdplugin-provisioner-.*"} |= "pvc-0009ee96-ae41-4bc5-bfdd-c22ca4bed3bc"
2023-02-14 00:58:39.173	I0214 00:58:39.173591       1 controller.go:1442] provision "pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" class "ceph-ssd-block": volume "pvc-0009ee96-ae41-4bc5-bfdd-c22ca4bed3bc" provisioned

# We expect the following log, but it is not found.
delete "<pv name>": volume deleted

I looked for the log to know the accurate timestamps of creating and deleting PVC, but I couldn't find it. The following is a logs of events get from the api server.

2023-02-14 00:58:59.414	I0214 00:58:59.414584       1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:58:46.858	I0214 00:58:46.858311       1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:58:32.029	I0214 00:58:32.029091       1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:58:17.967	I0214 00:58:17.966951       1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:58:01.918	I0214 00:58:01.918627       1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:57:59.539	I0214 00:57:59.539162       1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:57:59.530	I0214 00:57:59.530222       1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"

Additional context

related issue in extra provisioner: kubernetes-csi/external-provisioner#486
similer issue in CephFS: #4045

@Rakshith-R
Copy link
Contributor

related issue in extra provisioner: kubernetes-csi/external-provisioner#486

Yes this a known issue,
Nothing can be done from CephCSI side.
It needs to be handled in external-provisioner sidecar.

@ceph ceph locked and limited conversation to collaborators Nov 24, 2023
@Rakshith-R Rakshith-R converted this issue into discussion #4281 Nov 24, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants