You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If PVCs are created and deleted immediately, the volume may leak on rare occasions. This problem occurs especially when a large number of PVCs are created and deleted repeatedly at the same time. From our survey, it seems that the external provisioner finishes the process without calling DeleteVolume if the PVC is deleted before the PV is created after the CSI driver's CreateVolume is called. Therefore, ceph-csi may have continued the CreateVolume process and allocated the volume, and then DeleteVolume was not called, and a leak may have occurred. This behavior cannot be solved by fixing ceph-csi alone, but I report it as what we could actually observe.
Create and delete a large number of PVCs at the same time.
In our environment, we use csi monitor (pie). It repeats the creation and deletion of PVCs at regular intervals.
Especially, pie v0.4.1 and under act naively, so it creates PVC at the same time. And it seems to be the cause of this problem frequently.
To find the leaking volume that cannot be tied with PV, I used the attached script.
The attached logs can be obtained. In our environment, we found more than 900 volumes. However, this is the result of over a year of creating and deleting more than 100 PVCs per minute.
# exp for the loki: {namespace="ceph-ssd", pod=~"csi-rbdplugin-provisioner-.*"} |= "pvc-0009ee96-ae41-4bc5-bfdd-c22ca4bed3bc"
2023-02-14 00:58:39.173 I0214 00:58:39.173591 1 controller.go:1442] provision "pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" class "ceph-ssd-block": volume "pvc-0009ee96-ae41-4bc5-bfdd-c22ca4bed3bc" provisioned
# We expect the following log, but it is not found.
delete "<pv name>": volume deleted
I looked for the log to know the accurate timestamps of creating and deleting PVC, but I couldn't find it. The following is a logs of events get from the api server.
2023-02-14 00:58:59.414 I0214 00:58:59.414584 1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:58:46.858 I0214 00:58:46.858311 1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:58:32.029 I0214 00:58:32.029091 1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:58:17.967 I0214 00:58:17.966951 1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:58:01.918 I0214 00:58:01.918627 1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:57:59.539 I0214 00:57:59.539162 1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
2023-02-14 00:57:59.530 I0214 00:57:59.530222 1 event.go:294] "Event occurred" object="pie-system/pie-probe-10.69.1.141-ceph-ssd-block-9e378e-27937545-vkgld-genericvol" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ExternalProvisioning" message="waiting for a volume to be created, either by external provisioner \"ceph-ssd.rbd.csi.ceph.com\" or manually created by system administrator"
Describe the bug
If PVCs are created and deleted immediately, the volume may leak on rare occasions. This problem occurs especially when a large number of PVCs are created and deleted repeatedly at the same time. From our survey, it seems that the external provisioner finishes the process without calling DeleteVolume if the PVC is deleted before the PV is created after the CSI driver's CreateVolume is called. Therefore, ceph-csi may have continued the CreateVolume process and allocated the volume, and then DeleteVolume was not called, and a leak may have occurred. This behavior cannot be solved by fixing ceph-csi alone, but I report it as what we could actually observe.
Environment details
cephcsi : 3.9.0
csi-node-driver-registrar : 2.8.0
csi-provisioner : 3.5.0
csi-resizer : 1.8.0
csi-attacher : 4.3.0
csi-snapshotter : 6.2.2
Steps to reproduce
Especially, pie v0.4.1 and under act naively, so it creates PVC at the same time. And it seems to be the cause of this problem frequently.
check_volume.sh.txt
Actual results
The attached logs can be obtained. In our environment, we found more than 900 volumes. However, this is the result of over a year of creating and deleting more than 100 PVCs per minute.
volumes.log
Expected behavior
The volume is deleted or PV should remain until the volume is deleted. And need a way to clean up the leaked volumes.
Logs
It reports the logs and timestamps checked for the following records.
The stat of the volume.
The log of the provisioner.
I looked for the log to know the accurate timestamps of creating and deleting PVC, but I couldn't find it. The following is a logs of events get from the api server.
Additional context
related issue in extra provisioner: kubernetes-csi/external-provisioner#486
similer issue in CephFS: #4045
The text was updated successfully, but these errors were encountered: