-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RBD] parallel PVC creation on a newly created block pool will hang #2521
Comments
Hey @idryomov, a lot of users seem to be hitting this issue, would you recommend adding Will there be any side affects or reason not to do this ? Thanks ! |
How many reports of this exact issue did you get so far? I see you linked #2520 but it doesn't look related at first sight. I could be wrong but it feels like the combination of a fresh uninitialized pool and concurrent creation requests should make this rather unlikely to hit for a general user. I'm inclined to just visibly document this as a known issue and wait for the 16.2.7 release. |
Quite a few are reported here.
@idryomov
I suspect, we will get more such reports and if I am not wrong Thanks! |
I see two reports there.
This is not the first rook release based on pacific, right? The bug is present in all pacific releases (16.2.0 - 16.2.6), if we only got two reports in this time frame I'm not sure it is worth rushing in a workaround instead of just waiting for 16.2.7. We don't have a set date for 16.2.7 so yes it could take two months but we could also expedite it. |
This is the first CephCSI release
okay 👍 . @ceph/ceph-csi-contributors I think we can go ahead with cephcsi |
@idryomov couple of questions
@Rakshith-R below is the workaround you mentioned in the issue.
cant we just scale down the rbd provisioner pod and call |
only new pools as per my tests.
yes, that works too but need direct access to ceph cluster to execute the commands and will still leave stale omap entries. |
executing a command on the ceph cluster is not an issue and it's better than leaving the stale resources. if the pending PVC is not deleted how this will leave a stale omap? |
okay, tested it. It will not leave any stale resources.
and PVCs will go to bound state without leaving any stale resources. |
Only on the new (empty) and uninitialized pools. It can't occur once even a single image is created or the pool is initialized.
I don't think so. One possibility is to check if the pool's application is "rbd" but that is a bit wonky -- as part of initialization we set it to "rbd" but it is possible to set the application separately. OTOH initialization requests are more or less idempotent, so issuing them repeatedly should be safe.
This is a librbd issue. |
@idryomov Thanks for your reply. till pacific 16.2.7 is released is it good to switch back to octopus release as the issue is in librbd and doesn't exist in octopus? or as suggested earlier documenting it would be good enough? |
Technically the issue exists in octopus librbd too but it is much harder to hit there and rolling back to octopus should fix it. |
Thanks, @idryomov for your opinion. Much appreciated!. Looks like whoever is trying cephcsi with new pools might hit this issue. Most of the things are automated like pool creation, PVC creation, etc. As its much harder to hit this in octopus, Instead of documenting am more inclined towards using octopus as the base image for 3.4.1 and switch to the new 16.2.7 in cephcsi 3.4.2 |
We cannot use ceph octopus as base image since |
:D looks like we have another problem, Do we have any other option left except documentation to make the life easier? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation. |
@Rakshith-R this can be closed? |
Yes, ceph 16.2.7 was released in December https://github.com/ceph/ceph/releases/tag/v16.2.7, it contains the fix and in turn cephcsi v3.5.0 will have the fix for this issue. Closing the issue for the above reason. |
dockerhub did not have the latest version of ceph, base images are now being pulled from quay.io |
Thanks for notifying on this one |
closed by #2798 |
Seems like we're hitting parallel PVC creation bug in kubevirt/kubevirt: kubevirt/kubevirt#7783 ceph/ceph-csi#2521 Reproduced locally: ```bash [root@hera04 kubevirt]# cat ~/manifests/2pvcs.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ceph-block-pvc-1 spec: volumeMode: Block accessModes: - ReadWriteMany resources: requests: storage: 3Gi --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ceph-block-pvc-2 spec: volumeMode: Block accessModes: - ReadWriteMany resources: requests: storage: 3Gi [root@hera04 kubevirt]# ./cluster-up/kubectl.sh create -f ~/manifests/2pvcs.yaml selecting docker as container runtime persistentvolumeclaim/ceph-block-pvc-1 created persistentvolumeclaim/ceph-block-pvc-2 created [root@hera04 kubevirt]# ./cluster-up/kubectl.sh get pvc selecting docker as container runtime NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ceph-block-pvc-1 Pending rook-ceph-block 2s ceph-block-pvc-2 Pending rook-ceph-block 2s [root@hera04 kubevirt]# ./cluster-up/kubectl.sh get pvc selecting docker as container runtime NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ceph-block-pvc-1 Pending rook-ceph-block 3s ceph-block-pvc-2 Pending rook-ceph-block 3s [root@hera04 kubevirt]# ./cluster-up/kubectl.sh create -f ~/manifests/pvc.yaml selecting docker as container runtime persistentvolumeclaim/ceph-block-pvc created [root@hera04 kubevirt]# ./cluster-up/kubectl.sh get pvc selecting docker as container runtime NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ceph-block-pvc Pending rook-ceph-block 2s ceph-block-pvc-1 Pending rook-ceph-block 66s ceph-block-pvc-2 Pending rook-ceph-block 66s ``` Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
Seems like we're hitting parallel PVC creation bug in kubevirt/kubevirt: kubevirt/kubevirt#7783 ceph/ceph-csi#2521 Reproduced locally: ```bash [root@hera04 kubevirt]# cat ~/manifests/2pvcs.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ceph-block-pvc-1 spec: volumeMode: Block accessModes: - ReadWriteMany resources: requests: storage: 3Gi --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ceph-block-pvc-2 spec: volumeMode: Block accessModes: - ReadWriteMany resources: requests: storage: 3Gi [root@hera04 kubevirt]# ./cluster-up/kubectl.sh create -f ~/manifests/2pvcs.yaml selecting docker as container runtime persistentvolumeclaim/ceph-block-pvc-1 created persistentvolumeclaim/ceph-block-pvc-2 created [root@hera04 kubevirt]# ./cluster-up/kubectl.sh get pvc selecting docker as container runtime NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ceph-block-pvc-1 Pending rook-ceph-block 2s ceph-block-pvc-2 Pending rook-ceph-block 2s [root@hera04 kubevirt]# ./cluster-up/kubectl.sh get pvc selecting docker as container runtime NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ceph-block-pvc-1 Pending rook-ceph-block 3s ceph-block-pvc-2 Pending rook-ceph-block 3s [root@hera04 kubevirt]# ./cluster-up/kubectl.sh create -f ~/manifests/pvc.yaml selecting docker as container runtime persistentvolumeclaim/ceph-block-pvc created [root@hera04 kubevirt]# ./cluster-up/kubectl.sh get pvc selecting docker as container runtime NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ceph-block-pvc Pending rook-ceph-block 2s ceph-block-pvc-1 Pending rook-ceph-block 66s ceph-block-pvc-2 Pending rook-ceph-block 66s ``` Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
Seems like we're hitting parallel PVC creation bug in kubevirt/kubevirt: kubevirt/kubevirt#7783 ceph/ceph-csi#2521 Reproduced locally: ```bash [root@hera04 kubevirt]# cat ~/manifests/2pvcs.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ceph-block-pvc-1 spec: volumeMode: Block accessModes: - ReadWriteMany resources: requests: storage: 3Gi --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ceph-block-pvc-2 spec: volumeMode: Block accessModes: - ReadWriteMany resources: requests: storage: 3Gi [root@hera04 kubevirt]# ./cluster-up/kubectl.sh create -f ~/manifests/2pvcs.yaml selecting docker as container runtime persistentvolumeclaim/ceph-block-pvc-1 created persistentvolumeclaim/ceph-block-pvc-2 created [root@hera04 kubevirt]# ./cluster-up/kubectl.sh get pvc selecting docker as container runtime NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ceph-block-pvc-1 Pending rook-ceph-block 2s ceph-block-pvc-2 Pending rook-ceph-block 2s [root@hera04 kubevirt]# ./cluster-up/kubectl.sh get pvc selecting docker as container runtime NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ceph-block-pvc-1 Pending rook-ceph-block 3s ceph-block-pvc-2 Pending rook-ceph-block 3s [root@hera04 kubevirt]# ./cluster-up/kubectl.sh create -f ~/manifests/pvc.yaml selecting docker as container runtime persistentvolumeclaim/ceph-block-pvc created [root@hera04 kubevirt]# ./cluster-up/kubectl.sh get pvc selecting docker as container runtime NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ceph-block-pvc Pending rook-ceph-block 2s ceph-block-pvc-1 Pending rook-ceph-block 66s ceph-block-pvc-2 Pending rook-ceph-block 66s ``` Signed-off-by: Alex Kalenyuk <akalenyu@redhat.com>
Describe the bug
The following bug in librbd causes parallel pvc creation request on a newly created block
pool to hang.
Ceph issue tracker : https://tracker.ceph.com/issues/52537
Ceph pacific backport pr with fix : ceph/ceph#43113
Environment details
Steps to reproduce
Actual results
Expected behavior
Updated Work Around (does not leave stale imap entries, thanks @Madhu-1 )
rbd pool init <pool_name>
directly on cluster or from csi pods.After the above steps, parallel PVC creation requests should work fine.
Work Around (Not recommended, will leave stale omap entries)
Delete ongoing PVC creation requests.
Restart csi rbdplugin provisioner pod
Either
rbd pool init <pool_name>
on the ceph cluster.After the above steps, parallel PVC creation requests should work fine.
The text was updated successfully, but these errors were encountered: