Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PV deletion of cephfsplugin on PVC deletion fails #48

Closed
mkimuram opened this issue Jul 19, 2018 · 11 comments
Closed

PV deletion of cephfsplugin on PVC deletion fails #48

mkimuram opened this issue Jul 19, 2018 · 11 comments

Comments

@mkimuram
Copy link
Contributor

If a PV is dynamic-provisioned and has delete reclaim-policy, it should be deleted when PVC for it is deleted. However, the deletion fails with the following error messages, when tested with kubernetes/kubernetes#66388.

[Error messages in cephfsplugin]

I0718 23:43:29.512515       1 identityserver-default.go:32] Using default GetPluginInnfo
I0718 23:43:29.512539       1 utils.go:102] GRPC response: name:"csi-cephfsplugin" vendor_version:"0.2.0"
I0718 23:43:29.521353       1 utils.go:96] GRPC call: /csi.v0.Controller/DeleteVolume
I0718 23:43:29.521424       1 utils.go:97] GRPC request: volume_id:"csi-cephfs-dyn-5ab6b2f6-8ae4-11e8-bb12-0242ac110004" control
ler_delete_secrets:<key:"adminID" value:"admin" > controller_delete_secrets:<key:"adminKey" value:"AQDRrKNVbEevChAAEmRC+pW/KBVHx
a0w/POILA==" >
E0718 23:43:29.984564       1 controllerserver.go:171] failed to delete ceph user csi-user-5ab6b2f6-8ae4-11e8-bb12-0242ac110004:
 cephfs: ceph failed with following error: exit status 1
cephfs: ceph output: 2018-07-18 23:43:29.957718 7f4403592700  0 librados: client.admin authentication error (95) Operation not supported
[errno 95] error connecting to the cluster
I0718 23:43:38.727506       1 utils.go:96] GRPC call: /csi.v0.Identity/GetPluginCapabilities
I0718 23:43:38.727607       1 utils.go:97] GRPC request: 
I0718 23:43:38.727725       1 utils.go:102] GRPC response: capabilities:<service:<type:CONTROLLER_SERVICE > > 
I0718 23:43:38.728846       1 utils.go:96] GRPC call: /csi.v0.Controller/ControllerGetCapabilities
I0718 23:43:38.728904       1 utils.go:97] GRPC request: 
I0718 23:43:38.728991       1 controllerserver-default.go:82] Using default ControllerGetCapabilities
I0718 23:43:38.729007       1 utils.go:102] GRPC response: capabilities:<rpc:<type:CREATE_DELETE_VOLUME > > 
I0718 23:43:38.729940       1 utils.go:96] GRPC call: /csi.v0.Identity/GetPluginInfo
I0718 23:43:38.729974       1 utils.go:97] GRPC request: 
I0718 23:43:38.730034       1 identityserver-default.go:32] Using default GetPluginInnfo
I0718 23:43:38.730046       1 utils.go:102] GRPC response: name:"csi-cephfsplugin" vendor_version:"0.2.0" 
I0718 23:43:38.737483       1 utils.go:96] GRPC call: /csi.v0.Controller/DeleteVolume
I0718 23:43:38.737520       1 utils.go:97] GRPC request: volume_id:"csi-cephfs-dyn-5ab6b2f6-8ae4-11e8-bb12-0242ac110004" controller_delete_secrets:<key:"adminID" value:"admin" > controller_delete_secrets:<key:"adminKey" value:"AQDRrKNVbEevChAAEmRC+pW/KBVHxa0w/POILA==" > 
E0718 23:43:38.749523       1 controllerserver.go:166] failed to delete volume csi-cephfs-dyn-5ab6b2f6-8ae4-11e8-bb12-0242ac110004: cephfs: mount failed with following error: exit status 16
cephfs: mount output: mount error 16 = Device or resource busy
E0718 23:43:38.749679       1 utils.go:100] GRPC error: rpc error: code = Internal desc = cephfs: mount failed with following error: exit status 16
cephfs: mount output: mount error 16 = Device or resource busy
@rootfs
Copy link
Member

rootfs commented Jul 19, 2018

this is the problem

librados: client.admin authentication error (95) Operation not supported
[errno 95] error connecting to the cluster

At the time the user is to be deleted, the volume should have already be deleted, confirming the credential and cluster are working.

@mkimuram what ceph cluster you are using? could you please get the ceph mon logs and see if there is any message about deleting user?

@gman0 have you seen this error before?

@gman0
Copy link
Contributor

gman0 commented Jul 19, 2018

@rootfs no. Indeed, ceph logs could help.

@mkimuram
Copy link
Contributor Author

@gman0 @rootfs

what ceph cluster you are using?

I'm using ceph docker image used in e2e testing.
(Created by using https://github.com/kubernetes/kubernetes/tree/master/test/images/volumes-tester/rbd .)

could you please get the ceph mon logs and see if there is any message about deleting user?

Logs are extracted below way. Please check it.
ceph_logs.txt

# kubectl exec -it ceph-server -n e2e-tests-csi-mock-plugin-6hphk -- find /var/log/ceph/ -name "*log"
/var/log/ceph/ceph-osd.1.log
/var/log/ceph/ceph-mon.a.log
/var/log/ceph/ceph-client.admin.log
/var/log/ceph/ceph.log
/var/log/ceph/ceph-osd.0.log
/var/log/ceph/ceph.audit.log
/var/log/ceph/ceph-mds.cephfs.log
# kubectl exec -it ceph-server -n e2e-tests-csi-mock-plugin-6hphk -- find /var/log/ceph/ -name "*log" -print -exec cat {} \; > /tmp/ceph_logs.txt

@rootfs
Copy link
Member

rootfs commented Jul 19, 2018

there is no auth rm in ceph-audit.log, last msgs are:

2018-07-19 18:20:48.820491 mon.0 172.17.0.3:6789/0 78 : audit [INF] from='client.? 172.17.0.3:0/386396941' entity='client.admin' cmd=[{"prefix": "fs new", "data": "cephfs_data", "fs_name": "cephfs", "metadata": "cephfs_metadata"}]: dispatch
2018-07-19 18:20:51.959515 mon.0 172.17.0.3:6789/0 81 : audit [INF] from='client.? 172.17.0.3:0/386396941' entity='client.admin' cmd='[{"prefix": "fs new", "data": "cephfs_data", "fs_name": "cephfs", "metadata": "cephfs_metadata"}]': finished

@rootfs
Copy link
Member

rootfs commented Jul 19, 2018

@mkimuram can you post the relevant e2e test console messages too? I wonder if the ceph cluster was (in the process of) teardown before the PV was deleted in the background.

@rootfs
Copy link
Member

rootfs commented Jul 19, 2018

@mkimuram this is what message i would look for to ensure the PV is deleted before the cluster is teared down.
https://github.com/mkimuram/kubernetes/blob/b461cb7acad0a0912fda198b8df46cd5ae452f0a/test/e2e/storage/volume_provisioning.go#L166

@mkimuram
Copy link
Contributor Author

@rootfs

I run e2e test again with saving console log(e2e-console.txt) and wait until "Waiting up to 20m0s for PersistentVolume pvc-xxx to get deleted" message shows up.
After that, I got the logs (all_logs.txt) with the below script.

However, I still can't find auth rm.
This time, I also include logs for csi-external-provisioner and cephfs-driver for the relevant execution.
I hope this will be of your help.

# cat getlog.sh 
#! /bin/bash -x 
date
kubectl logs ceph-server -n $namespace
kubectl logs csi-cephfs-controller-0 -n $namespace -c csi-external-provisioner
kubectl logs $cephfsnode -n $namespace -c cephfs-driver
kubectl exec -it ceph-server -n $namespace -- find /var/log/ceph/ -name "*log"
kubectl exec -it ceph-server -n $namespace -- find /var/log/ceph/ -name "*log" -print -exec cat {} \;

# export namespace=e2e-tests-csi-mock-plugin-9sbdt
# export cephfsnode=csi-cephfs-node-jkxxx
# ./getlog.sh > all_logs.txt 2>&1

@rootfs
Copy link
Member

rootfs commented Jul 19, 2018

right, the timeline matches: in CSI logs, delete volume happened before cluster was teared down.

@mkimuram
Copy link
Contributor Author

Sharing logs with debug messages:
all_logs5.txt
e2e-console5.txt

@rootfs
Copy link
Member

rootfs commented Jul 23, 2018

This is what happened.

The e2e test just creates/deletes volume, it doesn't publish the volume, thus the volume is never mounted during the test.

However, the cephfs controller service doesn't create the user; the node service creates it here. So by the time volume is to be deleted by controller service, the user doesn't exist.

@gman0

@gman0
Copy link
Contributor

gman0 commented Jul 25, 2018

@rootfs very nice find, thanks! I'll move
this to NodeStageVolume/NodeUnstageVolume.

mkimuram added a commit to mkimuram/kubernetes that referenced this issue Aug 7, 2018
@rootfs rootfs closed this as completed in #62 Aug 8, 2018
agarwal-mudit pushed a commit to agarwal-mudit/ceph-csi that referenced this issue Jul 1, 2021
Sync openshift/master to ceph-csi/devel
agarwal-mudit pushed a commit to agarwal-mudit/ceph-csi that referenced this issue Jan 27, 2022
Sync rhs:release-4.10 with rhs:devel
nixpanic added a commit that referenced this issue Mar 23, 2023
In case creating the CSIAddonsNode for a Pod fails, a retry is done.
There are several causes that can cause failures, these seem to be the
most common ones:

 - User cannot create resource "csiaddonsnodes" in API group "csiaddons.openshift.io"
 - no kind "CSIAddonsNode" is registered for the internal version of group "csiaddons.openshift.io"

In the above cases, either the RBAC is limiting access to the API, or
the CRD is not available in the Kubernetes cluster.

Updates: #48
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants