Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transport endpoint is not connected when csi-s3 pod is restarted #153

Open
bbenlazreg opened this issue Jan 25, 2022 · 12 comments
Open

Transport endpoint is not connected when csi-s3 pod is restarted #153

bbenlazreg opened this issue Jan 25, 2022 · 12 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@bbenlazreg
Copy link

bbenlazreg commented Jan 25, 2022

If for any reason the csi-s3 pod is restarted, the Pod that uses s3 volumes looses connectivity to the mount target and we get
Transport endpoint is not connected error
The error is solved if we restart the pod that uses the volume this forces csi-s3 pod to remount the volume.

I think when csi-s3 restarts it should check for existing volumes and remount the volume.

To reproduce this behaviour just rollout restart the deamonset
Could you please take a look ?

@raj-katonic
Copy link

Facing the same issue

@raj-katonic
Copy link

Rolling out daemonsets and dataset-operator in dlf namespace altogether fixed this issue for me

@bbenlazreg
Copy link
Author

bbenlazreg commented Feb 8, 2022

Actually restarting the operator did not fix the issue for me, the only thing that fixes the issue is to restart the pod that uses the pvc created by dataset operator, but would be better if when the daemonset or operator restarts reconciles the mount, otherwise each time we update the csi provider to a new version connectivity will be lost on all pods

PS: the issue is happening for goofys and s3fs mounters

Scenario to reproduce:
1- Create an S3 dataset
2- Create a Pod mounting the pvc created by dataset
3- Restart the csi-s3 deamonset
==> Transport endpoint is not connected

attacher logs
I0208 14:55:53.304459 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolume total 0 items received I0208 14:59:03.311025 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.VolumeAttachment total 0 items received I0208 15:02:23.307740 1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolume total 0 items received I0208 15:04:48.294177 1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync I0208 15:04:48.294421 1 controller.go:208] Started VA processing "csi-0cd1a70398bbe7c6ed68a5ed04b9fa487d8ace466600da1be96e21d78b656b6d" I0208 15:04:48.294433 1 controller.go:223] Skipping VolumeAttachment csi-0cd1a70398bbe7c6ed68a5ed04b9fa487d8ace466600da1be96e21d78b656b6d for attacher blockvolume.csi.oraclecloud.com I0208 15:04:48.294192 1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync I0208 15:04:48.294462 1 controller.go:208] Started VA processing "csi-463b3945e8dd840a016d75511db98296afbaba07fbdc54f71d60f3c448afcbde" I0208 15:04:48.294467 1 controller.go:223] Skipping VolumeAttachment csi-463b3945e8dd840a016d75511db98296afbaba07fbdc54f71d60f3c448afcbde for attacher blockvolume.csi.oraclecloud.com I0208 15:04:48.294454 1 controller.go:208] Started VA processing "csi-20b7e9e5d47a9eb0c1a350d27f8c7e27c04de6d83e95128189c8eafd0a923fe5" I0208 15:04:48.294490 1 controller.go:208] Started VA processing "csi-27f260c4f75284c142c5f33aaa4d8ea8a985e82301bb80cfd76b31e8d9433db9"

Can someone please take a look on this ?

@srikumar003
Copy link
Collaborator

srikumar003 commented Feb 14, 2022

Verified that this problem exists. To solve this, the CSI-S3 driver would need to be extended to support LIST_VOLUMES and LIST_VOLUMES_PUBLISHED_NODES so that the external attacher can periodically re-sync the volumes. A better option would be to support external health monitor but this may involve changing dependencies to K8s 1.22+ (see #156) as well as extending the driver.

This will be a sizeable development, so not sure about the timelines yet.

@srikumar003 srikumar003 added help wanted Extra attention is needed bug Something isn't working labels May 2, 2022
@nikhil-das-katonic
Copy link

nikhil-das-katonic commented Aug 11, 2022

Tried adding an extra argument --reconcile-sync=10s to the csi-attacher-s3 StatefulSet. This resolved the issue to some degree, though it still comes up when trying to writing large amount of files consecutively to the same bucket (PVC).

@vitalif
Copy link

vitalif commented Sep 19, 2022

To solve this, the CSI-S3 driver would need to be extended to support LIST_VOLUMES and LIST_VOLUMES_PUBLISHED_NODES so that the external attacher can periodically re-sync the volumes

RPC_LIST_VOLUMES_PUBLISHED_NODES is officially not a solution :-) kubernetes-csi/external-attacher#374 (comment)

@srikumar003
Copy link
Collaborator

@vitalif Thanks for researching this issue, though the answer is disappointing :-)

CSI-S3 (atleast Datashim's fork) uses Bidirectional mount propagation which has caused some issues, such as the need for privileged containers (#139) and is preventing full support for ephemeral volumes (#164). Unfortunately, we haven't been able to find a way around it, yet.

If you do have a workaround, I'll be happy to look into it.

@rrehman-hbk
Copy link

Any update on when this will be resolved. We are also facing this issue. We are getting this very frequently, we are mounting s3 to 5-6 pods. Whenever we do some read or write, we are getting this error, And we have to restart frequently

@srikumar003
Copy link
Collaborator

@rrehman-hbk Could I ask under what conditions are you getting errors for read/write from S3 buckets ? This is a different problem from the one above. If you can create an issue and post the logs from your csi-s3 pods in there, then I could take a look at them

@rrehman-hbk
Copy link

@srikumar003 #324
raised a separate issue

@paullryan
Copy link

Also of note in this case for me if I have a livenessProbe kill the container it cannot just pickup from where it left off the whole pod must be destroyed. The CSI-S3 deamon reports the following

I0211 18:58:01.984006       1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}}]}
I0211 18:58:03.164542       1 utils.go:97] GRPC call: /csi.v1.Controller/DeleteVolume
I0211 18:58:03.164743       1 utils.go:98] GRPC request: {"volume_id":"pvc-4c8d7779-a5cc-4d29-897b-f94e9ab6ca9b"}
I0211 18:58:03.164950       1 controllerserver.go:131] Deleting volume pvc-4c8d7779-a5cc-4d29-897b-f94e9ab6ca9b
E0211 18:58:03.165086       1 utils.go:101] GRPC error: failed to initialize S3 client: Endpoint:  does not follow ip address or domain name standards.

If the pod is subsequently restarted the mount the succeeds and all is fine again.

@4F2E4A2E
Copy link

4F2E4A2E commented Mar 4, 2024

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants