Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unresolvable Multi-Attach error, after resizing node pools. #111

Closed
feluxe opened this issue Nov 22, 2018 · 12 comments
Closed

Unresolvable Multi-Attach error, after resizing node pools. #111

feluxe opened this issue Nov 22, 2018 · 12 comments

Comments

@feluxe
Copy link
Contributor

feluxe commented Nov 22, 2018

What did you do?

I created a digitalocean (preview) cluster via the DO web interface. In the cluster I installed an app that uses an existing block storage volume with a PV/PVC config similar to the one in the example pod-single-existing-volume.

So far it worked fine.

Then I resized the node pools (delete current pool entirly, add new pool) using the DO web interface. Having done so, the node on which the app was running on and to which the volume was attached to, has been deleted. K8s automatically tried to start the pod on another node, but this isn't working. Pod creation fails with this error:

Multi-Attach error for volume "nexus-mw-sonatype-nexus-data " Volume is already exclusively attached to one node and can't be attached to another

If I delete the app, everything (including PV/PVC) is removed from the cluster, like it should. Even the volume is listed as not being attached to any node in the DO web interface. But when I reinstall the app I keep getting the error, even though the PV is shown as being bound:

NAME                           CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                             STORAGECLASS                   REASON   AGE

nexus-mw-sonatype-nexus-data   10Gi       RWO            Retain           Bound    mw/nexus-mw-sonatype-nexus-data   nexus-mw-sonatype-nexus-data            13m

At this point I'm stuck. I can't find a way to attach the volume to the pod.

What did you expect to happen?

After the app (incl PV/PVC) was removed from the cluster and the volume was detached, csi-digitalocean should be able to attach the volume back to a node.

Steps to reproduce:

  • Create a k8s cluster via the DO web-interface with one node pool.
  • Deploy an app that uses an existing DO block storage volume that is configured like in the example: pod-single-existing-volume.
  • Wait until the app runs stable with the volume attached to it properly.
  • Create a new node pool for the cluster and delete the old node pool entirely (via the DO web-interface).
  • Now you are left with a block storage volume that cannot be used with your cluster, because csi-digitalocean seems to think it's still attached to the deleted node.

Configuration:

  • CSI Version: v0.3.1
  • Kubernetes Version: 12.1
  • Cloud provider/framework version: Digitalocean Kubernetes (Preview)
@naarani
Copy link

naarani commented Nov 24, 2018

I'm having a similar problems after resize (I had CSI 0.2.0 and I've updated to 0.3.1 manually with kubectl edit), as in #32 :
Multi-Attach error for volume "pvc-3c87f7c1d47c11e8" Volume is already exclusively attached to one node and can't be attached to another

running this commands give me no log :-/
kubectl logs -l app=csi-provisioner-doplugin -c digitalocean-csi-plugin -n kube-system
kubectl logs -l app=csi-attacher-doplugin -c digitalocean-csi-plugin -n kube-system
kubectl logs -l app=csi-doplugin -c csi-doplugin -n kube-system

PS
I could attach the blockstorage via consolle to another pod, copy the data and unmount it.

@fatih
Copy link
Contributor

fatih commented Nov 27, 2018

Thanks folks for the reports. There were some couple of issues with attachment to dead nodes that were fixed with latest versions of CSI (which I've released yesterday as v0.4.0). I'll try to reproduce it today to see how things are in this area and if there is any bug on our side. I'll keep you informed.

@radek-baczynski
Copy link

radek-baczynski commented Dec 11, 2018

I have same problem with this helm https://github.com/helm/charts/tree/master/stable/traefik
After installing traefik works but when I add new domain to acme.domains.domainsList and run helm upgrade with new config I got:

Multi-Attach error for volume "pvc-755581bc-fbde-11e8-b7a2-be5e9cc268b5" Volume is already used by pod(s) traefik-thingie-6db8884ccc-qjsn4

@Azuka
Copy link

Azuka commented Dec 12, 2018

@radek-baczynski, do you want to try deleting the running old pod, or setting your .spec.strategy (ie the helm value deploymentStrategy) to Recreate?

That's what I'm using with traefik.

@murdav
Copy link

murdav commented Dec 19, 2018

Did you try to patch the finalizers?

e.g.
kubectl patch pvc -p '{"metadata":{"finalizers":null}}'
kubectl patch pv -p '{"metadata":{"finalizers":null}}'

and after delete them. I solved like this.
Also rook/rook#1488

@fatih
Copy link
Contributor

fatih commented Jan 3, 2019

I just tried to reproduce it and was able to see the issue. But this is not a CSI-Digitalocean problem unfortunately. The Driver never gets an attach/detach command and if you check the logs you'll see nothing. It seems like the CSI sub-system is not properly handling these kind of issues. I'm trying to check what we can do here or if there is anything I can do at least.

These are the errors I see from the POD:

Events:
  Type     Reason              Age    From                           Message
  ----     ------              ----   ----                           -------
  Normal   Scheduled           3m22s  default-scheduler              Successfully assigned default/my-csi-app-555bfcb94d-nmchm to cranky-brattain-83pu
  Warning  FailedAttachVolume  3m22s  attachdetach-controller        Multi-Attach error for volume "volume-nyc1-01" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount         78s    kubelet, cranky-brattain-83pu  Unable to mount volumes for pod "my-csi-app-555bfcb94d-nmchm_default(c396a076-0f68-11e9-acf1-82ed94d9aecc)": timeout expired waiting for volumes to attach or mount for pod "default"/"my-csi-app-555bfcb94d-nmchm". list of unmounted volumes=[my-do-volume]. list of unattached volumes=[my-do-volume default-token-l9nfd]

I'll comment more on this.

@radek-baczynski
Copy link

radek-baczynski commented Jan 3, 2019

change of deploymentStrategy to Recreate helped. Thanks

@fatih
Copy link
Contributor

fatih commented Jan 4, 2019

@radek-baczynski just tried with "Recreate" and seems like it's really working well. But I waited couple of minutes and the scheduler had time to resolve the issues:

Events:
  Type     Reason                  Age                  From                           Message
  ----     ------                  ----                 ----                           -------
  Normal   Scheduled               6m13s                default-scheduler              Successfully assigned default/my-csi-app-555bfcb94d-ws584 to cranky-brattain-83pu
  Warning  FailedAttachVolume      6m12s                attachdetach-controller        Multi-Attach error for volume "volume-nyc1-01" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount             112s (x2 over 4m9s)  kubelet, cranky-brattain-83pu  Unable to mount volumes for pod "my-csi-app-555bfcb94d-ws584_default(f656145f-1016-11e9-acf1-82ed94d9aecc)": timeout expired waiting for volumes to attach or mount for pod "default"/"my-csi-app-555bfcb94d-ws584". list of unmounted volumes=[my-do-volume]. list of unattached volumes=[my-do-volume default-token-l9nfd]
  Normal   SuccessfulAttachVolume  8s                   attachdetach-controller        AttachVolume.Attach succeeded for volume "volume-nyc1-01"

I wonder if this is the same for rolling-update strategy as well. I'll post my findings here.

@fatih
Copy link
Contributor

fatih commented Jan 4, 2019

Alright tested it with RollingUpdate strategy and confirm that it works well! Here are the events for the pod:

Events:
  Type     Reason                  Age                   From                           Message
  ----     ------                  ----                  ----                           -------
  Normal   Scheduled               6m23s                 default-scheduler              Successfully assigned default/my-csi-app-555bfcb94d-7l9x5 to cranky-brattain-83pu
  Warning  FailedAttachVolume      6m23s                 attachdetach-controller        Multi-Attach error for volume "volume-nyc1-01" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount             2m4s (x2 over 4m20s)  kubelet, cranky-brattain-83pu  Unable to mount volumes for pod "my-csi-app-555bfcb94d-7l9x5_default(341f5f91-101d-11e9-acf1-82ed94d9aecc)": timeout expired waiting for volumes to attach or mount for pod "default"/"my-csi-app-555bfcb94d-7l9x5". list of unmounted volumes=[my-do-volume]. list of unattached volumes=[my-do-volume default-token-l9nfd]
  Normal   SuccessfulAttachVolume  19s                   attachdetach-controller        AttachVolume.Attach succeeded for volume "volume-nyc1-01"
  Normal   Pulling                 10s                   kubelet, cranky-brattain-83pu  pulling image "busybox"
  Normal   Pulled                  10s                   kubelet, cranky-brattain-83pu  Successfully pulled image "busybox"
  Normal   Created                 10s                   kubelet, cranky-brattain-83pu  Created container
  Normal   Started                 10s                   kubelet, cranky-brattain-83pu  Started container

The key is, it's not instantaneously. We need to wait until the reconciler in Kubernetes is aware of the imperfections and try to reconcile to a correct state. In my tests it reconciled in a good state in around 6 minutes.

@travisgroth
Copy link

travisgroth commented Feb 13, 2019

I'm on v0.4.0 and have the same scenario as feluxe. Here's how I got there:

  • Create stateful set app with PVC
  • Application is stable with data written to PV
  • Unknown scenario began crashlooping app
  • Delete crashlooping pod
  • Get PV warnings:
Warning  FailedMount  1m (x11 over 7m)  kubelet, charming-austin-8yz0  MountVolume.WaitForAttach failed for volume "pvc-1c06f5ec-1800-11e9-a645-5af0d80f33cc" : volume attachment is being deleted

Warning  FailedMount  1m (x3 over 5m)   kubelet, charming-austin-8yz0  Unable to mount volumes for pod "prometheus-prometheus-prometheus-oper-prometheus-0_prometheus(0cc769e6-2f2a-11e9-a3f3-1e02303657d9)": timeout expired waiting for volumes to attach or mount for pod "prometheus"/"prometheus-prometheus-prometheus-oper-prometheus-0". list of unmounted volumes=[prometheus-prometheus-prometheus-oper-prometheus-db]. list of unattached volumes=[prometheus-prometheus-prometheus-oper-prometheus-db config config-out prometheus-prometheus-prometheus-oper-prometheus-rulefiles-0 prometheus-prometheus-oper-prometheus-token-rd6s5]
  • Wait for 5-10 minutes
  • Recycle node via DO control panel (creates new node, deletes old)
  • New, slightly different warning:
  Warning  FailedAttachVolume  7m                attachdetach-controller        Multi-Attach error for volume "pvc-1c06f5ec-1800-11e9-a645-5af0d80f33cc" Volume is already exclusively attached to one node and can't be attached to another

Warning  FailedMount         39s (x3 over 5m)  kubelet, charming-austin-unyu  Unable to mount volumes for pod "prometheus-prometheus-prometheus-oper-prometheus-0_prometheus(8d61cc70-2f7f-11e9-a3f3-1e02303657d9)": timeout expired waiting for volumes to attach or mount for pod "prometheus"/"prometheus-prometheus-prometheus-oper-prometheus-0". list of unmounted volumes=[prometheus-prometheus-prometheus-oper-prometheus-db]. list of unattached volumes=[prometheus-prometheus-prometheus-oper-prometheus-db config config-out prometheus-prometheus-prometheus-oper-prometheus-rulefiles-0 prometheus-prometheus-oper-prometheus-token-rd6s5]

This continues whether the volume is attached to the new node or not. How do I resolve this? I left it this way for about 10 hours and have tried scaling the statefulset down/up as well as deleting it. No change to the situation.

Cluster info:

NAME                   STATUS    ROLES     AGE       VERSION
charming-austin-unyu   Ready     <none>    10h       v1.12.3

@twogood
Copy link

twogood commented Feb 24, 2019

Thank you @Azuka and @radek-baczynski , it seems I could solve this for traefik (installed with helm) by changing strategy to Recreate.

@timoreimann
Copy link
Collaborator

timoreimann commented Nov 2, 2019

Newer releases of the CSI plugin address a few bugs, most notably one about failure to detach due to a bug in the external-attacher side car. Please update to the latest possible version and file a new issue if the problem continues to happen.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants