Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Resiliency: Occasional failure unmounting Unity volume for raw block devices via iSCSI #237

Closed
hoppea2 opened this issue Mar 22, 2022 · 1 comment
Assignees
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/bug Something isn't working. This is the default label associated with a bug issue.
Milestone

Comments

@hoppea2
Copy link
Collaborator

hoppea2 commented Mar 22, 2022

Bug Description

This issue was discovered running longevity on Unity for CSM Resiliency (est. 20%-30% of the time). Longevity testing is a looped execution of the integration tests. The integration tests were run regularly in a single, non-looped, iteration with no failures during development in Q1.

Expected behavior:

After inducing an error condition on a node, the managed pods and resources are supposed to be removed and recreated on a healthy node.

Observed behavior:

Occasionally, the unmount operation of a Unity raw block volume attached via iSCSI fails. This has been detected during longevity testing after running this scenario 10 times

Workaround:

See item 2 in the design limitations [known design limitations|https://dell.github.io/csm-docs/docs/resiliency/design/]
The recommended workaround is for the administrator to reboot the affected node.

Logs

NodeUnstageVolume: REP 0621: rpc error: code = Internal desc = runid=621 Volume podmonvol-xxxxxx-iSCSI-apmnnnnnnnnnnn-sv_61383 has been mounted outside the provided target path /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/staging/podmonvol-nnnnnnnncb"

level=error msg="NodeUnstageVolume failed:
level=error msg="Could not Unmount private block device.

level=info msg="Couldn't completely cleanup node- taint not removed- cleanup will be retried, or a manual reboot is advised"

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

  1. Unity raw volumes mounts via iSCSI on k8s node
  2. Node failure in supported scenario causes node to be cordoned and resources migration
    --> PV unmount fails with "device is busy" after pod has been deleted

Expected Behavior

PV should be unmounted successfully after pod has been deleted

CSM Driver(s)

CSI Driver for Unity 2.2

Installation Type

Helm

Container Storage Modules Enabled

Resiliency Podmon 1.1

Container Orchestrator

Kubernetes 1.23

Operating System

CentOS 7

@hoppea2 hoppea2 added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Mar 22, 2022
@hoppea2 hoppea2 assigned hoppea2 and alikdell and unassigned hoppea2 Mar 22, 2022
@hoppea2 hoppea2 removed the needs-triage Issue requires triage. label Mar 22, 2022
@hoppea2 hoppea2 added this to the v1.3.0 milestone Mar 22, 2022
@hoppea2 hoppea2 added the area/csm-resiliency Issue pertains to the CSM Resiliency module label Mar 22, 2022
@alikdell
Copy link
Contributor

alikdell commented Apr 7, 2022

Root cause:
After running several failovers for several hours, we observed that there are couple of staled loopback devices left on mountpoint of iSCSI block volume with Unity arrays. This possibly happening due to some timing when kublet is not cleaning some pods properly, even though those pods are not available or visible to user namespace.

Solution:
When Resiliency cleanup a node and not able to unmount some mount point, and if those mount point are for iSCSI block volume with Unity. We look for staled loopbackdevices for those mount point, if those are there, delete those staled loopbackdevices and unmount those mount. This way Resiliency is able to complete the cleanup and remove the taint from the node.
For CSI-Unity, we needed to add couple of bidirectional mounts for podmon container so that podmon has correct mount path and device permissions.

@gallacher gallacher changed the title [BUG]: Resiliency: Occasional failure unmounting Unity volume for raw block devices via iSCSI [BUG]: Resiliency: Occasional failure unmounting Unity volume for raw block devices via iSCSI Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests

2 participants