Concurrency issue when pvc is mounted to multiple pods #4654

z2000l · 2024-06-02T21:38:18Z

Describe the bug

This is an extension/reopen of https://github.com/ceph/ceph-csi/issues/4592. Our test showed that when creating a deployment of replica 2, with high chance that one pod will fail to mount the pvc.

Environment details

Same as https://github.com/ceph/ceph-csi/issues/4592

Steps to reproduce

Steps to reproduce the behavior:

Change the pod yaml in issue 4592 to a deployment
Make replicas to 2(or high enough so pods will be running on different nodes)

Actual results

Only one pod will be running

Expected behavior

All pods will be running

Madhu-1 · 2024-06-03T07:50:46Z

@z2000l Instead of reopening the issue, can you be more specific about the problem with details?

The Fix is not part of any release yet.
Are you also testing cephfs encryption
Have you tried with canary cephcsi image

z2000l · 2024-06-04T00:28:59Z

Yes, we're trying to test cephfs encryption. We used rook to orchestrate our ceph cluster. We applied the patch to ceph csi 3.10.2. The patch improved fscrypt a lot. Still we experienced problems when replica is high, especially when pods were running on multiple nodes: a subset of the replicas were running without a problem, some of them were stuck in ContainerCreating with error messges:
MountVolume.MountDevice failed for volume "pvc-ad63a27d-4651-42ae-a96c-ca21de963a38" : rpc error: code = Internal desc = fscrypt: unsupported state metadata=true kernel_policy=false

We suspect when multiple nodes try to setup fscrypt policy at the same time, the failed nodes don't refresh or sync to get the policy and get stuck.

And which canary cephcsi image you'd like us to try? Thanks.

Madhu-1 · 2024-06-04T06:51:35Z

@z2000l Thanks for the response , quay.io/cephcsi/cephcsi:canary can you please use this message and also please provide the logs and steps to reproduce so that others can try to reproduce it?
it could be problem with the fscrypt library we are using we need to open issues with fscrypt if we have any.

@NymanRobin is it something still happening or its fixed?

NymanRobin · 2024-06-04T07:53:02Z

Thanks for notifying me @Madhu-1 I thought this was fixed, but in my testing I did not use multiple nodes so this might be an issue still as pointed out by @z2000l. I will try with multiple nodes and see if I can reproduce

Madhu-1 · 2024-06-04T07:57:58Z

Thanks for notifying me @Madhu-1 I thought this was fixed, but in my testing I did not use multiple nodes so this might be an issue still as pointed out by @z2000l. I will try with multiple nodes and see if I can reproduce

Thank you 👍🏻

NymanRobin · 2024-06-04T11:12:17Z

I can confirm that this is consistently reproducible when using a deployment with multiple replicas spread over nodes and RWX accessMode for the pvc. I will start looking into a solution, not sure yet if the problem is in fscrypt or ceph-csi implementation

NymanRobin · 2024-06-05T10:33:35Z

I now understand the error clearly. It seems that the fscrypt client handles metadata and policy differently, leading to this error. When checking for metadata, if it doesn't exist, we create it. However, this isn't the case for policy checks. This results in scenarios where, for instance, if three pods are created simultaneously, the first creates metadata, and the subsequent two find it. However, none of these pods find the policy, leading to a mismatch because the subsequent two have metadata but no policy.

The solution is to properly set up the policy and metadata initially. Once this is done, the race condition won't occur. As a temporary workaround, one can start the deployment with one replica, wait for it to be in the running state, and then scale up without issue.

I believe a permanent solution shouldn't be too difficult. Best idea I currently have is to check if the initial setup is done, this needs to be behind a lock and the lock can be released if the check is okay or when the setup is done. I will still do some more thinking about the best course of action, but how does this initially sound to you @Madhu-1

EDIT: Okay actually the pods are spread over nodes so a mutex won't solve this it will need a lease or some other mechanism to ensure it is ran only once

/assign

github-actions · 2024-06-05T10:33:46Z

Thanks for taking this issue! Let us know if you have any questions!

github-actions · 2024-06-05T11:41:48Z

The issue you are trying to assign to yourself is already assigned.

Madhu-1 · 2024-06-06T06:51:54Z

I believe a permanent solution shouldn't be too difficult. Best idea I currently have is to check if the initial setup is done, this needs to be behind a lock and the lock can be released if the check is okay or when the setup is done. I will still do some more thinking about the best course of action, but how does this initially sound to you @Madhu-1

EDIT: Okay actually the pods are spread over nodes so a mutex won't solve this it will need a lease or some other mechanism to ensure it is ran only once

Thanks for exploring it. Yes, as you mentioned internal locking might not be good, we need to see if we can have something else for this one.

nixpanic · 2024-06-12T08:19:34Z

Hi @NymanRobin,

You might be able to use VolumeLocks from internal/util/idlocker.go for locking too once you add a new fscrypt operation.

Thanks for looking into this!

NymanRobin · 2024-06-12T10:16:50Z

Hey @nixpanic!

Thanks for the suggestion but based on our last discussion I still think the Rados omap is the way to go.

Let me try to explain, so how I understand it is that the lock seems to be already acquired in the NodStageVolume on line 170, meaning that we already are in a locked state when entering function maybeUnlockFileEncryption. Why this does not prevent the race condition is because the VolumeLocks map is local to each node and multiple nodes can stage & encrypt a volume simultaneously in the RWX case

Do you agree with this understanding?

I think my best idea currently would be to implement a similar interface as the reftracker that would track the encryption of the cephfs filesystem to ensure the encryption is only set up once

NymanRobin · 2024-06-12T13:04:47Z

I tested the Rados omap locking with a simple proof of concept, and it appears to be working well!

During testing, I created a reference object based on the volume in the refTracker. If this succeeded, I proceeded with the encryption setup and, upon exiting the function, deleted the reference. This straightforward lock mechanism resolved all issues, and all pods started up correctly.

However, I believe this can be further improved by creating a dedicated interface for encryption tracking.

nixpanic · 2024-06-13T07:14:30Z

Let me try to explain, so how I understand it is that the lock seems to be already acquired in the NodStageVolume on line 170, meaning that we already are in a locked state when entering function maybeUnlockFileEncryption. Why this does not prevent the race condition is because the VolumeLocks map is local to each node and multiple nodes can stage & encrypt a volume simultaneously in the RWX case

Do you agree with this understanding?

Yes, if VolumeLocks do not store their state in Rados, they are definitely not usable.

I tested the Rados omap locking with a simple proof of concept, and it appears to be working well!

During testing, I created a reference object based on the volume in the refTracker. If this succeeded, I proceeded with the encryption setup and, upon exiting the function, deleted the reference. This straightforward lock mechanism resolved all issues, and all pods started up correctly.

Great work! Sounds really good 👏

However, I believe this can be further improved by creating a dedicated interface for encryption tracking.

If you have a generic "Rados lock/mutex" function, we might be able to use it for other things too. I can't immediately think of something else than encryption, though. Making it encryption specific is fine, but ideally it is easily adaptable later on to be more generic (if we find a usecase for it).

Thanks again!

Madhu-1 · 2024-06-13T11:15:52Z

Let me try to explain, so how I understand it is that the lock seems to be already acquired in the NodStageVolume on line 170, meaning that we already are in a locked state when entering function maybeUnlockFileEncryption. Why this does not prevent the race condition is because the VolumeLocks map is local to each node and multiple nodes can stage & encrypt a volume simultaneously in the RWX case
Do you agree with this understanding?

Yes, if VolumeLocks do not store their state in Rados, they are definitely not usable.

I tested the Rados omap locking with a simple proof of concept, and it appears to be working well!
During testing, I created a reference object based on the volume in the refTracker. If this succeeded, I proceeded with the encryption setup and, upon exiting the function, deleted the reference. This straightforward lock mechanism resolved all issues, and all pods started up correctly.

Great work! Sounds really good 👏

@NymanRobin what about the case where we took the lock and csi restarted and the app pod scale downed to 1 where it needs to run on a specific node instead of two nodes, will this also helps?

However, I believe this can be further improved by creating a dedicated interface for encryption tracking.

If you have a generic "Rados lock/mutex" function, we might be able to use it for other things too. I can't immediately think of something else than encryption, though. Making it encryption specific is fine, but ideally it is easily adaptable later on to be more generic (if we find a usecase for it).

Thanks again!

NymanRobin · 2024-06-13T12:11:03Z

@Madhu-1, I currently do not see any problems with the scale-down scenario, but maybe I am missing something?

I also do not foresee any issues with scaling down, as ideally, once a node has set up the encryption, there is no need to acquire the lock again. This is why I considered a specific encryption interface to to first do a lookup if the volume has already been encrypted, thus avoiding the need to lock it again in case it is already encrypted. This approach could be an extension of the lock mechanism. Although a normal mutex could also work, it will be slightly slower during large-scale operations. However, the lock will only be active for a very short period of time is the key point here. On another note if the update/delete call for the lock fails, we need to maintain a list of orphaned locks and attempt to clean them periodically is my thought

That said, I understand the concern about:
Acquire lock -> pod restarts -> Orphaned lock

Is a valid point I hadn't considered earlier. Handling a pod restart in a locked state is more challenging, and I do not have a perfect solution on top of my mind for this scenario. Maybe implementing an expiry time for the lock could solve this? What do you think @Madhu-1 and @nixpanic?

nixpanic · 2024-06-14T09:16:33Z

A lock that can time-out would be ideal, I guess. Depending on how complex you want to make it, this could be an approach:

obtain the lock
1. set owner id of the lock (useful for logging)
2. set timeout timestamp
3. do operation
4. (optional) update timeout timestamp to inform others the lock is still active
5. do more operation(s)
6. release the lock
blocked while obtaining the lock
1. report the owner of the lock
2. retry getting the lock, take timeout in account
3. if timed-out, warn about it, including the owner
4. if the lock was released, just report an informational message

NymanRobin · 2024-06-20T10:14:19Z

Hey everyone,

I've just opened a draft PR (#4688) that addresses the concurrency issue we've been facing. While it's still a bit rough around the edges and misses some functionality, it implements the necessary functionality for resolving the concurrency issue. I made a check list in the PR of what could still be done or looked into

I opened the PR in this state now because I'm going on vacation for the next four weeks and don't want to lose momentum on this issue. I'll be working to finalize as much as possible today, but any remaining tasks are open for anyone who feels inspired to take over.

From our side, @Sunnatillo will be available to assist with this.

Thanks!

Madhu-1 · 2024-06-20T10:43:42Z

Hey everyone,

I've just opened a draft PR (#4688) that addresses the concurrency issue we've been facing. While it's still a bit rough around the edges and misses some functionality, it implements the necessary functionality for resolving the concurrency issue. I made a check list in the PR of what could still be done or looked into

I opened the PR in this state now because I'm going on vacation for the next four weeks and don't want to lose momentum on this issue. I'll be working to finalize as much as possible today, but any remaining tasks are open for anyone who feels inspired to take over.

From our side, @Sunnatillo will be available to assist with this.

Thanks!

@NymanRobin Thank you for taking care of it, Enjoy your vacation

NymanRobin · 2024-06-20T17:31:16Z

Thanks a lot @Madhu-1!
I however now noticed that the ceph-go library already supports LockExclusive, so probably not the best idea to reinvent the wheel as I have done here and instead use that. The locking supported the functionality that we need on an initial look. Oh well, if this remains I will fix my mess after the vacation 😄

Sunnatillo · 2024-06-25T11:33:08Z

Hi @Madhu-1 @nixpanic. Should we use LockExclusive from ceph-go or the @NymanRobin's current implementation is fine?

Madhu-1 · 2024-06-26T07:18:08Z

@Sunnatillo yes i think using https://github.com/ceph/go-ceph/blob/ee25db94c6595ad3dca6cd66d7f290df3baa7c9a/rados/ioctx.go#L463-L508 LockExclusive from rados does what described above by @nixpanic am fine with using it instead of reinvent the same. @nixpanic can you please check this as well.

nixpanic · 2024-06-26T08:31:17Z

Hi @Madhu-1 @nixpanic. Should we use LockExclusive from ceph-go or the @NymanRobin's current implementation is fine?

The simpler it can be done, the better!

github-actions bot assigned NymanRobin Jun 5, 2024

Rakshith-R added bug Something isn't working component/cephfs Issues related to CephFS labels Jun 5, 2024

NymanRobin mentioned this issue Jun 20, 2024

wip: implement a global lock for cephfs encryption #4688

Closed

6 tasks

nixpanic mentioned this issue Jul 11, 2024

Set object lock for volumes for cephfs encryption #4697

Merged

6 tasks

mergify bot closed this as completed in #4697 Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrency issue when pvc is mounted to multiple pods #4654

Concurrency issue when pvc is mounted to multiple pods #4654

z2000l commented Jun 2, 2024

Madhu-1 commented Jun 3, 2024

z2000l commented Jun 4, 2024

Madhu-1 commented Jun 4, 2024

NymanRobin commented Jun 4, 2024

Madhu-1 commented Jun 4, 2024

NymanRobin commented Jun 4, 2024

NymanRobin commented Jun 5, 2024 •

edited

Loading

github-actions bot commented Jun 5, 2024

github-actions bot commented Jun 5, 2024

Madhu-1 commented Jun 6, 2024

nixpanic commented Jun 12, 2024

NymanRobin commented Jun 12, 2024

NymanRobin commented Jun 12, 2024

nixpanic commented Jun 13, 2024

Madhu-1 commented Jun 13, 2024

NymanRobin commented Jun 13, 2024 •

edited

Loading

nixpanic commented Jun 14, 2024

NymanRobin commented Jun 20, 2024

Madhu-1 commented Jun 20, 2024

NymanRobin commented Jun 20, 2024

Sunnatillo commented Jun 25, 2024

Madhu-1 commented Jun 26, 2024

nixpanic commented Jun 26, 2024

Concurrency issue when pvc is mounted to multiple pods #4654

Concurrency issue when pvc is mounted to multiple pods #4654

Comments

z2000l commented Jun 2, 2024

Describe the bug

Environment details

Steps to reproduce

Actual results

Expected behavior

Madhu-1 commented Jun 3, 2024

z2000l commented Jun 4, 2024

Madhu-1 commented Jun 4, 2024

NymanRobin commented Jun 4, 2024

Madhu-1 commented Jun 4, 2024

NymanRobin commented Jun 4, 2024

NymanRobin commented Jun 5, 2024 • edited Loading

github-actions bot commented Jun 5, 2024

github-actions bot commented Jun 5, 2024

Madhu-1 commented Jun 6, 2024

nixpanic commented Jun 12, 2024

NymanRobin commented Jun 12, 2024

NymanRobin commented Jun 12, 2024

nixpanic commented Jun 13, 2024

Madhu-1 commented Jun 13, 2024

NymanRobin commented Jun 13, 2024 • edited Loading

nixpanic commented Jun 14, 2024

NymanRobin commented Jun 20, 2024

Madhu-1 commented Jun 20, 2024

NymanRobin commented Jun 20, 2024

Sunnatillo commented Jun 25, 2024

Madhu-1 commented Jun 26, 2024

nixpanic commented Jun 26, 2024

NymanRobin commented Jun 5, 2024 •

edited

Loading

NymanRobin commented Jun 13, 2024 •

edited

Loading