Add force detach #477

bswartz · 2021-04-26T20:37:44Z

Add new controller capability:

UNPUBLISH_FENCE

Add new node capabilitiy:

FORCE_UNPUBLISH

spec.md

xing-yang · 2021-04-28T15:50:58Z

@bswartz did you modify csi.proto manually? I see that you added the new capabilities in csi.proto, but I didn't find them in spec.md. Changes should only be made in spec.md. After "make", csi.proto will be updated automatically.

spec.md

bswartz · 2021-04-28T17:16:32Z

@bswartz did you modify csi.proto manually? I see that you added the new capabilities in csi.proto, but I didn't find them in spec.md. Changes should only be made in spec.md. After "make", csi.proto will be updated automatically.

Seriously?!? I didn't know that! I'll update.

nixpanic

If CSI puts a node in a quarantine state, which component, who (and how) should the node get marked as recovered and get out of the quarantine state?

I guess that the quarantine state is per node+volume, making it possible to have other volumes (potentially from the same csi-driver) functioning correctly. Is that understanding correct?

nixpanic · 2021-04-29T15:52:49Z

spec.md

 The Plugin SHALL assume that this RPC will be executed on the node where the volume is being used.

+If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unpublishing volumes even when access has been revoked with `ControllerUnpublishVolume`.
+Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.
+It is essential that after a successful call to `NodeUnpublishVolume` that there be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it were to be subsequently re-published to that node.


typo: unintetional -> unintentional

humblec · 2021-05-04T11:33:57Z

csi.proto

@@ -763,6 +763,17 @@ message ControllerUnpublishVolumeRequest {
  // This field is OPTIONAL. Refer to the `Secrets Requirements`
  // section on how to use this field.
  map<string, string> secrets = 3 [(csi_secret) = true];
+
+  // Indicates SP MUST make the volume inacessible to the node or nodes


s/inacessible/inaccessible

@bswartz may be you missed this ? :)

humblec · 2021-05-04T11:38:11Z

spec.md

+If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unstaging volumes even when access has been revoked with `ControllerUnpublishVolume`.
+Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.
+It is essential that after a successful call to `NodeUnstageVolume` that there be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it were to be subsequently re-staged to that node.
+


s/ unintetional/unintentional

@bswartz may be correct this too ?

humblec · 2021-05-04T11:42:43Z

spec.md

+  // to a volume from a node that has been fenced MUST NOT succeed,
+  // even if the volume remains staged and/or published on the node.
+  // CO MUST NOT set this field to true unless SP has the
+  // UNPUBLISH_FENCE controller capability.


One of the main doubt I have here is, what if the SP implement ONLY node capability ? Is that SP driver intended to use the fence capability successfully ?

If the plugin has no controller, then there's nothing for the CO to talk to in the event that a node becomes unreachable. I would expect a CO to use both of these new capabilities together. However, if a plugin can only support UNPUBLISH_FENCE and not the FORCE_UNPUBLISH capability, then it's still possible to use the controller capability to safely move workloads and their attached volumes to a new node in the event of a node failure, and to manage the node cleanup process through a full node reboot. If a plugin supports FORCE_UNPUBLISH only and not UNPUBLISH_FENCE, then it's not particularly useful for anything except more aggressive node cleanup.

@bswartz yeah, that clarifies my doubt! imo, its good to document above scenario ( absence of either controller or node capability) in this spec which can avoid some confusion. On a side note, I was thinking, eventhough its just FORCE_UNPUBLISH capability advertised by the SP driver, its still useful to cut/fence access to volumes . Because in this enhancement of NODE capability, its granular to Volume Level. Removing access to the volume by black listing certain NODE IPs .etc are possible from SP driver pov if the CO can flag it (not sure how exactly though :) )in the 'next' NODE call for this volume.

Thinking a bit more from the pov of a csi driver, which does NOT support controller publish/unpublish, but supports node stage/unstage (e.g CSI driver for GCP filestore), I have a couple of questions:

In the current proposal, the driver does not have a way to opt in to leveraging the fence capability (even if there is a way to black list node ips for a given nfs filer), without supporting a controller publish/unpublish. correct? e.g today there is a controller service PUBLISH_UNPUBLISH capability, but we dont have a controller capability to only support UNPUBLISH with fence (where a regular controller unpublish is a no-op by the driver, but the unpublish with fence=true can quarantine a node-volume pair). no-op controller publish will be costly since it would create extra resources like VA objects in k8s.

Second question is around FORCE_UNPUBLISH. I am trying the understand some example scenarios on what condition would the CO trigger an NODE UNPUBLISH/UNSTAGE(with force=true). Does the CO trigger node unstage/unpublish (force=true), only for quarantined volumes? Because if so, then for a driver which does NOT support controller PUBLISH_UNPUBLISH (thereby no way to quarantine/fence a volume), but supports node stage/publish cannot leverage the force options. (e.g force unmount for nfs, in case of the nfs server unavailability)

1. In the current proposal, the driver does not have a way to opt in to leveraging the fence capability (even if there is a way to black list node ips for a given nfs filer), without supporting a controller publish/unpublish. correct? e.g today there is a controller service PUBLISH_UNPUBLISH capability, but we dont have a controller capability to only support UNPUBLISH with fence (where a regular controller unpublish is a no-op by the driver, but the unpublish with fence=true can quarantine a node-volume pair). no-op controller publish will be costly since it would create extra resources like VA objects in k8s.

Yes, in order for this feature to be useful to COs, there has to be a control path for the CO to tell the SP which nodes may access the volume and which can't. This is the whole purpose of ControllerPublish/Unpublish. In any scheme where the access granting/denying is out of band, you can't automate it from the CO side, and this feature is about automating the recovery from node failures.

2. Second question is around FORCE_UNPUBLISH. I am trying the understand some example scenarios on what condition would the CO trigger an NODE UNPUBLISH/UNSTAGE(with force=true). Does the CO trigger node unstage/unpublish (force=true), only for quarantined volumes? Because if so, then for a driver which does NOT support controller PUBLISH_UNPUBLISH (thereby no way to quarantine/fence a volume), but supports node stage/publish cannot leverage the force options. (e.g force unmount for nfs, in case of the nfs server unavailability)

The CO should use the force flag on unstage/unpublish if it's cleaning up a volume on a node where data has been lost, or will inevitably be lost. Normal behavior on unpublish/unstage is to fail if data loss would occur, and for the CO to keep retrying, so that data loss does not occur by accident. The CO has to make the judgement that data loss is acceptable and inform the SP of that by using the force flag, so that the SP can successfully clean up, discarding any pending data. Situation like this will occur when the CO has chosen to fence the node and move a workload to a different node (in the interest of minimizing workload downtime) or if the storage itself has experienced a data-losing failure and the CO wishes to write off the lost volume by cleaning up the attachment on the node.

humblec · 2021-05-04T11:46:26Z

csi.proto

+  // Indicates SP MUST make the volume inacessible to the node or nodes
+  // it is being unpublished from. Any attempt to read or write data
+  // to a volume from a node that has been fenced MUST NOT succeed,
+  // even if the volume remains staged and/or published on the node.


If there are a couple of pods accessing the volume on same node, we would have a state where more than one publish and just one stage. Considering the nodeunpublish request arrives for a particular volume handle and if its fenced on stage level, its not the desired outcome we intended here with this enhancement/spec, Isnt it?

Yes the normal set of NodeUnpublish and NodeUnstage calls still have to be made for each volume on each node to return to the original state. The trigger for the quarantined state is the ControllerUnpublish call with the fence flag. The above statement just means that, without communicating with the node, the SP is expected to render the volume inaccessible to the node regardless of the node's state, and the node will be forced to clean up under conditions where it can't access the volume. The only way the node will be able to access the volume again is for the node-side cleanup to complete and then for a subsequent ControllerPublish to happen for that node again.

bswartz · 2021-05-04T12:32:18Z

If CSI puts a node in a quarantine state, which component, who (and how) should the node get marked as recovered and get out of the quarantine state?

All of the state management in CSI is implicit. Nowhere in the RPCs are the states represented explicitly. The CO needs to represent the new quarantine states for the scheme to work correctly, and the SP needs to relax its assumptions about what order some RPCs may be called in if they assert the new capabilities.

I guess that the quarantine state is per node+volume, making it possible to have other volumes (potentially from the same csi-driver) functioning correctly. Is that understanding correct?

In theory the states are per volume, although the expectation is that this feature would be used on all of the volumes on a given node at once, both for the ControllerUnpublish(fenced) and the various node cleanup functions. COs still have to track state per volume, but I would expect an implementation to move all of the volumes to a quarantine state upon detection of a node problem, and then to move all of the volumes out of a quarantine state before attaching any new volumes upon resumption of a node's participation in a cluster. That's not strictly necessary, but it's cleaner and simpler.

nixpanic

Many thanks for the reply and clarifications! From a Ceph-CSI point of view, this looks workable, we can add fenced nodes to a blocklist, and once recovered allow access from the nodes again.

nixpanic · 2021-05-05T06:46:11Z

spec.md

+If the plugin has the `UNPUBLISH_FENCE` capability, the CO MAY specify `fence` as `true`, in which case the SP MUST ensure that the node may no longer access the volume before returning a successful response.
+This results in a transition into one of the `QUARANTINE` states where the node must be cleaned up without being able to access the volume like usual.
+This is intended cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.
+Once in one of the `QUARANTINE` states the volume MAY NOT be published to that node again until appropriate cleanup has happened using `NodeUnpublishVolume` and `NodeUnstageVolume` (if applicable).


This wording makes the calling of NodeUnpublishVolume and NodeUnstageVolume optional, is that intentional?

In other words, it seems acceptable that the CO calls ControllerPublish for a node, only once the node is recovered. So ControllerPublish should also do unfencing in case the node was fenced earlier. The CO could wipe its known state of a node (and attached volumes), without trying to call NodeUnpublishVolume and NodeUnstageVolume after the fencing.

This wording makes the calling of NodeUnpublishVolume and NodeUnstageVolume optional, is that intentional?

Well NodeUnstage is always optional because it's capability-controlled. If the capability is present, then you have to use it after NodeUnpublish to get correct results. However, the other alternative is to reboot the whole node and reset its state. All of the concerns we have with the potential for data corruption result from having unflushed write buffers in the kernel. If you reboot, those concerns go away. The point of the forced unpublish/unstage are to provide a path to cleaning up without requiring a reboot.

In other words, it seems acceptable that the CO calls ControllerPublish for a node, only once the node is recovered. So ControllerPublish should also do unfencing in case the node was fenced earlier. The CO could wipe its known state of a node (and attached volumes), without trying to call NodeUnpublishVolume and NodeUnstageVolume after the fencing.

Yes. In fact it's acceptable today for ControllerUnpublish to fence the node in question, it's just not required. Due to that ambiguity, we need this new flag so the CO can specify that it is required in certain cases. And because we're allowing a new sequence of operations that was previously illegal (calling ControllerUnpublish before NodeUnpublish) we have to spell out the implications of that.

Good, that all makes sense to me.

I'm a little weary that NodeUnpublishVolume and NodeUnstageVolume are being retried until they succeed (and cause the node to be moved out of quarantine). There seems to be the possibility that these calls never make any progress and a reboot is required. Probably not an issue that needs to be handled here, as it will become the responsibility of the CO to alert the cluster operator that a node is in quarantine and other nodes are taking over parts of the workload.

I'm a little weary that NodeUnpublishVolume and NodeUnstageVolume are being retried until they succeed (and cause the node to be moved out of quarantine). There seems to be the possibility that these calls never make any progress and a reboot is required. Probably not an issue that needs to be handled here, as it will become the responsibility of the CO to alert the cluster operator that a node is in quarantine and other nodes are taking over parts of the workload.

I'm not too worried about this. Usually there are more aggressive ways of cleaning up a mount (like the umount -f command). It there aren't, then the driver should not assert the capability, because it could lead to the problem you mention.

My feeling is that, in situations where there aren't ways to force-unmount, it's usually an indication of a missing feature in the underlying protocol or UI. You always want to be able to clean up client state when the server is permanently dead. Any time rebooting is your only option to solve a problem, it's a sign of bad design, and the proper response is to fix the design so that you can force unmount reliably.

nixpanic · 2021-05-20T08:24:39Z

After a little more discussion with colleagues, there seems to be some further (practical) clarification needed.

With Ceph-CSI we can use this proposal when the CSI Pods use host-networking. However, if there is any other form of networking configured, the IP-addresses or hostnames of the Pods may change, and block-listing is not trivial anymore.

This spec modification may not be the right place for this discussion, but we wonder if there has been any thought about this already.

bswartz · 2021-05-20T15:11:23Z

With Ceph-CSI we can use this proposal when the CSI Pods use host-networking. However, if there is any other form of networking configured, the IP-addresses or hostnames of the Pods may change, and block-listing is not trivial anymore.

I assume you're referring here to the ControllerUnpublish portion of the solution? Nobody said this is an easy thing to implement. That's why it's an optional capability.

I feel fairly confident that the higher level problem of how to safely recover from a node failure when there are pods with RWO volumes is unsolvable without either this capability, or some cloud provider plugin that can reliably kill/reboot problem nodes.

gnufied · 2021-05-26T16:04:34Z

spec.md

@@ -257,6 +261,12 @@ Plugins SHOULD expose all RPCs for an interface: Controller plugins SHOULD imple
 Unsupported RPCs SHOULD return an appropriate error code that indicates such (e.g. `CALL_NOT_IMPLEMENTED`).
 The full list of plugin capabilities is documented in the `ControllerGetCapabilities` and `NodeGetCapabilities` RPCs.

+### A Word on Quarantine States
+
+The purpose of the `QUARANTINE_S`, `QUARANTINE_P`, and `QUARANTINE_SP` states are to enable recovery from node problems.


I do not see any other mention/usage of these states?

Look at the ASCII art state transition diagrams.

ah - I searched for word and you are using different word in ASCII diagrams. But that explanation is not enough IMO. Do you think that quarantine states should be part of ControllerUnpublish responses?

Well all of the states in that ASCII chart are implicit states that the CO is supposed to follow but aren't expressed anywhere in the RPC interface. These new states are meant to be similarly implicit.

QUARANTINE_S, QUARANTINE_P, and QUARANTINE_SP are in the diagrams, but I still have difficulty to understand the meeting of those states.

All 3 states refer to situations where a particular node/volume pair is unsafe to use. The difference is which steps need to be performed to return back to the safe state. The _S state requires only unstage to return to normal, the _P state requires only unpublish to return to normal and the _SP state requires both unpublish and unstage to return to normal. These are inversions of the existing states in state diagram, with uncreatively-chosen names (I'm happy to accept better suggestions).

I wanted to emphasize that we're not merely introducing some new flags that yield different behavior, this change relies on a fundamental reordering of operations to arrive at the desired result. When a node is unreachable you have to sort out the problem by talking ONLY to the controller, and the existing CSI spec expressly forbids ControllerUnpublishing until the node side operations are complete.

I agree with @gnufied and @bswartz . explanation of QUARANTINE_S , QUARANTINE_P , and QUARANTINE_SP would be better to added. I suggest adding explanations to ControllerUnpublishVolume section.

jdef · 2021-05-27T12:10:16Z

"fencing" nodes and "forcing" detach sound like complementary features/capabilities. so, from that perspective, it's nice to see the capability bits split up.

(a) what if only fencing capability is specified? is that allowed, w/o also supporting "forcing"? if so, what's the use case?
(b) what if only forcing capability is specified? is that allowed, w/o also supporting "fencing"? if so, what's the use case?

if these features cannot be disentangled from each other, then maybe they are actually part of the same capability? there's a lot to think about in this proposal, and i certainly need to spend more time w/ it.

bswartz · 2021-05-27T14:31:08Z

"fencing" nodes and "forcing" detach sound like complementary features/capabilities. so, from that perspective, it's nice to see the capability bits split up.

(a) what if only fencing capability is specified? is that allowed, w/o also supporting "forcing"? if so, what's the use case?
(b) what if only forcing capability is specified? is that allowed, w/o also supporting "fencing"? if so, what's the use case?

if these features cannot be disentangled from each other, then maybe they are actually part of the same capability? there's a lot to think about in this proposal, and i certainly need to spend more time w/ it.

In situation (a) if the driver supports fencing but not force unpublish, it still gives the CO everything it needs to cope with node failures by moving workloads to other nodes and fencing off the bad node. What you would be missing in this situation is a safe way for the bad node to rejoin the cluster when it becomes responsive again. In general, the normal cleanup procedure of NodeUnpublish+NodeUnstage can't be expected to succeed under conditions where the node can't access the relevant volumes (while it's fenced) because doing so might result in data loss and a well-behaved node plugin would be expected to fail repeatedly until it can safely detach the volumes. This could be worked around be rebooting the node, though, which is what I'd recommend.

In short, situation (a) still gives the CO all the benefits of rapid recovery from node failure, but requires the use of a reboot to clean up afflicted nodes, because ordinary cleanup procedures can't be expected to work reliably.

In situation (b) the CO doesn't have any additional ways to cope with a node failure, but the force unpublish capability could still come in handy when dealing with storage controller failures. In the status quo, failure of a storage controller can result in nodes getting "stuck" because node unpublish/node unstage can never succeed, due to inability to cleanly unmount/detach the volumes while the storage controller itself is down. In a world where the CO knows about the health of the storage controller, it might choose to more aggressively clean up a node if it believes that data loss is inevitable, even if it lacks the capability to fence the node.

In short situation (b) gives the CO the capability to reliably clean up nodes AFTER data loss has become inevitable.

saad-ali

/lgtm
/approve

@jdef @humblec and anyone else interested can you PTAL ASAP.
Trying to get this merged and a new release cut on Monday.

humblec · 2021-06-05T05:13:43Z

spec.md

@@ -1322,6 +1337,17 @@ message ControllerUnpublishVolumeRequest {
  // This field is OPTIONAL. Refer to the `Secrets Requirements`
  // section on how to use this field.
  map<string, string> secrets = 3 [(csi_secret) = true];
+
+  // Indicates SP MUST make the volume inacessible to the node or nodes


@bswartz s/inacessible/inaccessible

xing-yang · 2021-06-05T11:05:41Z

spec.md

@@ -1322,6 +1337,17 @@ message ControllerUnpublishVolumeRequest {
  // This field is OPTIONAL. Refer to the `Secrets Requirements`
  // section on how to use this field.
  map<string, string> secrets = 3 [(csi_secret) = true];
+
+  // Indicates SP MUST make the volume inacessible to the node or nodes


inacessible -> inaccessible

xing-yang · 2021-06-05T11:18:21Z

spec.md

 The CO MUST guarantee that this RPC is called after all `NodeUnpublishVolume` have been called and returned success for the given volume on the given node.

+If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unstaging volumes even when access has been revoked with `ControllerUnpublishVolume`.
+Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.


it if means -> if it means

xing-yang · 2021-06-05T11:19:32Z

spec.md

 The CO MUST guarantee that this RPC is called after all `NodeUnpublishVolume` have been called and returned success for the given volume on the given node.

+If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unstaging volumes even when access has been revoked with `ControllerUnpublishVolume`.
+Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.
+It is essential that after a successful call to `NodeUnstageVolume` that there be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it were to be subsequently re-staged to that node.


there be -> there will be
if it were -> if it was

xing-yang · 2021-06-05T11:25:00Z

spec.md

 The Plugin SHALL assume that this RPC will be executed on the node where the volume is being used.

+If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unpublishing volumes even when access has been revoked with `ControllerUnpublishVolume`.
+Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.


it if means -> if it means

xing-yang · 2021-06-05T11:25:42Z

spec.md

 The Plugin SHALL assume that this RPC will be executed on the node where the volume is being used.

+If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unpublishing volumes even when access has been revoked with `ControllerUnpublishVolume`.
+Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.
+It is essential that after a successful call to `NodeUnpublishVolume` that there be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it were to be subsequently re-published to that node.


there be -> there will be
if it were -> if it was

xing-yang · 2021-06-05T11:38:07Z

spec.md

@@ -2501,6 +2553,11 @@ message NodeServiceCapability {
      // Note that, for alpha, `VolumeCondition` is intended to be
      // informative for humans only, not for automation.
      VOLUME_CONDITION = 4 [(alpha_enum_value) = true];
+      // Indicates that the node supports the NodeUnpublishVolume.force


the node supports -> the Node Plugin supports

xing-yang · 2021-06-05T11:38:16Z

spec.md

@@ -2501,6 +2553,11 @@ message NodeServiceCapability {
      // Note that, for alpha, `VolumeCondition` is intended to be
      // informative for humans only, not for automation.
      VOLUME_CONDITION = 4 [(alpha_enum_value) = true];
+      // Indicates that the node supports the NodeUnpublishVolume.force
+      // field. Also indicates that the node supports the


the node supports -> the Node Plugin supports

bswartz · 2021-06-05T17:30:11Z

Thank you for all the review feedback. I think I've addressed all the comments. Let me know when it's time to squash and merge.

bswartz · 2021-06-05T17:32:23Z

@saad-ali I expect there will be merge conflicts with the other spec changes, due to conflicting capability numbers, so we need to merge the important ones first and the followers will need to resolve the conflicts by choosing higher numbers.

jdef

Mostly nits; this seems well thought out. I feel that it's worth reiterating that unless CO's offer an API for a user to indicate "I don't care about data loss", manual administrator effort is still involved - and then it's unclear what this API actually buys anyone. If the goal is complete automation (w/ respect to cleanup and unpinning volumes) in the case of problematic nodes, then CO's need to address this at the API level as well. What's the plan for that, and if that hasn't landed yet - why?

jdef · 2021-06-06T00:30:15Z

spec.md

+
+The purpose of the `QUARANTINE_S`, `QUARANTINE_P`, and `QUARANTINE_SP` states are to enable recovery from node problems.
+Because CSI is designed to be used in distributed systems, it is inevitable that sometimes volumes will become attached to nodes that get stuck or lost, temporarily or permanently.
+Rather than require an administrator to manually clean up such situation, CSI offers a way disconnect a volume from a node "out of order" such that a volume can be disconnected from a problematic node, and safely connected to a different node, and the node can be reliably and safely cleaned up before accessing that volume again, as opposed to the normal path where the node must confirm a volume is disconnected before the controller can unpublish it.  


Suggested change

Rather than require an administrator to manually clean up such situation, CSI offers a way disconnect a volume from a node "out of order" such that a volume can be disconnected from a problematic node, and safely connected to a different node, and the node can be reliably and safely cleaned up before accessing that volume again, as opposed to the normal path where the node must confirm a volume is disconnected before the controller can unpublish it.

Rather than require an administrator to manually clean up in such a situation, CSI offers a way to disconnect a volume from a node "out of order" such that a volume can be disconnected from problematic *node A*, and safely connected to a different *node B*, and then *node A* can be reliably and safely cleaned up before accessing that volume again; as opposed to the normal path whereby *node A* must confirm a volume is disconnected before the controller can unpublish it.

jdef · 2021-06-06T00:35:33Z

spec.md

@@ -1322,6 +1337,17 @@ message ControllerUnpublishVolumeRequest {
  // This field is OPTIONAL. Refer to the `Secrets Requirements`
  // section on how to use this field.
  map<string, string> secrets = 3 [(csi_secret) = true];
+
+  // Indicates SP MUST make the volume inaccessible to the node or nodes
+  // it is being unpublished from. Any attempt to read or write data


I agree w/ the spirit of this statement. I do wonder how realistic this is, given that CSI doesn't actually govern the data path and it kind of depends on one or both of (a) the underlying OS / driver implementation; (b) firmware running on a node's hardware controller.

Yes the capability is optional for this reason. The hardware needs to be able to distinguish between nodes and deny access to some while allowing access to others, and furthermore needs to be able to revoke access from nodes which are currently have access and are connected. The security models of many storage systems are too primitive or achieve both or even one of these things, or even one of them.

jdef · 2021-06-06T00:37:02Z

spec.md

@@ -2161,9 +2191,13 @@ This RPC is a reverse operation of `NodeStageVolume`.
 This RPC MUST undo the work by the corresponding `NodeStageVolume`.
 This RPC SHALL be called by the CO once for each `staging_target_path` that was successfully setup via `NodeStageVolume`.

-If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability and the Node Plugin has `STAGE_UNSTAGE_VOLUME` capability, the CO MUST guarantee that this RPC is called and returns success before calling `ControllerUnpublishVolume` for the given node and the given volume.
+If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability and the Node Plugin has `STAGE_UNSTAGE_VOLUME` capability, the CO MUST guarantee that this RPC is called and returns success before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.


consistent grammar

Suggested change

If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability and the Node Plugin has `STAGE_UNSTAGE_VOLUME` capability, the CO MUST guarantee that this RPC is called and returns success before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.

If the corresponding Controller Plugin has the `PUBLISH_UNPUBLISH_VOLUME` controller capability and the Node Plugin has the `STAGE_UNSTAGE_VOLUME` capability, the CO MUST guarantee that this RPC is called and returns success before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has the `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.

jdef · 2021-06-06T00:38:58Z

spec.md

 The CO MUST guarantee that this RPC is called after all `NodeUnpublishVolume` have been called and returned success for the given volume on the given node.

+If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unstaging volumes even when access has been revoked with `ControllerUnpublishVolume`.
+Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even if it means losing data.
+It is essential that after a successful call to `NodeUnstageVolume` that there will be no buffered data on the node related to the volume which might result in unintentional modification of the volume if it was to be subsequently re-staged to that node.


this talk of "buffered data": again, I agree w/ the spirit of this, but I worry about real life implementation. have we considered what it might look like to write a CSI sanity test for this?

Parts of this are easy to test. We can simulate a situation where the CO (mis)identifies a node as down, and go through the steps of revoking access to a volume from a workload, cleaning up the volume on the node, and reestablishing access to the volume on the same node, and ensuring that the process completes without errors. It would be even better to do this across 2 nodes, so we could simulate the workload moving to another node -- but csi insanity does not lend itself to multi-node testing.

The parts that would be harder to test would be:

verifying that the workload really lost access to the storage after the fence operations

verifying that any writes to the volume only came from the new node after a simulated failover

verifying that the old node didn't corrupt the volume after re-establishing access following a complete and proper cleanup

The basic problem is that CSI itself doesn't give access to low-level-enough information about the volume to reliably determine the absence of these bad cases. We could do some best-effort checking of these things but it's impossible to prove the absence of side effects in a cross-platform way.

jdef · 2021-06-06T00:40:40Z

spec.md

 The CO MUST guarantee that this RPC is called after all `NodeUnpublishVolume` have been called and returned success for the given volume on the given node.

+If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unstaging volumes even when access has been revoked with `ControllerUnpublishVolume`.
+Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even if it means losing data.


how does a user communicate to a CO that data loss is acceptable? I get the point here, but I'm wondering if we're putting the cart ahead of the horse w/ this feature: if no CO's allow a user to say "yeah, I don't care about data loss here" then ... who's going to be able to use this feature? maybe this was already discussed in k8s SIG storage, and there's a grand plan for a k8s API to deal with this nuance?

jdef · 2021-06-06T00:42:32Z

spec.md

@@ -2323,9 +2364,13 @@ A Node Plugin MUST implement this RPC call.
 This RPC is a reverse operation of `NodePublishVolume`.
 This RPC MUST undo the work by the corresponding `NodePublishVolume`.
 This RPC SHALL be called by the CO at least once for each `target_path` that was successfully setup via `NodePublishVolume`.
-If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO SHOULD issue all `NodeUnpublishVolume` (as specified above) before calling `ControllerUnpublishVolume` for the given node and the given volume.
+If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO SHOULD issue all `NodeUnpublishVolume` (as specified above) before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.


Suggested change

If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO SHOULD issue all `NodeUnpublishVolume` (as specified above) before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.

If the corresponding Controller Plugin has the `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO SHOULD issue `NodeUnpublishVolume` (as specified above) before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has the `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.

jdef · 2021-06-06T00:43:52Z

spec.md

 The Plugin SHALL assume that this RPC will be executed on the node where the volume is being used.

+If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unpublishing volumes even when access has been revoked with `ControllerUnpublishVolume`.
+Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even if it means losing data.
+It is essential that after a successful call to `NodeUnpublishVolume` that there will be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it was to be subsequently re-published to that node.


Suggested change

It is essential that after a successful call to `NodeUnpublishVolume` that there will be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it was to be subsequently re-published to that node.

It is essential that after a successful call to `NodeUnpublishVolume` that there will be no buffered data on the node related to the volume that might result in unintentional modification of the volume if it was to be subsequently re-published to that node.

jdef · 2021-06-06T00:49:17Z

spec.md

@@ -1704,6 +1730,10 @@ message ControllerServiceCapability {
      // This enables COs to, for example, fetch per volume
      // condition after a volume is provisioned.
      GET_VOLUME = 12 [(alpha_enum_value) = true];
+
+      // Indicates the SP supports ControllerUnpublishVolume.fence


Suggested change

// Indicates the SP supports ControllerUnpublishVolume.fence

// Indicates the SP supports the ControllerUnpublishVolume.fence

jdef · 2021-06-06T00:50:15Z

spec.md

 The Plugin SHOULD perform the work that is necessary for making the volume ready to be consumed by a different node.
 The Plugin MUST NOT assume that this RPC will be executed on the node where the volume was previously used.

+If the plugin has the `UNPUBLISH_FENCE` capability, the CO MAY specify `fence` as `true`, in which case the SP MUST ensure that the node may no longer access the volume before returning a successful response.
+This results in a transition into one of the `QUARANTINE` states where the node must be cleaned up without being able to access the volume like usual.
+This is intended cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.


Suggested change

This is intended cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.

This is intended to cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.

jdef · 2021-06-06T00:51:50Z

spec.md

@@ -1293,10 +1303,15 @@ The CO MUST implement the specified error recovery behavior when it encounters t

 Controller Plugin MUST implement this RPC call if it has `PUBLISH_UNPUBLISH_VOLUME` controller capability.
 This RPC is a reverse operation of `ControllerPublishVolume`.
-It MUST be called after all `NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed.
+It MUST be called after all `NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed unless the plugin has the `UNPUBLISH_FENCE` capability.


Suggested change

It MUST be called after all `NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed unless the plugin has the `UNPUBLISH_FENCE` capability.

It MUST be called after both `NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed, unless the plugin has the `UNPUBLISH_FENCE` capability.

xing-yang

A couple of more nits. Otherwise LGTM.

xing-yang · 2021-06-06T00:54:19Z

spec.md

+
+The purpose of the `QUARANTINE_S`, `QUARANTINE_P`, and `QUARANTINE_SP` states are to enable recovery from node problems.
+Because CSI is designed to be used in distributed systems, it is inevitable that sometimes volumes will become attached to nodes that get stuck or lost, temporarily or permanently.
+Rather than require an administrator to manually clean up such situation, CSI offers a way disconnect a volume from a node "out of order" such that a volume can be disconnected from a problematic node, and safely connected to a different node, and the node can be reliably and safely cleaned up before accessing that volume again, as opposed to the normal path where the node must confirm a volume is disconnected before the controller can unpublish it.  


way disconnect -> way to disconnect

xing-yang · 2021-06-06T00:56:43Z

spec.md

 The Plugin SHOULD perform the work that is necessary for making the volume ready to be consumed by a different node.
 The Plugin MUST NOT assume that this RPC will be executed on the node where the volume was previously used.

+If the plugin has the `UNPUBLISH_FENCE` capability, the CO MAY specify `fence` as `true`, in which case the SP MUST ensure that the node may no longer access the volume before returning a successful response.
+This results in a transition into one of the `QUARANTINE` states where the node must be cleaned up without being able to access the volume like usual.
+This is intended cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.


intended cut -> intended to cut

bswartz · 2021-06-06T01:44:27Z

Mostly nits; this seems well thought out. I feel that it's worth reiterating that unless CO's offer an API for a user to indicate "I don't care about data loss", manual administrator effort is still involved - and then it's unclear what this API actually buys anyone. If the goal is complete automation (w/ respect to cleanup and unpinning volumes) in the case of problematic nodes, then CO's need to address this at the API level as well. What's the plan for that, and if that hasn't landed yet - why?

@jdef Thanks for the review, I will take another pass and clean up the language further by including your suggestions.

I don't agree that manual effort is still needed. The idea is that, by invoking ControllerUnpublishVolume with the fence flag set, you've already accepted the data loss, everything after that is just cleanup. This is expected to happen in situations where the node is not responsive which means that either (1) the node actually failed and the workload effectively crashed, leaving the volume in whatever state it was in at the moment the node crashed or (2) the node is still running but the workload WILL crash when it loses access to it storage (the equivalent of yanking a USB drive out while an application is writing to it).

The real question is under what circumstances a CO might choose to make this call, and that IS a more challenging question (answered below), but I think we should at least be able to agree that once the call is made, there's no recovering the lost data, at that point we're cutting our losses and trying to restore the system to a usable state in an automated way, which is why the force node unpublish/unstage are critical.

The answer the question of why anyone would make the ControllerUnpublishVolume call with the fence flag set, we have to think about workloads that highly value uptime. Historically, these application have been a poor fit for container orchestrator systems like Kubernetes and people have run them in extremely expensive compute clusters. Now that people are trying to run these kinds of applications in container orchestrators, they're looking for ways to achieve similar uptime guarantees that they were able to get in the old world. The reality is that nodes do fail, and while workloads can migrate, shared state makes moving the workload unsafe unless you're SURE the old node is dead and not merely uncommunicative.

Clustered systems have historically solved this with a technique called STONITH (Shoot The Other Node In The Head) which kills the node whether it's dead or alive, making it safe for shared state to be taken over by a new node. There are various ways of achieving the shooting of the other node, including cutting its power (if you have programmatic access to the power supply), or killing the VM running the node (if you have access to the hypervisor) which are both perfectly valid. This CSI spec proposal adds a 3rd way which relies only on access to the storage device and the storage device having the capability to fence a node (not all do, but it's an optional capability). COs are then able to choose the most appropriate technique to make nodes definitely dead when they're trying to rapidly migrate a workload from a node of questionable state to a healthy node.

Presumably COs will also allow policy settings that allow end users to specify their preference for either aggressively high uptime or a more restrained response that allows workloads on dead nodes to stay there in hopes that they might still be alive. In no case does this need to be a manual response. The policy can be set up front and the system can execute it, fencing workloads for which the policy says to prefer uptime very quickly after the node becomes unreachable.

jdef · 2021-06-06T02:52:39Z

Thanks for the detailed response. Can you provide a policy example of how a CO allows a user to state that they are OK with data loss? Because otherwise this feels like "build it and they will come" .. but maybe they won't because users aren't asking for this? Or maybe they are.. can you share? It's a cool feature, and I don't see how CO's are ready for it. Do you?

…

On Sat, Jun 5, 2021, 9:44 PM Ben Swartzlander ***@***.***> wrote: Mostly nits; this seems well thought out. I feel that it's worth reiterating that unless CO's offer an API for a user to indicate "I don't care about data loss", manual administrator effort is still involved - and then it's unclear what this API actually buys anyone. If the goal is complete automation (w/ respect to cleanup and unpinning volumes) in the case of problematic nodes, then CO's need to address this at the API level as well. What's the plan for that, and if that hasn't landed yet - why? @jdef <https://github.com/jdef> Thanks for the review, I will take another pass and clean up the language further by including your suggestions. I don't agree that manual effort is still needed. The idea is that, by invoking ControllerUnpublishVolume with the fence flag set, you've already accepted the data loss, everything after that is just cleanup. This is expected to happen in situations where the node is not responsive which means that either (1) the node actually failed and the workload effectively crashed, leaving the volume in whatever state it was in at the moment the node crashed or (2) the node is still running but the workload WILL crash when it loses access to it storage (the equivalent of yanking a USB drive out while an application is writing to it). The real question is under what circumstances a CO might choose to make this call, and that IS a more challenging question (answered below), but I think we should at least be able to agree that once the call is made, there's no recovering the lost data, at that point we're cutting our losses and trying to restore the system to a usable state in an automated way, which is why the force node unpublish/unstage are critical. The answer the question of why anyone would make the ControllerUnpublishVolume call with the fence flag set, we have to think about workloads that highly value uptime. Historically, these application have been a poor fit for container orchestrator systems like Kubernetes and people have run them in extremely expensive compute clusters. Now that people are trying to run these kinds of applications in container orchestrators, they're looking for ways to achieve similar uptime guarantees that they were able to get in the old world. The reality is that nodes do fail, and while workloads can migrate, shared state makes moving the workload unsafe unless you're SURE the old node is dead and not merely uncommunicative. Clustered systems have historically solved this with a technique called STONITH (Shoot The Other Node In The Head) which kills the node whether it's dead or alive, making it safe for shared state to be taken over by a new node. There are various ways of achieving the shooting of the other node, including cutting its power (if you have programmatic access to the power supply), or killing the VM running the node (if you have access to the hypervisor) which are both perfectly valid. This CSI spec proposal adds a 3rd way which relies only on access to the storage device and the storage device having the capability to fence a node (not all do, but it's an optional capability). COs are then able to choose the most appropriate technique to make nodes definitely dead when they're trying to rapidly migrate a workload from a node of questionable state to a healthy node. Presumably COs will also allow policy settings that allow end users to specify their preference for either aggressively high uptime or a more restrained response that allows workloads on dead nodes to stay there in hopes that they might still be alive. In no case does this need to be a manual response. The policy can be set up front and the system can execute it, fencing workloads for which the policy says to prefer uptime very quickly after the node becomes unreachable. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#477 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR5KLD2VEUAKDLLLUEWT23TRLHIPANCNFSM43TUTY6Q> .

xing-yang · 2021-06-06T13:52:09Z

@jdef There's a KEP in Kubernetes that is planning to use this CSI spec change. See here: kubernetes/enhancements#1116

bswartz · 2021-06-06T15:11:25Z

Thanks for the detailed response. Can you provide a policy example of how a CO allows a user to state that they are OK with data loss? Because otherwise this feels like "build it and they will come" .. but maybe they won't because users aren't asking for this? Or maybe they are.. can you share? It's a cool feature, and I don't see how CO's are ready for it. Do you?

The way I think sophisticated users think about this is: node failures aren't some unlikely event to be avoided, they're an inevitable certainty at scale and you can calculate the expected frequency of them with knowledge of the underlying infrastructure and its failure modes. Given that node failures can be challenging to detect reliably, most systems (like Kubernetes) rely on a simpler heuristic, like absence of X heartbeats, to determine a node failure. Heuristics like this will yield false positives with some frequency, and treating a node like it's failed when it hasn't actually failed can lead to even bigger problems when the workload on the pod has access to external storage.

I want to stress that the choice the end user has to make is not between accepting data loss and not accepting data loss -- it's between handling the false-positive case aggressively or conservatively. When nodes actually fail, data gets lost and the user's only choice is how patiently to wait to find out that's what really happened. When the system falsely detects a failed node, the choice is between causing a failure by evicting the workload forcefully and waiting to discover that the failure detection was in fact false.

The knobs that Kubernetes gives end users are pod-level tolerations to the node taints that Kubernetes applies to nodes that it deems unreachable or not-ready. By default pods tolerate these taints for 5 minutes, after which the pod eviction manager kills them and tries to reschedule them. Users that value higher workload uptime would tune these tolerations down from 5 minutes to something like 10 seconds. Users that prefer to not disturb pods might tune this value even higher.

Administrators have additional knobs that let them tune the node-failure detection thresholds to be more aggressive (lower timeouts) or more relaxed (higher timeouts).

I don't think you need more control that allowing users to set these timeout values. The proposed change in this PR and the KEP that @xing-yang linked is to adjust what we do after the pod eviction manager decides to kill the pod and reschedule it elsewhere. Today the workload can get stuck because while the system wants to move the workload to a new node, it is unable to do so safely because of the presence of an external volume that could get corrupted of 2 nodes mounts it at the same time. The PR simply allows the eviction to occur safely by ensuring that no more than 1 node has access to the volume at a time.

bswartz · 2021-06-06T23:44:07Z

@jdef Thank you for all the grammar suggestions, they have been implemented.

I plan to squash this PR when everyone is satisfied with it.

xing-yang

lgtm

gnufied · 2021-06-07T14:51:08Z

spec.md

+If the plugin has the `UNPUBLISH_FENCE` capability, the CO MAY specify `fence` as `true`, in which case the SP MUST ensure that the node may no longer access the volume before returning a successful response.
+This results in a transition into one of the `QUARANTINE` states where the node must be cleaned up without being able to access the volume like usual.
+This is intended to cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.
+Once in one of the `QUARANTINE` states the volume MAY NOT be published to that node again until appropriate cleanup has happened using `NodeUnpublishVolume` and `NodeUnstageVolume` (if applicable).


Thinking from k8s point of view - how will we actually perform the cleanup on the node? Say a node was shutdown with volumes attached to it and hence CO calls ControllerUnpublishVolume with fence: true. The node comes back:

Should node have some taint to prevent scheduling of pods with these volumes?

We could implement something during volume reconstruction in kubelet. But I am still unsure about how will kubelet mark those volumes as "clean".

Are we proposing additional state checks in kube-controller-manager that could detect if volume needs to be detached with fenced: true? Currently KCM won't detach volumes from unresponsive/shutdown nodes at all.

@gnufied Can you help review this KEP? kubernetes/enhancements#1116

Thanks. I did not know that k8s KEP was updated with CSI spec change. I will take a look.

saad-ali · 2021-06-07T19:25:47Z

#476 and #468 have merged.

I will hold off on cutting CSI.next RC until tomorrow (Tuesday, June 8) to see if we can get all the approvals in for this PR as well. If not, we will proceed with CSI.next RC without this PR.

Add new controller capability: * UNPUBLISH_FENCE Add new node capabilitiy: * FORCE_UNPUBLISH

Ensure all changes are part of spec.md, and csi.proto is generated by "make". Mark new fields as alpha. Include a top-level note about quarantine states.

bswartz · 2021-06-07T19:52:17Z

Rebased, will squash after approvals from @jdef @gnufied

jingxu97 · 2021-06-08T16:45:20Z

csi.proto

+  // CO MUST NOT set this field to true unless SP has the
+  // FORCE_UNPUBLISH node capability.
+  // This in an OPTIONAL field.
+  bool force = 3 [(alpha_field) = true];


I am not sure what driver will do differently with force is set to true or false.

Force would suggest that a "umount -f" would be acceptable rather than a "umount". Generally umount will not succeed if buffered data would be lost because the underlying block device is unwritable. The -f flag tells umount to not worry about this and just kill the mount.
There can be similar checks at lower levels that might ordinarily fail if they can't be done "safely" (i.e. without losing data) that should be skipped in the presence of the force flag.

It case it's not clear, the REASON for this is because if you've already cut off the node from the storage and moved the workload elsewhere (because the CO falsely detected the node was down and used the fence operation), then flushing the buffered data would result in corruption, so it's correct to discard it, but the SP cannot know this fact, so the flag gives the CO a way to communicate it. Ordinarily SPs are expected to fail unpublish/unstage if they can't be done safely because not failing in those cases could cause data corruption for a different reason.

jingxu97 · 2021-06-08T16:45:42Z

/hold

jdef · 2021-06-08T16:46:55Z

I've thought more about this and I'm still not sold. This smells like a bigger infra problem, not a volume one. E.g. we could intro a controller rpc called "NodeGone" that the CO uses to tell the SP that a node was shot, and so bookkeeping is needed. Let someone else handle shooting the node, and once bookkeeping is done the workloads are migrated to another node with access to required volumes.

…

On Mon, Jun 7, 2021, 3:52 PM Ben Swartzlander ***@***.***> wrote: Rebased, will squash after approvals from @jdef <https://github.com/jdef> @gnufied <https://github.com/gnufied> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#477 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR5KLAABWLRE3FEESB6KA3TRUPQDANCNFSM43TUTY6Q> .

jingxu97 · 2021-06-08T16:53:25Z

I see @jdef raised a number of questions in the kep. I think since this change is depends on the logic in the kep, maybe better to hold on merging this one

bswartz · 2021-06-08T17:01:23Z

I've thought more about this and I'm still not sold. This smells like a bigger infra problem, not a volume one. E.g. we could intro a controller rpc called "NodeGone" that the CO uses to tell the SP that a node was shot, and so bookkeeping is needed. Let someone else handle shooting the node, and once bookkeeping is done the workloads are migrated to another node with access to required volumes.

I like this proposal a lot from the SP implementor's side, because fencing ALL of the volumes from a given node is technically easier than fencing volumes one by one. However I worry this makes the CO implementor's job harder, because I'm presuming that volume state is tracked individually, and if there are a large number of volumes published to a particular node, with perhaps a mix of more than one SP, you'd end up doing some kind of batch processing of the volumes on a per-SP basis. I'm open to this idea though.

jdef · 2021-06-09T00:38:59Z

I've thought more about this and I'm still not sold. This smells like a bigger infra problem, not a volume one. E.g. we could intro a controller rpc called "NodeGone" that the CO uses to tell the SP that a node was shot, and so bookkeeping is needed. Let someone else handle shooting the node, and once bookkeeping is done the workloads are migrated to another node with access to required volumes.

I like this proposal a lot from the SP implementor's side, because fencing ALL of the volumes from a given node is technically easier than fencing volumes one by one. However I worry this makes the CO implementor's job harder, because I'm presuming that volume state is tracked individually, and if there are a large number of volumes published to a particular node, with perhaps a mix of more than one SP, you'd end up doing some kind of batch processing of the volumes on a per-SP basis. I'm open to this idea though.

I'd much rather add the majority of the complexity to the CO side of the house, instead of plugins.

Of course, with an API like NodeGone the question becomes "is the node really GONE, or will it come back - and if it comes back, will it have been rebooted (to start w/ a clean slate) or was it just disconnected from the network for a while?". So we'd probably need to carefully define the expectations here. I know that Mesos wrestled with this for a bit. I'm not sure if there's already an equivalent of "gone" in k8s. Either way, the CO should really be certain that a node is "gone" before invoking the CSI RPC to signal that state.

YuikoTakada · 2021-06-22T02:01:05Z

spec.md

+                +---v----+---+       +----+----------+
+                | PUBLISHED  +------>| QUARANTINED_P |
+                +------------+       +---------------+
+                   ControllerUnpublishVolume(fenced)


What if node is shutdown in the state of NODE_READY ? only in state of PUBLISHED , state can be moved to QUARANTINED_P ?

Right, the QUARANTINE_* states are only designed to cover the case where ControllerUnpublishVolume() is used while in a state where it would currently be forbidden, which include PUBLISHED and VOL_READY. If the node is shut down/rebooted, that's a separate case to consider, but without calling ControllerUnpublishVolume(), it should not affect anything.

Thank you for your reply. You mean when the status is not only PUBLISHED but also VOL_READY , ControllerUnpublishVolume() can be used? If so, in case of bellow figure, arrow to the QUARANTINED_S should be come from VOL_READY also? I mean, bellow operation will also be allowed?

ControllerUnpublishVolume(fenced) VOL_READY -----------------------------------------> QUARANTINED_SP

Yeah there should be another arrow in the diagram but it will be very hard to squeeze it into the ASCII art. I will try.

@bswartz

How about using a tool like this?
Just several human readable lines like below will generate an equivalent output to the ASCII art.

stateDiagram-v2 [*] --> CREATED : CreateVolume CREATED --> [*] : DeleteVolume CREATED --> NODE_READY : Controller Publish Volume NODE_READY --> CREATED : Controller Unpublish Volume NODE_READY --> VOL_READY : Node Stage Volume VOL_READY --> NODE_READY : Node Unstage Volume VOL_READY --> PUBLISHED : Node Publish Volume VOL_READY --> QUARANTINED_S : Controller Unpublish Volume (fenced) PUBLISHED --> VOL_READY : Node Unpublish Volume PUBLISHED --> QUARANTINED_SP : Controller Unpublish Volume (fenced) QUARANTINED_SP --> QUARANTINED_S : Node Unpublish Volume(forced) QUARANTINED_S --> CREATED : Node Unstage Volume (forced)

Completely agree that that's a better way, but changing the format of these diagrams is beyond the scope of my proposal. Maybe submit a separate PR that replaces the ASCII art with more readable figures? Assuming that other PR merges before this, I'd be happy to rebase on top of it and greatly simplify my changes to the state diagram.

Agree. I don't mean to block this PR to be merged unless you use the tool. You may only need to consider recreating the diagram with the tool, if you ever need another big change.

@bswartz thank you for your reply. And also, thank you for updating the diagram.

YuikoTakada · 2021-12-28T08:06:07Z

This comment is posted.
https://github.com/kubernetes/enhancements/pull/1116/files#diff-0225593bb1191b37cc24ad60c172668c3df10b62f2fd748ceb1bbe85ddf078ceR107

This would trigger the deletion of the volumeAttachment objects. 
This would allow ControllerUnpublishVolume to happen before NodeUnpublishVolume and/or NodeUnstageVolume are called.
Note that there is no additional code changes required for this step.

Is this true? In short, when volumeAttachment has been deleted, ControllerUnpublishVolume can happen before NodeUnpublishVolume and/or NodeUnstageVolume are called without the change which is suggested in this PR?

jdef · 2021-12-28T13:25:39Z

Out of order CSI calls are not spec compliant. I've said as much in the KEP

…

On Tue, Dec 28, 2021, 3:06 AM Yuiko Mouri ***@***.***> wrote: This comment is posted. https://github.com/kubernetes/enhancements/pull/1116/files#diff-0225593bb1191b37cc24ad60c172668c3df10b62f2fd748ceb1bbe85ddf078ceR107 This would trigger the deletion of the volumeAttachment objects. This would allow ControllerUnpublishVolume to happen before NodeUnpublishVolume and/or NodeUnstageVolume are called. Note that there is no additional code changes required for this step. Is this true? In short, when volumeAttachment has been deleted, ControllerUnpublishVolume can happen before NodeUnpublishVolume and/or NodeUnstageVolume are called without the change which is suggested in this PR? — Reply to this email directly, view it on GitHub <#477 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR5KLFH5RPDMHMIJWZMTP3UTFVXXANCNFSM43TUTY6Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

bswartz · 2021-12-28T22:20:53Z

Out of order CSI calls are not spec compliant. I've said as much in the KEP

I agree with this. And because reliably and quickly recovering from a node failure requires making the CSI calls out of order, this PR proposes allowing the out-of-order calls under special and strict conditions. Today what we see happening in Kubernetes is just willful violation of the spec because there's no "correct" way to do this, and I'd prefer to update the spec with a correct way.

xing-yang reviewed Apr 28, 2021

View reviewed changes

spec.md Outdated Show resolved Hide resolved

spec.md Show resolved Hide resolved

saad-ali reviewed Apr 28, 2021

View reviewed changes

spec.md Outdated Show resolved Hide resolved

nixpanic reviewed May 4, 2021

View reviewed changes

humblec reviewed May 4, 2021

View reviewed changes

nixpanic reviewed May 5, 2021

View reviewed changes

This was referenced May 5, 2021

On NodeLost, the new pod can't mount the same volume. rook/rook#1507

Closed

Add the ability to use exclusive RBD locking to prevent inadvertent multi-node access of an RWO image ceph/ceph-csi#578

Open

nixpanic mentioned this pull request May 20, 2021

Fencing Question for rbd and cephfs to recover from node lost rook/rook#7954

Closed

gnufied reviewed May 26, 2021

View reviewed changes

saad-ali approved these changes Jun 4, 2021

View reviewed changes

humblec reviewed Jun 5, 2021

View reviewed changes

xing-yang reviewed Jun 5, 2021

View reviewed changes

bswartz force-pushed the force-detach branch from 49a321e to 6707709 Compare June 5, 2021 17:29

bswartz force-pushed the force-detach branch from 6707709 to cb26f2d Compare June 5, 2021 17:31

bswartz force-pushed the force-detach branch from cb26f2d to 7574213 Compare June 5, 2021 21:23

jdef reviewed Jun 6, 2021

View reviewed changes

xing-yang reviewed Jun 6, 2021

View reviewed changes

xing-yang approved these changes Jun 7, 2021

View reviewed changes

gnufied reviewed Jun 7, 2021

View reviewed changes

bswartz added 4 commits June 7, 2021 15:48

Add force detach

71086de

Add new controller capability: * UNPUBLISH_FENCE Add new node capabilitiy: * FORCE_UNPUBLISH

Update based on review feedback

56f96a2

Ensure all changes are part of spec.md, and csi.proto is generated by "make". Mark new fields as alpha. Include a top-level note about quarantine states.

Address spelling/grammar errors

6fbff56

Implement jdef's grammar suggestions

45b6fd2

bswartz force-pushed the force-detach branch from 526c5d7 to 45b6fd2 Compare June 7, 2021 19:51

jingxu97 reviewed Jun 8, 2021

View reviewed changes

YuikoTakada reviewed Jun 22, 2021

View reviewed changes

Update state transition diagram

bafe849

jsafrane mentioned this pull request Dec 8, 2021

Volumes are detached while they're still mounted on a node kubernetes/kubernetes#106710

Closed

YuikoTakada mentioned this pull request Jun 16, 2022

Volume Lifecycle is not correspond to actual k8s behavior #512

Open

yosshy mentioned this pull request Dec 22, 2022

Fix lifecycle and ControllerUnpublishVolume description #533

Closed

	If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO SHOULD issue all `NodeUnpublishVolume` (as specified above) before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.
	If the corresponding Controller Plugin has the `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO SHOULD issue `NodeUnpublishVolume` (as specified above) before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has the `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.

	It is essential that after a successful call to `NodeUnpublishVolume` that there will be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it was to be subsequently re-published to that node.
	It is essential that after a successful call to `NodeUnpublishVolume` that there will be no buffered data on the node related to the volume that might result in unintentional modification of the volume if it was to be subsequently re-published to that node.

	// Indicates the SP supports ControllerUnpublishVolume.fence
	// Indicates the SP supports the ControllerUnpublishVolume.fence

	This is intended cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.
	This is intended to cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.

	It MUST be called after all `NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed unless the plugin has the `UNPUBLISH_FENCE` capability.
	It MUST be called after both `NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed, unless the plugin has the `UNPUBLISH_FENCE` capability.

Add force detach #477

Are you sure you want to change the base?

Add force detach #477

Conversation

bswartz commented Apr 26, 2021

xing-yang commented Apr 28, 2021

bswartz commented Apr 28, 2021

nixpanic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humblec May 4, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humblec May 5, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bswartz commented May 4, 2021

nixpanic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nixpanic commented May 20, 2021

bswartz commented May 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdef commented May 27, 2021

bswartz commented May 27, 2021

saad-ali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bswartz commented Jun 5, 2021

bswartz commented Jun 5, 2021

jdef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bswartz commented Jun 6, 2021

jdef commented Jun 6, 2021 via email

xing-yang commented Jun 6, 2021

bswartz commented Jun 6, 2021

bswartz commented Jun 6, 2021

xing-yang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humblec May 4, 2021 •

edited

humblec May 5, 2021 •

edited

saad-ali commented Jun 7, 2021 •

edited