Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add force detach #477

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

bswartz
Copy link
Contributor

@bswartz bswartz commented Apr 26, 2021

Add new controller capability:

  • UNPUBLISH_FENCE

Add new node capabilitiy:

  • FORCE_UNPUBLISH

spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
@xing-yang
Copy link
Contributor

@bswartz did you modify csi.proto manually? I see that you added the new capabilities in csi.proto, but I didn't find them in spec.md. Changes should only be made in spec.md. After "make", csi.proto will be updated automatically.

spec.md Outdated Show resolved Hide resolved
@bswartz
Copy link
Contributor Author

bswartz commented Apr 28, 2021

@bswartz did you modify csi.proto manually? I see that you added the new capabilities in csi.proto, but I didn't find them in spec.md. Changes should only be made in spec.md. After "make", csi.proto will be updated automatically.

Seriously?!? I didn't know that! I'll update.

Copy link

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CSI puts a node in a quarantine state, which component, who (and how) should the node get marked as recovered and get out of the quarantine state?

I guess that the quarantine state is per node+volume, making it possible to have other volumes (potentially from the same csi-driver) functioning correctly. Is that understanding correct?

spec.md Outdated
The Plugin SHALL assume that this RPC will be executed on the node where the volume is being used.

If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unpublishing volumes even when access has been revoked with `ControllerUnpublishVolume`.
Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.
It is essential that after a successful call to `NodeUnpublishVolume` that there be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it were to be subsequently re-published to that node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: unintetional -> unintentional

csi.proto Outdated
@@ -763,6 +763,17 @@ message ControllerUnpublishVolumeRequest {
// This field is OPTIONAL. Refer to the `Secrets Requirements`
// section on how to use this field.
map<string, string> secrets = 3 [(csi_secret) = true];

// Indicates SP MUST make the volume inacessible to the node or nodes
Copy link
Contributor

@humblec humblec May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/inacessible/inaccessible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bswartz may be you missed this ? :)

If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unstaging volumes even when access has been revoked with `ControllerUnpublishVolume`.
Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.
It is essential that after a successful call to `NodeUnstageVolume` that there be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it were to be subsequently re-staged to that node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ unintetional/unintentional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bswartz may be correct this too ?

// to a volume from a node that has been fenced MUST NOT succeed,
// even if the volume remains staged and/or published on the node.
// CO MUST NOT set this field to true unless SP has the
// UNPUBLISH_FENCE controller capability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main doubt I have here is, what if the SP implement ONLY node capability ? Is that SP driver intended to use the fence capability successfully ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the plugin has no controller, then there's nothing for the CO to talk to in the event that a node becomes unreachable. I would expect a CO to use both of these new capabilities together. However, if a plugin can only support UNPUBLISH_FENCE and not the FORCE_UNPUBLISH capability, then it's still possible to use the controller capability to safely move workloads and their attached volumes to a new node in the event of a node failure, and to manage the node cleanup process through a full node reboot. If a plugin supports FORCE_UNPUBLISH only and not UNPUBLISH_FENCE, then it's not particularly useful for anything except more aggressive node cleanup.

Copy link
Contributor

@humblec humblec May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bswartz yeah, that clarifies my doubt! imo, its good to document above scenario ( absence of either controller or node capability) in this spec which can avoid some confusion. On a side note, I was thinking, eventhough its just FORCE_UNPUBLISH capability advertised by the SP driver, its still useful to cut/fence access to volumes . Because in this enhancement of NODE capability, its granular to Volume Level. Removing access to the volume by black listing certain NODE IPs .etc are possible from SP driver pov if the CO can flag it (not sure how exactly though :) )in the 'next' NODE call for this volume.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit more from the pov of a csi driver, which does NOT support controller publish/unpublish, but supports node stage/unstage (e.g CSI driver for GCP filestore), I have a couple of questions:

  1. In the current proposal, the driver does not have a way to opt in to leveraging the fence capability (even if there is a way to black list node ips for a given nfs filer), without supporting a controller publish/unpublish. correct? e.g today there is a controller service PUBLISH_UNPUBLISH capability, but we dont have a controller capability to only support UNPUBLISH with fence (where a regular controller unpublish is a no-op by the driver, but the unpublish with fence=true can quarantine a node-volume pair). no-op controller publish will be costly since it would create extra resources like VA objects in k8s.
  2. Second question is around FORCE_UNPUBLISH. I am trying the understand some example scenarios on what condition would the CO trigger an NODE UNPUBLISH/UNSTAGE(with force=true). Does the CO trigger node unstage/unpublish (force=true), only for quarantined volumes? Because if so, then for a driver which does NOT support controller PUBLISH_UNPUBLISH (thereby no way to quarantine/fence a volume), but supports node stage/publish cannot leverage the force options. (e.g force unmount for nfs, in case of the nfs server unavailability)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. In the current proposal, the driver does not have a way to opt in to leveraging the fence capability (even if there is a way to black list node ips for a given nfs filer), without supporting a controller publish/unpublish. correct?  e.g today there is a controller service PUBLISH_UNPUBLISH capability, but we dont have a controller capability to only support UNPUBLISH with fence (where a regular controller unpublish is a no-op by the driver, but the unpublish with fence=true can quarantine a node-volume pair). no-op controller publish will be costly since it would create extra resources like VA objects in k8s.

Yes, in order for this feature to be useful to COs, there has to be a control path for the CO to tell the SP which nodes may access the volume and which can't. This is the whole purpose of ControllerPublish/Unpublish. In any scheme where the access granting/denying is out of band, you can't automate it from the CO side, and this feature is about automating the recovery from node failures.

2. Second question is around FORCE_UNPUBLISH. I am trying the understand some example scenarios on what condition would the CO trigger an NODE UNPUBLISH/UNSTAGE(with force=true). Does the CO trigger node unstage/unpublish (force=true), only for quarantined volumes?  Because if so, then for a driver which does NOT support controller PUBLISH_UNPUBLISH (thereby no way to quarantine/fence a volume), but supports node stage/publish cannot leverage the force options. (e.g force unmount for nfs, in case of the nfs server unavailability)

The CO should use the force flag on unstage/unpublish if it's cleaning up a volume on a node where data has been lost, or will inevitably be lost. Normal behavior on unpublish/unstage is to fail if data loss would occur, and for the CO to keep retrying, so that data loss does not occur by accident. The CO has to make the judgement that data loss is acceptable and inform the SP of that by using the force flag, so that the SP can successfully clean up, discarding any pending data. Situation like this will occur when the CO has chosen to fence the node and move a workload to a different node (in the interest of minimizing workload downtime) or if the storage itself has experienced a data-losing failure and the CO wishes to write off the lost volume by cleaning up the attachment on the node.

// Indicates SP MUST make the volume inacessible to the node or nodes
// it is being unpublished from. Any attempt to read or write data
// to a volume from a node that has been fenced MUST NOT succeed,
// even if the volume remains staged and/or published on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are a couple of pods accessing the volume on same node, we would have a state where more than one publish and just one stage. Considering the nodeunpublish request arrives for a particular volume handle and if its fenced on stage level, its not the desired outcome we intended here with this enhancement/spec, Isnt it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the normal set of NodeUnpublish and NodeUnstage calls still have to be made for each volume on each node to return to the original state. The trigger for the quarantined state is the ControllerUnpublish call with the fence flag. The above statement just means that, without communicating with the node, the SP is expected to render the volume inaccessible to the node regardless of the node's state, and the node will be forced to clean up under conditions where it can't access the volume. The only way the node will be able to access the volume again is for the node-side cleanup to complete and then for a subsequent ControllerPublish to happen for that node again.

@bswartz
Copy link
Contributor Author

bswartz commented May 4, 2021

If CSI puts a node in a quarantine state, which component, who (and how) should the node get marked as recovered and get out of the quarantine state?

All of the state management in CSI is implicit. Nowhere in the RPCs are the states represented explicitly. The CO needs to represent the new quarantine states for the scheme to work correctly, and the SP needs to relax its assumptions about what order some RPCs may be called in if they assert the new capabilities.

I guess that the quarantine state is per node+volume, making it possible to have other volumes (potentially from the same csi-driver) functioning correctly. Is that understanding correct?

In theory the states are per volume, although the expectation is that this feature would be used on all of the volumes on a given node at once, both for the ControllerUnpublish(fenced) and the various node cleanup functions. COs still have to track state per volume, but I would expect an implementation to move all of the volumes to a quarantine state upon detection of a node problem, and then to move all of the volumes out of a quarantine state before attaching any new volumes upon resumption of a node's participation in a cluster. That's not strictly necessary, but it's cleaner and simpler.

Copy link

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the reply and clarifications! From a Ceph-CSI point of view, this looks workable, we can add fenced nodes to a blocklist, and once recovered allow access from the nodes again.

If the plugin has the `UNPUBLISH_FENCE` capability, the CO MAY specify `fence` as `true`, in which case the SP MUST ensure that the node may no longer access the volume before returning a successful response.
This results in a transition into one of the `QUARANTINE` states where the node must be cleaned up without being able to access the volume like usual.
This is intended cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.
Once in one of the `QUARANTINE` states the volume MAY NOT be published to that node again until appropriate cleanup has happened using `NodeUnpublishVolume` and `NodeUnstageVolume` (if applicable).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording makes the calling of NodeUnpublishVolume and NodeUnstageVolume optional, is that intentional?

In other words, it seems acceptable that the CO calls ControllerPublish for a node, only once the node is recovered. So ControllerPublish should also do unfencing in case the node was fenced earlier. The CO could wipe its known state of a node (and attached volumes), without trying to call NodeUnpublishVolume and NodeUnstageVolume after the fencing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording makes the calling of NodeUnpublishVolume and NodeUnstageVolume optional, is that intentional?

Well NodeUnstage is always optional because it's capability-controlled. If the capability is present, then you have to use it after NodeUnpublish to get correct results. However, the other alternative is to reboot the whole node and reset its state. All of the concerns we have with the potential for data corruption result from having unflushed write buffers in the kernel. If you reboot, those concerns go away. The point of the forced unpublish/unstage are to provide a path to cleaning up without requiring a reboot.

In other words, it seems acceptable that the CO calls ControllerPublish for a node, only once the node is recovered. So ControllerPublish should also do unfencing in case the node was fenced earlier. The CO could wipe its known state of a node (and attached volumes), without trying to call NodeUnpublishVolume and NodeUnstageVolume after the fencing.

Yes. In fact it's acceptable today for ControllerUnpublish to fence the node in question, it's just not required. Due to that ambiguity, we need this new flag so the CO can specify that it is required in certain cases. And because we're allowing a new sequence of operations that was previously illegal (calling ControllerUnpublish before NodeUnpublish) we have to spell out the implications of that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, that all makes sense to me.

I'm a little weary that NodeUnpublishVolume and NodeUnstageVolume are being retried until they succeed (and cause the node to be moved out of quarantine). There seems to be the possibility that these calls never make any progress and a reboot is required. Probably not an issue that needs to be handled here, as it will become the responsibility of the CO to alert the cluster operator that a node is in quarantine and other nodes are taking over parts of the workload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little weary that NodeUnpublishVolume and NodeUnstageVolume are being retried until they succeed (and cause the node to be moved out of quarantine). There seems to be the possibility that these calls never make any progress and a reboot is required. Probably not an issue that needs to be handled here, as it will become the responsibility of the CO to alert the cluster operator that a node is in quarantine and other nodes are taking over parts of the workload.

I'm not too worried about this. Usually there are more aggressive ways of cleaning up a mount (like the umount -f command). It there aren't, then the driver should not assert the capability, because it could lead to the problem you mention.

My feeling is that, in situations where there aren't ways to force-unmount, it's usually an indication of a missing feature in the underlying protocol or UI. You always want to be able to clean up client state when the server is permanently dead. Any time rebooting is your only option to solve a problem, it's a sign of bad design, and the proper response is to fix the design so that you can force unmount reliably.

@nixpanic
Copy link

After a little more discussion with colleagues, there seems to be some further (practical) clarification needed.

With Ceph-CSI we can use this proposal when the CSI Pods use host-networking. However, if there is any other form of networking configured, the IP-addresses or hostnames of the Pods may change, and block-listing is not trivial anymore.

This spec modification may not be the right place for this discussion, but we wonder if there has been any thought about this already.

@bswartz
Copy link
Contributor Author

bswartz commented May 20, 2021

With Ceph-CSI we can use this proposal when the CSI Pods use host-networking. However, if there is any other form of networking configured, the IP-addresses or hostnames of the Pods may change, and block-listing is not trivial anymore.

I assume you're referring here to the ControllerUnpublish portion of the solution? Nobody said this is an easy thing to implement. That's why it's an optional capability.

I feel fairly confident that the higher level problem of how to safely recover from a node failure when there are pods with RWO volumes is unsolvable without either this capability, or some cloud provider plugin that can reliably kill/reboot problem nodes.

@@ -257,6 +261,12 @@ Plugins SHOULD expose all RPCs for an interface: Controller plugins SHOULD imple
Unsupported RPCs SHOULD return an appropriate error code that indicates such (e.g. `CALL_NOT_IMPLEMENTED`).
The full list of plugin capabilities is documented in the `ControllerGetCapabilities` and `NodeGetCapabilities` RPCs.

### A Word on Quarantine States

The purpose of the `QUARANTINE_S`, `QUARANTINE_P`, and `QUARANTINE_SP` states are to enable recovery from node problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see any other mention/usage of these states?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the ASCII art state transition diagrams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah - I searched for word and you are using different word in ASCII diagrams. But that explanation is not enough IMO. Do you think that quarantine states should be part of ControllerUnpublish responses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well all of the states in that ASCII chart are implicit states that the CO is supposed to follow but aren't expressed anywhere in the RPC interface. These new states are meant to be similarly implicit.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUARANTINE_S, QUARANTINE_P, and QUARANTINE_SP are in the diagrams, but I still have difficulty to understand the meeting of those states.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 3 states refer to situations where a particular node/volume pair is unsafe to use. The difference is which steps need to be performed to return back to the safe state. The _S state requires only unstage to return to normal, the _P state requires only unpublish to return to normal and the _SP state requires both unpublish and unstage to return to normal. These are inversions of the existing states in state diagram, with uncreatively-chosen names (I'm happy to accept better suggestions).

I wanted to emphasize that we're not merely introducing some new flags that yield different behavior, this change relies on a fundamental reordering of operations to arrive at the desired result. When a node is unreachable you have to sort out the problem by talking ONLY to the controller, and the existing CSI spec expressly forbids ControllerUnpublishing until the node side operations are complete.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gnufied and @bswartz . explanation of QUARANTINE_S , QUARANTINE_P , and QUARANTINE_SP would be better to added. I suggest adding explanations to ControllerUnpublishVolume section.

@jdef
Copy link
Member

jdef commented May 27, 2021

"fencing" nodes and "forcing" detach sound like complementary features/capabilities. so, from that perspective, it's nice to see the capability bits split up.

(a) what if only fencing capability is specified? is that allowed, w/o also supporting "forcing"? if so, what's the use case?
(b) what if only forcing capability is specified? is that allowed, w/o also supporting "fencing"? if so, what's the use case?

if these features cannot be disentangled from each other, then maybe they are actually part of the same capability? there's a lot to think about in this proposal, and i certainly need to spend more time w/ it.

@bswartz
Copy link
Contributor Author

bswartz commented May 27, 2021

"fencing" nodes and "forcing" detach sound like complementary features/capabilities. so, from that perspective, it's nice to see the capability bits split up.

(a) what if only fencing capability is specified? is that allowed, w/o also supporting "forcing"? if so, what's the use case?
(b) what if only forcing capability is specified? is that allowed, w/o also supporting "fencing"? if so, what's the use case?

if these features cannot be disentangled from each other, then maybe they are actually part of the same capability? there's a lot to think about in this proposal, and i certainly need to spend more time w/ it.

In situation (a) if the driver supports fencing but not force unpublish, it still gives the CO everything it needs to cope with node failures by moving workloads to other nodes and fencing off the bad node. What you would be missing in this situation is a safe way for the bad node to rejoin the cluster when it becomes responsive again. In general, the normal cleanup procedure of NodeUnpublish+NodeUnstage can't be expected to succeed under conditions where the node can't access the relevant volumes (while it's fenced) because doing so might result in data loss and a well-behaved node plugin would be expected to fail repeatedly until it can safely detach the volumes. This could be worked around be rebooting the node, though, which is what I'd recommend.

In short, situation (a) still gives the CO all the benefits of rapid recovery from node failure, but requires the use of a reboot to clean up afflicted nodes, because ordinary cleanup procedures can't be expected to work reliably.

In situation (b) the CO doesn't have any additional ways to cope with a node failure, but the force unpublish capability could still come in handy when dealing with storage controller failures. In the status quo, failure of a storage controller can result in nodes getting "stuck" because node unpublish/node unstage can never succeed, due to inability to cleanly unmount/detach the volumes while the storage controller itself is down. In a world where the CO knows about the health of the storage controller, it might choose to more aggressively clean up a node if it believes that data loss is inevitable, even if it lacks the capability to fence the node.

In short situation (b) gives the CO the capability to reliably clean up nodes AFTER data loss has become inevitable.

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@jdef @humblec and anyone else interested can you PTAL ASAP.
Trying to get this merged and a new release cut on Monday.

spec.md Outdated
@@ -1322,6 +1337,17 @@ message ControllerUnpublishVolumeRequest {
// This field is OPTIONAL. Refer to the `Secrets Requirements`
// section on how to use this field.
map<string, string> secrets = 3 [(csi_secret) = true];

// Indicates SP MUST make the volume inacessible to the node or nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bswartz s/inacessible/inaccessible

spec.md Outdated
@@ -1322,6 +1337,17 @@ message ControllerUnpublishVolumeRequest {
// This field is OPTIONAL. Refer to the `Secrets Requirements`
// section on how to use this field.
map<string, string> secrets = 3 [(csi_secret) = true];

// Indicates SP MUST make the volume inacessible to the node or nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inacessible -> inaccessible

spec.md Outdated
The CO MUST guarantee that this RPC is called after all `NodeUnpublishVolume` have been called and returned success for the given volume on the given node.

If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unstaging volumes even when access has been revoked with `ControllerUnpublishVolume`.
Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it if means -> if it means

spec.md Outdated
The CO MUST guarantee that this RPC is called after all `NodeUnpublishVolume` have been called and returned success for the given volume on the given node.

If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unstaging volumes even when access has been revoked with `ControllerUnpublishVolume`.
Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.
It is essential that after a successful call to `NodeUnstageVolume` that there be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it were to be subsequently re-staged to that node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there be -> there will be
if it were -> if it was

spec.md Outdated
The Plugin SHALL assume that this RPC will be executed on the node where the volume is being used.

If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unpublishing volumes even when access has been revoked with `ControllerUnpublishVolume`.
Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it if means -> if it means

spec.md Outdated
The Plugin SHALL assume that this RPC will be executed on the node where the volume is being used.

If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unpublishing volumes even when access has been revoked with `ControllerUnpublishVolume`.
Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even it if means losing data.
It is essential that after a successful call to `NodeUnpublishVolume` that there be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it were to be subsequently re-published to that node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there be -> there will be
if it were -> if it was

spec.md Outdated
@@ -2501,6 +2553,11 @@ message NodeServiceCapability {
// Note that, for alpha, `VolumeCondition` is intended to be
// informative for humans only, not for automation.
VOLUME_CONDITION = 4 [(alpha_enum_value) = true];
// Indicates that the node supports the NodeUnpublishVolume.force
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node supports -> the Node Plugin supports

spec.md Outdated
@@ -2501,6 +2553,11 @@ message NodeServiceCapability {
// Note that, for alpha, `VolumeCondition` is intended to be
// informative for humans only, not for automation.
VOLUME_CONDITION = 4 [(alpha_enum_value) = true];
// Indicates that the node supports the NodeUnpublishVolume.force
// field. Also indicates that the node supports the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node supports -> the Node Plugin supports

@bswartz
Copy link
Contributor Author

bswartz commented Jun 5, 2021

Thank you for all the review feedback. I think I've addressed all the comments. Let me know when it's time to squash and merge.

@bswartz
Copy link
Contributor Author

bswartz commented Jun 5, 2021

@saad-ali I expect there will be merge conflicts with the other spec changes, due to conflicting capability numbers, so we need to merge the important ones first and the followers will need to resolve the conflicts by choosing higher numbers.

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits; this seems well thought out. I feel that it's worth reiterating that unless CO's offer an API for a user to indicate "I don't care about data loss", manual administrator effort is still involved - and then it's unclear what this API actually buys anyone. If the goal is complete automation (w/ respect to cleanup and unpinning volumes) in the case of problematic nodes, then CO's need to address this at the API level as well. What's the plan for that, and if that hasn't landed yet - why?

spec.md Outdated

The purpose of the `QUARANTINE_S`, `QUARANTINE_P`, and `QUARANTINE_SP` states are to enable recovery from node problems.
Because CSI is designed to be used in distributed systems, it is inevitable that sometimes volumes will become attached to nodes that get stuck or lost, temporarily or permanently.
Rather than require an administrator to manually clean up such situation, CSI offers a way disconnect a volume from a node "out of order" such that a volume can be disconnected from a problematic node, and safely connected to a different node, and the node can be reliably and safely cleaned up before accessing that volume again, as opposed to the normal path where the node must confirm a volume is disconnected before the controller can unpublish it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Rather than require an administrator to manually clean up such situation, CSI offers a way disconnect a volume from a node "out of order" such that a volume can be disconnected from a problematic node, and safely connected to a different node, and the node can be reliably and safely cleaned up before accessing that volume again, as opposed to the normal path where the node must confirm a volume is disconnected before the controller can unpublish it.
Rather than require an administrator to manually clean up in such a situation, CSI offers a way to disconnect a volume from a node "out of order" such that a volume can be disconnected from problematic *node A*, and safely connected to a different *node B*, and then *node A* can be reliably and safely cleaned up before accessing that volume again; as opposed to the normal path whereby *node A* must confirm a volume is disconnected before the controller can unpublish it.

@@ -1322,6 +1337,17 @@ message ControllerUnpublishVolumeRequest {
// This field is OPTIONAL. Refer to the `Secrets Requirements`
// section on how to use this field.
map<string, string> secrets = 3 [(csi_secret) = true];

// Indicates SP MUST make the volume inaccessible to the node or nodes
// it is being unpublished from. Any attempt to read or write data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree w/ the spirit of this statement. I do wonder how realistic this is, given that CSI doesn't actually govern the data path and it kind of depends on one or both of (a) the underlying OS / driver implementation; (b) firmware running on a node's hardware controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the capability is optional for this reason. The hardware needs to be able to distinguish between nodes and deny access to some while allowing access to others, and furthermore needs to be able to revoke access from nodes which are currently have access and are connected. The security models of many storage systems are too primitive or achieve both or even one of these things, or even one of them.

spec.md Outdated
@@ -2161,9 +2191,13 @@ This RPC is a reverse operation of `NodeStageVolume`.
This RPC MUST undo the work by the corresponding `NodeStageVolume`.
This RPC SHALL be called by the CO once for each `staging_target_path` that was successfully setup via `NodeStageVolume`.

If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability and the Node Plugin has `STAGE_UNSTAGE_VOLUME` capability, the CO MUST guarantee that this RPC is called and returns success before calling `ControllerUnpublishVolume` for the given node and the given volume.
If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability and the Node Plugin has `STAGE_UNSTAGE_VOLUME` capability, the CO MUST guarantee that this RPC is called and returns success before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistent grammar

Suggested change
If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability and the Node Plugin has `STAGE_UNSTAGE_VOLUME` capability, the CO MUST guarantee that this RPC is called and returns success before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.
If the corresponding Controller Plugin has the `PUBLISH_UNPUBLISH_VOLUME` controller capability and the Node Plugin has the `STAGE_UNSTAGE_VOLUME` capability, the CO MUST guarantee that this RPC is called and returns success before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has the `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.

The CO MUST guarantee that this RPC is called after all `NodeUnpublishVolume` have been called and returned success for the given volume on the given node.

If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unstaging volumes even when access has been revoked with `ControllerUnpublishVolume`.
Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even if it means losing data.
It is essential that after a successful call to `NodeUnstageVolume` that there will be no buffered data on the node related to the volume which might result in unintentional modification of the volume if it was to be subsequently re-staged to that node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this talk of "buffered data": again, I agree w/ the spirit of this, but I worry about real life implementation. have we considered what it might look like to write a CSI sanity test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parts of this are easy to test. We can simulate a situation where the CO (mis)identifies a node as down, and go through the steps of revoking access to a volume from a workload, cleaning up the volume on the node, and reestablishing access to the volume on the same node, and ensuring that the process completes without errors. It would be even better to do this across 2 nodes, so we could simulate the workload moving to another node -- but csi insanity does not lend itself to multi-node testing.

The parts that would be harder to test would be:

  • verifying that the workload really lost access to the storage after the fence operations
  • verifying that any writes to the volume only came from the new node after a simulated failover
  • verifying that the old node didn't corrupt the volume after re-establishing access following a complete and proper cleanup

The basic problem is that CSI itself doesn't give access to low-level-enough information about the volume to reliably determine the absence of these bad cases. We could do some best-effort checking of these things but it's impossible to prove the absence of side effects in a cross-platform way.

The CO MUST guarantee that this RPC is called after all `NodeUnpublishVolume` have been called and returned success for the given volume on the given node.

If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unstaging volumes even when access has been revoked with `ControllerUnpublishVolume`.
Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even if it means losing data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does a user communicate to a CO that data loss is acceptable? I get the point here, but I'm wondering if we're putting the cart ahead of the horse w/ this feature: if no CO's allow a user to say "yeah, I don't care about data loss here" then ... who's going to be able to use this feature? maybe this was already discussed in k8s SIG storage, and there's a grand plan for a k8s API to deal with this nuance?

spec.md Outdated
@@ -2323,9 +2364,13 @@ A Node Plugin MUST implement this RPC call.
This RPC is a reverse operation of `NodePublishVolume`.
This RPC MUST undo the work by the corresponding `NodePublishVolume`.
This RPC SHALL be called by the CO at least once for each `target_path` that was successfully setup via `NodePublishVolume`.
If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO SHOULD issue all `NodeUnpublishVolume` (as specified above) before calling `ControllerUnpublishVolume` for the given node and the given volume.
If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO SHOULD issue all `NodeUnpublishVolume` (as specified above) before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO SHOULD issue all `NodeUnpublishVolume` (as specified above) before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.
If the corresponding Controller Plugin has the `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO SHOULD issue `NodeUnpublishVolume` (as specified above) before calling `ControllerUnpublishVolume` for the given node and the given volume, unless the Controller Plugin has the `UNPUBLISH_FENCE` capability and the Node Plugin has the `FORCE_UNPUBLISH` capability and the `force` flag is `true`.

spec.md Outdated
The Plugin SHALL assume that this RPC will be executed on the node where the volume is being used.

If the Node Plugin has the `FORCE_UNPUBLISH` capability, the CO MAY specify `force` as `true` in which case the Node Plugin MUST support unpublishing volumes even when access has been revoked with `ControllerUnpublishVolume`.
Because data loss is inevitable in such circumstances, the `force` flag is an indication that success is desired even if it means losing data.
It is essential that after a successful call to `NodeUnpublishVolume` that there will be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it was to be subsequently re-published to that node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is essential that after a successful call to `NodeUnpublishVolume` that there will be no buffered data on the node related to the volume which might result in unintetional modification of the volume if it was to be subsequently re-published to that node.
It is essential that after a successful call to `NodeUnpublishVolume` that there will be no buffered data on the node related to the volume that might result in unintentional modification of the volume if it was to be subsequently re-published to that node.

spec.md Outdated
@@ -1704,6 +1730,10 @@ message ControllerServiceCapability {
// This enables COs to, for example, fetch per volume
// condition after a volume is provisioned.
GET_VOLUME = 12 [(alpha_enum_value) = true];

// Indicates the SP supports ControllerUnpublishVolume.fence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Indicates the SP supports ControllerUnpublishVolume.fence
// Indicates the SP supports the ControllerUnpublishVolume.fence

spec.md Outdated
The Plugin SHOULD perform the work that is necessary for making the volume ready to be consumed by a different node.
The Plugin MUST NOT assume that this RPC will be executed on the node where the volume was previously used.

If the plugin has the `UNPUBLISH_FENCE` capability, the CO MAY specify `fence` as `true`, in which case the SP MUST ensure that the node may no longer access the volume before returning a successful response.
This results in a transition into one of the `QUARANTINE` states where the node must be cleaned up without being able to access the volume like usual.
This is intended cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is intended cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.
This is intended to cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.

spec.md Outdated
@@ -1293,10 +1303,15 @@ The CO MUST implement the specified error recovery behavior when it encounters t

Controller Plugin MUST implement this RPC call if it has `PUBLISH_UNPUBLISH_VOLUME` controller capability.
This RPC is a reverse operation of `ControllerPublishVolume`.
It MUST be called after all `NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed.
It MUST be called after all `NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed unless the plugin has the `UNPUBLISH_FENCE` capability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It MUST be called after all `NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed unless the plugin has the `UNPUBLISH_FENCE` capability.
It MUST be called after both `NodeUnstageVolume` and `NodeUnpublishVolume` on the volume are called and succeed, unless the plugin has the `UNPUBLISH_FENCE` capability.

Copy link
Contributor

@xing-yang xing-yang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of more nits. Otherwise LGTM.

spec.md Outdated

The purpose of the `QUARANTINE_S`, `QUARANTINE_P`, and `QUARANTINE_SP` states are to enable recovery from node problems.
Because CSI is designed to be used in distributed systems, it is inevitable that sometimes volumes will become attached to nodes that get stuck or lost, temporarily or permanently.
Rather than require an administrator to manually clean up such situation, CSI offers a way disconnect a volume from a node "out of order" such that a volume can be disconnected from a problematic node, and safely connected to a different node, and the node can be reliably and safely cleaned up before accessing that volume again, as opposed to the normal path where the node must confirm a volume is disconnected before the controller can unpublish it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

way disconnect -> way to disconnect

spec.md Outdated
The Plugin SHOULD perform the work that is necessary for making the volume ready to be consumed by a different node.
The Plugin MUST NOT assume that this RPC will be executed on the node where the volume was previously used.

If the plugin has the `UNPUBLISH_FENCE` capability, the CO MAY specify `fence` as `true`, in which case the SP MUST ensure that the node may no longer access the volume before returning a successful response.
This results in a transition into one of the `QUARANTINE` states where the node must be cleaned up without being able to access the volume like usual.
This is intended cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intended cut -> intended to cut

@bswartz
Copy link
Contributor Author

bswartz commented Jun 6, 2021

Mostly nits; this seems well thought out. I feel that it's worth reiterating that unless CO's offer an API for a user to indicate "I don't care about data loss", manual administrator effort is still involved - and then it's unclear what this API actually buys anyone. If the goal is complete automation (w/ respect to cleanup and unpinning volumes) in the case of problematic nodes, then CO's need to address this at the API level as well. What's the plan for that, and if that hasn't landed yet - why?

@jdef Thanks for the review, I will take another pass and clean up the language further by including your suggestions.

I don't agree that manual effort is still needed. The idea is that, by invoking ControllerUnpublishVolume with the fence flag set, you've already accepted the data loss, everything after that is just cleanup. This is expected to happen in situations where the node is not responsive which means that either (1) the node actually failed and the workload effectively crashed, leaving the volume in whatever state it was in at the moment the node crashed or (2) the node is still running but the workload WILL crash when it loses access to it storage (the equivalent of yanking a USB drive out while an application is writing to it).

The real question is under what circumstances a CO might choose to make this call, and that IS a more challenging question (answered below), but I think we should at least be able to agree that once the call is made, there's no recovering the lost data, at that point we're cutting our losses and trying to restore the system to a usable state in an automated way, which is why the force node unpublish/unstage are critical.

The answer the question of why anyone would make the ControllerUnpublishVolume call with the fence flag set, we have to think about workloads that highly value uptime. Historically, these application have been a poor fit for container orchestrator systems like Kubernetes and people have run them in extremely expensive compute clusters. Now that people are trying to run these kinds of applications in container orchestrators, they're looking for ways to achieve similar uptime guarantees that they were able to get in the old world. The reality is that nodes do fail, and while workloads can migrate, shared state makes moving the workload unsafe unless you're SURE the old node is dead and not merely uncommunicative.

Clustered systems have historically solved this with a technique called STONITH (Shoot The Other Node In The Head) which kills the node whether it's dead or alive, making it safe for shared state to be taken over by a new node. There are various ways of achieving the shooting of the other node, including cutting its power (if you have programmatic access to the power supply), or killing the VM running the node (if you have access to the hypervisor) which are both perfectly valid. This CSI spec proposal adds a 3rd way which relies only on access to the storage device and the storage device having the capability to fence a node (not all do, but it's an optional capability). COs are then able to choose the most appropriate technique to make nodes definitely dead when they're trying to rapidly migrate a workload from a node of questionable state to a healthy node.

Presumably COs will also allow policy settings that allow end users to specify their preference for either aggressively high uptime or a more restrained response that allows workloads on dead nodes to stay there in hopes that they might still be alive. In no case does this need to be a manual response. The policy can be set up front and the system can execute it, fencing workloads for which the policy says to prefer uptime very quickly after the node becomes unreachable.

@jdef
Copy link
Member

jdef commented Jun 6, 2021 via email

@xing-yang
Copy link
Contributor

@jdef There's a KEP in Kubernetes that is planning to use this CSI spec change. See here: kubernetes/enhancements#1116

@bswartz
Copy link
Contributor Author

bswartz commented Jun 6, 2021

Thanks for the detailed response. Can you provide a policy example of how a CO allows a user to state that they are OK with data loss? Because otherwise this feels like "build it and they will come" .. but maybe they won't because users aren't asking for this? Or maybe they are.. can you share? It's a cool feature, and I don't see how CO's are ready for it. Do you?

The way I think sophisticated users think about this is: node failures aren't some unlikely event to be avoided, they're an inevitable certainty at scale and you can calculate the expected frequency of them with knowledge of the underlying infrastructure and its failure modes. Given that node failures can be challenging to detect reliably, most systems (like Kubernetes) rely on a simpler heuristic, like absence of X heartbeats, to determine a node failure. Heuristics like this will yield false positives with some frequency, and treating a node like it's failed when it hasn't actually failed can lead to even bigger problems when the workload on the pod has access to external storage.

I want to stress that the choice the end user has to make is not between accepting data loss and not accepting data loss -- it's between handling the false-positive case aggressively or conservatively. When nodes actually fail, data gets lost and the user's only choice is how patiently to wait to find out that's what really happened. When the system falsely detects a failed node, the choice is between causing a failure by evicting the workload forcefully and waiting to discover that the failure detection was in fact false.

The knobs that Kubernetes gives end users are pod-level tolerations to the node taints that Kubernetes applies to nodes that it deems unreachable or not-ready. By default pods tolerate these taints for 5 minutes, after which the pod eviction manager kills them and tries to reschedule them. Users that value higher workload uptime would tune these tolerations down from 5 minutes to something like 10 seconds. Users that prefer to not disturb pods might tune this value even higher.

Administrators have additional knobs that let them tune the node-failure detection thresholds to be more aggressive (lower timeouts) or more relaxed (higher timeouts).

I don't think you need more control that allowing users to set these timeout values. The proposed change in this PR and the KEP that @xing-yang linked is to adjust what we do after the pod eviction manager decides to kill the pod and reschedule it elsewhere. Today the workload can get stuck because while the system wants to move the workload to a new node, it is unable to do so safely because of the presence of an external volume that could get corrupted of 2 nodes mounts it at the same time. The PR simply allows the eviction to occur safely by ensuring that no more than 1 node has access to the volume at a time.

@bswartz
Copy link
Contributor Author

bswartz commented Jun 6, 2021

@jdef Thank you for all the grammar suggestions, they have been implemented.

I plan to squash this PR when everyone is satisfied with it.

Copy link
Contributor

@xing-yang xing-yang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

If the plugin has the `UNPUBLISH_FENCE` capability, the CO MAY specify `fence` as `true`, in which case the SP MUST ensure that the node may no longer access the volume before returning a successful response.
This results in a transition into one of the `QUARANTINE` states where the node must be cleaned up without being able to access the volume like usual.
This is intended to cut off an unreachable node from accessing volumes so those volumes may be safely published to another node.
Once in one of the `QUARANTINE` states the volume MAY NOT be published to that node again until appropriate cleanup has happened using `NodeUnpublishVolume` and `NodeUnstageVolume` (if applicable).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking from k8s point of view - how will we actually perform the cleanup on the node? Say a node was shutdown with volumes attached to it and hence CO calls ControllerUnpublishVolume with fence: true. The node comes back:

  1. Should node have some taint to prevent scheduling of pods with these volumes?
  2. We could implement something during volume reconstruction in kubelet. But I am still unsure about how will kubelet mark those volumes as "clean".
  3. Are we proposing additional state checks in kube-controller-manager that could detect if volume needs to be detached with fenced: true? Currently KCM won't detach volumes from unresponsive/shutdown nodes at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnufied Can you help review this KEP? kubernetes/enhancements#1116

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I did not know that k8s KEP was updated with CSI spec change. I will take a look.

@saad-ali
Copy link
Member

saad-ali commented Jun 7, 2021

#476 and #468 have merged.

I will hold off on cutting CSI.next RC until tomorrow (Tuesday, June 8) to see if we can get all the approvals in for this PR as well. If not, we will proceed with CSI.next RC without this PR.

Add new controller capability:
* UNPUBLISH_FENCE

Add new node capabilitiy:
* FORCE_UNPUBLISH
Ensure all changes are part of spec.md, and csi.proto is generated by "make".
Mark new fields as alpha.
Include a top-level note about quarantine states.
@bswartz
Copy link
Contributor Author

bswartz commented Jun 7, 2021

Rebased, will squash after approvals from @jdef @gnufied

// CO MUST NOT set this field to true unless SP has the
// FORCE_UNPUBLISH node capability.
// This in an OPTIONAL field.
bool force = 3 [(alpha_field) = true];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what driver will do differently with force is set to true or false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force would suggest that a "umount -f" would be acceptable rather than a "umount". Generally umount will not succeed if buffered data would be lost because the underlying block device is unwritable. The -f flag tells umount to not worry about this and just kill the mount.
There can be similar checks at lower levels that might ordinarily fail if they can't be done "safely" (i.e. without losing data) that should be skipped in the presence of the force flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It case it's not clear, the REASON for this is because if you've already cut off the node from the storage and moved the workload elsewhere (because the CO falsely detected the node was down and used the fence operation), then flushing the buffered data would result in corruption, so it's correct to discard it, but the SP cannot know this fact, so the flag gives the CO a way to communicate it. Ordinarily SPs are expected to fail unpublish/unstage if they can't be done safely because not failing in those cases could cause data corruption for a different reason.

@jingxu97
Copy link

jingxu97 commented Jun 8, 2021

/hold

@jdef
Copy link
Member

jdef commented Jun 8, 2021 via email

@jingxu97
Copy link

jingxu97 commented Jun 8, 2021

I see @jdef raised a number of questions in the kep. I think since this change is depends on the logic in the kep, maybe better to hold on merging this one

@bswartz
Copy link
Contributor Author

bswartz commented Jun 8, 2021

I've thought more about this and I'm still not sold. This smells like a bigger infra problem, not a volume one. E.g. we could intro a controller rpc called "NodeGone" that the CO uses to tell the SP that a node was shot, and so bookkeeping is needed. Let someone else handle shooting the node, and once bookkeeping is done the workloads are migrated to another node with access to required volumes.

I like this proposal a lot from the SP implementor's side, because fencing ALL of the volumes from a given node is technically easier than fencing volumes one by one. However I worry this makes the CO implementor's job harder, because I'm presuming that volume state is tracked individually, and if there are a large number of volumes published to a particular node, with perhaps a mix of more than one SP, you'd end up doing some kind of batch processing of the volumes on a per-SP basis. I'm open to this idea though.

@jdef
Copy link
Member

jdef commented Jun 9, 2021

I've thought more about this and I'm still not sold. This smells like a bigger infra problem, not a volume one. E.g. we could intro a controller rpc called "NodeGone" that the CO uses to tell the SP that a node was shot, and so bookkeeping is needed. Let someone else handle shooting the node, and once bookkeeping is done the workloads are migrated to another node with access to required volumes.

I like this proposal a lot from the SP implementor's side, because fencing ALL of the volumes from a given node is technically easier than fencing volumes one by one. However I worry this makes the CO implementor's job harder, because I'm presuming that volume state is tracked individually, and if there are a large number of volumes published to a particular node, with perhaps a mix of more than one SP, you'd end up doing some kind of batch processing of the volumes on a per-SP basis. I'm open to this idea though.

I'd much rather add the majority of the complexity to the CO side of the house, instead of plugins.

Of course, with an API like NodeGone the question becomes "is the node really GONE, or will it come back - and if it comes back, will it have been rebooted (to start w/ a clean slate) or was it just disconnected from the network for a while?". So we'd probably need to carefully define the expectations here. I know that Mesos wrestled with this for a bit. I'm not sure if there's already an equivalent of "gone" in k8s. Either way, the CO should really be certain that a node is "gone" before invoking the CSI RPC to signal that state.

+---v----+---+ +----+----------+
| PUBLISHED +------>| QUARANTINED_P |
+------------+ +---------------+
ControllerUnpublishVolume(fenced)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if node is shutdown in the state of NODE_READY ? only in state of PUBLISHED , state can be moved to QUARANTINED_P ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the QUARANTINE_* states are only designed to cover the case where ControllerUnpublishVolume() is used while in a state where it would currently be forbidden, which include PUBLISHED and VOL_READY. If the node is shut down/rebooted, that's a separate case to consider, but without calling ControllerUnpublishVolume(), it should not affect anything.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your reply. You mean when the status is not only PUBLISHED but also VOL_READY , ControllerUnpublishVolume() can be used? If so, in case of bellow figure, arrow to the QUARANTINED_S should be come from VOL_READY also? I mean, bellow operation will also be allowed?

               ControllerUnpublishVolume(fenced)
VOL_READY -----------------------------------------> QUARANTINED_SP 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there should be another arrow in the diagram but it will be very hard to squeeze it into the ASCII art. I will try.

Copy link

@mkimuram mkimuram Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bswartz

How about using a tool like this?
Just several human readable lines like below will generate an equivalent output to the ASCII art.

stateDiagram-v2
    [*] -->  CREATED : CreateVolume
    CREATED --> [*] : DeleteVolume
    CREATED --> NODE_READY : Controller Publish Volume
    NODE_READY --> CREATED : Controller Unpublish Volume
    NODE_READY --> VOL_READY : Node Stage Volume
    VOL_READY --> NODE_READY : Node Unstage Volume
    VOL_READY --> PUBLISHED : Node Publish Volume
    VOL_READY --> QUARANTINED_S : Controller Unpublish Volume (fenced)
    PUBLISHED --> VOL_READY : Node Unpublish Volume
    PUBLISHED --> QUARANTINED_SP : Controller Unpublish Volume (fenced)
    QUARANTINED_SP --> QUARANTINED_S : Node Unpublish Volume(forced)
    QUARANTINED_S --> CREATED : Node Unstage Volume (forced)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree that that's a better way, but changing the format of these diagrams is beyond the scope of my proposal. Maybe submit a separate PR that replaces the ASCII art with more readable figures? Assuming that other PR merges before this, I'd be happy to rebase on top of it and greatly simplify my changes to the state diagram.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I don't mean to block this PR to be merged unless you use the tool. You may only need to consider recreating the diagram with the tool, if you ever need another big change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bswartz thank you for your reply. And also, thank you for updating the diagram.

@YuikoTakada
Copy link

This comment is posted.
https://github.com/kubernetes/enhancements/pull/1116/files#diff-0225593bb1191b37cc24ad60c172668c3df10b62f2fd748ceb1bbe85ddf078ceR107

This would trigger the deletion of the volumeAttachment objects. 
This would allow ControllerUnpublishVolume to happen before NodeUnpublishVolume and/or NodeUnstageVolume are called.
Note that there is no additional code changes required for this step. 

Is this true? In short, when volumeAttachment has been deleted, ControllerUnpublishVolume can happen before NodeUnpublishVolume and/or NodeUnstageVolume are called without the change which is suggested in this PR?

@jdef
Copy link
Member

jdef commented Dec 28, 2021 via email

@bswartz
Copy link
Contributor Author

bswartz commented Dec 28, 2021

Out of order CSI calls are not spec compliant. I've said as much in the KEP

I agree with this. And because reliably and quickly recovering from a node failure requires making the CSI calls out of order, this PR proposes allowing the out-of-order calls under special and strict conditions. Today what we see happening in Kubernetes is just willful violation of the spec because there's no "correct" way to do this, and I'd prefer to update the spec with a correct way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet