Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CephFS] Create a CSI volume from a snapshot #411

Closed
humblec opened this issue Jun 7, 2019 · 24 comments
Closed

[CephFS] Create a CSI volume from a snapshot #411

humblec opened this issue Jun 7, 2019 · 24 comments
Assignees
Labels
component/cephfs Issues related to CephFS Priority-0 highest priority issue
Milestone

Comments

@humblec
Copy link
Collaborator

humblec commented Jun 7, 2019

Describe the feature you'd like to have

We will be using ceph manager based volume provisioning from v1.1.0. Afaict, we only have a way to create snapshots from a volume. The support for cloning a volume from
existing snapshot looks to be unavailable. This is an important functionality which has to be supported in CSI driver.

This issue tracks this feature support.

@humblec
Copy link
Collaborator Author

humblec commented Jun 7, 2019

@ajarr can you please share your view on this ?

@batrick
Copy link
Member

batrick commented Jun 7, 2019

The plan is to have the volumes plugin add a command to recursive copy another subvolume@snapshot to a new subvolume. The interface is still TBD. What do you think?

@humblec
Copy link
Collaborator Author

humblec commented Jun 8, 2019

The plan is to have the volumes plugin add a command to recursive copy another subvolume@snapshot to a new subvolume.

Could you please explain some more on the volume plugin referred here? basically some interface like ceph fs snapshot clone <clonename> <snapname> is what I was proposing. If you were referring same and while invoked internally copying the data recursively to another subvolume is in plan , we are on same page :)

Most of the storage systems has such interface and one example would be gluster as mentioned here https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/administration_guide/ch08s03

@humblec humblec changed the title CephFS Clone support from a snapshot volume CephFS Clone support from a {snapshot} volume or existing subvolume Jun 10, 2019
@humblec humblec changed the title CephFS Clone support from a {snapshot} volume or existing subvolume CephFS Clone support from a snapshot volume or existing subvolume Jun 10, 2019
@humblec
Copy link
Collaborator Author

humblec commented Jun 10, 2019

@batrick @ajarr Do we have the functionality available today to clone or create a new volume from existing subvol ? afaict, this is not available atm and we need this functionality too.

@ajarr
Copy link
Contributor

ajarr commented Jun 10, 2019

@humblec , no. not available atm.

@humblec
Copy link
Collaborator Author

humblec commented Jun 10, 2019

@humblec , no. not available atm.

Thanks for confirming, we need this functionality too.

@batrick
Copy link
Member

batrick commented Jun 10, 2019

The plan is to have the volumes plugin add a command to recursive copy another subvolume@snapshot to a new subvolume.

Could you please explain some more on the volume plugin referred here?

The ceph-mgr "volumes" plugin that @ajarr is working on.

basically some interface like ceph fs snapshot clone <clonename> <snapname> is what I was proposing. If you were referring same and while invoked internally copying the data recursively to another subvolume is in plan , we are on same page :)

It would probably look like ceph fs subvolume clone <vol> <subvol> <snapshot> <new_subvol>. @ajarr ?

@humblec
Copy link
Collaborator Author

humblec commented Jun 10, 2019

The ceph-mgr "volumes" plugin that @ajarr is working on.

Oh. OK, thanks for clarifying.

ceph fs subvolume clone <vol> <subvol> <snapshot> <new_subvol>

Does above command can be reused, if we want to clone a subvol from an existing one ? or is the plan to do it with a seperate command?. If ceph fs subvolume clone is the first strings, may be its good to accommodate that functionality too in the same, Isnt it ? @batrick @ajarr

@tombarron
Copy link

We'd like to be able to use this directly from the OpenStack Manila driver as well. This would enable both (1) Manila CephFS (and CephFS with NFS) create-share-from snapshot support, and (2) manila-CSI (recently merged by gman0 in k8s/cloud-provider-openstack) support for creating a new volume from snapshot.

@humblec humblec self-assigned this Jun 10, 2019
@ShyamsundarR
Copy link
Contributor

The plan is to have the volumes plugin add a command to recursive copy another subvolume@snapshot to a new subvolume. The interface is still TBD. What do you think?

A clone in CSI is created using the CreateVolume call, with a VolumeContentSource as the snapshot to clone from. If we perform a recursive copy of data when such a CreateVolume is invoked, it would very likely take time, causing the caller to time out on the CreateVolume call, and try again.

The ceph-csi behaviour in this case would be, for the subsequent retries, to block on a mutex that the first call still holds. Thus delaying a response to the retry as well (based on time taken to copy).

I see that the container orchestrator, would possibly retry this call a few times, depending on the time taken to copy the data. Also, the time it would take is not deterministic. This sounds like a poor choice to implement this feature. I would hence ask what the urgency is to implement clone for CephFS subvolumes and if we should really be waiting for native clone support from CephFS instead?

@batrick
Copy link
Member

batrick commented Jun 18, 2019

The plan is to have the volumes plugin add a command to recursive copy another subvolume@snapshot to a new subvolume. The interface is still TBD. What do you think?

A clone in CSI is created using the CreateVolume call, with a VolumeContentSource as the snapshot to clone from. If we perform a recursive copy of data when such a CreateVolume is invoked, it would very likely take time, causing the caller to time out on the CreateVolume call, and try again.

Well this is troublesome.

The ceph-csi behaviour in this case would be, for the subsequent retries, to block on a mutex that the first call still holds. Thus delaying a response to the retry as well (based on time taken to copy).

I see that the container orchestrator, would possibly retry this call a few times, depending on the time taken to copy the data. Also, the time it would take is not deterministic. This sounds like a poor choice to implement this feature. I would hence ask what the urgency is to implement clone for CephFS subvolumes and if we should really be waiting for native clone support from CephFS instead?

Native clone is complex and simply will not be done anytime soon. The urgency surrounding this feature is to eliminate the feature gap between CephFS and RBD volumes.

Is there no way we can block volume creation in CSI somewhere in the chain of operations of provisioning/using a volume?

@ShyamsundarR
Copy link
Contributor

The plan is to have the volumes plugin add a command to recursive copy another subvolume@snapshot to a new subvolume. The interface is still TBD. What do you think?

A clone in CSI is created using the CreateVolume call, with a VolumeContentSource as the snapshot to clone from. If we perform a recursive copy of data when such a CreateVolume is invoked, it would very likely take time, causing the caller to time out on the CreateVolume call, and try again.

Is there no way we can block volume creation in CSI somewhere in the chain of operations of provisioning/using a volume?

Just to be clear, the first call to clone will time out on the caller end (the client end), the Ceph-CSI plugin (the server end) would still progress and (say) wait on a response to the clone from ceph, so eventually the first call will be complete, and hence the clone would be created.

The server on one of the subsequent retries would finally detect that the clone was complete and respond in time to the caller. This detection would just look at the CSI RADOS OMaps to detect the existence of the clone and respond in relatively constant time, but this will happen only after the first call is completed by the server.

IOW, there is no mechanism at present to start a clone job and check its status later in CSI CreateVolume.This would have meant much better semantics than what is happening currently. Maybe we can build such semantics between the plugin and the call to the ceph-mgr volumes plugin?

@batrick
Copy link
Member

batrick commented Jun 19, 2019

The plan is to have the volumes plugin add a command to recursive copy another subvolume@snapshot to a new subvolume. The interface is still TBD. What do you think?

A clone in CSI is created using the CreateVolume call, with a VolumeContentSource as the snapshot to clone from. If we perform a recursive copy of data when such a CreateVolume is invoked, it would very likely take time, causing the caller to time out on the CreateVolume call, and try again.

Is there no way we can block volume creation in CSI somewhere in the chain of operations of provisioning/using a volume?

Just to be clear, the first call to clone will time out on the caller end (the client end), the Ceph-CSI plugin (the server end) would still progress and (say) wait on a response to the clone from ceph, so eventually the first call will be complete, and hence the clone would be created.

The server on one of the subsequent retries would finally detect that the clone was complete and respond in time to the caller. This detection would just look at the CSI RADOS OMaps to detect the existence of the clone and respond in relatively constant time, but this will happen only after the first call is completed by the server.

IOW, there is no mechanism at present to start a clone job and check its status later in CSI CreateVolume.This would have meant much better semantics than what is happening currently. Maybe we can build such semantics between the plugin and the call to the ceph-mgr volumes plugin?

We can build those semantics in a new call like:

ceph subvolume clone ...

It would just work in the background and return immediately. Then just build another call to check the status:

ceph subvolume info ...

or similar. Would htat work?

@humblec
Copy link
Collaborator Author

humblec commented Jun 22, 2019

Native clone is complex and simply will not be done anytime soon. The urgency surrounding this feature is to eliminate the feature gap between CephFS and RBD volumes.

I completely agree to @batrick here and the urgency is really high!!! We have to sort this out asap. Even if CSI timeout, the subsequent calls is going to pick it up and thats the design for most of the CSI calls. Secondly, this is not specific to ceph, other storage systems behaves the same way, most of the storage operations are heavy.

We can build those semantics in a new call like:

ceph subvolume clone ...

It would just work in the background and return immediately. Then just build another call to check the status:

ceph subvolume info ...

or similar. Would htat work?

That works @batrick !

@batrick considering this is an important feature to have and to satisfy requirement of many products, how could we expedite or track the progress?

@ajarr
Copy link
Contributor

ajarr commented Jun 26, 2019

@nixpanic, this discussion might be interesting for you.

@ShyamsundarR
Copy link
Contributor

The plan is to have the volumes plugin add a command to recursive copy another subvolume@snapshot to a new subvolume. The interface is still TBD. What do you think?

A clone in CSI is created using the CreateVolume call, with a VolumeContentSource as the snapshot to clone from. If we perform a recursive copy of data when such a CreateVolume is invoked, it would very likely take time, causing the caller to time out on the CreateVolume call, and try again.

Is there no way we can block volume creation in CSI somewhere in the chain of operations of provisioning/using a volume?

Just to be clear, the first call to clone will time out on the caller end (the client end), the Ceph-CSI plugin (the server end) would still progress and (say) wait on a response to the clone from ceph, so eventually the first call will be complete, and hence the clone would be created.
The server on one of the subsequent retries would finally detect that the clone was complete and respond in time to the caller. This detection would just look at the CSI RADOS OMaps to detect the existence of the clone and respond in relatively constant time, but this will happen only after the first call is completed by the server.
IOW, there is no mechanism at present to start a clone job and check its status later in CSI CreateVolume.This would have meant much better semantics than what is happening currently. Maybe we can build such semantics between the plugin and the call to the ceph-mgr volumes plugin?

We can build those semantics in a new call like:

ceph subvolume clone ...

It would just work in the background and return immediately. Then just build another call to check the status:

ceph subvolume info ...

or similar. Would htat work?

There is one corner case where, when the clone by copy takes time, and the PVC that started the clone is deleted, may leak the clone against Ceph.

On experimenting with induced sleep in a create volume call, such that it would never report success in time, I observed that the moment the PVC is deleted (it is stll in Pending state till then as it never received a success) the provisioner stops calling CreateVolume and hence leaks the cloned/created volume.

It does seem like a bug in kubernetes provisioner, as it stops attempting to get a success return from a volume creation if the source PVC is deleted before the PV is created. IOW, it does not attempt the create indefinitely.

Other cases should be fine as per kubernetes provisioner documentation on timeouts and also as per timeouts in the spec state that the call would be tried indefinitely on timeouts.

Note on possible alternative: It seems that the requirement in question that needs the clone feature could rather use rbd clones as long as rbd RWX volumes are supported (with the required caveats as in PR #261)

@batrick
Copy link
Member

batrick commented Jul 3, 2019

@ajarr please summarize this discussion (with the links to the specs provided by @ShyamsundarR) and propose the new APIs in a tracker ticket.

It does seem like a bug in kubernetes provisioner, as it stops attempting to get a success return from a volume creation if the source PVC is deleted before the PV is created. IOW, it does not attempt the create indefinitely.

There are ways to approach this so that cleanup is possible. The cloned(-ing) subvolume could be put in an isolated location until the final ceph fs subvolume clone... call comes in to commit the operation. The volumes plugin can garbage collect any complete subvolume clones that were never committed (anything a day old should be sufficient).

@ajarr ajarr added the component/cephfs Issues related to CephFS label Aug 5, 2019
@humblec humblec added Priority-0 highest priority issue release-2.0.0 v2.0.0 release labels Sep 30, 2019
@humblec
Copy link
Collaborator Author

humblec commented Oct 4, 2019

@batrick @ajarr as cloning is supported in upstream , we have to provide the support from CephFS CSI side and this is a critical feature for v2.0.0 ( #557 ) , Do we have a volunteer to work on this feature ? Please suggest .

@ajarr ajarr changed the title CephFS Clone support from a snapshot volume or existing subvolume [CephFS] Create a CSI volume from a snapshot Oct 30, 2019
@joscollin
Copy link
Member

@batrick @ajarr as cloning is supported in upstream , we have to provide the support from CephFS CSI side and this is a critical feature for v2.0.0 ( #557 ) , Do we have a volunteer to work on this feature ? Please suggest .

#701 (comment)

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jan 6, 2020

This is cloning a PVC from snapshot not cloning from PVC

@Madhu-1 Madhu-1 added Release-2.1.0 and removed release-2.0.0 v2.0.0 release labels Jan 17, 2020
@humblec
Copy link
Collaborator Author

humblec commented Mar 4, 2020

@joscollin can you confirm, it is possible to clone a volume from Snapshot now and it is available in cephfs upstream release ? if yes, which version of cephfs release has it ?

@joscollin
Copy link
Member

joscollin commented Mar 4, 2020

Nautilus release status: https://ceph.io/releases/v14-2-8-nautilus-released, which has clone volume from the Snapshot.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jul 24, 2020

Moving it to release-v3.1.0

@humblec
Copy link
Collaborator Author

humblec commented Aug 10, 2020

Closing this as the functionality is already available with #394

@humblec humblec closed this as completed Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/cephfs Issues related to CephFS Priority-0 highest priority issue
Projects
None yet
Development

No branches or pull requests

8 participants