Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limits for CephFS snapshots #1133

Closed
ShyamsundarR opened this issue Jun 4, 2020 · 18 comments
Closed

Limits for CephFS snapshots #1133

ShyamsundarR opened this issue Jun 4, 2020 · 18 comments
Labels
component/cephfs Issues related to CephFS

Comments

@ShyamsundarR
Copy link
Contributor

The limit is in krbd (and kernel CephFS) since they only allocate 1 4KiB page to handle all the snapshot ids for an image / file.

The snapshot limit only counts for the image where the snapshot actually exists -- it does not apply to the total number of snapshots in the entire grandparent-parent-child hierarchy.

Originally posted by @dillaman in #1098 (comment)

Teeing off from the above comment, we would need to handle this for CephFS as well.

As we do not have any flatten or related operations to reduce the snapshots for a given subvolume, the determined maximum number of snapshots for CephFS would be a hard limit, post which we would need to return RESOURCE_EXHAUSTED errors till some older snapshots are deleted for CreateSnapshot calls.

Cloning volumes from snapshots should not be a concern, as that is a full copy of the source volume. Thus, limits to the same will not apply.

There maybe a limit on how many clone operations are in flight though, for resource consumption reasons.

@ShyamsundarR
Copy link
Contributor Author

@joscollin @batrick Request what the current CephFS snapshot limits would be. Also, if CSI should limit outstanding clone operations in any manner.

@joscollin
Copy link
Member

@ShyamsundarR
Kernel client only supports up to 400 snapshots per filesystem. Please see http://tracker.ceph.com/issues/21420. There is no limit for the number of clones that could be created. The only limitation, I see, would be based on the available system resources.

As we do not have any flatten or related operations to reduce the snapshots for a given subvolume, the determined maximum number of snapshots for CephFS would be a hard limit, post which we would need to return RESOURCE_EXHAUSTED errors till some older snapshots are deleted for CreateSnapshot calls.

The snapshot create command would fail, if you try to create snapshots beyond the limit. I think you need to check for failures at the CSI level and process it.

@humblec
Copy link
Collaborator

humblec commented Jun 5, 2020

@joscollin @batrick Request what the current CephFS snapshot limits would be. Also, if CSI should limit outstanding clone operations in any manner.

@ShyamsundarR there is no flattening available as a command for CephFS. Also, the maximum number of snapshots can be configured by below Param. The default value is 100.

mds_max_snaps_per_dir

@Madhu-1 Madhu-1 added the component/cephfs Issues related to CephFS label Jun 5, 2020
@ShyamsundarR
Copy link
Contributor Author

@ShyamsundarR
Kernel client only supports up to 400 snapshots per filesystem. Please see http://tracker.ceph.com/issues/21420. There is no limit for the number of clones that could be created. The only limitation, I see, would be based on the available system resources.

So total cap for all subvolumes (at the fs layer) is at 400, this seems to be clarified in tracker as well.

As we do not have any flatten or related operations to reduce the snapshots for a given subvolume, the determined maximum number of snapshots for CephFS would be a hard limit, post which we would need to return RESOURCE_EXHAUSTED errors till some older snapshots are deleted for CreateSnapshot calls.

The snapshot create command would fail, if you try to create snapshots beyond the limit. I think you need to check for failures at the CSI level and process it.

If the limits are per filesystem, the best manner for CSI to prevent it is to get an error and process it as above. Thanks.

@ShyamsundarR
Copy link
Contributor Author

@joscollin @batrick Request what the current CephFS snapshot limits would be. Also, if CSI should limit outstanding clone operations in any manner.

@ShyamsundarR there is no flattening available as a command for CephFS.

Yes, this is noted in the issue, and we should look at errors like RESOURCE_EXHAUSTED to be returned at the CSI layer.

Also, the maximum number of snapshots can be configured by below Param. The default value is 100.

mds_max_snaps_per_dir

I believe this is a further cap at the per directory level, with the overall limit at 400 per filesystem.

@batrick
Copy link
Member

batrick commented Jun 5, 2020

I'm not sure if that 400 snapshot limit applies only to the mount and what it can see. If we have a mount like:

mount -t ceph <mon-ip>:/volumes/_nogroup/foo/<uuid>/

then it will only see snapshots of the _nogroup subvolumegroup and snapshots of _nogroup/foo.

@ukernel, can you tell me if that helps with avoiding the 400 snapshot per-file-system limit?

@ShyamsundarR
Copy link
Contributor Author

I tested the following and found no errors from the ceph fs subvolume snapshot create <args> CLI. The subvolumes were not mounted and tested to detect any client side errors.

  1. Create > 100 snapshots per subvolume - Passed
  • I was able to go upto ~550 snapshots for a single subvolume
  1. Create > 400 snapshots across various subvolumes - Passed
  • I was able to go upto ~800 snapshots with no CLI errors

I did not find the setting mds_max_snaps_per_dir (tried ceph fs get <fsname> or tried setting it. I am possibly looking at the wrong place for an MDS setting?).

None of the settings are modified otherwise, but it was a Rook deployed CephFS instance, so unsure what other setting are changed by default.

Ceph version used was 14.2.9

@ukernel
Copy link

ukernel commented Jun 11, 2020

I'm not sure if that 400 snapshot limit applies only to the mount and what it can see. If we have a mount like:

mount -t ceph <mon-ip>:/volumes/_nogroup/foo/<uuid>/

then it will only see snapshots of the _nogroup subvolumegroup and snapshots of _nogroup/foo.

@ukernel, can you tell me if that helps with avoiding the 400 snapshot per-file-system limit?

the problem is inodes have multiple links, these inodes are in a dummy snaprealm, which contains all snapshots in the filesystem. For cephfs volume, we only create snapshots at volume root. we can disable the special handling for inodes with multiple links. If the special handling is disabled, that can help avoiding the 400 snapshot per-file-system limit

@batrick
Copy link
Member

batrick commented Jun 11, 2020

I'm not sure if that 400 snapshot limit applies only to the mount and what it can see. If we have a mount like:
mount -t ceph <mon-ip>:/volumes/_nogroup/foo/<uuid>/
then it will only see snapshots of the _nogroup subvolumegroup and snapshots of _nogroup/foo.
@ukernel, can you tell me if that helps with avoiding the 400 snapshot per-file-system limit?

the problem is inodes have multiple links, these inodes are in a dummy snaprealm, which contains all snapshots in the filesystem. For cephfs volume, we only create snapshots at volume root.

You mean the "volumes" mgr plugin? We're creating snapshots on each subvolume directory, e.g. /volumes/_nogroup/foo or on the group directory /volumes/_nogroup.

we can disable the special handling for inodes with multiple links. If the special handling is disabled, that can help avoiding the 400 snapshot per-file-system limit

How do we disable this special handling and what are the side-effects for hardlinks?

@ukernel
Copy link

ukernel commented Jun 17, 2020

I'm not sure if that 400 snapshot limit applies only to the mount and what it can see. If we have a mount like:
mount -t ceph <mon-ip>:/volumes/_nogroup/foo/<uuid>/
then it will only see snapshots of the _nogroup subvolumegroup and snapshots of _nogroup/foo.
@ukernel, can you tell me if that helps with avoiding the 400 snapshot per-file-system limit?

the problem is inodes have multiple links, these inodes are in a dummy snaprealm, which contains all snapshots in the filesystem. For cephfs volume, we only create snapshots at volume root.

You mean the "volumes" mgr plugin? We're creating snapshots on each subvolume directory, e.g. /volumes/_nogroup/foo or on the group directory /volumes/_nogroup.

yes

we can disable the special handling for inodes with multiple links. If the special handling is disabled, that can help avoiding the 400 snapshot per-file-system limit

How do we disable this special handling and what are the side-effects for hardlinks?

need to small patch to disable it. the side effect is: if there are hardlinks (to the same inode) across multiple subvolumes, snapshots have no effect for remote links

@ShyamsundarR
Copy link
Contributor Author

How do we disable this special handling and what are the side-effects for hardlinks?

need to small patch to disable it. the side effect is: if there are hardlinks (to the same inode) across multiple subvolumes, snapshots have no effect for remote links

Added a tracker for the above patch: https://tracker.ceph.com/issues/46074

Madhu-1 added a commit to Madhu-1/ceph-csi that referenced this issue Aug 10, 2020
as we cannot have more than 400 active snapshots
on a single subvolume due to the kernel limitation
we need to restrict the users creating more snapshots
on a single subvolume during CreateSnapshot

fixes ceph#1133

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
@Madhu-1
Copy link
Collaborator

Madhu-1 commented Aug 10, 2020

@ShyamsundarR when i tried to create more than 100 snapshots on a subvolume it stated failing

 for i in {0..110};do ceph fs subvolume snapshot create myfs csi-vol-969a0947-dad9-11ea-913c-0242ac110007 test$i --group_name csi;done
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 975, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 260, in handle_command
    return handler(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 354, in _cmd_fs_subvolume_snapshot_create
    group_name=cmd.get('group_name', None))
  File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 272, in create_subvolume_snapshot
    subvolume.create_snapshot(snapname)
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/subvolume_v1.py", line 197, in create_snapshot
    mksnap(self.fs, snappath)
  File "/usr/share/ceph/mgr/volumes/fs/operations/snapshot_util.py", line 18, in mksnap
    raise VolumeException(-e.args[0], e.args[1])
TypeError: bad operand type for unary -: 'str'

Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 975, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 260, in handle_command
    return handler(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 354, in _cmd_fs_subvolume_snapshot_create
    group_name=cmd.get('group_name', None))
  File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 272, in create_subvolume_snapshot
    subvolume.create_snapshot(snapname)
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/subvolume_v1.py", line 197, in create_snapshot
    mksnap(self.fs, snappath)
  File "/usr/share/ceph/mgr/volumes/fs/operations/snapshot_util.py", line 18, in mksnap
    raise VolumeException(-e.args[0], e.args[1])
TypeError: bad operand type for unary -: 'str'
$ ceph fs subvolume snapshot ls myfs csi-vol-969a0947-dad9-11ea-913c-0242ac110007 --group_name csi |grep name| wc -l
100
sh-4.2# ceph version
ceph version 14.2.10

with octopus, even max snapshots I can create is 100

100
sh-4.4# 
sh-4.4# ceph version
ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)

@ShyamsundarR @kotreshhr anything am missing here?

@humblec
Copy link
Collaborator

humblec commented Aug 10, 2020

@Madhu-1 the 100 snapshot limit is coming from mds_max_snaps_per_dir as mentioned here #1133 (comment)

@kotreshhr
Copy link

@ShyamsundarR when i tried to create more than 100 snapshots on a subvolume it stated failing

 for i in {0..110};do ceph fs subvolume snapshot create myfs csi-vol-969a0947-dad9-11ea-913c-0242ac110007 test$i --group_name csi;done
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 975, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 260, in handle_command
    return handler(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 354, in _cmd_fs_subvolume_snapshot_create
    group_name=cmd.get('group_name', None))
  File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 272, in create_subvolume_snapshot
    subvolume.create_snapshot(snapname)
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/subvolume_v1.py", line 197, in create_snapshot
    mksnap(self.fs, snappath)
  File "/usr/share/ceph/mgr/volumes/fs/operations/snapshot_util.py", line 18, in mksnap
    raise VolumeException(-e.args[0], e.args[1])
TypeError: bad operand type for unary -: 'str'

Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 975, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 260, in handle_command
    return handler(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 354, in _cmd_fs_subvolume_snapshot_create
    group_name=cmd.get('group_name', None))
  File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 272, in create_subvolume_snapshot
    subvolume.create_snapshot(snapname)
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/subvolume_v1.py", line 197, in create_snapshot
    mksnap(self.fs, snappath)
  File "/usr/share/ceph/mgr/volumes/fs/operations/snapshot_util.py", line 18, in mksnap
    raise VolumeException(-e.args[0], e.args[1])
TypeError: bad operand type for unary -: 'str'
$ ceph fs subvolume snapshot ls myfs csi-vol-969a0947-dad9-11ea-913c-0242ac110007 --group_name csi |grep name| wc -l
100
sh-4.2# ceph version
ceph version 14.2.10

with octopus, even max snapshots I can create is 100

100
sh-4.4# 
sh-4.4# ceph version
ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)

@ShyamsundarR @kotreshhr anything am missing here?

The traceback is during the handling of EMLINK error thrown by cephfs for exceeding per directory snapshot limit. Following is the actual error.

Error EINVAL: Traceback (most recent call last):
  File "/root//ceph/src/pybind/mgr/volumes/fs/operations/snapshot_util.py", line 14, in mksnap
    fs.mkdir(snappath, 0o755)
  File "cephfs.pyx", line 1149, in cephfs.LibCephFS.mkdir
cephfs.Error: error in mkdir /volumes/_nogroup/sub_0/6de60cea-b703-47ff-8182-dad0f04b1430/.snap/snap_103: Too many links [Errno 31]

The traceback should not have occurred. It should just returned the error message. This issue [1] is fixed by @ajarr in master and backported to natuilus. Octopus backport is still pending.

[1] https://tracker.ceph.com/issues/46360

@humblec
Copy link
Collaborator

humblec commented Aug 10, 2020

The traceback is during the handling of EMLINK error thrown by cephfs for exceeding per directory snapshot limit. Following is the actual error.

Error EINVAL: Traceback (most recent call last):
  File "/root//ceph/src/pybind/mgr/volumes/fs/operations/snapshot_util.py", line 14, in mksnap
    fs.mkdir(snappath, 0o755)
  File "cephfs.pyx", line 1149, in cephfs.LibCephFS.mkdir
cephfs.Error: error in mkdir /volumes/_nogroup/sub_0/6de60cea-b703-47ff-8182-dad0f04b1430/.snap/snap_103: Too many links [Errno 31]

The traceback should not have occurred. It should just returned the error message. This issue [1] is fixed by @ajarr in master and backported to natuilus. Octopus backport is still pending.

[1] https://tracker.ceph.com/issues/46360

Thats great..so to summarize, the limit is imposed by above mentioned configuration (mds_max_snaps_per_dir) and when it exhausted we get EMLINK. Isnt it?

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Aug 10, 2020

The traceback is during the handling of EMLINK error thrown by cephfs for exceeding per directory snapshot limit. Following is the actual error.

Error EINVAL: Traceback (most recent call last):
  File "/root//ceph/src/pybind/mgr/volumes/fs/operations/snapshot_util.py", line 14, in mksnap
    fs.mkdir(snappath, 0o755)
  File "cephfs.pyx", line 1149, in cephfs.LibCephFS.mkdir
cephfs.Error: error in mkdir /volumes/_nogroup/sub_0/6de60cea-b703-47ff-8182-dad0f04b1430/.snap/snap_103: Too many links [Errno 31]

The traceback should not have occurred. It should just returned the error message. This issue [1] is fixed by @ajarr in master and backported to natuilus. Octopus backport is still pending.
[1] https://tracker.ceph.com/issues/46360

Thats great..so to summarize, the limit is imposed by above mentioned configuration (mds_max_snaps_per_dir) and when it exhausted we get EMLINK. Isnt it?

The above are snapshots per-directory limit which CSI doesn't care about. cephcsi need to worry about the kernel limit (which can cause issues in mounting a subvolume which is having 400+ snapshots)

Madhu-1 added a commit to Madhu-1/ceph-csi that referenced this issue Aug 11, 2020
as we cannot have more than 400 active snapshots
on a single subvolume due to the kernel limitation
we need to restrict the users creating more snapshots
on a single subvolume during CreateSnapshot

fixes ceph#1133

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
@ShyamsundarR
Copy link
Contributor Author

@ShyamsundarR when i tried to create more than 100 snapshots on a subvolume it stated failing

Retested this today, at 100 it errors out (or throws a traceback as per the version in use). This limit handling should hence not need any CSI changes, as the call to CreateSnapshot would error out at these limits.

@humblec
Copy link
Collaborator

humblec commented Oct 1, 2020

Considering this has been already addressed and looks like we dont need any other adjsutements in CSI code. I am closing this for now and please feel free to reopen if required. Thanks @ShyamsundarR

@humblec humblec closed this as completed Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/cephfs Issues related to CephFS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants