New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mds: add ceph.dir.bal.mask vxattr for MDS Balancer #52373
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few things:
- Please add the PR discussion to your commit message.
- Because the bitset is 256 bits, it's not generally easy to compute the xattr value for the caller. I think it would be helpful to have the MDS compute the value by allowing something like
setfattr -n ceph.dir.bal.mak -v 0,1,3,15 dir/
such that the MDS will do the bitwise or of those bits. - I'd like to see some tests in
qa/tasks/cephfs/test_exports.py
. You can usevstart_runner.py
to test. - There should be some docs added to explain this and the MDSMap bal rank mask. Which should users prefer and when? Is it valid to set the rank mask on the root directory? Are values inherited or override-able?
src/mon/FSCommands.cc
Outdated
@@ -432,6 +432,12 @@ class SetHandler : public FileSystemCommandHandler | |||
if (r != 0) { | |||
return r; | |||
} | |||
std::bitset<MAX_MDS> rank_mask = std::bitset<MAX_MDS>(bin_string); | |||
if (!rank_mask.test(0)) { | |||
ss << "the first bit must be set for rank0"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain (to the user)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There have been some slight changes in the implementation. My current implementation involves creating a subtree root for the target directory and exporting that node to a different rank. Then, rebalancing is performed within the masked ranks. For example, if you run "setfattr -n ceph.dir.bal.mask -v 1,2 /home/backup," it converts /home/backup into a subtree and moves it to rank 1. Subsequently, the balancer queries the bal_rank_mask vxattr values of subtrees with authority on that rank and redistributes them within the range of 1 to 2. The '/' root inode is indeed fixed at rank 0, making it impossible to move it to a different rank and apply the logic. Therefore, I’ve added the constraint to ensure that rank 0 is always included for the root ‘/‘.
The existing bal_rank_mask implementation in #43284 sends unpinned subtrees (excluding the root '/') to masked ranks for redistribution. The ceph.dir.bal.mask functionality has been implemented, and during the process of integrating the bal_rank_mask feature, constraints were added out of necessity.
src/mon/FSCommands.cc
Outdated
@@ -432,6 +432,12 @@ class SetHandler : public FileSystemCommandHandler | |||
if (r != 0) { | |||
return r; | |||
} | |||
std::bitset<MAX_MDS> rank_mask = std::bitset<MAX_MDS>(bin_string); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please carve out this into a separate commit since this is through the fs set
interface (for mdsmap) rather than the vxattr interface proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will split the PR once the discussion about restrictionon rank0 is resolved.
#52373 (comment)
ab04b9d
to
5499c23
Compare
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
5499c23
to
bfef169
Compare
bfef169
to
108e4ad
Compare
@batrick @vshankar Please review my changes.
Done.
I agree. It is not easy to calculate numerous bits and configure the bitfield.
Test cases for ceph.dir.bal.mask were implemented in test_exports.py.
How to use ceph.dir.bal.mask is explained in the document. |
jenkins retest this please |
4aca9f4
to
53a786c
Compare
jenkins test make check |
jenkins test make check arm64 |
@vshankar ping? |
Was on vacation. Will recheck tomorrow or early next week. |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise the change look reasonable.
src/include/cephfs/types.h
Outdated
@@ -520,6 +521,7 @@ struct inode_t { | |||
|
|||
double export_ephemeral_random_pin = 0; | |||
bool export_ephemeral_distributed_pin = false; | |||
std::string bal_rank_mask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, probably the main concern that I have with this change is the increase in the size of the inode which would happen irrespective of the xattr being set.
What are your thoughts on persisting this in the xattr map @yongseokoh ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshankar There are downsides that may affect inode size. On the other hand, CephFS users have the advantage of being able to flexibly map different subtrees to multiple mds ranks depending on their use case. I think the benefits are great. Also, CephFS users are familiar with using xattr, so I think this method is convenient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine - the xattr interface can remain as it is - just that we could benefit from not increasing the inode size if we can store the mask in xattr_map rather than in the inode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean xattr_map used in #36603? I will try to see if it can be employed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a rough understanding of xattrs through ceph.mirror.dirty_snap_id xattr.
"xattrs": [
{
"key": "ceph.mirror.dirty_snap_id",
"val": "7"
}
]
Using xattrs seems like a good idea. However, it would be better for this proposed PR to be managed as the ceph.dir.* family. It seems to be easier to manage in the long term if it is managed in a similar way to the existing ceph.dir.pin and ceph.dir.pin.random. Then, please give your opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yongseokoh Sure thing - thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshankar @batrick I changed this PR to use xattrs map. I confirmed that ceph.dir.bal.mask xattr can be changed through setfattr/getfattr even without changing Client.cc.
afa6c73#diff-277dc6e796ccecb6aa14c9357f7a86898d6ddcf4113c110a4e00e0a10f6fefa7R6427
Do I have to change client.cc as mentioned by Patrick ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I have to change client.cc as mentioned by Patrick ?
I don't think that's required since the mirror_info xattr is a bit differently handled as it's a compound xattr.
Make sure relevant tests are added for validating get/set of the xattr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 24-01-11 05:01:39, Yongseok Oh wrote:
@yongseokoh commented on this pull request.
> @@ -520,6 +521,7 @@ struct inode_t {
double export_ephemeral_random_pin = 0;
bool export_ephemeral_distributed_pin = false;
+ std::string bal_rank_mask;
Do you mean xattr_map used in #36603? I will try to see if it can be employed.
Yes.
…
--
Reply to this email directly or view it on GitHub:
#52373 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
64c50fa
to
afa6c73
Compare
That introduces the ceph.dir.bal.mask vxattr, which is an option to rebalance a subtree within specific active MDSs. Similar to the CPU mask, this feature enables load balancing of specific directories across multiple MDS ranks. It is especially useful for fine-tuning and improving performance in various scenarios. Previously, the bal_rank_mask in #43284 supports isolating unpinned subtrees under the root directory ('/') to a specific MDS rank. However, with this new option vxattr, it becomes possible to isolate specific subdirectories to designated MDS ranks. By introducing the ceph.dir.bal.mask vxattr, this PR empowers Ceph administrators with enhanced control and flexibility for optimizing performance and fine-tuning their deployments. trakcer: https://tracker.ceph.com/issues/61777 Signed-off-by: Yongseok Oh <yongseok.oh@linecorp.com>
jenkins test make check |
https://pulpito.ceph.com/?branch=wip-vshankar-testing-20240307.022905 Will review the run next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yongseokoh PTAL at the failed tests.
As far as this run in concerned, we ran into a bug with the testing kernel causing some tests failures unrelated to this change. See - https://tracker.ceph.com/issues/64679#note-4
Once those two issues are fixed, the failed tests would need a rerun. Other test look good 👍
This is getting really close to getting merged.
qa/tasks/cephfs/test_exports.py
Outdated
|
||
self.mount_a.setfattr("1", "ceph.dir.bal.mask", "2") | ||
value = self.mount_a.getfattr("1", "ceph.dir.bal.mask") | ||
self.assertEqual(value, "2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert check fails - https://pulpito.ceph.com/vshankar-2024-03-07_10:41:12-fs-wip-vshankar-testing-20240307.022905-testing-default-smithi/7584618/
2024-03-07T22:51:06.010 INFO:tasks.cephfs_test_runner:======================================================================
2024-03-07T22:51:06.010 INFO:tasks.cephfs_test_runner:FAIL: test_mds_failover (tasks.cephfs.test_exports.TestBalMask)
2024-03-07T22:51:06.010 INFO:tasks.cephfs_test_runner:That MDS failover does not affect the ceph.dir.bal.mask.
2024-03-07T22:51:06.010 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-03-07T22:51:06.011 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-03-07T22:51:06.011 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_b0ea271bb96f7f4cec2f2a40ff88a5def1e0530f/qa/tasks/cephfs/test_exports.py", line 934, in test_mds_failover
2024-03-07T22:51:06.011 INFO:tasks.cephfs_test_runner: self.assertEqual(value, "2")
2024-03-07T22:51:06.011 INFO:tasks.cephfs_test_runner:AssertionError: None != '2'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshankar I will check and get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the vstart-based local test, all 20 test cases passed. I will look at the log you provided and see if I can reproduce it.
2024-03-13 11:38:19,031.031 INFO:__main__:test_unset_rank_mask_under_nested_parent (tasks.cephfs.test_exports.TestBalMask) ... ok
2024-03-13 11:38:19,031.031 INFO:__main__:test_unset_rank_mask_under_nested_root (tasks.cephfs.test_exports.TestBalMask) ... ok
2024-03-13 11:38:19,031.031 INFO:__main__:
2024-03-13 11:38:19,031.031 INFO:__main__:----------------------------------------------------------------------
2024-03-13 11:38:19,031.031 INFO:__main__:Ran 20 tests in 1378.896s
2024-03-13 11:38:19,031.031 INFO:__main__:
2024-03-13 11:38:19,032.032 INFO:__main__:
2024-03-13 11:38:19,032.032 INFO:__main__:> ip netns list
2024-03-13 11:38:19,039.039 INFO:__main__:> sudo ip link delete ceph-brx
Cannot find device "ceph-brx"
2024-03-13 11:38:19,066.066 INFO:__main__:OK
2024-03-13 11:38:19,067.067 INFO:__main__:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshankar
I think it's a timing issue. This is the case where getfattr fails after setfattr regardless of MDS fail. Can MDS logs be shared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The major difference is the use of the kernel client. I only tested ceph-fuse.
In the vstart local test, both kernelclient and ceph-fuse tests passed without any problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshankar
I think it's a timing issue. This is the case where getfattr fails after setfattr regardless of MDS fail. Can MDS logs be shared?
I can have a look at the logs to see if I can spot something. The logs would be huge, but here they are: http://qa-proxy.ceph.com/teuthology/vshankar-2024-03-07_10:41:12-fs-wip-vshankar-testing-20240307.022905-testing-default-smithi/7584618/remote/
Check for "ceph-mds*.gz" in both the remote directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setxattr client request was processed and a getxattr request was requested between early_reply and reply_client_request. I think the client inode xattrs was not updated at that time. Is it possible to guarantee that the client inode's xattrs will be updated after early_reply? Otherwise, should I just add some logic in the test case to wait for a few seconds?
2024-03-07T22:50:52.520+0000 7f8607400640 4 mds.0.server handle_client_request client_request(client.5210:27 setxattr #0x10000000000 ceph.dir.bal.mask 2024-03-07T22:50:52.519761+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6
2024-03-07T22:50:52.521+0000 7f8607400640 10 mds.0.server setxattr 'ceph.dir.bal.mask' len 1 on [inode 0x10000000000 [...2,head] /1/ auth v4 ap=2 DIRTYPARENT f(v0 m2024-03-07T22:50:51.975767+0000 1=0+1) n(v0 rc2024-03-07T22:50:51.975767+0000 2=0+2) (isnap sync r=1) (inest lock) (ifile excl) (ixattr xlock x=1 by 0x55c05e997a80) (iversion lock w=1 last_client=5210) caps={5210=pAsLsFsx/pAsLsXsFsxcral@2},l=5210 | request=1 lock=3 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x55c05e90a100]
2024-03-07T22:50:52.521+0000 7f8607400640 10 mds.0.server early_reply 0 ((0) Success) client_request(client.5210:27 setxattr #0x10000000000 ceph.dir.bal.mask 2024-03-07T22:50:52.519761+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6
2024-03-07T22:50:52.580+0000 7f8607400640 4 mds.0.server handle_client_request client_request(client.5210:28 ??? #0x10000000000 ceph.dir.bal.mask 2024-03-07T22:50:52.578761+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6
2024-03-07T22:50:52.580+0000 7f8607400640 7 mds.0.server reply_client_request -61 ((61) No data available) client_request(client.5210:28 ??? #0x10000000000 ceph.dir.bal.mask 2024-03-07T22:50:52.578761+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6
inode->xattrs may be not updated.
https://github.com/ceph/ceph/blob/main/src/client/Client.cc#L13757
2024-03-07T22:50:55.379+0000 7f85ff200640 7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.5210:27 setxattr #0x10000000000 ceph.dir.bal.mask 2024-03-07T22:50:52.519761+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setxattr client request was processed and a getxattr request was requested between early_reply and reply_client_request. I think the client inode xattrs was not updated at that time. Is it possible to guarantee that the client inode's xattrs will be updated after early_reply? Otherwise, should I just add some logic in the test case to wait for a few seconds?
If the client has the relevant caps, it should have the latest xattr version. I think you might have run into a race where the client thinks it has the caps and can trust the cached xattr, however, the xattr is stale.
I have seen something similar before and fixed it in a code path. See commit 526e4ee
2024-03-07T22:50:52.520+0000 7f8607400640 4 mds.0.server handle_client_request client_request(client.5210:27 setxattr #0x10000000000 ceph.dir.bal.mask 2024-03-07T22:50:52.519761+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6 2024-03-07T22:50:52.521+0000 7f8607400640 10 mds.0.server setxattr 'ceph.dir.bal.mask' len 1 on [inode 0x10000000000 [...2,head] /1/ auth v4 ap=2 DIRTYPARENT f(v0 m2024-03-07T22:50:51.975767+0000 1=0+1) n(v0 rc2024-03-07T22:50:51.975767+0000 2=0+2) (isnap sync r=1) (inest lock) (ifile excl) (ixattr xlock x=1 by 0x55c05e997a80) (iversion lock w=1 last_client=5210) caps={5210=pAsLsFsx/pAsLsXsFsxcral@2},l=5210 | request=1 lock=3 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x55c05e90a100] 2024-03-07T22:50:52.521+0000 7f8607400640 10 mds.0.server early_reply 0 ((0) Success) client_request(client.5210:27 setxattr #0x10000000000 ceph.dir.bal.mask 2024-03-07T22:50:52.519761+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6 2024-03-07T22:50:52.580+0000 7f8607400640 4 mds.0.server handle_client_request client_request(client.5210:28 ??? #0x10000000000 ceph.dir.bal.mask 2024-03-07T22:50:52.578761+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6 2024-03-07T22:50:52.580+0000 7f8607400640 7 mds.0.server reply_client_request -61 ((61) No data available) client_request(client.5210:28 ??? #0x10000000000 ceph.dir.bal.mask 2024-03-07T22:50:52.578761+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6 inode->xattrs may be not updated. https://github.com/ceph/ceph/blob/main/src/client/Client.cc#L13757 2024-03-07T22:50:55.379+0000 7f85ff200640 7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.5210:27 setxattr #0x10000000000 ceph.dir.bal.mask 2024-03-07T22:50:52.519761+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6
Jusy FYI - apart from one failures reported here, there is an unrelated test case failure due to a bug in the testing kernel. |
qa/tasks/cephfs/test_exports.py
Outdated
self.assertEqual(value, None) | ||
|
||
self.mount_a.setfattr(".", "ceph.dir.bal.mask", "0") | ||
self._wait_subtrees([('', 0)], status=self.status, path='', exclude_path='~mds') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this test is doing that test_unset_rank_mask
isn't already testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unset test may be redundant, but '.' The directory is located at rank 0, and the '1' directory is exported at rank 2. The purpose is to test whether the parent directory's export affects the '1' directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the same line. #52373 (comment)
Signed-off-by: Yongseok Oh <yongseok.oh@linecorp.com>
Signed-off-by: Yongseok Oh <yongseok.oh@linecorp.com>
That introduces the ceph.dir.bal.mask vxattr, which is an option to rebalance a subtree within specific active MDSs. Similar to the CPU mask, this feature enables load balancing of specific directories across multiple MDS ranks. It is especially useful for fine-tuning and improving performance in various scenarios. Previously, the bal_rank_mask in ceph#43284 supports isolating unpinned subtrees under the root directory ('/') to a specific MDS rank. However, with this new option vxattr, it becomes possible to isolate specific subdirectories to designated MDS ranks. By introducing the ceph.dir.bal.mask vxattr, this PR empowers Ceph administrators with enhanced control and flexibility for optimizing performance and fine-tuning their deployments. trakcer: https://tracker.ceph.com/issues/61777 Signed-off-by: Yongseok Oh <yongseok.oh@linecorp.com>
int r = mds->balancer->get_rank_mask_bitset(dir, rank_mask_bitset, false); | ||
bool has_mask = false; | ||
if (r == 0 && rank_mask_bitset.count()) { | ||
has_mask = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false to true
@@ -186,6 +240,7 @@ void MDBalancer::handle_export_pins(void) | |||
*/ | |||
if (dir->get_num_head_items() > 0) { | |||
mds->mdcache->migrator->export_dir(dir, target); | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary
@batrick ping? �Could you please review this PR when you get a chance? |
tracker: https://tracker.ceph.com/issues/61777
Introduction
This PR introduces the ceph.dir.bal.mask vxattr, which is an option to rebalance a subtree within specific active MDSs. Similar to the CPU mask, this feature enables load balancing of specific directories across multiple MDS ranks. It is especially useful for fine-tuning and improving performance in various scenarios. Previously, the bal_rank_mask in #43284 supports isolating unpinned subtrees under the root directory ('/') to a specific MDS rank. However, with this new option vxattr, it becomes possible to isolate specific subdirectories to designated MDS ranks. By introducing the ceph.dir.bal.mask vxattr, this PR empowers Ceph administrators with enhanced control and flexibility for optimizing performance and fine-tuning their deployments.
Use Cases
The first is when it is difficult to pin a subdir to one MDS rank. The /home/images directory exists. There are /0 to /99 directories under it, and 10 million image files are stored in each directory. In this case, it is difficult to pin the entire images directory to one MDS rank. Also, pinning the huge 100 directories manually or using ephemeral pinning is not an easy task. Therefore, efficient resource management is possible by using ceph.dir.bal.mask.
Second, when there are several large directories such as /home/images, performance can be optimized by distributing them to different MDS rank groups using ceph.dir.bal.mask. Since the existing mdsmap’s bal_rank_mask isolated the entire ‘/’ directory to specific ranks, it can affect performance due to each other's migration overhead. For example, mdsmap’s bal_rank_mask is set to 0xf and /home/images and /home/backups large directories exist. If the load on /home/images instantaneously increases, metadata distribution occurs across ranks 0 to 3. Thus, users of /home/backups may be affected by noisy neighbors unnecessarily. If the two directories are set to MDS rank 0-1 (ceph.dir.bal.mask 0x3) and 2-3 (ceph.dir.bal.mask 0xC) respectively, the effect on each other can be minimized. Like this, it can be used efficiently for various directories.
How to use
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows