diff --git a/doc/cephfs/multimds.rst b/doc/cephfs/multimds.rst index 9855fa5e725bf..c55f0d7519453 100644 --- a/doc/cephfs/multimds.rst +++ b/doc/cephfs/multimds.rst @@ -268,6 +268,8 @@ such as with the ``bal_rank_mask`` setting (described below). Careful monitoring of the file system performance and MDS is advised. +.. _intro-bal-rank-mask: + Dynamic subtree partitioning with Balancer on specific ranks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -311,3 +313,151 @@ the ``bal_rank_mask`` should be set to ``0x0``. For example: .. prompt:: bash # ceph fs set bal_rank_mask 0x0 + +Dynamically partitioning directory trees onto specific ranks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``bal_rank_mask`` file system setting (see also :ref:`intro-bal-rank-mask`) +can be overridden by configuring the ``ceph.dir.bal.mask`` vxattr. +Notably, the ``ceph.dir.bal.mask`` offers the capability to isolate particular subdirectories +into dedicated groups of MDS ranks, whereas bal_rank_mask only allows isolation +at the file system level, such as the root directory ``/``. It serves as a valuable +tool for finely tuning and enhancing MDS performance in various scenarios. + + +One scenario is when dealing with a large directory that proves challenging +for a single MDS rank to handle efficiently. For example, the ``/usr`` directory +contains many directories for multiple users. In most cases, static pinning is +simply used to distribute subdirectories of ``/usr`` across multiple ranks. However, +the ``/usr/share`` directory presents a unique situation with many subdirectories, +each containing large number of files, often exceeding a million files. For +example, consider ``/usr/share/images`` with several million images. Processing this directory on one MDS may be inappropriate +due to insufficient MDS resources like MDS cache memory. Therefore, +the ceph.dir.bal.mask could dynamically balance the workload of this large +directory within a specific subset of ranks. + + +The ``ceph.dir.bal.mask`` can also be a useful option for fine-tuning performance, +particularly in scenarios involving large directories like ``/usr/share/images`` +and ``/usr/share/backups`` within the file system. + +.. note:: The example here uses the /usr/share directory. + Depending on the user's preference, it can be applied + to various directories such as /usr/share and /mnt/cephfs. + +While the file system setting ``bal_rank_mask`` isolates the entire ``/`` directory to +specific ranks, it can affect performance due to each other's migration overhead. +For example, if the file system setting``bal_rank_mask`` is set to ``0xf`` and large directories +like ``/usr/share/images`` and ``/usr/share/backups`` exist, the load on ``/usr/share/images`` +instantaneously increases and metadata distribution occurs across ranks 0 to 3. +Thus, users of ``/usr/share/backups`` may be affected by noisy neighbors unnecessarily. +By distributing the two directories to MDS rank 0,1 group and MDS rank 2,3 group +respectively through the ``ceph.dir.bal.mask``, metadata service can be provided +without affecting the performance of other concurrent workloads. + + +This option can be set via: + +:: + + setfattr -n ceph.dir.bal.mask -v 1,2 /usr/share/images + setfattr -n ceph.dir.bal.mask -v 3,4 /usr/share/backups + +.. note:: The balance_automate setting must be `true` for `ceph.dir.bal.mask` to have any effect. + +``/usr/share/images`` and ``/usr/share/backups`` are distributed within mds rank 1~2 and 3~4, respectively. + +Similar to ``ceph.dir.pin``, the ``ceph.dir.bal.mask`` is also inherited from its closest parent. +This manner involves configuring the ``ceph.dir.bal.mask`` for a directory, which consequently +impacts all its descendants. However, the parent ``ceph.dir.bal.mask`` can be overridden +by setting the child directory’s value. For example: + +:: + + mkdir -p /usr/share/images/jpg + setfattr -n ceph.dir.bal.mask -v 1,2 /usr/share/images + # /usr/share/images and /usr/share/images/jpg are now freely moved among mds ranks 1 and 2 + setfattr -n ceph.dir.bal.mask -v 3,4 /usr/share/images/jpg + # /usr/share/images/jpg now moved among mds ranks 3 and 4 + +The option can be unset via: + +:: + + setfattr -x ceph.dir.bal.mask /usr/share/images/jpg + +The value of the ``ceph.dir.bal.mask`` for ``/usr/share/images/jpg`` is unset and replaced with their inherited value from +its nearest parent. If the ``ceph.dir.bal.mask`` for ``/usr/share/images`` is configured with +a valid value, ``/usr/share/images/jpg`` will be distributed according to its parent’s settings. + +This can override the file system setting ``bal_rank_mask``. For example: + +:: + + cephfs fs set cephfs bal_rank_mask 0xf + + +Initially, the balancer dynamically partitions the file system within MDS rank +0 to 3 because the ``bal_rank_mask`` is set to ``0xf``. + +If the ``ceph.dir.bal.mask`` for the root ``/`` directory is set to ``0,1,2,3,4,5``, +the ``bal_rank_mask`` will be overridden. In this way, the balancer dynamically +distributes the unpinned subtrees of the root ``/`` directory from ``0`` to ``5`` mds ranks. + +.. note:: Users may prefer to use the file system setting `bal_rank_mask` for simplicity. + +:: + + setfattr -n ceph.dir.bal.mask -v 0,1,2,3,4,5 / + +``ceph.dir.bal.mask`` overrides parent ``ceph.dir.pin`` and vice versa. For example: + +:: + + mkdir -p /usr/share/images/jpg + setfattr -n ceph.dir.pin -v 1 /usr/share/images + setfattr -n ceph.dir.bal.mask -v 2,3 /usr/share/images/jpg + +The ``/usr/share/images`` directory will be pinned to rank 1, while the ``/usr/share/images/jpg`` +directory may be dynamically split into rank 2 and 3. + +While `ceph.dir.bal.mask` functions similarly to `ceph.dir.pin` when it has only one mds rank, +its distinct advantage lies in its flexibility. +Unlike `ceph.dir.pin`, `ceph.dir.bal.mask` can be expanded to multiple mds ranks or reduced to fewer mds ranks. +Therefore, in scenarios requiring expansion to multiple mds ranks, +it is recommended to employ `ceph.dir.bal.mask`. + +:: + + # Same way as setfattr -n ceph.dir.pin -v 1 /usr/share/images + setfattr -n ceph.dir.bal.mask -v 1 /usr/share/images + + # If expansion is necessary, multiple mds ranks can be manipulated. + setfattr -n ceph.dir.bal.mask -v 1,2,3 /usr/share/images + +If the max_mds number shrinks, the subtree should move to mds rank 0. +Therefore, max_mds should be adjusted carefully because MDS performance may decrease. + +:: + # The file system is operating with mds ranks 0 to 3. + ceph fs set cephfs max_mds 4 + + # The subtree is moved among mds ranks 2 and 3. + setfattr -n ceph.dir.bal.mask -v 2,3 /usr/share/images + + # The file system is operating with mds ranks 0 and 1. + ceph fs set cephfs max_mds 2 + + # The subtree is now moved to mds rank 0. + +**Restrictions**: Since the inode of the root directory is always implicitly pinned to MDS rank 0, +it's required to include MDS rank 0 when adjusting the ceph.dir.rank.mask for the ``/`` directory. + +:: + + setfattr -n ceph.dir.pin -v 1,2 / + setfattr: mnt: Invalid argument + # failed with invalid argument error + + setfattr -n ceph.dir.pin -v 0,1,2 / + # success diff --git a/qa/tasks/cephfs/cephfs_test_case.py b/qa/tasks/cephfs/cephfs_test_case.py index c1312ec5efcd1..1d9d6642ea748 100644 --- a/qa/tasks/cephfs/cephfs_test_case.py +++ b/qa/tasks/cephfs/cephfs_test_case.py @@ -337,7 +337,7 @@ def delete_mds_coredump(self, daemon_id): def _get_subtrees(self, status=None, rank=None, path=None): if path is None: - path = "/" + path = "/" # everything below root but not root try: with contextutil.safe_while(sleep=1, tries=3) as proceed: while proceed(): @@ -351,7 +351,16 @@ def _get_subtrees(self, status=None, rank=None, path=None): subtrees += s else: subtrees = self.fs.rank_asok(["get", "subtrees"], status=status, rank=rank) - subtrees = filter(lambda s: s['dir']['path'].startswith(path), subtrees) + def match(s): + if path == "" and s['dir']['path'] == "": + return True + elif path == "" and s['dir']['path'].startswith("/"): + return True + elif path != "" and s['dir']['path'].startswith(path): + return True + else: + return False + subtrees = filter(match, subtrees) return list(subtrees) except CommandFailedError as e: # Sometimes we get transient errors @@ -363,6 +372,8 @@ def _get_subtrees(self, status=None, rank=None, path=None): raise RuntimeError(f"could not get subtree state from rank {rank}") from e def _wait_subtrees(self, test, status=None, rank=None, timeout=30, sleep=2, action=None, path=None): + log.info(f'_wait_subtrees test={test} status={status} rank={rank} timeout={timeout} ' + f'sleep={sleep} action={action} path={path}') test = sorted(test) try: with contextutil.safe_while(sleep=sleep, tries=timeout//sleep) as proceed: @@ -419,6 +430,16 @@ def _wait_random_subtrees(self, count, status=None, rank=None, path=None): except contextutil.MaxWhileTries as e: raise RuntimeError("rank {0} failed to reach desired subtree state".format(rank)) from e + def _wait_xattrs(self, mount, path, key, exp): + try: + with contextutil.safe_while(sleep=5, tries=20) as proceed: + while proceed(): + value = mount.getfattr(path, key) + if value == exp: + return value + except contextutil.MaxWhileTries as e: + raise RuntimeError("client failed to get desired value with key {0}".format(key)) from e + def create_client(self, client_id, moncap=None, osdcap=None, mdscap=None): if not (moncap or osdcap or mdscap): if self.fs: diff --git a/qa/tasks/cephfs/filesystem.py b/qa/tasks/cephfs/filesystem.py index 1c00a49077dff..9a0f93ac135bd 100644 --- a/qa/tasks/cephfs/filesystem.py +++ b/qa/tasks/cephfs/filesystem.py @@ -655,6 +655,9 @@ def set_allow_new_snaps(self, yes): def set_bal_rank_mask(self, bal_rank_mask): self.set_var("bal_rank_mask", bal_rank_mask) + def set_balance_automate(self, yes): + self.set_var("balance_automate", yes) + def set_refuse_client_session(self, yes): self.set_var("refuse_client_session", yes) diff --git a/qa/tasks/cephfs/test_exports.py b/qa/tasks/cephfs/test_exports.py index 16de379f54fe5..1319e784a8b1d 100644 --- a/qa/tasks/cephfs/test_exports.py +++ b/qa/tasks/cephfs/test_exports.py @@ -628,3 +628,367 @@ def test_ephemeral_pin_shrink_mds(self): log.info("{0} migrations have occured due to the cluster resizing".format(count)) # rebalancing from 3 -> 2 may cause half of rank 0/1 to move and all of rank 2 self.assertLessEqual((count/len(subtrees_old)), (1.0/3.0/2.0 + 1.0/3.0/2.0 + 1.0/3.0)*1.25) # aka .66 with 25% overbudget + +class TestBalMask(CephFSTestCase): + MDSS_REQUIRED = 3 + CLIENTS_REQUIRED = 1 + + def setUp(self): + CephFSTestCase.setUp(self) + + self.fs.set_max_mds(3) + self.status = self.fs.wait_for_daemons() + + if self.fs.get_var("max_mds") < 3: + self.skipTest("Require three mdss") + + self.fs.set_balance_automate(True) + + self.mount_a.run_shell_payload("mkdir -p 1/a") + + def test_set_to_only_invalid_value_for_subdir(self): + """ + That passing invalid values leads to a command failure. + """ + MAX_MDS = 256 + invalid_values = list(map(str, [-2,MAX_MDS, MAX_MDS+1])) + for value in invalid_values: + try: + self.mount_a.setfattr("1", "ceph.dir.bal.mask", value) + except CommandFailedError as e: + self.assertEqual(e.exitstatus, 1) + + def test_set_to_mix_invalid_value_for_subdir(self): + """ + That combining valid and invalid values results in a command failure. + """ + MAX_MDS = 256 + valid_values = list(map(str, [0, 1])) + invalid_values = list(map(str, [-2, MAX_MDS, MAX_MDS+1])) + invalid_values = [f'{pair[1]},{pair[1]}' for pair in zip(invalid_values, valid_values)] + for value in invalid_values: + try: + self.mount_a.setfattr("1", "ceph.dir.bal.mask", value) + except CommandFailedError as e: + self.assertEqual(e.exitstatus, 1) + + def test_distribute_directory_to_multiple_ranks(self): + """ + That bal.mask distributes a directory to multiple ranks. + """ + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "1,2") + for i in range(10): + self.mount_a.run_shell_payload(f"mkdir -p 1/{i}") + self.mount_a.create_n_files(f"1/{i}/file", 100, sync=True) + + time.sleep(15) + + subtrees = self._get_subtrees(status=self.status, rank='all', path="/1") + hit_ranks = set([s['auth_first'] for s in subtrees]) + self.assertEqual(set([1,2]), hit_ranks) + + def test_set_to_single_valid_value_for_subdir(self): + """ + That vaild value is passed. + """ + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "1") + self._wait_subtrees([('/1', 1)], status=self.status) + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "2") + self._wait_subtrees([('/1', 2)], status=self.status) + + def test_set_to_multiple_valid_values_for_subdir(self): + """ + That multiple vaild values are passed. + The root of the subtree is moved to the lowest rank + among multiple rank values. + """ + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "0,1") + self._wait_subtrees([('/1', 0)], status=self.status) + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "1,2") + self._wait_subtrees([('/1', 1)], status=self.status) + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "0,2") + self._wait_subtrees([('/1', 0)], status=self.status) + + def test_set_to_invalid_value_for_root(self): + """ + That the root directory must always include rank 0. + Otherwise, it will result in a command failure. + """ + + try: + self.mount_a.setfattr("", "ceph.dir.bal.mask", "1") + except CommandFailedError as e: + self.assertEqual(e.exitstatus, 1) + + try: + self.mount_a.setfattr("", "ceph.dir.bal.mask", "2") + except CommandFailedError as e: + self.assertEqual(e.exitstatus, 1) + + try: + self.mount_a.setfattr("", "ceph.dir.bal.mask", "1,2") + except CommandFailedError as e: + self.assertEqual(e.exitstatus, 1) + + def test_set_to_valid_value_for_root(self): + """ + That the root directory must always include rank 0. + """ + + self.mount_a.setfattr(".", "ceph.dir.bal.mask", "0") + self._wait_subtrees([('', 0)], status=self.status, path='') + + self.mount_a.setfattr(".", "ceph.dir.bal.mask", "0,1") + self._wait_subtrees([('', 0)], status=self.status, path='') + + self.mount_a.setfattr(".", "ceph.dir.bal.mask", "0,2") + self._wait_subtrees([('', 0)], status=self.status, path='') + + self.mount_a.setfattr(".", "ceph.dir.bal.mask", "0,1,2") + self._wait_subtrees([('', 0)], status=self.status, path='') + + + def test_override_ceph_fs_set_bal_rank_mask(self): + """ + That ceph.dir.bal.mask overrides mdsmap's bal_rank_mask. + The bal_rank_mask is originally set to 0x1, which confines all workloads to rank 0. + By overriding it with ceph.dir.bal.mask at 0,1, you can now examine sub-trees like /1 on rank 1. + """ + bal_rank_mask = '0x1' + self.fs.set_bal_rank_mask(bal_rank_mask) + self.assertEqual(bal_rank_mask, self.fs.get_var('bal_rank_mask')) + self._wait_subtrees([('', 0)], status=self.status, path='') + + self.mount_a.setfattr(".", "ceph.dir.bal.mask", "0,1") + self._wait_subtrees([('', 0)], status=self.status, path='') + + found = False + for i in range(10): + self.mount_a.create_n_files("1/file", 100, sync=True) + self.mount_a.create_n_files("file", 100, sync=True) + + try: + self._wait_subtrees([('/1', 1)], status=self.status, timeout=10) + found = True + break + except: + pass + + self.assertEqual(found, True) + + def test_rank_mask_override_pin(self): + """ + That ceph.dir.bal.mask overrides ceph.dir.pin. + """ + self.mount_a.setfattr("1", "ceph.dir.pin", "1") + self._wait_subtrees([('/1', 1)], status=self.status, rank=1) + + self.mount_a.setfattr("1/a", "ceph.dir.bal.mask", "2") + self._wait_subtrees([('/1', 1), ('/1/a', 2)], status=self.status, rank=2) + + def test_pin_override_rank_mask_simple(self): + """ + That ceph.dir.pin overrides ceph.dir.bal.mask. + """ + self.mount_a.run_shell(["touch", "1/a/file"]) + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "1") + self._wait_subtrees([('/1', 1)], status=self.status, rank=1) + + self.mount_a.setfattr("1/a", "ceph.dir.pin", "2") + self._wait_subtrees([('/1', 1),('/1/a', 2)], status=self.status, rank=2) + + def test_pin_override_rank_mask_mix(self): + """ + That ceph.dir.pin overrides ceph.dir.bal.mask in the same directory. + ceph.dir.pin has higher priority. + """ + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "1") + self._wait_subtrees([('/1', 1)], status=self.status, rank=1) + + self.mount_a.setfattr("1", "ceph.dir.pin", "2") + self._wait_subtrees([('/1', 2)], status=self.status, rank=2) + + self.mount_a.setfattr("1", "ceph.dir.pin", "-1") + self._wait_subtrees([('/1', 1)], status=self.status, rank=1) + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "2") + self._wait_subtrees([('/1', 2)], status=self.status, rank=2) + + self.mount_a.setfattr("1", "ceph.dir.pin", "1") + self._wait_subtrees([('/1', 1)], status=self.status, rank=1) + + self.mount_a.setfattr("1", "ceph.dir.pin", "-1") + self._wait_subtrees([('/1', 2)], status=self.status, rank=2) + + def test_ephemeral_pin_overried_rank_mask(self): + """ + That ephemeral pin overrides ceph.dir.bal.mask + """ + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "2") + self._wait_subtrees([('/1', 2)], status=self.status, rank=2) + self.mount_a.create_n_files("1/file", 100, sync=True) + time.sleep(10) + + self.config_set('mds', 'mds_export_ephemeral_distributed', True) + self.mount_a.setfattr("1", "ceph.dir.pin.distributed", "1") + self.mount_a.create_n_files("1/file", 100, sync=True) + subtrees = self._wait_distributed_subtrees(3 * 2, status=self.status, rank="all") + for s in subtrees: + path = s['dir']['path'] + if path == '/1': + self.assertTrue(s['distributed_ephemeral_pin']) + self.assertEqual(s['bal_rank_mask'], "2") + + def test_ephemeral_random_overried_rank_mask(self): + """ + that ephemeral random overrides ceph.dir.bal.mask. + """ + self.config_set('mds', 'mds_export_ephemeral_random', True) + self.config_set('mds', 'mds_export_ephemeral_random_max', 1.0) + count = 10 + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "2") + for i in range(count): + self.mount_a.create_n_files("1/file", 10, sync=True) + self._wait_subtrees([('/1', 2)], status=self.status, rank=2) + + self.mount_a.setfattr("1", "ceph.dir.pin.random", "1.0") + # timing issue + # Sometimes, there might be a delay in the synchronization of export_ephemeral_random_pin. + time.sleep(10) + for i in range(count): + self.mount_a.run_shell_payload(f"mkdir -p 1/{i}") + self.mount_a.create_n_files(f"1/{i}/file", 10, sync=True) + + self._wait_random_subtrees(count, status=self.status, rank="all") + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "1") + for i in range(count): + self.mount_a.create_n_files("1/file", 10, sync=True) + + self._wait_random_subtrees(count, status=self.status, rank="all") + + def test_unset_rank_mask(self): + """ + That ceph.dir.bal.mask is unset with -1. + """ + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "2") + value = self.mount_a.getfattr("1", "ceph.dir.bal.mask") + self.assertEqual(value, "2") + self._wait_subtrees([('/1', 2)], status=self.status) + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "-1") + value = self.mount_a.getfattr("1", "ceph.dir.bal.mask") + self.assertEqual(value, None) + + def test_unset_rank_mask_under_nested_root(self): + """ + That ceph.dir.bal.mask is unset under root directory. + """ + + self.mount_a.setfattr(".", "ceph.dir.bal.mask", "0") + self._wait_subtrees([('', 0)], status=self.status, path='') + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "2") + value = self.mount_a.getfattr("1", "ceph.dir.bal.mask") + self.assertEqual(value, "2") + self._wait_subtrees([('/1', 2)], status=self.status) + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "-1") + value = self.mount_a.getfattr("1", "ceph.dir.bal.mask") + self.assertEqual(value, None) + + """ + After the ceph.dir.bal.mask value of /1 is unset, + the /1 subtree is merged with the / root . + + """ + self.mount_a.setfattr(".", "ceph.dir.bal.mask", "0") + self._wait_subtrees([('', 0)], status=self.status, path='') + + def test_unset_rank_mask_under_nested_parent(self): + """ + That ceph.dir.bal.mask is unset under nested directory. + """ + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "1") + self._wait_subtrees([('/1', 1)], status=self.status, rank=1) + self.mount_a.setfattr("1/a", "ceph.dir.bal.mask", "2") + self._wait_subtrees([('/1', 1), ('/1/a', 2)], status=self.status, rank=2) + + self.mount_a.setfattr("1/a", "ceph.dir.bal.mask", "-1") + self._wait_subtrees([('/1', 1)], status=self.status, rank=1) + + def test_mds_failover(self): + """ + That MDS failover does not affect the ceph.dir.bal.mask. + """ + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "2") + self._wait_xattrs(self.mount_a, "1", "ceph.dir.bal.mask", "2") + self._wait_subtrees([('/1', 2)], status=self.status) + + for i in range(5): + self.fs.rank_fail(rank=2) + self.status = self.fs.wait_for_daemons() + + value = self.mount_a.getfattr("1", "ceph.dir.bal.mask") + self.assertEqual(value, "2") + self._wait_subtrees([('/1', 2)], status=self.status) + + def test_mds_shrink(self): + """ + That ceph.dir.bal.mask is sustained during reducing MDS. + """ + + self.fs.set_max_mds(3) + self.status = self.fs.wait_for_daemons() + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "1") + self._wait_xattrs(self.mount_a, "1", "ceph.dir.bal.mask", "1") + self._wait_subtrees([('/1', 1)], status=self.status) + + self.fs.set_max_mds(2) + self.status = self.fs.wait_for_daemons() + self._wait_subtrees([('/1', 1)], status=self.status) + + def test_mds_shrink_and_grow(self): + """ + That ceph.dir.bal.mask is sustained during reducing and growing MDS. + """ + + self.fs.set_max_mds(3) + self.status = self.fs.wait_for_daemons() + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "2") + self._wait_xattrs(self.mount_a, "1", "ceph.dir.bal.mask", "2") + self._wait_subtrees([('/1', 2)], status=self.status) + + self.fs.set_max_mds(2) + self.status = self.fs.wait_for_daemons() + self._wait_subtrees([('/1', 0)], status=self.status) + + self.fs.set_max_mds(3) + self.status = self.fs.wait_for_daemons() + self._wait_subtrees([('/1', 2)], status=self.status) + + def test_mds_grow(self): + """ + That ceph.dir.bal.mask is sustained during growing MDS. + """ + + self.fs.set_max_mds(2) + self.status = self.fs.wait_for_daemons() + + self.mount_a.setfattr("1", "ceph.dir.bal.mask", "1") + self._wait_xattrs(self.mount_a, "1", "ceph.dir.bal.mask", "1") + self._wait_subtrees([('/1', 1)], status=self.status, timeout=600) + + self.fs.set_max_mds(3) + self.status = self.fs.wait_for_daemons() + self._wait_subtrees([('/1', 1)], status=self.status, timeout=600) diff --git a/src/include/cephfs/types.h b/src/include/cephfs/types.h index 108878794f755..a3bfe132ca800 100644 --- a/src/include/cephfs/types.h +++ b/src/include/cephfs/types.h @@ -77,6 +77,7 @@ typedef int32_t mds_rank_t; constexpr mds_rank_t MDS_RANK_NONE = -1; constexpr mds_rank_t MDS_RANK_EPHEMERAL_DIST = -2; constexpr mds_rank_t MDS_RANK_EPHEMERAL_RAND = -3; +constexpr mds_rank_t MDS_RANK_MASK = -4; struct scatter_info_t { version_t version = 0; diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc index acddeb4f1d157..474a654ef6ecb 100644 --- a/src/mds/CDir.cc +++ b/src/mds/CDir.cc @@ -2867,6 +2867,12 @@ mds_rank_t CDir::get_export_pin(bool inherit) const return export_pin; } +std::string CDir::get_rank_mask(bool inherit) const +{ + CInode *in = inode->get_rank_mask_inode(inherit); + return in->get_bal_rank_mask_from_xattrs(); +} + bool CDir::is_exportable(mds_rank_t dest) const { mds_rank_t export_pin = get_export_pin(); diff --git a/src/mds/CDir.h b/src/mds/CDir.h index 215375ca297cd..43ca10f97b367 100644 --- a/src/mds/CDir.h +++ b/src/mds/CDir.h @@ -513,6 +513,7 @@ class CDir : public MDSCacheObject, public Counter { // -- import/export -- mds_rank_t get_export_pin(bool inherit=true) const; + std::string get_rank_mask(bool inherit=true) const; bool is_exportable(mds_rank_t dest) const; void encode_export(ceph::buffer::list& bl); diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc index c2ea2facbd0cb..e2ca4387fc549 100644 --- a/src/mds/CInode.cc +++ b/src/mds/CInode.cc @@ -489,6 +489,9 @@ CInode::projected_inode CInode::project_inode(const MutationRef& mut, void CInode::pop_and_dirty_projected_inode(LogSegment *ls, const MutationRef& mut) { ceph_assert(!projected_nodes.empty()); + + bool bal_rank_mask_updated = get_bal_rank_mask_from_xattrs(true) != get_bal_rank_mask_from_xattrs(false); + auto front = std::move(projected_nodes.front()); dout(15) << __func__ << " v" << front.inode->version << dendl; @@ -514,7 +517,7 @@ void CInode::pop_and_dirty_projected_inode(LogSegment *ls, const MutationRef& mu if (get_inode()->is_backtrace_updated()) mark_dirty_parent(ls, pool_updated); - if (pin_updated) + if (pin_updated || bal_rank_mask_updated) maybe_export_pin(true); } @@ -2098,6 +2101,8 @@ void CInode::encode_lock_ixattr(bufferlist& bl) void CInode::decode_lock_ixattr(bufferlist::const_iterator& p) { + std::string prev_bal_rank_mask = get_bal_rank_mask_from_xattrs(false); + ceph_assert(!is_auth()); auto _inode = allocate_inode(*get_inode()); DECODE_START(2, p); @@ -2112,6 +2117,9 @@ void CInode::decode_lock_ixattr(bufferlist::const_iterator& p) } DECODE_FINISH(p); reset_inode(std::move(_inode)); + + std::string bal_rank_mask = get_bal_rank_mask_from_xattrs(false); + maybe_export_pin(prev_bal_rank_mask != bal_rank_mask); } void CInode::encode_lock_isnap(bufferlist& bl) @@ -5340,6 +5348,8 @@ void CInode::queue_export_pin(mds_rank_t export_pin) target = export_pin; else if (export_pin == MDS_RANK_EPHEMERAL_RAND) target = mdcache->hash_into_rank_bucket(ino()); + else if (export_pin == MDS_RANK_MASK) + target = MDS_RANK_MASK; else target = MDS_RANK_NONE; @@ -5394,8 +5404,18 @@ void CInode::maybe_export_pin(bool update) dout(15) << __func__ << " update=" << update << " " << *this << dendl; mds_rank_t export_pin = get_export_pin(false); - if (export_pin == MDS_RANK_NONE && !update) + if (export_pin == MDS_RANK_NONE && !update) { return; + } + + if (export_pin == MDS_RANK_NONE) { + CInode *in = get_rank_mask_inode(false); + std::string bal_rank_mask = in->get_bal_rank_mask_from_xattrs(); + if (bal_rank_mask.size() == 0 && !update) { + return; + } + export_pin = MDS_RANK_MASK; + } check_pin_policy(export_pin); queue_export_pin(export_pin); @@ -5492,6 +5512,59 @@ void CInode::setxattr_ephemeral_dist(bool val) _get_projected_inode()->set_ephemeral_distributed_pin(val); } +std::string CInode::get_bal_rank_mask_from_xattrs(bool projected_node) +{ + const auto& pxattrs = (projected_node && !projected_nodes.empty()) + ? projected_nodes.front().xattrs : get_xattrs(); + if (pxattrs) { + auto it = pxattrs->find("ceph.dir.bal.mask"); + if (it != pxattrs->end()) { + std::string val(it->second.c_str(), it->second.length()); + return val; + } + } + return ""; +} + +CInode *CInode::get_rank_mask_inode(bool inherit) +{ + if (!mdcache->get_bal_export_pin()) + return this; + + CInode *in = this; + const CDir *dir = nullptr; + std::string bal_rank_mask; + + while (true) { + if (in->is_system()) { + break; + } + + const CDentry *pdn = in->get_parent_dn(); + if (!pdn) { + break; + } + + if (in->get_inode()->nlink == 0) { + // ignore export pin for unlinked directory + break; + } + + bal_rank_mask = in->get_bal_rank_mask_from_xattrs(); + if (bal_rank_mask.size()) { + break; + } + + if (!inherit) { + break; + } + dir = pdn->get_dir(); + in = dir->inode; + } + + return in; +} + void CInode::set_export_pin(mds_rank_t rank) { ceph_assert(is_dir()); diff --git a/src/mds/CInode.h b/src/mds/CInode.h index cf2322998e3e3..84fa1f982bba5 100644 --- a/src/mds/CInode.h +++ b/src/mds/CInode.h @@ -1008,6 +1008,7 @@ class CInode : public MDSCacheObject, public InodeStoreBase, public Counter #include +#include + using namespace std; #include "common/config.h" @@ -137,7 +139,50 @@ void MDBalancer::handle_conf_change(const std::set& changed, const bool MDBalancer::test_rank_mask(mds_rank_t rank) { - return mds->mdsmap->get_bal_rank_mask_bitset().test(rank); + return bal_rank_mask_bitset.test(rank); +} + +void MDBalancer::handle_rank_mask(void) +{ + dout(20) << " start " << dendl; + mds->mdsmap->update_num_mdss_in_rank_mask_bitset(); + std::vector authsubs = mds->mdcache->get_auth_subtrees(); + for (auto &cd : authsubs) { + if (cd->inode->is_system() || cd->inode->is_root()) { + continue; + } + + mds_rank_t export_pin = cd->inode->get_export_pin(false); + dout(7) << "export_pin " << export_pin << " " << cd->get_path() << dendl; + + if (export_pin != MDS_RANK_NONE && export_pin != MDS_RANK_MASK) + continue; + + max_mds_bitset_t rank_mask_bitset; + int r = get_rank_mask_bitset(cd, rank_mask_bitset); + if (r != 0) { + continue; + } + + dout(7) << "rank_mask_bitset " << bitmask_to_str(rank_mask_bitset) << " " << cd->get_path() << dendl; + ceph_assert(rank_mask_bitset.count()); + + mds_rank_t target = -1; + for (mds_rank_t mds_rank = 0; mds_rank < mds->mdsmap->get_max_mds(); mds_rank++) { + if (rank_mask_bitset.test(mds_rank)) { + target = mds_rank; + break; + } + } + + if (rank_mask_bitset.test(cd->authority().first) == false && + target != mds->get_nodeid() && + target >= 0 && + target < mds->mdsmap->get_max_mds()) { + dout(7) << "try to export nicely " << cd->get_path() << " auth " << cd->authority().first << " to " << target << " mask " << bitmask_to_str(rank_mask_bitset) << dendl; + mds->mdcache->migrator->export_dir_nicely(cd, target); + } + } } void MDBalancer::handle_export_pins(void) @@ -154,6 +199,12 @@ void MDBalancer::handle_export_pins(void) ceph_assert(in->is_dir()); mds_rank_t export_pin = in->get_export_pin(false); + if (export_pin == MDS_RANK_NONE) { + CInode *pin = in->get_rank_mask_inode(false); + std::string bal_rank_mask = pin->get_bal_rank_mask_from_xattrs(); + if (bal_rank_mask.size()) + export_pin = MDS_RANK_MASK; + } in->check_pin_policy(export_pin); if (export_pin >= max_mds) { @@ -166,7 +217,6 @@ void MDBalancer::handle_export_pins(void) continue; } - dout(20) << " executing export_pin=" << export_pin << " on " << *in << dendl; unsigned min_frag_bits = 0; mds_rank_t target = MDS_RANK_NONE; if (export_pin >= 0) @@ -175,6 +225,10 @@ void MDBalancer::handle_export_pins(void) target = mdcache->hash_into_rank_bucket(in->ino()); else if (export_pin == MDS_RANK_EPHEMERAL_DIST) min_frag_bits = mdcache->get_ephemeral_dist_frag_bits(); + else if (export_pin == MDS_RANK_MASK) + target = MDS_RANK_MASK; + + dout(20) << " executing export_pin=" << export_pin << " on " << *in << " target " << target << dendl; bool remove = true; for (auto&& dir : in->get_dirfrags()) { @@ -204,7 +258,7 @@ void MDBalancer::handle_export_pins(void) dir->state_clear(CDir::STATE_AUXSUBTREE); mds->mdcache->try_subtree_merge(dir); } - } else if (target == mds->get_nodeid()) { + } else if (target == mds->get_nodeid() || target == MDS_RANK_MASK) { if (dir->state_test(CDir::STATE_AUXSUBTREE)) { ceph_assert(dir->is_subtree_root()); } else if (dir->state_test(CDir::STATE_CREATING) || @@ -255,6 +309,8 @@ void MDBalancer::handle_export_pins(void) export_pin = mdcache->hash_into_rank_bucket(cd->ino(), cd->get_frag()); } else if (export_pin == MDS_RANK_EPHEMERAL_RAND) { export_pin = mdcache->hash_into_rank_bucket(cd->ino()); + } else if (export_pin == MDS_RANK_MASK) { + export_pin = MDS_RANK_MASK; } if (print_auth_subtrees) @@ -274,6 +330,7 @@ void MDBalancer::tick() if (bal_export_pin) { handle_export_pins(); + handle_rank_mask(); } // sample? @@ -506,6 +563,85 @@ void MDBalancer::send_heartbeat() } } +int MDBalancer::rank_mask_list_str_to_bitset(CInode *cur, std::string& rank_mask_list_str, max_mds_bitset_t& rank_mask_bitset, std::ostream& ss) +{ + typedef boost::tokenizer> tokenizer; + boost::char_separator sep{","}; + tokenizer tokens{rank_mask_list_str, sep}; + bool rank0_involved = false; + max_mds_bitset_t temp_mask_bitset; + + if (std::distance(tokens.begin(), tokens.end()) == 0) { + ss << "bad vxattr value, unable to parse string"; + return -EINVAL; + } + + for (const auto &token : tokens) { + try { + mds_rank_t rank = std::stoi(token); + if (cur->is_root() && rank == 0) + rank0_involved = true; + + if (rank < 0 || rank >= MAX_MDS) { + ss << "bad vxattr value, unable to parse string"; + return -EINVAL; + } + temp_mask_bitset.set(rank); + } catch (const std::invalid_argument& e) { + ss << "bad vxattr value, unable to parse string"; + return -EINVAL; + } + } + + if (cur->is_root() and rank0_involved == false) { + ss << "bad vxattr value, rank0 must be set for root dir"; + return -EINVAL; + } + + if (temp_mask_bitset.count() == 0) { + ss << "at least one rank must be set"; + return -EINVAL; + } + + rank_mask_bitset = temp_mask_bitset; + return 0; +} + +int MDBalancer::get_rank_mask_bitset(CDir *dir, max_mds_bitset_t& rank_mask_bitset, bool inherit) +{ + CInode *in = dir->inode->get_rank_mask_inode(inherit); + std::string bal_rank_mask = in->get_bal_rank_mask_from_xattrs(); + int r; + + if (in->is_root() && bal_rank_mask.size() == 0) { + rank_mask_bitset = mds->mdsmap->get_bal_rank_mask_bitset(); + dout(10) << " bal_rank_mask is obtained from mdsmap " << mds->mdsmap->get_num_mdss_in_rank_mask_bitset() << + " " << + mds->mdsmap->get_bal_rank_mask() << dendl; + if (rank_mask_bitset.count() == 0) { + dout(10) << "at least one rank must be set" << dendl; + r = -EINVAL; + } else { + r = 0; + } + } else { + dout(7) << dir->get_path() << " " << bal_rank_mask << dendl; + CachedStackStringStream css; + r = rank_mask_list_str_to_bitset(dir->get_inode(), bal_rank_mask, rank_mask_bitset, *css); + dout(10) << css->str() << dendl; + } + return r; +} + +std::string MDBalancer::bitmask_to_str(max_mds_bitset_t& bitmask) +{ + std::string mask_str = bitmask.to_string(); + std::reverse(mask_str.begin(), mask_str.end()); + mask_str.resize(mds->mdsmap->get_max_mds()); + std::reverse(mask_str.begin(), mask_str.end()); + return mask_str; +} + void MDBalancer::handle_heartbeat(const cref_t &m) { mds_rank_t who = mds_rank_t(m->get_source().num()); @@ -550,20 +686,49 @@ void MDBalancer::handle_heartbeat(const cref_t &m) } } - { - auto em = mds_load.emplace(std::piecewise_construct, std::forward_as_tuple(who), std::forward_as_tuple(m->get_load())); - if (!em.second) { - em.first->second = m->get_load(); + mds->mdsmap->update_num_mdss_in_rank_mask_bitset(); + + max_mds_bitset_t my_bal_rank_mask_bitset; + std::vector authsubs = mds->mdcache->get_auth_subtrees(); + int overlap_count = 0; + for (auto &cd : authsubs) { + max_mds_bitset_t temp_bitset; + int r = get_rank_mask_bitset(cd, temp_bitset); + if (r != 0) + continue; + + dout(10) << " authsubtree bitset mask "<< bitmask_to_str(temp_bitset) << " bit count " << temp_bitset.count() << " auth " << cd->authority().first << " " << cd->get_path() << dendl; + + if (temp_bitset.test(mds->get_nodeid())) { + my_bal_rank_mask_bitset |= temp_bitset; + overlap_count++; + if (overlap_count > 1) + dout(10) << " rank_mask overlapped subdir " << *cd << dendl; } } - mds_import_map[who] = m->get_import_map(); - mds->mdsmap->update_num_mdss_in_rank_mask_bitset(); + if (overlap_count > 1) { + dout(10) << " bal_rank_mask values of multiple subdirs have been overlapped, value: " << overlap_count << dendl; + } + + bal_rank_mask_bitset = my_bal_rank_mask_bitset; + num_mdss_in_rank_mask_bitset = bal_rank_mask_bitset.count(); - if (mds->mdsmap->get_num_mdss_in_rank_mask_bitset() > 0) + if (bal_rank_mask_bitset.test(who)) { + { + auto em = mds_load.emplace(std::piecewise_construct, std::forward_as_tuple(who), std::forward_as_tuple(m->get_load())); + if (!em.second) { + em.first->second = m->get_load(); + } + } + mds_import_map[who] = m->get_import_map(); + } + + if (num_mdss_in_rank_mask_bitset) { unsigned cluster_size = mds->get_mds_map()->get_num_in_mds(); - if (mds_load.size() == cluster_size) { + cluster_size = std::min(cluster_size, num_mdss_in_rank_mask_bitset); + if (mds_load.size() == cluster_size && cluster_size > 1) { // let's go! //export_empties(); // no! @@ -577,6 +742,8 @@ void MDBalancer::handle_heartbeat(const cref_t &m) } prep_rebalance(m->get_beat()); } + } else { + dout(10) << "Nothing to do because rank mast bitset are cleared "<< bitmask_to_str(bal_rank_mask_bitset) << dendl; } } @@ -792,6 +959,9 @@ void MDBalancer::prep_rebalance(int beat) double total_load = 0.0; multimap load_map; for (mds_rank_t i=mds_rank_t(0); i < mds_rank_t(cluster_size); i++) { + if (test_rank_mask(i) == false) { + continue; + } mds_load_t& load = mds_load.at(i); double l = load.mds_load(bal_mode) * load_fac; @@ -810,7 +980,7 @@ void MDBalancer::prep_rebalance(int beat) } // target load - target_load = total_load / (double)mds->mdsmap->get_num_mdss_in_rank_mask_bitset(); + target_load = total_load / (double)num_mdss_in_rank_mask_bitset; dout(7) << "my load " << my_load << " target " << target_load << " total " << total_load @@ -848,7 +1018,7 @@ void MDBalancer::prep_rebalance(int beat) for (multimap::iterator it = load_map.begin(); it != load_map.end(); ++it) { - if (it->first < target_load && test_rank_mask(it->second)) { + if (it->first < target_load) { dout(15) << " mds." << it->second << " is importer" << dendl; importers.insert(pair(it->first,it->second)); importer_set.insert(it->second); @@ -882,6 +1052,9 @@ void MDBalancer::prep_rebalance(int beat) for (map::iterator im = mds_import_map[ex->second].begin(); im != mds_import_map[ex->second].end(); ++im) { + if (!test_rank_mask(im->first)) { + continue; + } double maxim = get_maxim(state, im->first, target_load); if (maxim <= .001) continue; try_match(state, ex->second, maxex, im->first, maxim); @@ -973,8 +1146,6 @@ int MDBalancer::mantle_prep_rebalance() return 0; } - - void MDBalancer::try_rebalance(balance_state_t& state) { if (g_conf()->mds_thrash_exports) { @@ -991,11 +1162,21 @@ void MDBalancer::try_rebalance(balance_state_t& state) CInode *diri = dir->get_inode(); if (diri->is_mdsdir()) continue; - if (diri->get_export_pin(false) != MDS_RANK_NONE) + mds_rank_t export_pin = diri->get_export_pin(false); + if (export_pin != MDS_RANK_NONE && export_pin != MDS_RANK_MASK) continue; if (dir->is_freezing() || dir->is_frozen()) continue; // export pbly already in progress + max_mds_bitset_t rank_mask_bitset; + int r = get_rank_mask_bitset(dir, rank_mask_bitset); + if (r != 0) + continue; + + if (!rank_mask_bitset.test(mds->get_nodeid())) { + continue; + } + mds_rank_t from = diri->authority().first; double pop = dir->pop_auth_subtree.meta_load(); const auto bal_idle_threshold = g_conf().get_val("mds_bal_idle_threshold"); diff --git a/src/mds/MDBalancer.h b/src/mds/MDBalancer.h index e10d671d9f06e..879031dc106d6 100644 --- a/src/mds/MDBalancer.h +++ b/src/mds/MDBalancer.h @@ -50,6 +50,7 @@ class MDBalancer { void tick(); void handle_export_pins(void); + void handle_rank_mask(void); void subtract_export(CDir *ex); void add_import(CDir *im); @@ -75,6 +76,10 @@ class MDBalancer { void handle_mds_failure(mds_rank_t who); int dump_loads(Formatter *f, int64_t depth = -1) const; + int get_rank_mask_bitset(CDir *dir, max_mds_bitset_t& rank_mask_bitset, bool inherit=true); + std::string bitmask_to_str(max_mds_bitset_t& bitmask); + void print_auth_subtrees(std::vector authsubs); + int rank_mask_list_str_to_bitset(CInode *cur, std::string& rank_mask_list_str, max_mds_bitset_t& rank_mask_bitset, std::ostream& ss); bool get_bal_export_pin() const { return bal_export_pin; @@ -184,5 +189,9 @@ class MDBalancer { // per-epoch state double my_load = 0; double target_load = 0; + + // rank mask + max_mds_bitset_t bal_rank_mask_bitset; + uint32_t num_mdss_in_rank_mask_bitset; }; #endif diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 83ad5756360eb..283b7feda89ba 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -154,6 +154,7 @@ MDCache::MDCache(MDSRank *m, PurgeQueue &purge_queue_) : export_ephemeral_distributed_config = g_conf().get_val("mds_export_ephemeral_distributed"); export_ephemeral_random_config = g_conf().get_val("mds_export_ephemeral_random"); export_ephemeral_random_max = g_conf().get_val("mds_export_ephemeral_random_max"); + bal_export_pin = g_conf().get_val("mds_bal_export_pin"); symlink_recovery = g_conf().get_val("mds_symlink_recovery"); kill_dirfrag_at = static_cast(g_conf().get_val("mds_kill_dirfrag_at")); @@ -209,6 +210,9 @@ void MDCache::handle_conf_change(const std::set& changed, const MDS if (changed.count("mds_export_ephemeral_random_max")) { export_ephemeral_random_max = g_conf().get_val("mds_export_ephemeral_random_max"); } + if (changed.count("mds_bal_export_pin")) { + bal_export_pin = g_conf().get_val("mds_bal_export_pin"); + } if (changed.count("mds_kill_dirfrag_at")) { kill_dirfrag_at = static_cast(g_conf().get_val("mds_kill_dirfrag_at")); @@ -1002,12 +1006,23 @@ void MDCache::try_subtree_merge(CDir *dir) auto oldbounds = subtrees.at(dir); set to_eval; - // try merge at my root - try_subtree_merge_at(dir, &to_eval); + + max_mds_bitset_t rank_mask_bitset; + int r = mds->balancer->get_rank_mask_bitset(dir, rank_mask_bitset, false); + if (r != 0) { + // try merge at my root + try_subtree_merge_at(dir, &to_eval); + } // try merge at my old bounds - for (auto bound : oldbounds) + for (auto bound : oldbounds) { + rank_mask_bitset.reset(); + int r = mds->balancer->get_rank_mask_bitset(bound, rank_mask_bitset, false); + if (r == 0 && rank_mask_bitset.count()) { + continue; + } try_subtree_merge_at(bound, &to_eval); + } if (!(mds->is_any_replay() || mds->is_resolve())) { for(auto in : to_eval) @@ -1032,8 +1047,15 @@ void MDCache::try_subtree_merge_at(CDir *dir, set *to_eval, bool adjust if (!dir->inode->is_base()) parent = get_subtree_root(dir->get_parent_dir()); + max_mds_bitset_t rank_mask_bitset; + int r = mds->balancer->get_rank_mask_bitset(dir, rank_mask_bitset, false); + bool has_mask = false; + if (r == 0 && rank_mask_bitset.count()) { + has_mask = true; + } + if (parent != dir && // we have a parent, - parent->dir_auth == dir->dir_auth) { // auth matches, + parent->dir_auth == dir->dir_auth && has_mask == false) { // auth matches, // merge with parent. dout(10) << " subtree merge at " << *dir << dendl; dir->set_dir_auth(CDIR_AUTH_DEFAULT); diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h index 18c848d941c76..f04b0e0fe9fba 100644 --- a/src/mds/MDCache.h +++ b/src/mds/MDCache.h @@ -255,6 +255,10 @@ class MDCache { return symlink_recovery; } + bool get_bal_export_pin(void) const { + return bal_export_pin; + } + /** * Call this when you know that a CDentry is ready to be passed * on to StrayManager (i.e. this is a stray you've just created) @@ -1509,6 +1513,7 @@ class MDCache { bool export_ephemeral_distributed_config; bool export_ephemeral_random_config; unsigned export_ephemeral_dist_frag_bits; + bool bal_export_pin; // Stores the symlink target on the file object's head bool symlink_recovery; diff --git a/src/mds/MDSMap.cc b/src/mds/MDSMap.cc index cd5cb3a98a7b5..33d4a71d60885 100644 --- a/src/mds/MDSMap.cc +++ b/src/mds/MDSMap.cc @@ -1207,7 +1207,7 @@ void MDSMap::set_min_compat_client(ceph_release_t version) required_client_features = feature_bitset_t(bits); } -const std::bitset& MDSMap::get_bal_rank_mask_bitset() const { +const max_mds_bitset_t& MDSMap::get_bal_rank_mask_bitset() const { return bal_rank_mask_bitset; } @@ -1238,7 +1238,7 @@ void MDSMap::update_num_mdss_in_rank_mask_bitset() r = hex2bin(bal_rank_mask, bin_string, MAX_MDS, *css); if (r == 0) { - auto _mds_bal_mask_bitset = std::bitset(bin_string); + auto _mds_bal_mask_bitset = max_mds_bitset_t(bin_string); bal_rank_mask_bitset = _mds_bal_mask_bitset; num_mdss_in_rank_mask_bitset = _mds_bal_mask_bitset.count(); } else { @@ -1247,14 +1247,16 @@ void MDSMap::update_num_mdss_in_rank_mask_bitset() } if (r == -EINVAL) { + bal_rank_mask_bitset.reset(); if (check_special_bal_rank_mask(bal_rank_mask, BAL_RANK_MASK_TYPE_NONE)) { dout(10) << "Balancer is disabled with bal_rank_mask " << bal_rank_mask << dendl; - bal_rank_mask_bitset.reset(); num_mdss_in_rank_mask_bitset = 0; } else { dout(10) << "Balancer distributes mds workloads to all ranks as bal_rank_mask is empty or invalid" << dendl; - bal_rank_mask_bitset.set(); - num_mdss_in_rank_mask_bitset = get_max_mds(); + for (mds_rank_t mds_rank = 0; mds_rank < get_max_mds(); mds_rank++) { + bal_rank_mask_bitset.set(mds_rank); + } + num_mdss_in_rank_mask_bitset = bal_rank_mask_bitset.count(); } } diff --git a/src/mds/MDSMap.h b/src/mds/MDSMap.h index 9ba6377da3f43..131875a638b3a 100644 --- a/src/mds/MDSMap.h +++ b/src/mds/MDSMap.h @@ -296,11 +296,12 @@ class MDSMap { const std::string get_balancer() const { return balancer; } void set_balancer(std::string val) { balancer.assign(val); } - const std::bitset& get_bal_rank_mask_bitset() const; + const max_mds_bitset_t& get_bal_rank_mask_bitset() const; void set_bal_rank_mask(std::string val); unsigned get_num_mdss_in_rank_mask_bitset() const { return num_mdss_in_rank_mask_bitset; } void update_num_mdss_in_rank_mask_bitset(); int hex2bin(std::string hex_string, std::string &bin_string, unsigned int max_bits, std::ostream& ss) const; + std::string get_bal_rank_mask() const { return bal_rank_mask; } typedef enum { @@ -695,7 +696,7 @@ class MDSMap { std::string balancer; /* The name/version of the mantle balancer (i.e. the rados obj name) */ std::string bal_rank_mask = "-1"; - std::bitset bal_rank_mask_bitset; + max_mds_bitset_t bal_rank_mask_bitset; uint32_t num_mdss_in_rank_mask_bitset; std::set in; // currently defined cluster diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc index 91e7d4a7d556b..c11f5299d1543 100644 --- a/src/mds/MDSRank.cc +++ b/src/mds/MDSRank.cc @@ -3243,6 +3243,7 @@ void MDSRank::command_get_subtrees(Formatter *f) f->dump_int("export_pin", export_pin >= 0 ? export_pin : -1); f->dump_bool("distributed_ephemeral_pin", export_pin == MDS_RANK_EPHEMERAL_DIST); f->dump_bool("random_ephemeral_pin", export_pin == MDS_RANK_EPHEMERAL_RAND); + f->dump_string("bal_rank_mask", dir->inode->get_bal_rank_mask_from_xattrs()); } f->dump_int("export_pin_target", dir->get_export_pin(false)); f->open_object_section("dir"); diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 011718aa8c96b..6a2654e9c534c 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -18,6 +18,7 @@ #include #include #include +#include #include "MDSRank.h" #include "Server.h" @@ -6502,6 +6503,13 @@ const Server::XattrHandler Server::xattr_handlers[] = { setxattr: &Server::mirror_info_setxattr_handler, removexattr: &Server::mirror_info_removexattr_handler }, + { + xattr_name: "ceph.dir.bal.mask", + description: "balancer mask xattr handler", + validate: &Server::bal_rank_mask_xattr_validate, + setxattr: &Server::bal_rank_mask_setxattr_handler, + removexattr: &Server::default_removexattr_handler + }, }; const Server::XattrHandler* Server::get_xattr_or_default_handler(std::string_view xattr_name) { @@ -6670,6 +6678,39 @@ void Server::mirror_info_removexattr_handler(CInode *cur, InodeStoreBase::xattr_ xattr_rm(xattrs, Server::MirrorXattrInfo::FS_ID); } +int Server::bal_rank_mask_xattr_validate(CInode *cur, const InodeStoreBase::xattr_map_const_ptr xattrs, + XattrOp *xattr_op) { + int r = xattr_validate(cur, xattrs, xattr_op->xattr_name, xattr_op->op, xattr_op->flags); + if (r < 0) + return r; + + if (xattr_op->op == CEPH_MDS_OP_RMXATTR) { + return 0; + } + + std::string value = xattr_op->xattr_value.to_str(); + if (value == "-1") + return 0; + + CachedStackStringStream css; + max_mds_bitset_t rank_mask_bitset; + r = mds->balancer->rank_mask_list_str_to_bitset(cur, value, rank_mask_bitset, *css); + if (r != 0) { + dout(10) << __func__ << " invalid value " << css->str() << dendl; + return -CEPHFS_EINVAL; + } + + return 0; +} + +void Server::bal_rank_mask_setxattr_handler(CInode *cur, InodeStoreBase::xattr_map_ptr xattrs, + const XattrOp &xattr_op) { + if (xattr_op.xattr_value.to_str() != "-1") + xattr_set(xattrs, xattr_op.xattr_name, xattr_op.xattr_value); + else + xattr_rm(xattrs, xattr_op.xattr_name); +} + void Server::handle_client_setxattr(const MDRequestRef& mdr) { const cref_t &req = mdr->client_request; diff --git a/src/mds/Server.h b/src/mds/Server.h index 68842ea01cbeb..4a54e41baa2a2 100644 --- a/src/mds/Server.h +++ b/src/mds/Server.h @@ -438,6 +438,10 @@ class Server { const XattrOp &xattr_op); void mirror_info_removexattr_handler(CInode *cur, InodeStoreBase::xattr_map_ptr xattrs, const XattrOp &xattr_op); + int bal_rank_mask_xattr_validate(CInode *cur, const InodeStoreBase::xattr_map_const_ptr xattrs, + XattrOp *xattr_op); + void bal_rank_mask_setxattr_handler(CInode *cur, InodeStoreBase::xattr_map_ptr xattrs, + const XattrOp &xattr_op); static bool is_ceph_vxattr(std::string_view xattr_name) { return xattr_name.rfind("ceph.dir.layout", 0) == 0 || @@ -484,7 +488,8 @@ class Server { } return xattr_name == "ceph.mirror.info" || - xattr_name == "ceph.mirror.dirty_snap_id"; + xattr_name == "ceph.mirror.dirty_snap_id" || + xattr_name == "ceph.dir.bal.mask"; } void reply_client_request(const MDRequestRef& mdr, const ref_t &reply); diff --git a/src/mds/journal.cc b/src/mds/journal.cc index e080b11761050..ade66d22f9786 100644 --- a/src/mds/journal.cc +++ b/src/mds/journal.cc @@ -370,7 +370,8 @@ void EMetaBlob::add_dir_context(CDir *dir, int mode) !dir->is_ambiguous_dir_auth() && !dir->state_test(CDir::STATE_EXPORTBOUND) && !dir->state_test(CDir::STATE_AUXSUBTREE) && - !diri->state_test(CInode::STATE_AMBIGUOUSAUTH)) { + !diri->state_test(CInode::STATE_AMBIGUOUSAUTH) && + dir->get_rank_mask(false).size() == 0) { dout(0) << "EMetaBlob::add_dir_context unexpected subtree " << *dir << dendl; ceph_abort(); } diff --git a/src/mds/mdstypes.h b/src/mds/mdstypes.h index 17a5bf7acae78..9abd53d7b66aa 100644 --- a/src/mds/mdstypes.h +++ b/src/mds/mdstypes.h @@ -151,6 +151,8 @@ inline std::ostream& operator<<(std::ostream &out, const vinodeno_t &vino) { typedef uint32_t damage_flags_t; +typedef std::bitset max_mds_bitset_t; + template class Allocator> using alloc_string = std::basic_string,Allocator>;