Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: add ceph.dir.bal.mask vxattr for MDS Balancer #52373

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
150 changes: 150 additions & 0 deletions doc/cephfs/multimds.rst
Expand Up @@ -268,6 +268,8 @@ such as with the ``bal_rank_mask`` setting (described below). Careful
monitoring of the file system performance and MDS is advised.


.. _intro-bal-rank-mask:

Dynamic subtree partitioning with Balancer on specific ranks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -311,3 +313,151 @@ the ``bal_rank_mask`` should be set to ``0x0``. For example:
.. prompt:: bash #

ceph fs set <fs_name> bal_rank_mask 0x0

Dynamically partitionning directory trees with vxattr on a specific ranks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``bal_rank_mask`` file system setting (see also :ref:`intro-bal-rank-mask`)
can be overridden by configuring the ``ceph.dir.bal.mask`` vxattr.
Notably, the ``ceph.dir.bal.mask`` offers the capability to isolate particular subdirectories
into dedicated groups of MDS ranks, whereas bal_rank_mask only allows isolation
at the file system level, such as the root directory ``/``. It serves as a valuable
tool for finely tuning and enhancing MDS performance in various scenarios.


One scenario is when dealing with a large directory that proves challenging
for a single MDS rank to handle efficiently. For example, the ``/usr`` directory
contains many directories for multiple users. In most cases, static pinning is
simply used to distribute subdirectories of ``/usr`` across multiple ranks. However,
the ``/usr/share`` directory presents a unique situation with many subdirectories,
each containing large number of files, often exceeding a million files. With
this example with ``/usr/share/images``, processing them in one MDS is inappropriate
due to insufficient MDS resources like MDS cache memory. Therefore,
the ceph.dir.bal.mask can dynamically balance the workload of this large
directory within a specific subset of ranks.


The ``ceph.dir.bal.mask`` can also be a useful option for fine-tunning performance,
particularly in scenarios involving large directories like ``/usr/share/images``
and ``/usr/share/backups`` within the file system.

.. note:: The example here uses the /usr/share directory.
Depending on the user's preference, it can be applied
to various directories such as /usr/share and /mnt/cephfs.

While the file system setting ``bal_rank_mask`` isolates the entire ``/`` directory to
specific ranks, it can affect performance due to each other's migration overhead.
For example, if the file system setting``bal_rank_mask`` is set to ``0xf`` and large directories
like ``/usr/share/images`` and ``/usr/share/backups`` exist, the load on ``/usr/share/images``
instantaneously increases and metadata distribution occurs across ranks 0 to 3.
Thus, users of ``/usr/share/backups`` may be affected by noisy neighbors unnecessarily.
By distributing the two directories to MDS rank 0,1 group and MDS rank 2,3 group
respectively through the ``ceph.dir.bal.mask``, metadata service can be provided
without affecting the performance of other concurrent workloads.


This option can be set via:

::

setfattr -n ceph.dir.bal.mask -v 1,2 /usr/share/images
setfattr -n ceph.dir.bal.mask -v 3,4 /usr/share/backups

.. note:: The balance_automate setting should be `true` for this option to be turned on.

``/usr/share/images`` and ``/usr/share/backups`` are distributed within mds rank 1~2 and 3~4, respectively.

Similar to ``ceph.dir.pin``, the ``ceph.dir.bal.mask`` is also inherited from its closest parent.
This manner involves configuring the ``ceph.dir.bal.mask`` for a directory, which consequently
impacts all its descendants. However, the parent ``ceph.dir.bal.mask`` can be overridden
by setting the child directory’s value. For example:

::

mkdir -p /usr/share/images/jpg
setfattr -n ceph.dir.bal.mask -v 1,2 /usr/share/images
# /usr/share/images and /usr/share/images/jpg are now freely moved among mds ranks 1 and 2
setfattr -n ceph.dir.bal.mask -v 3,4 /usr/share/images/jpg
# /usr/share/images/jpg now moved among mds ranks 3 and 4

The option can be unset via:

::

setfattr -x ceph.dir.bal.mask /usr/share/images/jpg

The value of the ``ceph.dir.bal.mask`` for ``/usr/share/images/jpg`` is unset and replaced with their inherited value from
its nearest parent. If the ``ceph.dir.bal.mask`` for ``/usr/share/images`` is configured with
a valid value, ``/usr/share/images/jpg`` will be distributed according to its parent’s settings.

This can override the file system setting ``bal_rank_mask``. For example:

::

cephfs fs set cephfs bal_rank_mask 0xf


Initially, the balancer dynamically partitions the file system within MDS rank
0 to 3 because the ``bal_rank_mask`` is set to ``0xf``.

If the ``ceph.dir.bal.mask`` for the root ``/`` directory is set to ``0,1,2,3,4,5``,
the ``bal_rank_mask`` will be overridden. In this way, the balancer dynamically
distributes the unpinned subtrees of the root ``/`` directory from ``0`` to ``5`` mds ranks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should bal_rank_mask be deprecated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bal_rank_mask does not need to be deprecated. If detailed tuning is required according to the user's preference, ceph.dir.bal.mask xattr can be used. Otherwise, bal_rank_mask can be used for simplicity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a note about that, literally:

.. note:: Users may prefer to use the file system setting `bal_rank_mask` ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


.. note:: Users may prefer to use the file system setting `bal_rank_mask` for simplicity.

::

setfattr -n ceph.dir.bal.mask -v 0,1,2,3,4,5 /

``ceph.dir.bal.mask`` overrides parent ``ceph.dir.pin`` and vice versa. For example:

::

mkdir -p /usr/share/images/jpg
setfattr -n ceph.dir.pin -v 1 /usr/share/images
setfattr -n ceph.dir.bal.mask -v 2,3 /usr/share/images/jpg

The ``/usr/share/images`` directory will be pinned to rank 1, while the ``/usr/share/images/jpg``
directory may be dynamically split into rank 2 and 3.

While `ceph.dir.bal.mask` functions similarly to `ceph.dir.pin` when it has only one mds rank,
its distinct advantage lies in its flexibility.
Unlike `ceph.dir.pin`, `ceph.dir.bal.mask` can be expanded to multiple mds ranks or reduced to fewer mds ranks.
Therefore, in scenarios requiring expansion to multiple mds ranks,
it is recommended to employ `ceph.dir.bal.mask`.

::

# Same way as setfattr -n ceph.dir.pin -v 1 /usr/share/images
setfattr -n ceph.dir.bal.mask -v 1 /usr/share/images

# If expansion is necessary, multiple mds ranks can be manipulated.
setfattr -n ceph.dir.bal.mask -v 1,2,3 /usr/share/images

If the max_mds number shrinks, the subtree should move to mds rank 0.
Therefore, max_mds should be adjusted carefully because MDS performance may decrease.

::
# The file system is operating with mds ranks 0 to 3.
ceph fs set cephfs max_mds 4

# The subtree is moved among mds ranks 2 and 3.
setfattr -n ceph.dir.bal.mask -v 2,3 /usr/share/images

# The file system is operating with mds ranks 0 and 1.
ceph fs set cephfs max_mds 2

# The subtree is now moved to mds rank 0.

**Restrictions**: Since the inode of the root directory is anchored to MDS rank 0,
it's required to include MDS rank 0 when adjusting the ceph.dir.rank.mask for the ``/`` directory.

::

setfattr -n ceph.dir.pin -v 1,2 /
setfattr: mnt: Invalid argument
# failed with invalid argument error

setfattr -n ceph.dir.pin -v 0,1,2 /
# success
25 changes: 23 additions & 2 deletions qa/tasks/cephfs/cephfs_test_case.py
Expand Up @@ -330,7 +330,7 @@ def delete_mds_coredump(self, daemon_id):

def _get_subtrees(self, status=None, rank=None, path=None):
if path is None:
path = "/"
path = "/" # everything below root but not root
try:
with contextutil.safe_while(sleep=1, tries=3) as proceed:
while proceed():
Expand All @@ -343,7 +343,16 @@ def _get_subtrees(self, status=None, rank=None, path=None):
subtrees += s
else:
subtrees = self.fs.rank_asok(["get", "subtrees"], status=status, rank=rank)
subtrees = filter(lambda s: s['dir']['path'].startswith(path), subtrees)
def match(s):
if path == "" and s['dir']['path'] == "":
return True
elif path == "" and s['dir']['path'].startswith("/"):
return True
elif path != "" and s['dir']['path'].startswith(path):
return True
else:
return False
subtrees = filter(match, subtrees)
return list(subtrees)
except CommandFailedError as e:
# Sometimes we get transient errors
Expand All @@ -355,6 +364,8 @@ def _get_subtrees(self, status=None, rank=None, path=None):
raise RuntimeError(f"could not get subtree state from rank {rank}") from e

def _wait_subtrees(self, test, status=None, rank=None, timeout=30, sleep=2, action=None, path=None):
log.info(f'_wait_subtrees test={test} status={status} rank={rank} timeout={timeout} '
f'sleep={sleep} action={action} path={path}')
test = sorted(test)
try:
with contextutil.safe_while(sleep=sleep, tries=timeout//sleep) as proceed:
Expand Down Expand Up @@ -407,6 +418,16 @@ def _wait_random_subtrees(self, count, status=None, rank=None, path=None):
except contextutil.MaxWhileTries as e:
raise RuntimeError("rank {0} failed to reach desired subtree state".format(rank)) from e

def _wait_xattrs(self, mount, path, key, exp):
try:
with contextutil.safe_while(sleep=5, tries=20) as proceed:
while proceed():
value = mount.getfattr(path, key)
if value == exp:
return value
except contextutil.MaxWhileTries as e:
raise RuntimeError("client failed to get desired value with key {0}".format(key)) from e

def create_client(self, client_id, moncap=None, osdcap=None, mdscap=None):
if not (moncap or osdcap or mdscap):
if self.fs:
Expand Down
3 changes: 3 additions & 0 deletions qa/tasks/cephfs/filesystem.py
Expand Up @@ -633,6 +633,9 @@ def set_allow_new_snaps(self, yes):
def set_bal_rank_mask(self, bal_rank_mask):
self.set_var("bal_rank_mask", bal_rank_mask)

def set_balance_automate(self, yes):
self.set_var("balance_automate", yes)

def set_refuse_client_session(self, yes):
self.set_var("refuse_client_session", yes)

Expand Down