Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cephfs: improve usability of growing/shrinking a multimds cluster #16608

Merged
merged 17 commits into from Apr 18, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
32 changes: 26 additions & 6 deletions PendingReleaseNotes
Expand Up @@ -13,10 +13,10 @@
(even as standby). Operators may ignore the error messages and continue
upgrading/restarting or follow this upgrade sequence:

Reduce the number of ranks to 1 (`ceph fs set <fs_name> max_mds 1`),
deactivate all other ranks (`ceph mds deactivate <fs_name>:<n>`), shutdown
standbys leaving the one active MDS, upgrade the single active MDS, then
upgrade/start standbys. Finally, restore the previous max_mds.
Reduce the number of ranks to 1 (`ceph fs set <fs_name> max_mds 1`), wait
for all other MDS to deactivate, leaving the one active MDS, upgrade the
single active MDS, then upgrade/start standbys. Finally, restore the
previous max_mds.

See also: https://tracker.ceph.com/issues/23172

Expand All @@ -27,12 +27,32 @@
- mds stop -> mds deactivate
- mds set_max_mds -> fs set max_mds
- mds set -> fs set
- mds cluster_down -> fs set cluster_down true
- mds cluster_up -> fs set cluster_down false
- mds cluster_down -> fs set joinable false
- mds cluster_up -> fs set joinable true
- mds add_data_pool -> fs add_data_pool
- mds remove_data_pool -> fs rm_data_pool
- mds rm_data_pool -> fs rm_data_pool

* As the multiple MDS feature is now standard, it is now enabled by
default. ceph fs set allow_multimds is now deprecated and will be
removed in a future release.

* As the directory fragmentation feature is now standard, it is now
enabled by default. ceph fs set allow_dirfrags is now deprecated and
will be removed in a future release.

* MDS daemons now activate and deactivate based on the value of
max_mds. Accordingly, ceph mds deactivate has been deprecated as it
is now redundant.

* Taking a CephFS cluster down is now done by setting the down flag which
deactivates all MDS. For example: `ceph fs set cephfs down true`.

* Preventing standbys from joining as new actives (formerly the now
deprecated cluster_down flag) on a file system is now accomplished by
setting the joinable flag. This is useful mostly for testing so that a
file system may be quickly brought down and deleted.

* New CephFS file system attributes session_timeout and session_autoclose
are configurable via `ceph fs set`. The MDS config options
mds_session_timeout, mds_session_autoclose, and mds_max_file_size are now
Expand Down
80 changes: 50 additions & 30 deletions doc/cephfs/administration.rst
Expand Up @@ -77,23 +77,59 @@ to enumerate the objects during operations like stats or deletes.
Taking the cluster down
-----------------------

Taking a CephFS cluster down is done by reducing the number of ranks to 1,
setting the cluster_down flag, and then failing the last rank. For example:
Taking a CephFS cluster down is done by setting the down flag:

::

mds set <fs_name> down true

To bring the cluster back online:

::

mds set <fs_name> down false

This will also restore the previous value of max_mds. MDS daemons are brought
down in a way such that journals are flushed to the metadata pool and all
client I/O is stopped.


Taking the cluster down rapidly for deletion or disaster recovery
-----------------------------------------------------------------

To allow rapidly deleting a file system (for testing) or to quickly bring MDS
daemons down, the operator may also set a flag to prevent standbys from
activating on the file system. This is done using the ``joinable`` flag:

::

fs set <fs_name> joinable false

Then the operator can fail all of the ranks which causes the MDS daemons to
respawn as standbys. The file system will be left in a degraded state.

::
ceph fs set <fs_name> max_mds 1
ceph mds deactivate <fs_name>:1 # rank 2 of 2
ceph status # wait for rank 1 to finish stopping
ceph fs set <fs_name> cluster_down true
ceph mds fail <fs_name>:0

Setting the ``cluster_down`` flag prevents standbys from taking over the failed
rank.
# For all ranks, 0-N:
mds fail <fs_name>:<n>

Once all ranks are inactive, the file system may also be deleted or left in
this state for other purposes (perhaps disaster recovery).


Daemons
-------

These commands act on specific mds daemons or ranks.
Most commands manipulating MDSs take a ``<role>`` argument which can take one
of three forms:

::

<fs_name>:<rank>
<fs_id>:<rank>
<rank>

Comamnds to manipulate MDS daemons:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo Comamnds


::

Expand All @@ -108,29 +144,13 @@ If the MDS daemon was in reality still running, then using ``mds fail``
will cause the daemon to restart. If it was active and a standby was
available, then the "failed" daemon will return as a standby.

::

mds deactivate <role>

Deactivate an MDS, causing it to flush its entire journal to
backing RADOS objects and close all open client sessions. Deactivating an MDS
is primarily intended for bringing down a rank after reducing the number of
active MDS (max_mds). Once the rank is deactivated, the MDS daemon will rejoin the
cluster as a standby.
``<role>`` can take one of three forms:

::

<fs_name>:<rank>
<fs_id>:<rank>
<rank>

Use ``mds deactivate`` in conjunction with adjustments to ``max_mds`` to
shrink an MDS cluster. See :doc:`/cephfs/multimds`

::
tell mds.<daemon name> command ...

tell mds.<daemon name>
Send a command to the MDS daemon(s). Use ``mds.*`` to send a command to all
daemons. Use ``ceph tell mds.* help`` to learn available commands.

::

Expand Down Expand Up @@ -208,5 +228,5 @@ These legacy commands are obsolete and no longer usable post-Luminous.
mds remove_data_pool # replaced by "fs rm_data_pool"
mds set # replaced by "fs set"
mds set_max_mds # replaced by "fs set max_mds"
mds stop # replaced by "mds deactivate"
mds stop # obsolete

4 changes: 0 additions & 4 deletions doc/cephfs/dirfrags.rst
Expand Up @@ -25,10 +25,6 @@ fragments may be *merged* to reduce the number of fragments in the directory.
Splitting and merging
=====================

An MDS will only consider doing splits if the allow_dirfrags setting is true in
the file system map (set on the mons). This setting is true by default since
the *Luminous* release (12.2.X).

When an MDS identifies a directory fragment to be split, it does not
do the split immediately. Because splitting interrupts metadata IO,
a short delay is used to allow short bursts of client IO to complete
Expand Down
68 changes: 25 additions & 43 deletions doc/cephfs/multimds.rst
Expand Up @@ -27,17 +27,13 @@ are those with many clients, perhaps working on many separate directories.
Increasing the MDS active cluster size
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Each CephFS filesystem has a *max_mds* setting, which controls
how many ranks will be created. The actual number of ranks
in the filesystem will only be increased if a spare daemon is
available to take on the new rank. For example, if there is only one MDS daemon running, and max_mds is set to two, no second rank will be created.

Before ``max_mds`` can be increased, the ``allow_multimds`` flag must be set.
The following command sets this flag for a filesystem instance.

::

# ceph fs set <fs_name> allow_multimds true --yes-i-really-mean-it
Each CephFS filesystem has a *max_mds* setting, which controls how many ranks
will be created. The actual number of ranks in the filesystem will only be
increased if a spare daemon is available to take on the new rank. For example,
if there is only one MDS daemon running, and max_mds is set to two, no second
rank will be created. (Note that such a configuration is not Highly Available
(HA) because no standby is available to take over for a failed rank. The
cluster will complain via health warnings when configured this way.)

Set ``max_mds`` to the desired number of ranks. In the following examples
the "fsmap" line of "ceph status" is shown to illustrate the expected
Expand All @@ -63,7 +59,7 @@ requires standby daemons** to take over if any of the servers running
an active daemon fail.

Consequently, the practical maximum of ``max_mds`` for highly available systems
is one less than the total number of MDS servers in your system.
is at most one less than the total number of MDS servers in your system.

To remain available in the event of multiple server failures, increase the
number of standby daemons in the system to match the number of server failures
Expand All @@ -72,49 +68,35 @@ you wish to withstand.
Decreasing the number of ranks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All ranks, including the rank(s) to be removed must first be active. This
means that you must have at least max_mds MDS daemons available.

First, set max_mds to a lower number, for example we might go back to
having just a single active MDS:
Reducing the number of ranks is as simple as reducing ``max_mds``:

::

# fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
ceph fs set <fs_name> max_mds 1
# fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:active}, 1 up:standby

Note that we still have two active MDSs: the ranks still exist even though
we have decreased max_mds, because max_mds only restricts creation
of new ranks.

Next, use the ``ceph mds deactivate <role>`` command to remove the
unneeded rank:

::

ceph mds deactivate cephfs_a:1
telling mds.1:1 172.21.9.34:6806/837679928 to deactivate
# fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
# fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
...
# fsmap e10: 1/1/1 up {0=a=up:active}, 2 up:standby

# fsmap e11: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
# fsmap e12: 1/1/1 up {0=a=up:active}, 1 up:standby
# fsmap e13: 1/1/1 up {0=a=up:active}, 2 up:standby
The cluster will automatically deactivate extra ranks incrementally until
``max_mds`` is reached.

See :doc:`/cephfs/administration` for more details which forms ``<role>`` can
take.

The deactivated rank will first enter the stopping state for a period
of time while it hands off its share of the metadata to the remaining
active daemons. This phase can take from seconds to minutes. If the
MDS appears to be stuck in the stopping state then that should be investigated
as a possible bug.
Note: deactivated ranks will first enter the stopping state for a period of
time while it hands off its share of the metadata to the remaining active
daemons. This phase can take from seconds to minutes. If the MDS appears to
be stuck in the stopping state then that should be investigated as a possible
bug.

If an MDS daemon crashes or is killed while in the 'stopping' state, a
standby will take over and the rank will go back to 'active'. You can
try to deactivate it again once it has come back up.
If an MDS daemon crashes or is killed while in the ``up:stopping`` state, a
standby will take over and the cluster monitors will against try to deactivate
the daemon.

When a daemon finishes stopping, it will respawn itself and go
back to being a standby.
When a daemon finishes stopping, it will respawn itself and go back to being a
standby.


Manually pinning directory trees to a particular rank
Expand Down
22 changes: 16 additions & 6 deletions doc/cephfs/upgrading.rst
Expand Up @@ -18,29 +18,39 @@ The proper sequence for upgrading the MDS cluster is:

ceph fs set <fs_name> max_mds 1

2. Deactivate all non-zero ranks, from the highest rank to the lowest, while waiting for each MDS to finish stopping:
2. Wait for cluster to deactivate non-zero ranks where only rank 0 is active and the rest are standbys.

::

ceph mds deactivate <fs_name>:<n>
ceph status # wait for MDS to finish stopping

3. Take all standbys offline, e.g. using systemctl:

::

systemctl stop ceph-mds.target
ceph status # confirm only one MDS is online and is active

4. Upgrade the single active MDS, e.g. using systemctl:
4. Confirm only one MDS is online and is rank 0 for your FS:

::

ceph status

5. Upgrade the single active MDS, e.g. using systemctl:

::

# use package manager to update cluster
systemctl restart ceph-mds.target

5. Upgrade/start the standby daemons.
6. Upgrade/start the standby daemons.

::

# use package manager to update cluster
systemctl restart ceph-mds.target

6. Restore the previous max_mds for your cluster:
7. Restore the previous max_mds for your cluster:

::

Expand Down
2 changes: 2 additions & 0 deletions qa/cephfs/overrides/whitelist_health.yaml
Expand Up @@ -7,3 +7,5 @@ overrides:
- \(MDS_DEGRADED\)
- \(FS_WITH_FAILED_MDS\)
- \(MDS_DAMAGE\)
- \(MDS_ALL_DOWN\)
- \(MDS_UP_LESS_THAN_MAX\)
2 changes: 2 additions & 0 deletions qa/suites/fs/multifs/tasks/failover.yaml
Expand Up @@ -3,6 +3,8 @@ overrides:
log-whitelist:
- not responding, replacing
- \(MDS_INSUFFICIENT_STANDBY\)
- \(MDS_ALL_DOWN\)
- \(MDS_UP_LESS_THAN_MAX\)
ceph-fuse:
disabled: true
tasks:
Expand Down
2 changes: 2 additions & 0 deletions qa/suites/kcephfs/recovery/tasks/failover.yaml
Expand Up @@ -3,6 +3,8 @@ overrides:
log-whitelist:
- not responding, replacing
- \(MDS_INSUFFICIENT_STANDBY\)
- \(MDS_ALL_DOWN\)
- \(MDS_UP_LESS_THAN_MAX\)
tasks:
- cephfs_test_runner:
fail_on_skip: false
Expand Down
1 change: 0 additions & 1 deletion qa/tasks/ceph.py
Expand Up @@ -377,7 +377,6 @@ def cephfs_setup(ctx, config):
num_active = len([r for r in all_roles if is_active_mds(r)])

fs.set_max_mds(num_active)
fs.set_allow_dirfrags(True)

yield

Expand Down