New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: misc multimds fixes #13227

merged 22 commits into from Feb 26, 2017


None yet
2 participants

ukernel commented Feb 2, 2017

No description provided.

@ukernel ukernel changed the title from [DNM] mds: misc multimds fixes to mds: misc multimds fixes Feb 7, 2017

ukernel added some commits Feb 2, 2017

mds: drop superfluous MMDSOpenInoReply
Superfluous MMDSOpenInoReply can causes MDCache::handle_open_ino_reply()
to call MDCache::do_open_ino() in improper state.

Signed-off-by: "Yan, Zheng" <>
mds: avoid journal unnessary dirfrags in ESubtreeMap
EMetaBlob::add_dir_contex() skips adding inodes that has already
been journaled in the last ESubtreeMap. The log replay code only
replays the first ESubtreeMap. For the rest ESubtreeMap, it just
verifies subtree map in the cache matches the ESubtreeMap. If
unnessary inodes were included in non-first ESubtreeMap, these
inodes do not get added to the cache, the log replay code can
find these inodes are missing when replaying the rest events in
the log segment.

Previous attempt (commit a9b959d) to fix this issue is not
complete. This patch makes MDCache::create_subtree_map() journal
dirfrags according to simplified subtree map. It should fix this
issue completely.

Signed-off-by: "Yan, Zheng" <>
mds: set STATE_AUTH in MDSCacheObject::deocde_import
Signed-off-by: "Yan, Zheng" <>
mds: avoid zero replica_nonce
Signed-off-by: "Yan, Zheng" <>
mds: tracking committing and rolling back slave requests
When handling mds failure, we need to distinguish committing and
'rolling back' slave requests from unprepared slave requests.

Signed-off-by: "Yan, Zheng" <>
mds: log master commit after all slave commits get journaled
When survivor mds sends resolve message to recovering mds, aslo
records committing slave request in the message. So the recovering
mds knows the slave commit is still being journaled. It journals
master commit after receiving corresponding OP_COMMITTED message.

Signed-off-by: "Yan, Zheng" <>
mds: cleanup ambiguous slave update when master mds fails
When auth mds of rename source dentry fails, slave updates in witness
mds become ambiguous. Witnesses need to ask the master if they should
rollback the updates. This type of rollback is special, corresponding
MDRequest struct need to be preserved after rollback. If the master
mds also fails, slave updates in witness mds are no longer special.
Corresponding MDRequest struct need to be cleanup after rollback.

see commit e62e48b for more information.

Signed-off-by: "Yan, Zheng" <>
mds: disambiguate other mds' imports when cluster enters rejoin state
When mds cluster is in rejoin state, we know all mds have finished
their exports. All export abort notifications have been processed
by standby mds. So it's safe to disambiguate other mds' imports.

Signed-off-by: "Yan, Zheng" <>
mds: kill export finish waiters
this code is unused

Signed-off-by: "Yan, Zheng" <>
mds: wait acknowledgment for export abort notification
To disambiguate other mds's failed import, survivor bystander mds
need to receive either exporter mds' export abort notification or
exporter mds' resolve message. For bystander mds, it's hard to
distinguish "export succeeded" from the case "hasn't received
export abort notification".

To handle this problem, we rely on the fact that surviver mds does
not send resolve message to the recovering mds until it finishes
all its exports. Without the resolve message, the recovering mds
can't go to rejoin state. So when mds cluster is in rejoin state,
we know all mds have finished their exports. If export abort
notifications also require acknowledgments. When mds cluster is
in rejoin state, we know all export abort notifications have been
proceesed by bystander mds. So bystander mds can disambiguate other
mds' imports

Signed-off-by: "Yan, Zheng" <>
mds: properly set ambiguous auth on auth mds of rename source inode
When doing trans-authority rename, the master mds may send two slave
requests to auth mds of rename source inode. The first slave request
set ambiguous auth on rename source inode. The second slave request
is sent after receiving all bystanders' slave request replies.

Current code uses mdr->more()->is_ambiguous_auth bit to indicate if
the first slave reuqest was sent. The is_ambiguous_auth is set when
when calling Server::_rename_prepare_witness(). This causes problem
if Server::_rename_prepare_witness() can't send the slave request
immediately and wants to retry the MDRequest laster. The fix is set
is_ambiguous_auth when receiving reply for the first slave request

Signed-off-by: "Yan, Zheng" <>
mds: handle race of freezing auth pin
Linkage of rename source dentry may change during freezing auth
pin for the rename source inode. So we may freeze auth pin for
the wrong inode.

Signed-off-by: "Yan, Zheng" <>
mds: note subtree bounds when rolling back rename
mds can do a slave rename that moves directory inode (whoes dirfrags
are all non-auth) to new auth. Then rolls back the slave rename. If
There is a ESubtreeMap event between log event of slave rename and
log event of rollback. The ESubtreeMap does not have information
about the inode's non-auth dirfrags.

Later when mds replays the log, the log event of slave rename can
be missing. So mds need to re-create subtree bounds when replaying
the log event of rename rollback

Signed-off-by: "Yan, Zheng" <>
mds: cleanup CInode::encode_inodestat()
variable no_caps should be true when valid is false.

Signed-off-by: "Yan, Zheng" <>
mds: issue new caps to client even when session is stale
If early reply is not allowed for open request, MDS does not send
reply to client immediately after adding adds new caps. Later when
MDS sends the reply, client session can be in stale stale. If MDS
does not issue the new caps to client along with the reply, the
new caps get lost. This issue can cause MDS to hang at revoking

Signed-off-by: "Yan, Zheng" <>
mds: fix deadlock when wrlock and remote_wrlock the same lock
When handling trans-authority rename, the master mds may ask slave
mds to wrlock a lock, then try to wrlock the same lock locally.
If the master can't wrlock the lock locally, it need to drop the
remote wrlock and wait. Otherwise deadlock happens. The code does
not handle a corner case: Lock::wrlock_start() can sleep even
when SimpleLock::can_wrlock() return true.

Signed-off-by: "Yan, Zheng" <>
mds: properly update replica inode's ctime
Signed-off-by: "Yan, Zheng" <>
mds: avoid race between cache expire and MDentryLink
commit 2253534 tried to fix race between cache expire and
MDentryLink. It avoids trimming null dentry whose lock is
not readable. The fix does not handle the case that MDS
first recevies a MDentryUnlink message, then receives a
MDentryLink message.

Signed-off-by: "Yan, Zheng" <>
mds: don't call kick_discovers() for recovering mds twice
MDSRankDispatcher::handle_mds_map() calls kick_discovers() when
the recovering mds enters rejoin state. No need to call it again
when the recovering mds entry clientreplay/active state.

Signed-off-by: "Yan, Zheng" <>
mds: drop MDiscover/MMDSOpenIno messages if mds state < REJOIN
Signed-off-by: "Yan, Zheng" <>
mds: properly set default dir_hash for directory inodes
MDCache::handle_cache_rejoin_strong(). may add new inodes (race with
cache expire). Updating these inodes is at the very end of the function.
Before these inodes get updated, MDCache::handle_cache_rejoin_strong()
may add dentries to these inodes. So dir_hash type of these inodes
should be set to the default value.

Signed-off-by: "Yan, Zheng" <>

jcsp approved these changes Feb 26, 2017

@jcsp jcsp merged commit 477ddea into ceph:master Feb 26, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Unmodifed Submodules submodules for project are unmodified
default Build finished.

@ukernel ukernel deleted the ukernel:wip-multimds-misc branch Feb 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment