Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mimic: mds: stopping MDS with a large cache (40+GB) causes it to miss heartbeats #28452

Merged
merged 12 commits into from Oct 29, 2019

Conversation

@batrick

This comment has been minimized.

Copy link
Member

batrick commented Jun 7, 2019

f47493f

seems to be referring to the wrong commit in the commit message:

(cherry picked from commit ce153b8)

It should be referring to: ef46216

?

@thmour

This comment has been minimized.

Copy link
Author

thmour commented Jun 8, 2019

these are cherry-picks of commits from the luminous merged pr

@smithfarm smithfarm requested a review from batrick Jun 11, 2019
@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Jun 11, 2019

@batrick I looked at f47493f and it seems to be OK, i.e.: it says "cherry picked from ce153b8" and that commit is in master.

@yuriw

This comment has been minimized.

Copy link
Contributor

yuriw commented Jul 9, 2019

@thmour pls rebase

--- pr 28452 --- pulling https://github.com/thmour/ceph.git branch mimic_test
remote: Enumerating objects: 73, done.
remote: Counting objects: 100% (73/73), done.
remote: Total 85 (delta 73), reused 73 (delta 73), pack-reused 12
Unpacking objects: 100% (85/85), done.
From https://github.com/thmour/ceph

  • branch mimic_test -> FETCH_HEAD
    Auto-merging src/mds/SessionMap.h
    CONFLICT (content): Merge conflict in src/mds/SessionMap.h
    Auto-merging src/mds/SessionMap.cc
    Auto-merging src/mds/Server.cc
    Auto-merging src/mds/MDCache.h
    Auto-merging src/mds/MDCache.cc
    Auto-merging src/common/options.cc
    Auto-merging src/common/legacy_config_opts.h
    Automatic merge failed; fix conflicts and then commit the result.
    Traceback (most recent call last):
    File "/home/yuriw/wip_master/src/script/build-integration-branch", line 62, in
    assert not r
    AssertionError
@thmour thmour force-pushed the thmour:mimic_test branch from 2b67b94 to c0754f2 Jul 11, 2019
@thmour

This comment has been minimized.

Copy link
Author

thmour commented Jul 11, 2019

@yuriw done

@batrick

This comment has been minimized.

Copy link
Member

batrick commented Jul 17, 2019

needs rebased

@thmour thmour force-pushed the thmour:mimic_test branch from c0754f2 to 8fe568a Jul 18, 2019
@thmour

This comment has been minimized.

Copy link
Author

thmour commented Jul 22, 2019

@batrick bump

@yuriw

This comment has been minimized.

Copy link
Contributor

yuriw commented Jul 23, 2019

@thmour pls rebase
--- pr 28452 --- pulling https://github.com/thmour/ceph.git branch mimic_test
remote: Enumerating objects: 73, done.
remote: Counting objects: 100% (73/73), done.
remote: Total 85 (delta 73), reused 73 (delta 73), pack-reused 12
Unpacking objects: 100% (85/85), done.
From https://github.com/thmour/ceph

  • branch mimic_test -> FETCH_HEAD
    Auto-merging src/mds/MDSRank.cc
    Auto-merging src/mds/MDSDaemon.cc
    Auto-merging src/mds/MDCache.cc
    Auto-merging src/common/options.cc
    Auto-merging qa/tasks/cephfs/test_client_limits.py
    Auto-merging PendingReleaseNotes
    CONFLICT (content): Merge conflict in PendingReleaseNotes
    Automatic merge failed; fix conflicts and then commit the result.
    Traceback (most recent call last):
    File "/home/yuriw/wip_master/src/script/build-integration-branch", line 62, in
    assert not r
    AssertionError
@thmour

This comment has been minimized.

Copy link
Author

thmour commented Jul 24, 2019

I am sorry I don't get any conflicts...what do I do different?

# git checkout mimic
Switched to branch 'mimic'
Your branch is up-to-date with 'origin/mimic'.
# git checkout -b mimic_thmour
Switched to a new branch 'mimic_thmour'
# git pull thmour mimic_test
From https://github.com/thmour/ceph
 * branch            mimic_test -> FETCH_HEAD
Auto-merging src/common/options.cc
Merge made by the 'recursive' strategy.
 PendingReleaseNotes                                       |  22 ++++++++++++++-
 qa/suites/fs/bugs/client_trim_caps/tasks/trim-i22073.yaml |  19 +++++++++++++
 qa/tasks/cephfs/test_client_limits.py                     |  38 ++++++++++++++++++++-----
 src/common/legacy_config_opts.h                           |   1 -
 src/common/options.cc                                     |  40 ++++++++++++++++++++++----
 src/mds/Beacon.cc                                         |  44 +++++++++++------------------
 src/mds/MDCache.cc                                        |  62 +++++++++++++++++++++++++++-------------
 src/mds/MDCache.h                                         |  11 +++++---
 src/mds/MDSDaemon.cc                                      |   5 +++-
 src/mds/MDSRank.cc                                        |  87 ++++++++++++++++++++++++++++++++++++++++----------------
 src/mds/Server.cc                                         | 143 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------
 src/mds/Server.h                                          |  16 +++++++++--
 src/mds/SessionMap.cc                                     | 101 +++++++++++++++++++++++++++++++++++++++++++----------------------
 src/mds/SessionMap.h                                      | 101 ++++++++++++++++++++++++++++++++++++++---------------------------
 src/test/mds/TestSessionFilter.cc                         |  18 ++++++------
 15 files changed, 493 insertions(+), 215 deletions(-)
 create mode 100644 qa/suites/fs/bugs/client_trim_caps/tasks/trim-i22073.yaml

Or

# git checkout mimic_test
Branch mimic_test set up to track remote branch mimic_test from thmour.
Switched to a new branch 'mimic_test'
# git pull thmour mimic_test
From https://github.com/thmour/ceph
 * branch            mimic_test -> FETCH_HEAD
Already up-to-date.
# git pull --rebase origin mimic
From https://github.com/ceph/ceph
 * branch            mimic      -> FETCH_HEAD
First, rewinding head to replay your work on top of it...
Applying: mds: cleanup SessionMap init
Applying: mds: add throttle for trimming MDCache
Applying: mds: adapt drop cache for incremental trim
Applying: mds: cleanup Session init
Applying: mds: adapt drop cache for incremental recall
Applying: qa: test mds_max_caps_per_client conf
Applying: mds: add extra details for cache drop output
Applying: test/mds: fix Session cons call
Applying: mds: handle negative decay counter
@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Jul 24, 2019

@thmour What @yuriw means is this PR conflicts with some other mimic-targeting PR that is currently open. You'll only see the conflicts if you make an integration branch based on mimic with the open PRs merged on top.

@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Oct 22, 2019

@yuriw Can we get this one into v13.2.7 ?

@yuriw

This comment has been minimized.

Copy link
Contributor

yuriw commented Oct 22, 2019

@tchaikov

This comment has been minimized.

Copy link
Contributor

tchaikov commented Oct 24, 2019

@smithfarm i don't see the connection between the failure and shallow clone. could you help me understand it?

@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Oct 24, 2019

Sigh, this needs rebasing again.

@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Oct 24, 2019

@yuriw Is it possible you are merging newer PRs before older ones? It should be the other way around. @thmour had to rebase this three times already, and now he has to rebase a fourth time. That's not fair.

@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Oct 24, 2019

@tchaikov Presumably, the docs build test failure was an instance of https://tracker.ceph.com/issues/42403 fixed for mimic by #31090

batrick added 12 commits Jan 18, 2019
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 69efdaf)

Conflicts:
    src/mds/SessionMap.h
This is necessary when the MDS cache size decreases by a significant amount.
For example, when stopping a large MDS or when the operator makes a large cache
size reduction.

Fixes: http://tracker.ceph.com/issues/37723

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 7bf2f31)

Conflicts:
	PendingReleaseNotes
	src/mds/MDCache.cc
	src/mds/MDCache.h
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit b750b3b)
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit ce153b8)

Conflicts:
	src/mds/SessionMap.cc
	src/mds/SessionMap.h
As with trimming, use DecayCounters to throttle the number of caps we recall,
both globally and per-session.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit ef46216)

Conflicts:
	PendingReleaseNotes
	qa/suites/fs/bugs/client_trim_caps/tasks/trim-i22073.yaml
	src/mds/Beacon.cc
	src/mds/MDSDaemon.cc
	src/mds/Server.cc
	src/mds/Server.h
	src/mds/SessionMap.cc
	src/mds/SessionMap.h
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 7244cae)
This is to prevent unsustainable situations where a client has so many
outstanding caps that a linear traversal/operation on the session's caps takes
unacceptable amounts of time.

Fixes: http://tracker.ceph.com/issues/38022
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 48ca097)

Conflicts:
	PendingReleaseNotes
	src/mds/Server.cc
That the MDS will not let a client sit above mds_max_caps_per_client caps.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 30aaa88)
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 3bc093f)

Conflicts:
	src/mds/Server.cc
Instead of a timeout and complicated decisions about whether the client is
releasing caps in an expeditious fashion, just use a DecayCounter that tracks
the number of caps we've recalled. This counter is decremented whenever the
client releases caps. If the counter passes a threshold, then we raise the
warning.

Similar reworking is done for the steady-state recall of client caps. Another
release DecayCounter is added so we can tell when the client is not releasing
any more caps.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit c0b3a11)

Conflicts:
	PendingReleaseNotes
	src/mds/Beacon.cc
	src/mds/Server.cc
	src/mds/SessionMap.cc
	src/mds/SessionMap.h
Problem did not exist in master.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry-picked from commit 5ed5c51)

Conflicts:
	src/test/mds/TestSessionFilter.cc
Problem only exists in Luminous/Mimic.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry-picked from commit 5f23246)
@smithfarm smithfarm force-pushed the thmour:mimic_test branch from d0ff90a to bbbe96e Oct 24, 2019
@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Oct 24, 2019

I went ahead and rebased it on behalf of @thmour

@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Oct 24, 2019

IOError: CRC check failed 0xa97912f2 != 0x650305fbL
src/pybind/mgr/dashboard/CMakeFiles/mgr-dashboard-nodeenv.dir/build.make:60: recipe for target 'src/pybind/mgr/dashboard/node-env/bin/npm' failed

(not related to this PR)

@smithfarm

This comment has been minimized.

Copy link
Contributor

smithfarm commented Oct 24, 2019

jenkins test make check

@yuriw yuriw merged commit 377035a into ceph:mimic Oct 29, 2019
4 checks passed
4 checks passed
Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.