Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/cephadm: stable nfs-ganesha ranks; nfs + ingress support #41007

Merged
merged 46 commits into from May 25, 2021

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Apr 23, 2021

  • Assign each ganesha daemon a stable rank id (0, 1, ...).
  • Ensure that no more than one daemon at a time is running on a single rank. For now, this relies on revoking the old daemon's auth key. In the future, we can additionally fence it, but that infrastructure isn't in tree yet.
  • Add NFS support for ingress: haproxy with consistent hashing
  • Update mgr/nfs to optionally deploy ingress (when VIP is provided)
  • Update mgr/nfs cluster ls output to include stable endpoint separate from backend server instances
  • Make everything default to pool nfs-ganesha and namespace matching the service_id
  • Update mgr/nfs commands to use rm instead of delete for consistency with the rest of the CLI

@jtlayton
Copy link
Contributor

@jtlayton Is this the right thing? 49e15c7#diff-b13a6b9b90f10d54533fcaa8b7d9db7f8d24ae39bb52cd611c28c9c39cb429ecR86

This goes into the template https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/templates/services/nfs/ganesha.conf.j2

Yeah, looks good. As long as you have unique nodeids for the daemons that are running, and their successors get the same name you should be fine.

@liewegas liewegas force-pushed the cephadm-ha-nfs branch 2 times, most recently from 6ca36c6 to 5dc6ca2 Compare April 26, 2021 19:53
Copy link
Contributor

@mgfritch mgfritch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of suggestions, but looks good overall!

src/pybind/mgr/cephadm/services/nfs.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/schedule.py Show resolved Hide resolved
src/pybind/mgr/cephadm/services/nfs.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/services/nfs.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/inventory.py Show resolved Hide resolved
@@ -943,6 +943,7 @@ def to_json(self) -> dict:
out: Dict[str, Any] = OrderedDict()
out['daemon_type'] = self.daemon_type
out['daemon_id'] = self.daemon_id
out['service_name'] = self._service_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove osdspec_affinity a few lines below by now.

src/pybind/mgr/cephadm/inventory.py Show resolved Hide resolved
src/pybind/mgr/cephadm/inventory.py Show resolved Hide resolved
Comment on lines +52 to +62
def assign_rank(self, rank: int, gen: int) -> 'DaemonPlacement':
return DaemonPlacement(
self.daemon_type,
self.hostname,
self.network,
self.name,
self.ip,
self.ports,
rank,
gen,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best place to put that code. Would be great to have the rank assignment in the NFSService class.

Comment on lines +275 to +280
d.rank is not None, # ranked first, then non-ranked
d.rank, # low ranks
0 - (d.rank_generation or 0), # newer generations first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels wrong to me. We're putting a lot of per-service logic into the scheduler here. Can we keep it generic in the scheduler?

expected: List[str]
expected_add: List[str]
expected_remove: List[DaemonDescription]


@pytest.mark.parametrize("service_type,placement,hosts,daemons,expected,expected_add,expected_remove",
@pytest.mark.parametrize("service_type,placement,hosts,daemons,rank_map,post_rank_map,expected,expected_add,expected_remove",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad feeling about that. This is going to make the test matrix explode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the non-ranked test cases, these are just None, None. I thought about making a separate test but that would just duplicate the test code... or put it in a helper that just passed None, None for the older test cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that we essentially have to test

(different number of hosts) * (different existing placement on those hosts) * (different networks) * (different filter_new_host functions) * (allow_colo = yes and allow_colo = no) * (primary_daemon_type) * (per_host_daemon_type) * (different values for the rank_map).

And that scares me.

@github-actions github-actions bot added the core label Apr 30, 2021
@liewegas liewegas changed the title mgr/cephadm: stable nfs-ganesha ranks mgr/cephadm: stable nfs-ganesha ranks; nfs + ingress support Apr 30, 2021
@github-actions github-actions bot added the tests label May 2, 2021
Copy link
Contributor

@varshar16 varshar16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the pool name in nfs tests here https://github.com/ceph/ceph/blob/master/qa/tasks/cephfs/test_nfs.py

@liewegas
Copy link
Member Author

liewegas commented May 4, 2021

@varshar16 I went back to nfs-ganesha for the default pool name.. still undecided whether we should change this to .nfs-ganesha or .nfs

Copy link
Contributor

@varshar16 varshar16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going ahead with option 1 then please update the doc and create a tracker ticket for the new update interface.

@liewegas liewegas force-pushed the cephadm-ha-nfs branch 2 times, most recently from d32996b to c651dd6 Compare May 25, 2021 13:31
This command is very awkward to implement unless all service spec fields
are always required.  That will soon mean both the placement *and*
virtual_ip (if any), making it much less useful for a human to make use
of.

Instead, let them update yaml, or adjust the nfs and/or ingress specs
directly.  I don't think this command is needed.

Signed-off-by: Sage Weil <sage@newdream.net>
For 'nfs cluster create', optionally take a virtual_ip to deploy ingress.

Signed-off-by: Sage Weil <sage@newdream.net>
- include the virtual_ip and port at top level
- move backend server list into a sub-item
- include (haproxy) monitoring port

Signed-off-by: Sage Weil <sage@newdream.net>
- leave off pool/ns, since they should almost never be necessary.
- add port

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Apply nfs default pool (currently 'nfs-ganesha'), and default the
namespace to the service_id.

There is no practical reason for users to ever need to change this, and
requiring them to provide this informaiton at config/apply time just
complicates life.

Signed-off-by: Sage Weil <sage@newdream.net>
haproxy's container image tells docker|podman to send SIGUSR1 for a "clean"
shutdown.  For NFS, the connections never close, so we will always hit the
podman|docker 10s timeout and get a SIGKILL.  That, in turn, causes haproxy
to exit with 143, and puts the systemd unit in a failed state.

This highlights a general problem(?) with stopping containers: if they don't
do it quickly then we'll end up in this error state.  We don't directly
address that here.

Avoid this problem by always stopping containers with SIGTERM.  In the
haproxy case, that means an immediate shutdown (no graceful drain of
open connections).  In theory we could do this only for haproxy with
NFS, but we can easily imagine RGW connections that don't close in 10s
either, and we don't want containers exiting in error state--we just
want the proxy to stop quickly.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
It is likely that the rook/k8s variation of ingress will not take a
virtual_ip argument.  We want to make sure that ingress yes/no can be
specified independent of the virtual_ip.

Signed-off-by: Sage Weil <sage@newdream.net>
Still missing a full client mount test, though!

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
@liewegas
Copy link
Member Author

If we are going ahead with option 1 then please update the doc and create a tracker ticket for the new update interface.

https://tracker.ceph.com/issues/50972
#41534

Copy link
Contributor

@varshar16 varshar16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liewegas for addressing my comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants