Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cephadm: add a new SMB service to the mgr module #55068

Merged
merged 29 commits into from Mar 25, 2024

Conversation

phlogistonjohn
Copy link
Contributor

@phlogistonjohn phlogistonjohn commented Jan 5, 2024

Depends on #54817
Depends on #55719

Add initial support for deploying SMB gateways using Samba Container based daemons from a new smb service spec. An example service spec that I used for testing follows:

service_type: smb
service_id: tango
placement:
  hosts: 
    - ceph0
spec:
  cluster_id: tango
  features:
    - domain
  config_uri: rados://.smb/tango/scc.toml
  custom_dns:
    - "192.168.76.204"
  join_sources:
    - "rados:mon-config-key:smb/config/tango/join1.json"
  include_ceph_users: 
    - client.smb1

The only feature flag supported is 'domain'. A deployment w/o that flag is a "users/groups" based deployment w/o active directory support. In the future a 'clustered' feature flag will added to support said feature.

The content of the config_uri and join_sources fields are pseudo-URIs that point to objects stored in RADOS pool (rados:// similar to how nfs is used on ceph) or through the rados api call to fetch data from the mon configuration key store (rados:mon-config-key:). In both cases the content of the objects is largely opaque to ceph orchestration and is defined by the samba-container / sambacc projects.

In the future I plan on submitting a new smb manager module that is functionally similar to the nfs module that will automate the construction of these configurations. You'll be able to define smb clusters, smb shares, etc through the ceph command line tool. Thus the intended layering works like so:

    Admin User
    (ceph commands)
     |
     |
     V
  +-----------------------+       +----------+
  | MGR: smb module  <----------->| RADOS    |
  |                   o   |       +----------+
  +-------------------|---+
  | MGR: orch module  V   |
  +-------------------|---+
  | MGR: cephadm mod  V   |
  +-------------------|---+
                      |
                      |
  +-------------------V---+
  |  NODE: cephadm        |
  |                 o     |
  +-----------------|-----+
  | NODE            V     |
  |  podman/docker        |
  + - - - - - - - - - - - +
  | NODE:                 |       +-------------+
  |           data <------------->| MDS/CephFS  |
  |                       |       +-------------+
  |  samba-container      |
  |                       |       +----------+
  |         config o------------->| RADOS    |
  +-----------------------+       +----------+

Each layer should largely stand on it's own so (for now) I'm planning on filing separate PRs for each item. Because there is a logical dependency chain between each item we can review and discuss them independently but choose to merge only the final PR if convenient.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
    • Docs will be updated in a follow-up PR
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@rkachach
Copy link
Contributor

jenkins test rook e2e

Copy link
Contributor

@rkachach rkachach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phlogistonjohn please, can you have a look to the following error (rook e2e testing is failing):

Error ENOENT: module 'rook' reports that it cannot run on the active manager daemon: cannot import name 'SMBSpec' (pass --force to force enablement)

@phlogistonjohn
Copy link
Contributor Author

@phlogistonjohn please, can you have a look o the following error (rook e2e testing is failing):

Error ENOENT: module 'rook' reports that it cannot run on the active manager daemon: cannot import name 'SMBSpec' (pass --force to force enablement)

I looked at the code and I looked at the log from the jenkins job and I looked to see if there was an obvious place where I missed something (by looking for other spec types, like Nvemof). I do not see where the rook module would be trying to pull in the SMBSpec class, so it must be indirect. I will need a traceback/mgr logs to debug this further. How can I get that?

@phlogistonjohn
Copy link
Contributor Author

^ @rkachach any way I can get the mgr logs for these runs?

@rkachach
Copy link
Contributor

^ @rkachach any way I can get the mgr logs for these runs?

Sorry @phlogistonjohn , I saw your message but I was doing some more investigation related to the error above. I think it's related to the rook e2e testing. I opened this PR to fix it #55393. It would be good to see if the patch from the PR fixes the error on your side.

@phlogistonjohn
Copy link
Contributor Author

phlogistonjohn commented Jan 31, 2024

^ @rkachach any way I can get the mgr logs for these runs?

Sorry @phlogistonjohn , I saw your message but I was doing some more investigation related to the error above. I think it's related to the rook e2e testing. I opened this PR to fix it #55393. It would be good to see if the patch from the PR fixes the error on your side.

Would it be best if I cherry-pick your patch into this PR branch (for now)?

@github-actions github-actions bot added the rook label Jan 31, 2024
@phlogistonjohn
Copy link
Contributor Author

Would it be best if I cherry-pick your patch into this PR branch (for now)?

Since it's so easy to undo if it's a bad approach, that is what I'm going with for now.

@rkachach
Copy link
Contributor

^ @rkachach any way I can get the mgr logs for these runs?

Sorry @phlogistonjohn , I saw your message but I was doing some more investigation related to the error above. I think it's related to the rook e2e testing. I opened this PR to fix it #55393. It would be good to see if the patch from the PR fixes the error on your side.

Would it be best if I cherry-pick your patch into this PR branch (for now)?

Sure 👍, Thanks @phlogistonjohn.

Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@phlogistonjohn
Copy link
Contributor Author

@rkachach ceph rook orchestrator e2e tests passed this time.

@rkachach
Copy link
Contributor

@rkachach ceph rook orchestrator e2e tests passed this time.

awesome, thanks @phlogistonjohn

@synarete synarete added wip-ssharon-test synarete testing and removed wip-ssharon-test synarete testing labels Feb 6, 2024
@phlogistonjohn phlogistonjohn force-pushed the jjm-cephadm-smb-svc-spec branch 2 times, most recently from 51d7691 to d789c33 Compare February 12, 2024 19:14
phlogistonjohn and others added 24 commits March 21, 2024 18:30
The not-a-real-fqdn hostname that the containers got were causing
performance issues joining AD (and running testjoin and winbind).
Define a virtual hostname that can be passed in from the service or
automatically derived from the system's hostname.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Tests that touch this enum fail for me locally but pass in the CI. This
seems to be due to new enum related behavior in Python 3.11.
See: https://blog.pecar.me/python-enum
Instead of fixing it as suggested in the above blog, adding a __str__
method works on all python versions I care to know about.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
In the seemingly never-ending fight against line continuations and just
blatting tons of stuff onto single lines another small victory is won.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
While ceph doesn't enforce sorted imports I prefer them when possible. I
had once sorted these imports but then nvmeof came along an ruined
things. Put nvmeof back in it's place.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Reformat the _service_classes variable so that it uses a multi-line list
with a single item on each line in a more black-ish style that is more
readable (especially if you use code-folding wisely).
Sort the list while we're at it.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
…caps

Refactor get_keyring_with_caps such that the keyring simplification code
is moved into a new function that can be used in other locations.
get_keyring_with_caps will now call the new function to return the
simplified & consistent keyring output.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Reformat the ServiceSpec classes properties KNOWN_SERVICE_TYPES and
REQUIRES_SERVICE_ID. These were previously strings that were converted
to lists via a call to split. With a string there's very little a human
or a tool can do to validate the content. Changing these into proper
lists in the source code brings clarity of intent and the ability to
analyze the code. Because there's no semantic difference what services
are listed where (this means the type could probably be a set - a quest
for another day) I also took the opportunity to sort the contents of the
lists and add some basic comments for what these lists are for.

It also removes the use of (ugly, IMO) line continuations. The downside
is that it makes more total lines, but if that bugs you - use code
folding :-).

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Will be used in a later commit to implement deploying smb instances.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Add the smb service by name or by type to one of the many, many touch
points in the orchestrator and cephadm packages needed to get the
orchestrator aware of smb.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Because the "if-ladder" was only ever assigning a single variable with
a value it can be directly replaced by a dict & dict-lookup which is
much more succinct.
Also take the opportunity to sort the (non-comment) lines as there's
no meaning to the previous order and this makes it easier for a reader
to scan through.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Add a new task function to cephadm.py that sets up a container running
the Samba based domain controller on a node using podman or docker.
Much of the function actually deals with disabling systemd-resolved
because that service conflicts with the DNS server component of the DC.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
There are cases where I want to pass some large-ish strings to ceph
commands executed via cephadm shell. Allow items within the commands
list to be dicts containing a command (as before) and an optional
stdin variable. This change also supports possible future extensions as
well.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Add a cephadm.exclude role that excludes a test node from cluster setup
and related commands. I need this as I have  test node that will be set
up as an AD Domain Controller for testing Samba and do not want that
node to be have *any* other services running on it.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Add a new cephadm.exec task that works similarly to the existing
vip.exec but instead of only considering VIP related string replacements
it uses that templating feature that was recently added to the
cephadm module for generalized string templating.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Add a `role_to_remote` template filter function that has the ability to
map a role name to a remote. Attributes of the remote can then be
used to get the actual node ip or name.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
All `exec`-style function in teuthology appear to have a transformation
block that expands names like `all-roles` and `all-hosts`. With the new
cephadm.exec task that block appeared twice in cephadm.py. This change
removes the duplication by creating an _expand_roles function that
can be called from the command executing functions.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Looking at the code that expands `all-roles` and `all-hosts` there's no
proper error checking for when these values appear but there are >1
top-level roles in the task config. If a user does this it'll fail
but in a somewhat unclear manner. Add a new condition that raises a
clear exception in this case hopefully saving someone future debugging
time.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Start a new subdir under cephadm suite for the new smb service
that cephadm can deploy. Add one new test that checks that a
smb service with domain membership can be deployed and connect
to it with smbclient from the samba client container image.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
To have the standalone (non-AD) server test function similarly to the AD
member server test we need to set a variable for samba client container
command similar to how the AD setup command does it.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
Test minimal SMB deployment over CephFS, using local users (non-AD).
Upon successful deployment run minima smbclient command ('ls') to probe
Samba's share liveness.

Co-authored-by: John Mulligan <jmulligan@redhat.com>
Signed-off-by: Shachar Sharon <ssharon@redhat.com>
Signed-off-by: John Mulligan <jmulligan@redhat.com>
@adk3798
Copy link
Contributor

adk3798 commented Mar 25, 2024

https://pulpito.ceph.com/phlogistonjohn-2024-03-22_03:02:08-orch:cephadm-wip-phlogistonjohn-testing-2024-03-21-1835-distro-default-smithi/

Vast majority is XXX in cluster log stuff, tests that weren't are all failures I've seen before (mds_upgrade_sequence, staggered upgrade with agent, etc.)

No reason to block merging this PR.

@adk3798 adk3798 merged commit c743075 into ceph:main Mar 25, 2024
10 of 13 checks passed
@phlogistonjohn phlogistonjohn deleted the jjm-cephadm-smb-svc-spec branch March 25, 2024 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants