New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: add pre-flight checklist to admin guide #5899
Conversation
9278db3
to
de81421
Compare
I finished up that section and fixed up conflicts. Dropping WIP as this is ready for a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Just noticed a couple minor things.
doc/guide/admin.rst
Outdated
* - flux-sched | ||
- All nodes | ||
|
||
* - flux-pam (optional) | ||
- All nodes | ||
|
||
* - flux-accounting (optional) | ||
- Leader (management) node only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the scheduler only required on the leader node (modules are only loaded on the leader), and flux-accounting should be installed on all nodes so that users/admins have access to the query tools like flux account
(which can be run on any node and query the rank 0 accounting service) The accounting systemd service should only be enabled on the management node though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the scheduler only required on the leader node
For the system instance, but presumably the system instance launches batch jobs with schedulers on other nodes
flux-accounting should be installed on all nodes so that users/admins have access to the query tools like flux account (which can be run on any node and query the rank 0 accounting service) The accounting systemd service should only be enabled on the management node though.
Oh good point. I hadn't thought about that. Well @cmoussa1 might want to weigh in here. The accounting guide says management node only but that may predate the work to let the clients work remotely.
Isn't the systemd unit enabled by default though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the system instance, but presumably the system instance launches batch jobs with schedulers on other nodes
Oh heh, thinking brain not thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the systemd unit enabled by default though?
It seems to be disabled by default (at least in the RPM?) I kind of forget how this works right now, but here's the output from a TOSS login node:
● flux-accounting.service - Flux accounting service
Loaded: loaded (/usr/lib/systemd/system/flux-accounting.service; disabled; vendor preset: disabled)
Active: inactive (dead)
I think vendor preset: disabled
means the service needs to be explicitly enabled with systemctl enable flux-accounting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the accounting guide needs to be updated with some notes regarding the new accounting systemd service that was added a little while back (there is currently an open PR in flux-framework/flux-docs#216 that looks to clarify this). It used to be the case where you could only run flux-accounting commands on the management node, but that is no longer the case, you should be able to run it from anywhere on the cluster.
My understanding was that flux-accounting is typically only installed on the management node, however, since that is where the DB is installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that flux-accounting is typically only installed on the management node, however, since that is where the DB is installed.
This is not the case on our TOSS clusters running flux. flux-accounting is installed at least on the login nodes as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake then! Sorry for the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just simplify and have all packages installed on all nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Thanks @cmoussa1 for the follow up.
doc/guide/admin.rst
Outdated
=========================================================== | ||
|
||
Flux stores its databases and KVS dumps in the ``statedir`` on the | ||
leader (management) node. Storage consumption depend on usage, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/depend/depends/
Is the Flux network certificate synced? | ||
======================================= | ||
|
||
The network certificate should be identical on all nodes and should | ||
only be readable by the ``flux`` user: | ||
|
||
.. code-block:: console | ||
|
||
$ sudo pdsh -a md5sum /etc/flux/system/curve.cert | dshbak -c | ||
---------------- | ||
test[0-7] | ||
---------------- | ||
1b3c226159b9041d357a924841845cec /etc/flux/system/curve.cert | ||
|
||
$ pdsh -a stat -c '"%U %A"' /etc/flux/system/curve.cert | dshbak -c | ||
---------------- | ||
test[0-7] | ||
---------------- | ||
flux -r-------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add link to the section on cert creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't link back to the relevant sections on all the other checklist items - do you think that would be a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary for all of them. This one stood out to me as a "did I completely forget a step?" vs. other ones. Like starting a stopped queue, your simple example is more than enough information to move forward.
Edit: This isn't a big deal either way, just a minor comment.
doc/guide/admin.rst
Outdated
.. code-block:: console | ||
|
||
$ flux run hostname | ||
test1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps at this point it would be useful to validate that the job runs as the correct user, with supplementary groups, etc, e.g.:
$ id
uid=1000(pi) gid=1000(pi) groups=1000(pi),27(sudo),114(netdev)
$ flux run -N4 id
uid=1000(pi) gid=1000(pi) groups=1000(pi),27(sudo),117(netdev)
uid=1000(pi) gid=1000(pi) groups=1000(pi),27(sudo),114(netdev)
uid=1000(pi) gid=1000(pi) groups=1000(pi),27(sudo),117(netdev)
uid=1000(pi) gid=1000(pi) groups=1000(pi),27(sudo),114(netdev)
Pushed all the suggested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already approved, but updates LGTM
Thanks! Setting MWP. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5899 +/- ##
==========================================
- Coverage 83.31% 83.31% -0.01%
==========================================
Files 514 514
Lines 82913 82913
==========================================
- Hits 69083 69079 -4
- Misses 13830 13834 +4 |
Problem: a checklist may be helpful to sys admins preparing a new Flux system installation. Add a "Pre-flight Checklist" section to the admin guide.
Problem: the admin guide warns that scales beyond 128 nodes are not regularly tested. I think we're way past that now. Drop the warning.
#5878 suggests that some common sys admin problems be added to the troubleshooting section of the admin guide. Thinking about it a bit, maybe a checklist for a new system would be more useful?
I started this but got stuck thinking about how to check configuration and opened #5898.
Maybe others will have some good ideas for things to add here.