Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: add pre-flight checklist to admin guide #5899

Merged
merged 2 commits into from Apr 25, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented Apr 18, 2024

#5878 suggests that some common sys admin problems be added to the troubleshooting section of the admin guide. Thinking about it a bit, maybe a checklist for a new system would be more useful?

I started this but got stuck thinking about how to check configuration and opened #5898.

Maybe others will have some good ideas for things to add here.

@garlick garlick force-pushed the docs_update branch 2 times, most recently from 9278db3 to de81421 Compare April 23, 2024 18:34
@garlick
Copy link
Member Author

garlick commented Apr 23, 2024

I finished up that section and fixed up conflicts. Dropping WIP as this is ready for a review.

@garlick garlick changed the title WIP: add pre-flight checklist to admin guide doc: add pre-flight checklist to admin guide Apr 23, 2024
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Just noticed a couple minor things.

Comment on lines 877 to 884
* - flux-sched
- All nodes

* - flux-pam (optional)
- All nodes

* - flux-accounting (optional)
- Leader (management) node only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the scheduler only required on the leader node (modules are only loaded on the leader), and flux-accounting should be installed on all nodes so that users/admins have access to the query tools like flux account (which can be run on any node and query the rank 0 accounting service) The accounting systemd service should only be enabled on the management node though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the scheduler only required on the leader node

For the system instance, but presumably the system instance launches batch jobs with schedulers on other nodes

flux-accounting should be installed on all nodes so that users/admins have access to the query tools like flux account (which can be run on any node and query the rank 0 accounting service) The accounting systemd service should only be enabled on the management node though.

Oh good point. I hadn't thought about that. Well @cmoussa1 might want to weigh in here. The accounting guide says management node only but that may predate the work to let the clients work remotely.

https://flux-framework.readthedocs.io/en/latest/guides/accounting-guide.html#installing-software-packages

Isn't the systemd unit enabled by default though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the system instance, but presumably the system instance launches batch jobs with schedulers on other nodes

Oh heh, thinking brain not thinking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the systemd unit enabled by default though?

It seems to be disabled by default (at least in the RPM?) I kind of forget how this works right now, but here's the output from a TOSS login node:

● flux-accounting.service - Flux accounting service
   Loaded: loaded (/usr/lib/systemd/system/flux-accounting.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

I think vendor preset: disabled means the service needs to be explicitly enabled with systemctl enable flux-accounting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the accounting guide needs to be updated with some notes regarding the new accounting systemd service that was added a little while back (there is currently an open PR in flux-framework/flux-docs#216 that looks to clarify this). It used to be the case where you could only run flux-accounting commands on the management node, but that is no longer the case, you should be able to run it from anywhere on the cluster.

My understanding was that flux-accounting is typically only installed on the management node, however, since that is where the DB is installed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that flux-accounting is typically only installed on the management node, however, since that is where the DB is installed.

This is not the case on our TOSS clusters running flux. flux-accounting is installed at least on the login nodes as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake then! Sorry for the confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just simplify and have all packages installed on all nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Thanks @cmoussa1 for the follow up.

===========================================================

Flux stores its databases and KVS dumps in the ``statedir`` on the
leader (management) node. Storage consumption depend on usage, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/depend/depends/

Comment on lines +946 to +953
Is the Flux network certificate synced?
=======================================

The network certificate should be identical on all nodes and should
only be readable by the ``flux`` user:

.. code-block:: console

$ sudo pdsh -a md5sum /etc/flux/system/curve.cert | dshbak -c
----------------
test[0-7]
----------------
1b3c226159b9041d357a924841845cec /etc/flux/system/curve.cert

$ pdsh -a stat -c '"%U %A"' /etc/flux/system/curve.cert | dshbak -c
----------------
test[0-7]
----------------
flux -r--------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add link to the section on cert creation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't link back to the relevant sections on all the other checklist items - do you think that would be a good idea?

Copy link
Member

@chu11 chu11 Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary for all of them. This one stood out to me as a "did I completely forget a step?" vs. other ones. Like starting a stopped queue, your simple example is more than enough information to move forward.

Edit: This isn't a big deal either way, just a minor comment.

.. code-block:: console

$ flux run hostname
test1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps at this point it would be useful to validate that the job runs as the correct user, with supplementary groups, etc, e.g.:

$ id
uid=1000(pi) gid=1000(pi) groups=1000(pi),27(sudo),114(netdev)
$ flux run -N4 id
uid=1000(pi) gid=1000(pi) groups=1000(pi),27(sudo),117(netdev)
uid=1000(pi) gid=1000(pi) groups=1000(pi),27(sudo),114(netdev)
uid=1000(pi) gid=1000(pi) groups=1000(pi),27(sudo),117(netdev)
uid=1000(pi) gid=1000(pi) groups=1000(pi),27(sudo),114(netdev)

@garlick
Copy link
Member Author

garlick commented Apr 24, 2024

Pushed all the suggested changes.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already approved, but updates LGTM

@garlick
Copy link
Member Author

garlick commented Apr 24, 2024

Thanks! Setting MWP.

Copy link

codecov bot commented Apr 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.31%. Comparing base (d8b86fd) to head (00e7a51).
Report is 14 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5899      +/-   ##
==========================================
- Coverage   83.31%   83.31%   -0.01%     
==========================================
  Files         514      514              
  Lines       82913    82913              
==========================================
- Hits        69083    69079       -4     
- Misses      13830    13834       +4     

see 12 files with indirect coverage changes

Problem: a checklist may be helpful to sys admins preparing a new
Flux system installation.

Add a "Pre-flight Checklist" section to the admin guide.
Problem: the admin guide warns that scales beyond 128 nodes
are not regularly tested.

I think we're way past that now.
Drop the warning.
@mergify mergify bot merged commit fa7a09a into flux-framework:master Apr 25, 2024
33 checks passed
@garlick garlick deleted the docs_update branch April 28, 2024 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants