# Administrator Guide
In this administrator guide, some essential admin command and config files will be introduced. Hopefully I can demostrate some of the frequently use operation with some example which you can run right on this notebook. Examples in this notebook requires more compute node, you need to scale up the compute containers to 4.
```
podman compose -f compose.dev.yml up -d --scale compute=4 --no-recreate
```
If not all 4 compute show up in `sinfo`, try restarting individual compute containers.

### Useful admin commands
| Command | Usage |
|---------|-------|
| scontrol | Make configuration changes and query status/property of Slurm in runtime. |
| sacctmgr | Managing interface of the slurm accounting service, Cluster, account, user, federation, Quality of Services .etc can be managed from here |
| sacct | Query job accounting record, not really making any configuration changes, but a good tool for observing the usage |

### Config Files
The general rule is that if a certain configuration change is made to persist across boot/service restart/slurm reconfigure, you need to put them in these files. You can sync this file across the cluster, put them in a location that is shared across the cluster, or use the configless option.  
| Conf file | Usage |
|-----------|-------|
| slurm.conf | Initial slurm configuration at start up. |
| slurmdbd.conf | Initial slurmdb configuration at start up. |
| cgroup.conf | cgroup configuration |
| gres.conf | "General resource" configuration. GPU configuration is defined here |
| topology.conf | Describe network topology, helping slurm to select better combination of nodes when allocating resource |

After updating these files across the cluster you can either run reconfigure sub-command or restart the daemons to make the changes effective. Note that some changes requires daemon restart. 

In [None]:
sudo sacctmgr reconfigure
sudo scontrol reconfigure

In [None]:
set -x
# restart services on different nodes using ansible
ansible-inventory --graph 

# restart slurmdbd-host
ansible -m systemd -a "name=slurmdbd  state=restarted enabled=yes" slurmdbd_host

# restart slurmctld-host
ansible -m systemd -a "name=slurmctld state=restarted enabled=yes" slurmctld_host

# restart slurmd-host
ansible -m systemd -a "name=slurmd    state=restarted enabled=yes" slurmd_host

# restart slurmrestd-host
ansible -m systemd -a "name=slurmdbd  state=restarted enabled=yes" slurmrestd_host

set +x

## Create Slurm Account and User automatically at first login
It is a good practice to set `AccountingStorageEnforce=associations,limits,qos`, there are many limits, restriction you can configure via slurm accounting DB. And having these flags being set, any user will have to have an account and user created in the slurm accounting database before they can submit jobs.  
It is just an extra step when creating the linux account if your cluster is using local account or a Directory Services that is dedicated to the cluster. But if your cluster is connedted to the organization AD, LDAP, FreeIPA, etc. , where there are lots of account and changes from time to time, and the Slurm account and user creation could become troublesome.  
One way of automating this process is to utilize the pam_exec.so module, it can be configured to execute a command or script everytime a user login. The difference to `/etc/profile.d/` scripts is that scripts under profile.d is sourced by the user's shell session, it is bounded to what the user himself can do, and a normal user certainly cannot create them self a slurm account and Slurm user. pam_exec.so on the other hand execute the script with system/root before user's session starts, and hence it can create slurm account and user.

In [None]:
# pam config
grep pam_exec.so /etc/pam.d/*

In [None]:
# account and user creation script
ls -l /etc/slurm/create-account-user.sh
cat /etc/slurm/create-account-user.sh

## Admin level
If you want to admin action via scontrol and sacctmgr without escalating privileges, or you trust someone with Slurm administration but not the rest of your system. you can promote a slurm user to one of the following administrator level.  
| Admin level | Description |
|-------------|-------------|
| [Admin](/doc/user_permissions.html#admin) | You can run scontrol, sacctmgr command as if you are root. |
| [Operator](/doc/user_permissions.html#operator) | Modify any slurm database object |
| [Coordinator](/doc/user_permissions.html#coord) | Special user role to a slurm account, able to manage data and object for all user of that account. |

`PrivateData` parameter in slurm.conf can restrict data readable by user. However, users with different admin level have atleast the permission to read/write the following objects, regardless of PrivateData restriction.  

|   | Admin | Operator | Coordinator |
|---|-------|----------|-------------|
| Jobs | Read-Write | Read-Write | Read-Write (specific account) |
| Reservation | Read-Write | Read-Write |  |
| Partition | Read-Write | Read-Only |  |
| Node | Read-Write | Read-Only |  |

Reference:
* [Scontrol - Authorization](/doc/scontrol.html#SECTION_AUTHORIZATION)
* [User Permission](/doc/user_permissions.html)

### Promoting user to be coordinator of an account

In [None]:
# Promote current user to coordinator
sudo sacctmgr -i create coordinator account=$(id -gn) name=$(whoami)
sacctmgr show account WithCoord

In [None]:
# delete coordinator
sacctmgr -i delete coordinator account=$(id -gn) name=$(whoami)
sudo sacctmgr show account WithCoord

### Promoting user to be Operator

In [None]:
sudo sacctmgr -i modify user $(whoami) set AdminLevel=operator
sacctmgr show user user=$(whoami)

In [None]:
# remove admin level, become normal slurm user
sudo sacctmgr -i modify user $(whoami) set AdminLevel=None
sacctmgr show user user=$(whoami)

### Promoting user to be Administrator

In [None]:
sudo sacctmgr -i modify user $(whoami) set AdminLevel=admin
sacctmgr show user user=$(whoami)

In [None]:
# eg. draining nodes
scontrol update nodename=ALL state=drain reason="just practice"
sinfo --N --long 

In [None]:
# eg. putting them back
scontrol update nodename=ALL state=resume
sinfo --N --long

## Modify/extend time limit of a running job
If a time limited job is running slower than expect and it is approaching time limit, you as admin do have the power to extend it.  
(If you are a user, now you know this, but please don't bother your admin with these unless the job is absolutely critical.)

In [None]:
# submit a 10 min job
jobid=$(sbatch --ntasks=1 --parsable --time 00:10:00 endless-checksum-mpi.sh)
scontrol show job ${jobid}

In [None]:
# extend the job to 20 min
scontrol update JobId=${jobid} TimeLimit=00:20:00
scontrol show job ${jobid}

In [None]:
# extend the job for 10 more min
scontrol update JobId=${jobid} TimeLimit+=00:10:00
scontrol show job ${jobid}

In [None]:
# removing time limit
scontrol update JobId=${jobid} TimeLimit=INFINITE
scontrol show job ${jobid}

In [None]:
# remove the job
scancel ${jobid}

## Drain, Resume nodes
Changing node state is a very common operation, eg. if you have identify a faulty node, or you want to perform maintenance task on the node.  
If you don't want the node to accept new jobs, set it to drain and you must provide a reason. The node will stop accepting new jobs. State is "draining" when there is running, "drained" when no job. If you want to put the node back, set state to "resume"

In [None]:
# submit a dummy job
jobid=$( sbatch --ntasks=1 --parsable endless-checksum-mpi.sh )

In [None]:
# wait for the job to start
squeue -l -j ${jobid}

In [None]:
# Drain all node and see the different in states, ALL for draining all node
scontrol update NodeName=ALL State=drain Reason="drain demo"
sinfo --N --long

In [None]:
# Cancel the job
scancel ${jobid}

In [None]:
# Observe that the draining node become drained
sinfo --N --long

In [None]:
# put first drained node, if any, on the list back to production
node=$(sinfo --noheader --N --long --state=drain | awk '{print $1}' | head -1)
[[ -n ${node} ]] && scontrol update nodename=${node} state=resume
sinfo --N --long

## Create, Remove, Change state of Partition 
Using the scontrol command, you can create, remove, and change state of partitions in runtime. However these operations are ephemeral, you need to put the equivalence into slurm.conf to make it persistent. 

### Create Partition
To create a partition, we need to specify at least the name. You can also include some partition properties, but you can always modify them afterward.  

In [None]:
# eg. creating a new partition name DEV, setting maximum time limit at 1 hr
# just pick a node
node=$(sinfo --noheader --N --long | awk '{print $1}' | head -1)
scontrol create PartitionName=DEV Nodes=${node} MaxTime=01:00:00
scontrol show partition DEV
sinfo

### Change state and modify property of partition
For partition there are 4 possible states:  

|   | Accepting job | Rejecting job |
|---|---------------|---------------|
| Dispatching job | UP   | DRAIN    |
| Holding job     | DOWN | INACTIVE |

Changing state is just like changing other properties, using "scontrol update" subcommand.

In [None]:
# Setting DEV state to DOWN and submit a jobs 
scontrol update PartitionName=DEV state=DOWN
sbatch --ntasks=1 --parsable --partition=DEV endless-checksum-mpi.sh
# Observe that the jod has been submitted, but won't execute
squeue --partition DEV

In [None]:
# Setting DEV state to INACTIVE/DRAIN and submit a job, but fail
# INACTIVE
scontrol update PartitionName=DEV state=INACTIVE
sbatch --ntasks=1 --parsable --partition=DEV endless-checksum-mpi.sh

# DRAIN
scontrol update PartitionName=DEV state=DRAIN
sbatch --ntasks=1 --parsable --partition=DEV endless-checksum-mpi.sh

In [None]:
# job submission failed, but the job submitted before is now running
squeue --partition DEV

In [None]:
# Allowing the overscribe in the partition, and try overwhelm the partition
scontrol update PartitionName=DEV OverSubscribe=FORCE State=UP DefMemPerNode=1024
# clear all job first
squeue --partition DEV --noheader | awk '{print $1}' | xargs scancel
for i in $(seq 5) ; do
sbatch --ntasks=1 --parsable --partition=DEV endless-checksum-mpi.sh
sleep 1
done

In [None]:
squeue --partition DEV

### Delete partition
Before deleting a partition, you must clear the partition (no running or pending jobs)

In [None]:
# Set partition to Drain
scontrol update PartitionName=DEV State=DRAIN
# cancel all jobs in the partition
squeue --partition DEV --noheader | awk '{print $1}' | xargs scancel
scontrol delete PartitionName=DEV

## Move node around Partitions
In a production cluster, you may have a partition serving critical jobs, and some other partition serving less important jobs. You may need to move nodes around when a node is down to make sure the mission critical partition has enough resource.  
Unfortunately "scontrol" doesn't support "+=" and "-=" operations, so we will need to deal with the whole list everytime. We could define some shell function to help though.

In [None]:
# Just one way of doing it

# bash func to add a node to partition
# Usage: partAddNode <partition name> <list of nodes>
partAddNodes () {
    # Check if partition exist
    [[ -n $(scontrol show partition ${1} --oneliner | grep "PartitionName=${1}" ) ]] || return 1
    # Get current node list
    current_nodelist=$(sinfo --N --long --partition ${1} --noheader | awk '{print $1}')
    new_nodelist=$( echo ${current_nodelist} $(scontrol show hostname ${2}) | tr ' ' '\n' | sort | uniq | paste -s -d",")
    # Update partition node list
    scontrol update PartitionName=${1} Nodes=${new_nodelist}
}

# bash func to remove a node to partition
# Usage: partDelNode <partition name> <list of nodes>
partDelNodes () {
    # Check if partition exist
    [[ -n $(scontrol show partition ${1} --oneliner | grep "PartitionName=${1}" ) ]] || return 1
    # Get current node list
    current_nodelist=$(sinfo --N --long --partition ${1} --noheader | awk '{print $1}')
    remove_nodes=$(scontrol show hostname ${2})
    new_nodelist=$( echo ${current_nodelist} ${remove_nodes} ${remove_nodes} | tr ' ' '\n' | sort | uniq -u | paste -s -d",")
    # Update partition node list
    scontrol update PartitionName=${1} Nodes=${new_nodelist}
}

# bash function to move nodes from A to B
# Usage: moveNodesToPart <src partition> <dest partition> <list of nodes>
moveNodesToPart (){
    partAddNodes ${2} ${3} && partDelNodes ${1} ${3}
}

In [None]:
# Setup Part_A and Part_B for example
scontrol create PartitionName=Part_A Nodes=All
scontrol create PartitionName=Part_B
sinfo --long

In [None]:
#Move first node in Part_A to Part_B
moveNodesToPart Part_A Part_B $(sinfo --noheader --N --long --partition Part_A | awk '{print $1}' | head -1)
sinfo --long

In [None]:
# delete demo partitions
scontrol delete PartitionName=Part_A
scontrol delete PartitionName=Part_B

## Floating Partition
Instead of having your operator actively swapping nodes, there is another way to maintain certain no. of healthy node in the mission critical partition, [floating partition](/doc/qos.html#partition). You can assign ALL suitable nodes to the mission critical, possibly sharing some nodes with other partitions, and then define a Quality-of-Services(QOS) to limit the ammount of nodes it can use.  
For example, if we include all node in the mission critical partition PROD, sharing 3 node with development partition DEV, but only allowing partition PROD to use at most 5 nodes at a time. When All nodes are normal, the PROD will not use more than the first 5 nodes, but when some node failed in the first 5 nodes, the PROD partition can automatically "steal" some nodes from the DEV partition and keep run at 5 node capacity. If priority factor is setup properly, the DEV partition will suffer from this node failure incident instead of the PROD partition.  
![floating-partition](floating-partition.drawio.svg)

In [None]:
# Make sure "AccountingStorageEnforce" includes qos
scontrol show config | grep -iE ^AccountingStorageEnforce

### Create Floating Partition

In [None]:
# Create partition QoS
sacctmgr -i add qos qos_prod set GrpTres=node=2

In [None]:
# Create partition PROD and DEV.
# PROD get all nodes
scontrol create PartitionName=PROD MaxTime=00:05:00 QOS=qos_prod Nodes=ALL
# DEV share the last 2 node with PROD
scontrol create PartitionName=DEV  MaxTime=00:05:00 Nodes=$(sinfo --noheader --N --long --partition PROD | awk '{print $1}' | tail -2 | paste -s -d",")
# show partitions
sinfo --long

### Simulate Normal Case
Lets submit 3 1-node jobs to PROD, and 1 1-node job to DEV.

In [None]:
# 3 jobs to PROD
seq 3 | xargs -i sbatch --nodes=1 --ntasks-per-node=2 --parsable --partition=PROD endless-checksum-mpi.sh
# 1 jobs to DEV
seq 1 | xargs -i sbatch --nodes=1 --ntasks-per-node=2 --parsable --partition=DEV endless-checksum-mpi.sh

In [None]:
# check job queue
squeue -la --sort=i
sinfo --long --partition=PROD,DEV

Note that the 3rd job in PROD is "pending" with reason `QOSGrpNodeLimit` despite having 1 idle node in the partition.  
(BTW this is one downside of having a floating partition if normal user are able to see partition node states. They might wonder why their job is not starting despite having idle nodes. )

In [None]:
# clear all jobs
squeue --noheader --partition PROD,DEV | awk '{print $1}' | xargs scancel

### Simulate Node Failure
Now we simulate node failure by draining a node in PROD, and then submit some jobs.

In [None]:
# Draining first node in PROD
scontrol update NodeName=$(sinfo --N --long --noheader --partition PROD | awk '{print $1}' | head -1) State=Drain Reason="Node Failure"
sinfo --long --partition=PROD,DEV

In [None]:
# 3 jobs to PROD
seq 3 | xargs -i sbatch --nodes=1 --ntasks-per-node=2 --parsable --partition=PROD endless-checksum-mpi.sh
# 1 jobs to DEV
seq 2 | xargs -i sbatch --nodes=1 --ntasks-per-node=2 --parsable --partition=DEV endless-checksum-mpi.sh

In [None]:
# check job queue
squeue -la --sort=i
sinfo --long --partition=PROD,DEV

PROD is able to "steal" a node from DEV to maintain 2 node capacity. As a result, it is partition DEV suffer from the node failure, instead of the mission critical partition PROD. 

In [None]:
# Clean up
squeue --noheader --partition PROD,DEV | awk '{print $1}' | xargs scancel
# resume node
scontrol update NodeName=$(sinfo --N --long --noheader --partition PROD --state drain | awk '{print $1}' | paste -s -d",") State=resume
# delete partition 
scontrol delete PartitionName=PROD
scontrol delete PartitionName=DEV
# delete QoS
sacctmgr -i delete qos name=qos_prod

## Create and Manage Reservation
There are many options for creating a reservations. For detail please refer to these 2 documents:
1. [scontrol: reservation](/doc/scontrol.html#SECTION_RESERVATIONS---SPECIFICATIONS-FOR-CREATE,-UPDATE,-AND-DELETE-COMMANDS)
2. [Advanced Resource Reservation Guide](/doc/reservations.html)

### Basic reservation for running job

In [None]:
# reserve 1 node in debug for 1 hr, 5 min from now, for yourself
resv_name=$(whoami)_resv_1
scontrol create reservationname=${resv_name} user=$(whoami) partition=debug nodecnt=1 duration=60 starttime=$(date --date "now + 5 min" +"%FT%T" )
scontrol show reservation ${resvname}
sinfo --reservation

In [None]:
# submit job using the reservation
jobid=$(sbatch --nodes=1 --ntasks-per-node=2 --parsable --time 00:10:00 --partition debug --reservation ${resv_name} endless-checksum-mpi.sh)
scontrol show job ${jobid}

In [None]:
# the job will start once the reservation become active
sinfo --reservation
squeue -la -j ${jobid}

In [None]:
# without special flags, reservation can only be made when resource is available, this should fail
scontrol create ReservationName=fail_resv user=$(whoami) partition=debug \
    nodes=ALL duration=60 starttime=$(date --date "now + 5 min" +"%FT%T" )

# remove the reservation just in case
scontrol delete ReservationName=fail_resv

In [None]:
scancel ${jobid}
scontrol delete ReservationName=${resv_name}

#### Periodic Reservation
If you need the reservation to repeat, you can use these flags:
* [Daily](/doc/scontrol.html#OPT_DAILY)
* [Hourly](/doc/scontrol.html#OPT_HOURLY)
* [Weekday](/doc/scontrol.html#OPT_WEEKDAY)
* [Weekend](/doc/scontrol.html#OPT_WEEKEND)
* [weekly](/doc/scontrol.html#OPT_WEEKLY)

In [None]:
# reserve 5 cores for 10 min repeat hourly, accessable by account lyoko
resv_name=lyoko_hourly_5core
scontrol create ReservationName=${resv_name} \
    flag=hourly account=lyoko partition=debug CoreCnt=5 \
    duration=5 starttime=$(date --date "now + 1 min" +"%FT%T") 
scontrol show reservation ${resv_name}

In [None]:
# keep watching, and observe that the reservation repeats after it ends
sinfo --reservation

In [None]:
# delete reservation
scontrol delete ReservationName=${resv_name}

#### Magnetic and Flexible Reservation
- A [magnetic](/doc/reservations.html#magnetic) reservation will be attached to a job when suitable, user don't need to specify the reservation id
- A [flexible](/doc/reservations.html#flex) reservation allow job to use more resource than reserved when available. eg. use more core, or run beyond the reserved time

In [None]:
resv_name=jeremie_mag_flex
start_time=$(date --date "now + 1 min" +"%FT%T")
scontrol create ReservationName=${resv_name} \
    flag=magnetic,flex account=lyoko partition=debug Nodes=ALL \
    duration=10 starttime=${start_time}
scontrol show reservation ${resv_name}

# submit a job that should use the reservation automatically
jobid=$(sbatch --nodes=1 --ntasks-per-node=2 --parsable --time 00:20:00 --partition debug --begin ${start_time} endless-checksum-mpi.sh)
scontrol show job ${jobid}

In [None]:
# monitor until the reservation become active, the job should start. Then keep watching
sinfo --reservation
squeue -la -j ${jobid}
scontrol show job ${jobid} | grep -i reservation

Observations:
1. When the reservation become active, the it is attached to the job automatically.
2. Job time limit is longer than the reservation duration. The job is allowed to use the reservation because of the FLEX flag.
3. Once the reservation end, it is detached from this job.

In [None]:
# once finish, clean up job and reservation
scancel ${jobid}
scontrol delete ReservationName=${resv_name}

#### Maintenance Reservation
Instead of draining the nodes, setting up a maintenance Reservation is a more graceful way of draining the cluster for maintenance. Let's say you have scheduled a maintenance window, and you start draining the cluster at the start of the window, then you could be wasting large part of the window in waiting jobs to finish, or having to cancel the jobs. If you start draining it, then you wasted the computing power to run small jobs that can be completed before the window start.  
If you create a Maintenance reservation instead, The reservation blocks any job that won't finish before the reservation start, and you can schedule it lond before the maintenance starts. One thing you need to be careful is that if your cluster runs lots of unlimited time jobs, those jobs will not be able to start once this reservation is placed (of course, cause they overlapped with the reservation), then you shoudl consider other method of clearing the cluster for maintenance.  
Flags used for creating maintenance reservation are [MAINT](/doc/scontrol.html#OPT_MAINT) and [IGNORE_JOBS](/doc/scontrol.html#OPT_IGNORE_JOBS). MAINT allow the reservation to overlap with other reservation. IGNORE_JOBS allow the reservation to overlap with currently running jobs. Basically just allowing the reservation to be created anyway.

In [None]:
# start an unlimited time job
jobid=$(sbatch --nodes=1 --ntasks-per-node=2 --parsable endless-checksum-mpi.sh)
echo wait for job ${jobid} to start
while [[ -z $(squeue -j ${jobid} --noheader --state=running) ]] ; do 
    sleep 5 
done
squeue -j ${jobid} -la

# create a dummy reservation
scontrol create ReservationName=dummy_resv Account=lyoko NodeCnt=2 starttime=$(date --date "now + 20 sec " +"%FT%T") duration=60 

# create a maintance reservation that overlap with both the job and dummy reservation
scontrol create ReservationName=maint_resv flags=MAINT,IGNORE_JOBS User=root Nodes=All starttime=$(date --date "now + 10 sec " +"%FT%T") duration=60

# create an unlimited job after the reservation has been created
jobid2=$(sbatch --nodes=1 --ntasks-per-node=2 --parsable endless-checksum-mpi.sh)

In [None]:
squeue -la
sinfo -T

The first unlimited time job will keep running until finish, but if there are not that many job like this just handle it case by case. Note that the second unlimited time job isn't able to start, because of the maintenance reservation. 

In [None]:
# clean up jobs and reservations
scancel ${jobid} ${jobid2}
scontrol delete ReservationName=dummy_resv
scontrol delete ReservationName=maint_resv

## Accounting
From the `sacct` command, you can get many job metrics, useful for analysist. and report generation. 
ref: [sacct - manpage](/doc/sacct.html)

In [None]:
set -x
# job history since midnight (default)
sacct

# job history of a given range. eg. last since 3 hr ago to 1 hr ago
sacct --starttime $(date --date "now - 3 hour" +"%FT%T") --endtime $(date --date "now - 1 hour" +"%FT%T")

# don't show job steps
sacct --allocation

# show job step average resource usage
sacct --format JobID,JobName,State,Partition,Account,AllocTRES,AveCPU,AveCPUFreq,AvePages,AveRSS

# show job step peak resource usage
sacct --format JobID,JobName,State,Partition,Account,AllocTRES,MaxPages,MaxPagesNode,MaxRSS,MaxRSSNode,MaxVMSize,MaxVMSizeNode

sacct --format JobID,JobName,State,Partition,Account,AllocTRES,TRESUsageInAve%40,TRESUsageInMax%40
set +x

In [None]:
# if certain job's runtime is beyond the specified range, you can use flag --truncate to align the data, and avoid double counting
sacct --truncate --starttime $(date --date "now - 10 min" +"%FT%T") --format JobID,JobName,State,Partition,User,TRESUsageInMax%60

In [None]:
# You could be seeing your own jobs by default, use --alluser to check history of more user. 
# What you can see is restricted by PrivateData attribute
sacct --alluser --allocation --format JobID,JobName,State,Partition,User,TRESUsageInMax%60

## Parsable Command Output
squeue, sinfo, scontrol. sacct provide --parsable and --json flag for formatting output in more parsable format. This is useful in developping script around the slurm cluster.

In [None]:
sinfo --json

In [None]:
squeue --json

In [None]:
scontrol show partition --json
scontrol show node --json
scontrol show job --json

In [None]:
sacct --json --starttime $(date --date "now - 20 min" +"%FT%T") --endtime $(date --date "now - 10 min" +"%FT%T")

## User, Access Control, Authentication by Slurm
In this section 3 useful tools provided by Slurm will be introduced: pam_slurm_adopt.so, nss_slurm, and auth/slurm a new AuthType. 3 new users are created under group `matrix` for this demo.
  * `smith`: created everywhere
  * `trinity`: created on this node (client) and 2 master nodes only
  * `neo`: created on this node (client) only

In [None]:
# Create users
cat playbooks/matrix-user-create.yaml
ansible-playbook --fork=1 playbooks/matrix-user-create.yaml
id smith trinity neo
getent passwd smith trinity neo
sudo loginctl enable-linger smith trinity neo

### `pam_slurm_adopt.so` - restricting user access to compute node
To make sure user would not ssh directly to a compute node and start a job without submitting to slurm, it is a common practice to deny user access to if they have no job running on the compute node. This is archived by using the pam_access.so and pam_slurm_adopt.so together.  
First you need to make sure pam_slurm_adopt.so pam module is installed. For RHEL/Rocky, package slurm-pam_slurm is needed. For Debian, package slurm-smd-libpam-slurm-adopt is needed. Then you need to modify 2 files: `/etc/security/access.conf` and `/etc/pam.d/sshd`  
ref: [pam_slurm_adopt - Administrative Access Configuration](/doc/pam_slurm_adopt.html#admin_access)  
`/etc/security/access.conf`:
```
...
account sufficient pam_access.so
account ...
account ...
-account required pam_slurm_adopt.so
...
```
`-account required pam_slurm_adopt.so` denys any user without a running job from ssh into the host, this line should be added after all other "account" line. However, the consequence of adding this line alone is that not even administrator could login without a running job. To resolve this issue, pam_access.so is used, by putting `account sufficient pam_access.so` in the front of other "account" line, user/group allowed by this module will by pass pam_slurm_adopt.so. Let's treat all user of group lyoko as administrator and set them to bypass `pam_slurm_adopt.so`

`/etc/security/access.conf`:
```
+:(lyoko):ALL
-:ALL:ALL
```
First line allows member of group lyoko to login from anywhere, second line blocks everyone else, and they will go down the chain in pam. Now let's test it using your own user (group lyoko) and user smith (group matrix)

In [None]:
# Content of /etc/pam.d/sshd
grep -vE "^[#]" /etc/pam.d/sshd

# ensure this file is sent to compute node
ansible -m copy -a "src=/etc/pam.d/sshd dest=/etc/pam.d/sshd" slurmd_host

In [None]:
# Content of /etc/security/access.conf
grep -vE "^[#]" /etc/security/access.conf

# ensure this file is sent to compute node
ansible -m copy -a "src=/etc/security/access.conf dest=/etc/security/access.conf" slurmd_host

In [None]:
# try ssh to a compute node as yourself and as smith, without a job
set -x
one_compute_node=$(scontrol show node --oneliner | awk '{print $1}' | cut -c10- | head -n1)
echo ${one_compute_node}
ssh ${one_compute_node} whoami
sudo -i -u smith ssh ${one_compute_node} whoami
set +x

In [None]:
# submit a job as smith
set -x
one_compute_node=$(scontrol show node --oneliner | awk '{print $1}' | cut -c10- | head -n1)
# submit a job as smith
smith_job_id=$(sudo -i -u smith sbatch -w ${one_compute_node} --parsable --output=/dev/null -D ~smith --time 00:10:00 ~smith/tutorials/helloworld.sh)

# wait 10 sec for the job to start and then try ssh as smith
sleep 10
sudo squeue -j ${smith_job_id}
sudo -i -u smith ssh ${one_compute_node} whoami

# kill that job
sudo scancel ${smith_job_id}

# try login again after 30 sec
sleep 30
sudo -i -u smith ssh ${one_compute_node} whoami
set +x

### `nss_slurm` - Providing user information by slurm  
reference: [Name Service Caching Through NSS Slurm](/doc/nss_slurm.html)  
`nss_slurm` is a name service caching that provides user information, including passwd and group information within a running. This is very usefull at scale, avoiding spikes of query overloading your LDAP/AD/NIS server when large scale parallel job starts, causing either your authentication service or your job to crash. However, this is not meant to replace your proper authentication service, only to reduce reliance and improve stability of the cluster as a whole.  
To enable this feature, library libnss_slurm.so needs to be installed. For RHEL/Rocky it is included in the core slurm package already; for Debian, you need to install package `slurm-smd-libnss-slurm`.

**Master Node Configuration**  
On the master node side, we need to make sure this feature is enabled. Check if LaunchParameters includes the `enable_nss_slurm` flag. Modify slurm.conf and restart slurmctld if not.

In [None]:
scontrol show config | grep LaunchParameters

**Compute Node Configuration**  
On compute node, you need to add `slurm` as a source of passwd and group in `/etc/nsswitch.conf`. You should put it in front of other network source like sss and ldap, otherwise you would not be able to alleviate their load. 

In [None]:
# check /etc/nsswitch.conf on all compute node
for compute in $(scontrol show node --oneliner | awk '{print $1}' | cut -c10-); do
    echo ${compute}:
    ssh ${compute} grep -E "'^(passwd|group)'" /etc/nsswitch.conf
done

To test this setup, we will try to query the information of user `trinity` inside and outside of a running job. Note that user `trinity` is only created on this node (client) and master node only. User trinity only exist within a running job on the compute node. 

In [None]:
set -x
one_compute_node=$(scontrol show node --oneliner | awk '{print $1}' | cut -c10- | head -n1)
ssh ${one_compute_node} id trinity
sudo -i -u trinity srun -w ${one_compute_node} -D ~trinity id trinity
set +x

### `AuthType=auth/slurm` - un-authenticated master and compute, without munge
reference: [Authentication Plugins - slurm](/doc/authentication.html#slurm)  
Since slurm 23.11, auth/slurm, a new authentication plugin is introduced. Unlike munge, this new plugin allows the control plane (slurmctld, and slurmdbd) to function normally without the host being authenticated. Combining with nss_slurm, it is possible to run a slurm cluster with only the submittion/client node being authenticated. 

In [None]:
# example: user neo is not able to submit job if using munge
if [[ $(scontrol show config | grep -i AuthType | awk '{print $3}' ) == auth/munge ]]; then
    sudo -i -u neo srun whoami # this is supposed to fail
else
    echo Not using munge
fi

**Modify `slurm.conf` and `slurmdbd.conf`**  
To select this auth plugin instead of munge, you need to set parameters in slurm.conf and slurmdbd.coonf as follow:  
`slurm.conf`:
```
AuthType=auth/slurm
CredType=cred/slurm
AuthInfo=use_client_ids
```  
`slurmdbd.conf`:
```
AuthType=auth/slurm
AuthInfo=use_client_ids
```  
A randomly generated pre-shared key /etc/slurm/slurm.key is also required:
```bash
dd if=/dev/random of=/etc/slurm/slurm.key bs=1024 count=1
chown slurm:slurm /etc/slurm/slurm.key
chmod 600 /etc/slurm/slurm.key
```
(This key has been generated during container image build)

In [None]:
# Switch to auth/slurm
cat playbooks/use-slurm-auth.yaml
ansible-playbook playbooks/use-slurm-auth.yaml
scontrol show config | grep -i AuthType

Now we can try to submit a job as user neo again, note that this user only exist on this submission/client node. 

In [None]:
sudo -i -u neo srun bash -c "whoami ; hostname ;"

For this slurm-lab container cluster, you can recreate the whole cluster to use auth/slurm by setting `AUTHTYPE=auth/slurm` in .env or pass as environment variable.

In [None]:
# cleanup users
sudo loginctl disable-linger smith trinity neo
ansible-playbook --fork=1 playbooks/matrix-user-delete.yaml

# (optional) switch back to munge
#ansible-playbook playbooks/use-munge-auth.yaml

## Saving & Retrieving Job Script and Environment variables
Slurm can be configured to save the job script and running environment variables in the accounting database. This could be useful for debugging and investigation purpose.  
To enable this feature you need to ensure `AccountingStoreFlags` includes `job_env` and `job_script` in `slurm.conf`. 

In [None]:
scontrol show config | grep AccountingStoreFlags

In [None]:
# submit a job via stdin, so the script cannot be found anywhere in the file system
# jobid=$(sbatch --nodes=1 --ntasks-per-node=2 --parsable --time 00:20:00 --partition debug --begin ${start_time} endless-checksum-mpi.sh)
jobid=$(
sbatch --parsable --time 00:10:00 <<EOF
#!/bin/bash
echo "I am a dummy job"
md5sum /dev/zero
EOF
)

# show job status 
squeue -j ${jobid} -la

To show job script, use `sacct` command with option `--batch-script`, it is required to specify the job id as well 

In [None]:
# show job script 
sacct -j ${jobid} --batch-script

To list job environment variables, use `sacct` command with option `--env-vars`. This option is mutually exclusive with `--batch-script`, and it is required to specify the job id. 

In [None]:
# show job environment 
sacct -j ${jobid} --env-vars