# Quick Review of cluster architecture

(diagram goes here)


# Quick Review of Slurm allocations

$ srun [options] batch.sh

option | example | comment
-------|--------|---
-c _cores_ | -c 20 | cores/task on single node
-n _tasks_ | -n 4 | mpi progs only
-N _nodes_ | -N 10 | never useful
-t _time_  | -t 7-, -t 3:00 | job killed if exceeded
--mem=_mem_ | --mem=16g | ditto
--mem-per-cpu=_mem_ | --mem-per-cpu=8g | ditto
-J _name_ | -J myjob | for user only


# Thinking about memory requirements
- default 5GB/core on our clusters
- strictly enforced; jobs exceeding limit killed
- you can request custom memory per node or core


# Determining memory requirements
During run:
- ssh to node, run 
    "top -o RES -u _netid_"
    look at RES column
- /usr/bin/time -a _prog args_

After successful run, determine actual usage:

- sacct
- remora

# Slurm sacct

sacct -o 'JobID,MaxRSS,MaxVMSize' -j _jobid_

or 

Configure sacct format:

 export SACCT_FORMAT=JobID%-20,JobName,User,Partition,NodeList,Elapsed,State,ExitCode,MaxRSS, AllocTRES%32
 
 sacct -j _jobid_




# Remora
https://github.com/TACC/remora

module load REMORA
remora prog args ...

This will create a directory: remora_jobid

Copy (rsync) to local computer, open remora_summary.html with browser 



# Finding Compute Resources

To get overall sense:
```
sinfo -p general
```
To see completely idle nodes, by core count:

```
$ sinfo -p general -e -t IDLE -o "%P %.5a %c %.10l %.6D %.6t %N"
PARTITION AVAIL CPUS  TIMELIMIT  NODES  STATE NODELIST
general*    up 8 30-00:00:0     35   idle c06n[10-16],c07n[01-14,16],c08n[01-06,08-14]
general*    up 16 30-00:00:0     23   idle c10n[13-16],c11n[01-16],c12n[09-11]
```

Hint, use alias:
```
alias findidle='sinfo -p general -e -t IDLE -o "%P %.5a %c %.10l %.6D %.6t %N"'
```

# Fairshare scheduling
- Groups and users with heavy recent usage (last 30-45 days) have lower priority


# Using Scavenge Partition
- Compute nodes in other partions are available via scavenge partition
- sbatch -p scavenge ...
- separate per user limits apply
- works best for short jobs, dSQ/array jobs, or jobs that checkpoint

# Special Nodes

# Large Memory Nodes
- We have some compute nodes with 512GB-1.5TB of RAM
- Reserved for applications with large memory needs. Please be considerate.
- Separate slurm partition: bigmem

Typical allocation: 
```
srun/sbatch -p bigmem --mem=1500g ...
```

# GPU Nodes
- Some applications have been ported to GPUs with impressive performance improvement
- Gpu nodes have conventional cpus with multiple cores, and 1-4 GPUs.  
- To use GPUs, you must:
 - request node(s) with GPUs
 - request the type and number of GPUs 

Typical allocation:
```
srun/sbatch -p gpu -c 20 --gres=gpu:1080ti:4 ...
```
Note that partition names, types and number of GPUs vary by cluster.


# Parallelism

- Sbatch can allocate multiple cores and nodes, but the script runs on one core on one node sequentially.

- Simply allocating more nodes or cores DOES NOT make jobs faster.

- How do we use multiple cores to increase speed?


- Two classes of parallelism:
 - Lots of independent sequential jobs
 - Single job parallelized (somehow)
 

- Some options:
 - Submit many batch jobs simultaneously (not good)
 - Use job arrays, or dSQ (much better)
 - Submit a parallel version of your program (great if you have one)



# Job Arrays

- Useful when you have many nearly identical, independent jobs to run
- Starts many identical copies of your script, distinguished by a task id.

Submit jobs like this:
```
sbatch --array=1-100 ...
```
Inside your batch script this environment variable to do something different in each task:
```
./mycommand -i input.${SLURM_ARRAY_TASK_ID} \
    -o output.${SLURM_ARRAY_TASK_ID}
```

A few nice features of job arrays:
- only one job to keep track of
- easy to start or cancel entire set
- time limits apply to each task, not overall job
- your allocation can grow and shrink as conditions change
- when using scavenge partition, tasks are killed, but job persists


# dSQ (aka Dead Simple Queue)
- built on job arrays.  Same nice features, but easier to use
- more flexible; tasks can be different from one another
- reporting and error recovery built in


# Using dSQ


- Create file containing list of commands to run (jobs.txt)
```
prog arg1 arg2 -o job1.out
prog arg1 arg2 -o job2.out
...
```
- Create launch script
```
module load dSQ
dSQ --taskfile jobs.txt [slurm args] > run.sh
```

slurm args can specify partion, timelimit, memory, etc. in the usual way.

- Submit launch script
```
sbatch run.sh
```

For more info, see <http://research.computing.yale.edu/support/hpc/user-guide/dead-simple-queue>



# dSQ Reporting
- When dSQ job is finished, you'll see a file `job_<jobid>_status.tsv`
- Generate report:
```
$ dSQAutopsy jobs.txt job_<jobid>_status.tsv > failedjobs.txt
Autopsy Task Report:
9 succeeded
1 failed
0 didn't run.
```

- If any jobs failed, failedjobs.txt will contain those jobs





# Some ways to run in parallel
- R: multicore
- Python: multiprocessing
- C: threads

# R Multicore

Many R packages have parallelism built in: e.g. bootstrapping (boot)

```
cores=Sys.getenv("SLURM_CPUS_ON_NODE")
boot(data=trees, statistic=volume_estimate, R=50000, parallel="multicore", ncpus=cores)
```