Skip to content

Commit

Permalink
Merge pull request #6 from flindersuni/feature/FAQ
Browse files Browse the repository at this point in the history
feat(FAQ): Added first FAQ Entry  & Documentation Captions
  • Loading branch information
The-Scott-Flinders committed Aug 21, 2020
2 parents f2c8681 + 5e6f23d commit 90ef94a
Show file tree
Hide file tree
Showing 12 changed files with 350 additions and 169 deletions.
111 changes: 111 additions & 0 deletions docs/source/FAQ/faq.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
*****
FAQ
*****

Below are some of the common steps that the team has been asked to resolve more than once, so we put them here to (hopefully) answer your questions before you have to wait in the Ticket Queue!

What are the SLURM Partitions?
===============================
There are just two (for now):

* hpc_general
* hpc_melfeu

You can omit the

* #SBATCH partition=<name> directive


as the sane-default for you is the hpc_general partition.

SLURM - Tasks & OpenMPI/MPI
===========================
When running jobs enabled with OpenMPI/MPI, there is some confusion around how it all works and what the correct settings for SLURM are. The biggest confusion is around what a 'Task' is, and when to use them.

Tasks
-----
Think of a task as a 'Bucket of Resources' you ask for. It cannot talk to another bucket without some way to communicate - this is what OpenMPI/MPI does for you. It lets any number of buckets talk to each other.

When asking SLURM for resources, when you ask for N Tasks, you will get N tasks of X size that you asked for, and all are counted against your usage. For example

* -N12 --cpus-per-task=10 --mem-per-cpu=2G

Will get you a combined *total* of 120 CPUs and 240GB of RAM *spread across 12 individual little instances*.

Running the Job
----------------
There are several ways to correctly start OpenMPI/MPI based programs. SLURM does an excellent job of integrating with OpenMPI/MPI, so usually it will 'Just Work'. Its highly dependant upon how the program is structured and written. Here are some options that can help you boot things when they do not go to plan.

* mpirun - bootstraps a program under MPI. Best tested under a manual allocation via salloc.
* srun - Acts nearly the same as 'sbatch' but runs immediacy via SLURM, instead of submitting the job for later execution.

OOM Killer
-----------
Remember, that each 'task' is its own little bucket - which means that SLURM tracks it individually! If a single task goes over its resource allocation, SLURM will kill it, and usually that causes a cascade failure with the rest of your program, as you suddenly have a process missing.


IsoSeq3: Installation
=====================

IsoSeq3, from Pacific Bio Sciences has install instructions that won't get you all the way on DeepThought. There are some missing packages and some commands that must be altered to get you up and running.
This guide will assume that you are starting from scratch, so feel free to skip any steps you have already performed.

The steps below will:

* Create a new Virtual Environment
* Install the dependencies for IsoSeq
* Install IsoSeq3
* Alter a SLURM script to play nice with Conda

Conda/Python Environment
--------------------------
Only thing you will need to decide is 'where you want to store my environment' you can store it in your /home directory if you like or in /scratch. Just put it someplace that is easy to remember.
To get you up and running (anywhere it says FAN, please substitute yours):

* module load miniconda/3.0
* conda create -p /home/FAN/isoseq3 python=3.7
* source activate /home/FAN/isoseq3
* You may get a warning saying 'your shell is not setup to use conda/anaconda correctly' - let it do its auto-configuration. Then Issue

* source ~/.bashrc

When all goes well, your prompt should read something similar to

.. image:: ../_static/conda_active_env.png

Notice the (/home/ande0548/isoseq3)? Thats a marker to tell you which Python/Conda Environment you have active at this point.

BX-Python
----------
The given bx-python version in the wiki doesn't install correctly, and if it *does* work, then it will fail on run. To get a working version, run the following.

* conda install -c conda-forge -c bioconda bx-python

Which will get you a working version.

IsoSeq3
---------

Finally, we can install IsoSeq3 and its dependencies.

* conda install -c bioconda isoseq3 pbccs pbcoretools bamtools pysam lima


Will get you all the tools installed into your virtual environment. To test this, you should be able to call the individual commands, like so.

.. image:: ../_static/conda_isoseq3_installed.png


SLURM Modifications
--------------------

You may get an issue when you ask SLURM to run your job about CONDA not being initialised correctly. This is a very-brute-force hammer approach, but it will cover everything for you.

Right at the start of your script, add the following lines:

* module load miniconda/3.0
* conda init --all
* source /home/FAN/.bashrc
* conda activate /path/to/conda/environment

This will load conda, initialises (all of your) conda environment(s), force a shell refresh and load that new configuration, then finally load up your environment. Your job can now run without strange conda-based initialisation errors.
29 changes: 0 additions & 29 deletions docs/source/FAQ/faqandissues.rst

This file was deleted.

23 changes: 23 additions & 0 deletions docs/source/FAQ/knownissues.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@

******************
Known Issues
******************

Below are the 'known issues' with the documentation at this point. This is not an issue tracker for the HPC, so please don't try and lodge an issue with DeepThought here!

PDF Version
==============

* Images are missing in some sections on the PDF Version
* Alt-Text appears when it should not in the PDF Version

EBPUB Version
================

* EPUB Version is missing some pages of content


Web / ReadTheDocs / Online Version
====================================

* Some builds seem to have magic-ed away the images and they no longer display correctly.
4 changes: 2 additions & 2 deletions docs/source/FileTransfers/FileTransfersIntro.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ Windows doesn't support the SFTP protocol in a native way. Thankfully, there are

#### Sub-System for Linux

You can use the WSL for this - head on over to the [Linux](../Linux/LinuxFileTransfer.md) Guide.
You can use the WSL for this - head on over to the [Linux](#TransferringFiles) Guide.

#### Potential Client List

Expand All @@ -108,7 +108,7 @@ This guide will focus on WinSCP.

Open WinSCP, enter deepthought.flinders.edu.au as the host to connect to, and click Login. You should have a screen that looks like this.

![](../../_static/WinSCPImage.png)
![](../_static/WinSCPImage.png)

The first time you connect up you will get a warning - this is fine, just click YES to continue on.

Expand Down
4 changes: 2 additions & 2 deletions docs/source/ModuleSystem/LMod.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,12 +85,12 @@ The software must in all cases be appropriately licensed.

## Currently Installed Modules

Be warned, that this is a _long_ list. It's updated on best effort basis to help you see what is already present, but the latest list is always available by running the `module list` command on the HPC. Its broken into several segments, as there are several tools used to facilitate the installation and management of software used on the HPC.
Be warned, that this is a _long_ list. It's updated on best effort basis to help you see what is already present, but the latest list is always available by running the `module avail` command on the HPC. Its broken into several segments, as there are several tools used to facilitate the installation and management of software used on the HPC.

### Manually Installed Software / Default Available

This is the list of software that has been 'hand rolled' as it contains either things that at that time, did not have an automated way of installation, or rather esoteric software that required extensive modification to work correctly on the HPC. It is available [here](ManuallyInstalled.md)

### Additional Software

There are additional software collections will be made available in a near future.
There are additional software collections will be made available in a near future.
133 changes: 126 additions & 7 deletions docs/source/SLURM/SLURMIntro.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ As a cluster workload manager, Slurm has three key functions. First, it allocate

## System Specifications

If you want to know the system specifications for Deepthought, head on over to [here](../system/deepthoughspecifications.md)
If you want to know the system specifications for DeepThought, head on over to [here](../system/deepthoughspecifications.md)

## SLURM on Deepthought
## SLURM on DeepThought

SLURM on DeepThought uses the 'Fairshare' work allocation algorithm. This works by tracking how many resource your job takes and adjusting your position in queue depending upon your usage. The following sections will break down a quick overview of how we calculate things, what some of the cool-off periods are and how it all slots together.

Expand Down Expand Up @@ -44,7 +44,7 @@ Which will allow for greater details of how your score was calculated.

### Calculating Priority

SLURM tracks 'Resources'. This can be nearly anything on the HPC - CPU's, Power, GPU's, Memory, Storage, Licenses, anything that people share adn could use really.
SLURM tracks 'Resources'. This can be nearly anything on the HPC - CPU's, Power, GPU's, Memory, Storage, Licenses, anything that people share and could use really.

The basic premise is - you have:

Expand Down Expand Up @@ -77,7 +77,7 @@ To give you an idea of the _initial_ score you would get for consuming an entire

**Total**: `‭65,600,000`

So, its stacks up very quickly, and you really want to write your job to ask for what it needs, and not much more!
So, its stacks up very quickly, and you really want to write your job to ask for what it needs, and not much more! This is not the number you see and should only be taken as an example. If you want to read up on exactly how Fairshare works, then head on over to [here](https://slurm.schedmd.com/priority_multifactor.html).

## SLURM: The Basics

Expand Down Expand Up @@ -199,9 +199,6 @@ The following varaibles are set per job, and can be access from your SLURM Scrip
|$SLURM_SUBMIT_HOST | Host on which job was submitted.|
|$SLURM_PROC_ID | The process (task) ID within the job. This will start from zero and go up to $SLURM_NTASKS-1.|

### Example SLURM Job Script

[Here](SLURMScript.md)

## SLURM: Extras

Expand All @@ -214,3 +211,125 @@ Here is an assortment of resources that have been passed on to the Support Team
Besides useful commands and ideas, this [FAQ](http://www.ceci-hpc.be/slurm_faq.html#Q01) has been the best explanation of 'going parallel' and the different types of parallel jobs, as well as a clear definition for what is considered a task.

An excellent guide to [submitting jobs](https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html).


## SLURM: Script Template

#!/bin/bash
# Please note that you need to adapt this script to your job
# Submitting as is will fail cause the job to fail
# The keyword command for SLURM is #SBATCH --option
# Anything starting with a # is a comment and will be ignored
# ##SBATCH is a commented-out #SBATCH command
##################################################################
# Change FAN to your fan account name
# Change JOBNAME to what you want to call the job
# This is what is shows when attempting to Monitor / interrogate the job,
# So make sure it is something pertinent!
#
#SBATCH --job-name=FAN_JOBNAME
#
##################################################################
# If you want email updates form SLURM for your job.
# Change MYEMAIL to your email address
#SBATCH --mail-user=MYEMAIL@flinders.edu.au
#SBATCH --mail-type=ALL
#
# Valid 'points of notification are':
# BEGIN, END, FAIL, REQUEUE.
# ALL means all of these
##################################################################
# Tell SLURM where to put the Job 'Output Log' text file.
# This will aid you in debugging crashed or stalled jobs.
# You can capture both Standard Error and Standard Out
# %j will append the 'Job ID' from SLURM.
# %x will append the 'Job Name' from SLURM
#SBATCH --output=/home/$FAN/%x-%j.out.txt
#SBATCH --error=/home/$FAN/%x-%j.err.txt
##################################################################
# You can leave this commented out, or submit to hpc_general
# Valid partitions are hpc_general and hpc_melfeu
##SBATCH --partition=PARTITIONNAME
#
##################################################################
# Tell SLURM how long your job should run for, at most.
# SLURM will kill/stop the job if it goes over this amount of time.
# Currently, this is unlimited - however, but the longer your job
# runs, the worse your Fairshare score becomes!
#
# In the future this will have a limit, so best to get used to
# setting it now.
#
# The command format is as follows: #SBATCH --time=DAYS-HOURS
#SBATCH --time=14=0
#
##################################################################
# How many tasks is your job going to run?
# Unless you are running something that is Parallel / Modular or
# pipelined, leave this as 1. Think of each task as a 'bucket of
# resources' that stand alone. Without MPI / IPC you cant talk to
# another bucket!
#
#SBATCH --ntasks=1
# If each task will need more that a single CPU, then alter this
# value. Remeber, this is multiplicative, so if you ask for
# 4 Tasks and 4 CPU's per Task, you will be allocated 16 CPU's
#SBATCH --cpus-per-task=1
##################################################################
# Set the memory requirements for the job in MB. Your job will be
# allocated exclusive access to that amount of RAM. In the case it
# overuses that amount, Slurm will kill it. The default value is
# around 2GB per CPU you ask for.
# Note that the lower the requested memory, the higher the
# chances to get scheduled to 'fill in the gaps' between other
# jobs. Pick ONE of the below options. They are Mutually Exclusive.
# You can ask for X Amount of RAM per CPU (MB by default)
#SBATCH --mem-per-cpu=4000
# Or, you can ask for a 'total amount of RAM'
##SBATCH --mem=12G
##################################################################
# Change the number of GPU's required and the most GPU's that can be
# requested is 2 per node. As there are limited GPU slots, they are heavily
# weighted against for Fairshare Score calculations.
# This line requests 0 GPU's by default.
#
#SBATCH --gres="gpu:0"
##################################################################
# Load any modules that are required. This is exactly the same as
# loading them manually, with a space-separated list, or you can
# write multiple lines.
# You will need to uncomment these.
#module add miniconda/3.0 cuda10.0/toolkit/10.0.130
#module add miniconda/3.0
#module add cuda10.0/toolkit/10.0.130

##################################################################
# If you have not already transferred you data-sets to your /scratch
# directory, then you can do so as a part of you job.
# Change the FAN and JOBNAME as needed.
# REMOVE the data from /home when you do not need it, as /home space is
# limited.

# Copy Data to /scratch
cp -r /home/$FAN/??DataDirectory?? /scratch/user/FAN/JOBNAME

##################################################################
# Enter the command-line arguments that you job needs to run.
cd /scratch/user/$FAN/
python36 Generator.py

##################################################################
# Once you job has finished its processing, copy back your results
# and ONLY the results from /scratch.

cp /scratch/user/FAN/JOBNAME/??ResultsOutput.txt?? ~/JOBNAME/

# Now, cleanup your /scratch directory of the extra data you have.
# If you need to keep the data-set for later usage, then utilise
# your prefered method to get it OFF the HPC, and into your
# local storage.
rm -rf /scratch/user/FAN/JOBNAME
##################################################################
# Print out the 'Job Efficiency' - this is how well your job used the
# resources you asked for. The higher the better!
seff $SLURM_JOBID

0 comments on commit 90ef94a

Please sign in to comment.