Skip to content

Commit

Permalink
Merge pull request #41 from flindersuni/feature/jupyter-gpu
Browse files Browse the repository at this point in the history
ODC & Jupyter GPU Update
  • Loading branch information
The-Scott-Flinders committed Jun 14, 2022
2 parents d9bf754 + 604fd78 commit eed6106
Show file tree
Hide file tree
Showing 16 changed files with 263 additions and 71 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,7 @@ docs/source/_build
# Files used by Visual Code
.vscode/settings.json
*.code-workspace
.venv
.venv
.vscode/targets.log
.vscode/dryrun.log
.vscode/configurationCache.log
64 changes: 62 additions & 2 deletions docs/source/FAQ/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ There are three at this point:

* general
* gpu
* melfeu
* melfu

You can omit the

Expand Down Expand Up @@ -55,7 +55,7 @@ OOM Killer
Remember, that each 'task' is its own little bucket - which means that SLURM tracks it individually! If a single task goes over its resource allocation, SLURM will kill it, and usually that causes a cascade failure with the rest of your program, as you suddenly have a process missing.


IsoSeq3: Installation
Issues Installed ISoSeq3
=====================

IsoSeq3, from Pacific Bio Sciences has install instructions that won't get you all the way on DeepThought. There are some missing packages and some commands that must be altered to get you up and running.
Expand Down Expand Up @@ -128,3 +128,63 @@ The given bx-python is a problematic module that appears in many of the BioScien
These steps are the same as the installation for IsoSeq3, but given how often this particular python package gives the support team issues, it gets its own section!

* conda install -c conda-forge -c bioconda bx-python



My Jupyter Kernel Times Out
===============================
This is usually caused by one of two things:

* HPC has allocated all its Resources
* Incorrect Conda Setup

HPC Is Busy
------------

You job will time out when the HPC is busy, as your job cannot get an allocation within 30 seconds (or so).
If you do not see a file like 'slurm-<NUMBER>.out' in your /home directory, then the HPC cannot fit your kernel's requested allocation as all resources are busy.

To solve the above, you can either:

* Recreate a Kernel with lower resource requirements
* Wait for the HPC to be less busy

A sneaky command from the HPC Admin Team: ``sinfo -No "%17n %13C %10O %10e %30G"``. This gets you a layout like so::

HOSTNAMES CPUS(A/I/O/T) CPU_LOAD FREE_MEM GRES
hpc-node001 0/64/0/64 0.46 241647 gpu:tesla_v100:2(S:2,6)
hpc-node002 0/64/0/64 1.86 250777 gpu:tesla_v100:2(S:2,6)
hpc-node003 64/0/0/64 20.44 240520 (null)
hpc-node004 64/0/0/64 19.46 244907 (null)
hpc-node005 64/0/0/64 18.59 241284 (null)
hpc-node006 64/0/0/64 17.37 244390 (null)
hpc-node007 64/0/0/64 14.50 221633 (null)
hpc-node008 64/0/0/64 18.06 211002 (null)
hpc-node009 64/0/0/64 19.27 206833 (null)
hpc-node010 64/0/0/64 19.39 233411 (null)
hpc-node011 64/0/0/64 20.51 221966 (null)
hpc-node012 64/0/0/64 19.06 181808 (null)
hpc-node013 64/0/0/64 20.35 221835 (null)
hpc-node014 60/0/4/64 4.00 151584 (null)
hpc-node015 64/0/0/64 18.01 191874 (null)
hpc-node016 64/0/0/64 11.04 214227 (null)
hpc-node017 0/64/0/64 0.00 512825 (null)
hpc-node018 0/64/0/64 0.03 61170 (null)
hpc-node019 128/0/0/128 515.85 1929048 (null)
hpc-node020 128/0/0/128 30.31 1062956 (null)
hpc-node021 128/0/0/128 38.10 975893 (null)
hpc-node022 0/64/0/64 0.06 119681 gpu:tesla_v100:1(S:2)

What you want to look at is that first and second numbers in the CPUS Column. The first is 'Allocated' and the second is 'Available for Usage'.
This above example shows that the GPU queue is empty (0/64) but the general queue is busy (64/0).

Incorrect Conda Environment Setup
-----------------------------------
The timeout error can also be caused by missing a required package for the custom WLM Integration to work correctly.

This means that the job started, but could not connect your Jupyter Notebook correctly. If you look in your home directory, you will see the previously mentioned 'slurm-<NUMBER>.out' file.
Right at the very bottom of the file (its quite long, with lots of debugging information in it) you will see a message similar to:

* ``command not found ipykernel-wlm``

To fix this type of 'command not found' error for ipykernel or similar - go back to the Jupyter Hub Conda Setup instructions, and double check that you have installed *all* of the needed packages.
2 changes: 1 addition & 1 deletion docs/source/FileTransfers/FileTransfersIntro.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ When using a *NIX based system, using the terminal is the fastest way to upload
Substitute your filename, FAN and Password, type scp FILENAME FAN@deepthought.flinders.edu.au:/home/FAN then hit enter.
Enter your password when prompted. This will put the file in your home directory on DeepThought. It looks (when substituted accordingly) similar to:

![](../_static/SCPExampleImage.png)
`scp /path/to/local/file fan@deepthought.flinders.edu.au:/path/on/deepthought/hpc/`

### The Longer Version

Expand Down
14 changes: 7 additions & 7 deletions docs/source/SLURM/SLURMIntro.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Which will allow for greater details of how your score was calculated.

### Calculating Priority

SLURM tracks 'Resources'. This can be nearly anything on the HPC - CPU's, Power, GPU's, Memory, Storage, Licenses, anything that people share and could use really.
SLURM tracks 'Resources'. This can be nearly anything on the HPC - CPU's, Power, GPU's, Memory, Storage, Licenses, anything that the HPC needs to track usage and allocation.

The basic premise is - you have:

Expand All @@ -54,7 +54,7 @@ The basic premise is - you have:

Then you multiple all three together to get your end priority. So, lets say you ask for 2 GPU's (The current max you can ask for)

A GPU on DeepThought (When this was written) is set to have these parameters:
A GPU on DeepThought (when this was written) is set to have these parameters:

- Weight: 5
- Factor: 1000
Expand All @@ -73,9 +73,9 @@ To give you an idea of the _initial_ score you would get for consuming an entire

**CPU**: `64 * 1 * 1000 = 64,000` (Measure Per CPU Core)

**RAM**: `256 * 0.25 * 1000 = 65,536,000` (Measured Per MB)
**RAM**: `256 * 0.25 * 1000 = 64,000` (Measured Per GB)

**Total**: `65,600,000`
**Total**: `128,000`

So, its stacks up very quickly, and you really want to write your job to ask for what it needs, and not much more! This is not the number you see and should only be taken as an example. If you want to read up on exactly how Fairshare works, then head on over to [here](https://slurm.schedmd.com/priority_multifactor.html).

Expand Down Expand Up @@ -207,9 +207,9 @@ The following variables are set per job, and can be access from your SLURM Scrip

The DeepThought HPC will set some additional environment variables to manipulate some of the Operating system functions. These directories are set at job creation time and then are removed when a job completes, crashes or otherwise exists.

This means that if you leave anything in $TMP or $SHM directories it will be *removed when your job finishes*.
This means that if you leave anything in $TMP, $BGFS or $SHM directories it will be *removed when your job finishes*.

To make that abundantly clear. If the Job creates `/cluster/jobs/$SLURM_JOB_USER/$SLURM_JOB_ID` it will also **delete that entire directory when the job completes**. Ensure that your last step in any job creation is to _move any data you want to keep to /scratch or /home_.
To make that abundantly clear. If the Job creates `/cluster/jobs/$SLURM_JOB_USER/$SLURM_JOB_ID` (the $BGFS location) it will also **delete that entire directory when the job completes**. Ensure that your last step in any job creation is to _move any data you want to keep to /scratch or /home_.


|Variable Name | Description | Value |
Expand Down Expand Up @@ -261,7 +261,7 @@ To reiterate the warning above - if you leave anything in the $TMP or $SHM Direc

### Filename Patterns

Some commands will take a filename. THe following modifiers will allow you to generate files that are substituted with different variables controlled by SLURM.
Some commands will take a filename. The following modifiers will allow you to generate files that are substituted with different variables controlled by SLURM.

| Symbol | Substitution |
|-|-|
Expand Down
32 changes: 32 additions & 0 deletions docs/source/_static/_overrides.css
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,35 @@
height: 100% !important;
max-width: 100% !important;
}

.wy-side-nav-search {
background-color: #FFD300;
}
.wy-side-nav-search>a {
color: #002F60;
}

.wy-menu-vertical {
background-color: #002F60;
}

.wy-nav-side {
background: #002F60;
color: #002F60;
}

.wy-menu-vertical header, .wy-menu-vertical p.caption {
color: #F6EEE1;
}

.wy-menu-vertical a:hover {
background-color: #21509F;
}

a {
color: #002F60;
}

.wy-menu-vertical a {
color: #b7B7b7;
}
9 changes: 5 additions & 4 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@ Welcome to the DeepThought HPC

The new Flinders University HPC is called DeepThought. This new HPC comprises of AMD EPYC based hardware and next-generation management software, allowing for a dynamic and agile HPC service.

.. _BeeGFS Section of Storage & Usage Guidelines: storage/storageusage.html
.. _/cluster: storage/storageusage.html

.. attention::
The new BeeGFS Parallel Filesystem mounted at /cluster has just been deployed, but is *not yet ready for usage*. It will appear in any disk
usage listings on the HPC. For further information and to prepare for when this filesystem is fully released, please read the `BeeGFS Section of Storage & Usage Guidelines`_.
The new BeeGFS Parallel Filesystem mounted at /cluster has just been deployed, and is **now ready for usage**. It will appear in any disk
usage listings on the HPC. For further information please read the `/cluster`_ section of the Storage Usage & Guidelines.

.. attention::
This documentation is under active development, meaning that it can
Expand All @@ -22,7 +22,7 @@ Attribution
If you use the HPC to form a part of your research, you should attribute your usage.
Flinders has minted a DOI that points to this documentation, specific for the HPC Service. It will also allow for tracking the research outputs that the HPC has contributed to.

Text Citation
Text Citation
++++++++++++++

.. _ARDC Data Citation: https://ardc.edu.au/resources/working-with-data/citation-identifiers/data-citation/
Expand Down Expand Up @@ -78,6 +78,7 @@ Table of Contents
software/matlab.rst
software/singularity.rst
software/vasp.rst
software/opendatacube.rst


.. toctree::
Expand Down
4 changes: 3 additions & 1 deletion docs/source/software/ansys.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
-------------------------
ANSYS Engineering Suite
-------------------------

=============
ANSYS Status
=============

ANSYS 2021R2 is the current version of the ANSYS Suite installed on the HPC. Both Single-Node (-smp) and Multi-Node (-dis) execution is supported as well as GPU acceleration.


.. _ANSYS: https://www.ansys.com/

===============
ANSYS Overview
===============
===============
The ANSYS Engineering Suite is a comprehensive software suite for engineering simulation. More information on can be found on the `ANSYS`_ website.


Expand Down
21 changes: 11 additions & 10 deletions docs/source/software/delft3d.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
-------------------------
Delft3D
-------------------------
=======
Status
=======
=====================
Delft3D Status
=====================

Delft3D 4, Revision 65936 is installed and available for use on the HPC.

.. Delft3D:
.. _Delft3D Home: https://oss.deltares.nl/web/delft3d

==================
====================
Delft3D Overview
==================
====================

From `Delft3D`_:
From `Delft3D Home`_:

Delft3D is Open Source Software and facilitates the hydrodynamic (Delft3D-FLOW module), morphodynamic (Delft3D-MOR module), waves (Delft3D-WAVE module), water quality (Delft3D-WAQ module including the DELWAQ kernel) and particle (Delft3D-PART module) modelling

Expand All @@ -21,12 +22,12 @@ Delft3D is Open Source Software and facilitates the hydrodynamic (Delft3D-FLOW m
Delft3D Known Issues
================================

Delft3D does **not** currently support Multi-Node Execution. The binary swan_mpi.exe will *not work and immediately crash with errors*.
Delft3D does **not** currently support Multi-Node Execution. The binary swan_mpi.exe will *not* work and immediately crash with errors.


+++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++
Delft3D Program Quick List
+++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++

Below are two main binaries that are used as part of the Delft3D Suite

Expand Down
19 changes: 9 additions & 10 deletions docs/source/software/gromacs.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,17 @@
--------
GROMACS
--------
===============
=======================================
GROMACS Status
===============

=======================================
GROMACS version 2021.5 is installed and available for use on the HPC.

.. _GROMACS: https://www.gromacs.org/

=================
GROMACS Overview
=================

==========================================
GROMACS Overview
==========================================
From `GROMACS`_:

GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles.
Expand All @@ -22,19 +21,19 @@ It is primarily designed for biochemical molecules like proteins, lipids and nuc
GROMACS supports all the usual algorithms you expect from a modern molecular dynamics implementation.


======================================
================================================================
GROMACS Quickstart Command Line Guide
=======================================
================================================================

GROMACS uses UCX and will require a custom mpirun invocation. The module system will warn you of this when you load the module. The following is a known good starting point:


``mpirun -mca pml ucx --mca btl ^vader,tcp,uct -x UCX_NET_DEVICES=bond0 <program> <options>``


+++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++
GROMACS Program Quick List
+++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++

Below is a quick reference list of the different programs that make up the GROMACS suite.

Expand Down

0 comments on commit eed6106

Please sign in to comment.