Merge pull request #41 from flindersuni/feature/jupyter-gpu

ODC & Jupyter GPU Update
flindersuni · Jun 14, 2022 · eed6106 · eed6106
2 parents d9bf754 + 604fd78
commit eed6106
Show file tree

Hide file tree

Showing 16 changed files with 263 additions and 71 deletions.
diff --git a/.gitignore b/.gitignore
@@ -7,4 +7,7 @@ docs/source/_build
 # Files used by Visual Code
 .vscode/settings.json
 *.code-workspace
-.venv
+.venv
+.vscode/targets.log
+.vscode/dryrun.log
+.vscode/configurationCache.log
diff --git a/docs/source/FAQ/faq.rst b/docs/source/FAQ/faq.rst
@@ -20,7 +20,7 @@ There are three at this point:
 
 * general 
 * gpu
-* melfeu 
+* melfu 
 
 You can omit the 
 
@@ -55,7 +55,7 @@ OOM Killer
 Remember, that each 'task' is its own little bucket - which means that SLURM tracks it individually! If a single task goes over its resource allocation, SLURM will kill it, and usually that causes a cascade failure with the rest of your program, as you suddenly have a process missing.
 
 
-IsoSeq3: Installation
+Issues Installed ISoSeq3 
 =====================
 
 IsoSeq3, from Pacific Bio Sciences has install instructions that won't get you all the way on DeepThought.  There are some missing packages and some commands that must be altered to get you up and running.
@@ -128,3 +128,63 @@ The given bx-python is a problematic module that appears in many of the BioScien
 These steps are the same as the installation for IsoSeq3, but given how often this particular python package gives the support team issues, it gets its own section!
 
 * conda install -c conda-forge -c bioconda bx-python
+
+
+
+My Jupyter Kernel Times Out
+===============================
+This is usually caused by one of two things: 
+
+    * HPC has allocated all its Resources 
+    * Incorrect Conda Setup
+
+HPC Is Busy
+------------
+
+You job will time out when the HPC is busy, as your job cannot get an allocation within 30 seconds (or so). 
+If you do not see a file like 'slurm-<NUMBER>.out' in  your /home directory, then the HPC cannot fit your kernel's requested allocation as all resources are busy. 
+
+To solve the above, you can either: 
+
+* Recreate a Kernel with lower resource requirements 
+* Wait for the HPC to be less busy
+
+A sneaky command from the HPC Admin Team: ``sinfo -No "%17n %13C %10O %10e %30G"``. This gets you a layout like so:: 
+
+    HOSTNAMES         CPUS(A/I/O/T) CPU_LOAD   FREE_MEM   GRES
+    hpc-node001       0/64/0/64     0.46       241647     gpu:tesla_v100:2(S:2,6)
+    hpc-node002       0/64/0/64     1.86       250777     gpu:tesla_v100:2(S:2,6)
+    hpc-node003       64/0/0/64     20.44      240520     (null)
+    hpc-node004       64/0/0/64     19.46      244907     (null)
+    hpc-node005       64/0/0/64     18.59      241284     (null)
+    hpc-node006       64/0/0/64     17.37      244390     (null)
+    hpc-node007       64/0/0/64     14.50      221633     (null)
+    hpc-node008       64/0/0/64     18.06      211002     (null)
+    hpc-node009       64/0/0/64     19.27      206833     (null)
+    hpc-node010       64/0/0/64     19.39      233411     (null)
+    hpc-node011       64/0/0/64     20.51      221966     (null)
+    hpc-node012       64/0/0/64     19.06      181808     (null)
+    hpc-node013       64/0/0/64     20.35      221835     (null)
+    hpc-node014       60/0/4/64     4.00       151584     (null)
+    hpc-node015       64/0/0/64     18.01      191874     (null)
+    hpc-node016       64/0/0/64     11.04      214227     (null)
+    hpc-node017       0/64/0/64     0.00       512825     (null)
+    hpc-node018       0/64/0/64     0.03       61170      (null)
+    hpc-node019       128/0/0/128   515.85     1929048    (null)
+    hpc-node020       128/0/0/128   30.31      1062956    (null)
+    hpc-node021       128/0/0/128   38.10      975893     (null)
+    hpc-node022       0/64/0/64     0.06       119681     gpu:tesla_v100:1(S:2)
+
+What you want to look at is that first and second numbers in the CPUS Column. The first is 'Allocated' and the second is 'Available for Usage'. 
+This above example shows that the GPU queue is empty (0/64) but the general queue is busy (64/0). 
+
+Incorrect Conda Environment Setup 
+-----------------------------------
+The timeout error can also be caused by missing a required package for the custom WLM Integration to work correctly. 
+
+This means that the job started, but could not connect your Jupyter Notebook correctly. If you look in your home directory, you will see the previously mentioned 'slurm-<NUMBER>.out' file. 
+Right at the very bottom of the file (its quite long, with lots of debugging information in it) you will see a message similar to: 
+
+* ``command not found ipykernel-wlm`` 
+
+To fix this type of 'command not found' error for ipykernel or similar - go back to the Jupyter Hub Conda Setup instructions, and double check that you have installed *all* of the needed packages. 
diff --git a/docs/source/FileTransfers/FileTransfersIntro.md b/docs/source/FileTransfers/FileTransfersIntro.md
@@ -28,7 +28,7 @@ When using a *NIX based system, using the terminal is the fastest way to upload
 Substitute your filename, FAN and Password, type scp FILENAME FAN@deepthought.flinders.edu.au:/home/FAN then hit enter.
 Enter your password when prompted. This will put the file in your home directory on DeepThought. It looks (when substituted accordingly) similar to:
 
-![](../_static/SCPExampleImage.png)
+`scp /path/to/local/file fan@deepthought.flinders.edu.au:/path/on/deepthought/hpc/`
 
 ### The Longer Version
 

diff --git a/docs/source/SLURM/SLURMIntro.md b/docs/source/SLURM/SLURMIntro.md
@@ -44,7 +44,7 @@ Which will allow for greater details of how your score was calculated.
 
 ### Calculating Priority
 
-SLURM tracks 'Resources'. This can be nearly anything on the HPC - CPU's, Power, GPU's, Memory, Storage, Licenses, anything that people share and could use really.
+SLURM tracks 'Resources'. This can be nearly anything on the HPC - CPU's, Power, GPU's, Memory, Storage, Licenses, anything that the HPC needs to track usage and allocation.
 
 The basic premise is - you have:
 
@@ -54,7 +54,7 @@ The basic premise is - you have:
 
 Then you multiple all three together to get your end priority. So, lets say you ask for 2 GPU's (The current max you can ask for)
 
-A GPU on DeepThought (When this was written) is set to have these parameters:
+A GPU on DeepThought (when this was written) is set to have these parameters:
 
 - Weight: 5
 - Factor: 1000
@@ -73,9 +73,9 @@ To give you an idea of the _initial_ score you would get for consuming an entire
 
 **CPU**: `64 * 1 * 1000 = 64,000` (Measure Per CPU Core)
 
-**RAM**: `256 * 0.25 * 1000 = 65,536,000` (Measured Per MB)
+**RAM**: `256 * 0.25 * 1000 = 64,000` (Measured Per GB)
 
-**Total**: `65,600,000`
+**Total**: `128,000`
 
 So, its stacks up very quickly, and you really want to write your job to ask for what it needs, and not much more! This is not the number you see and should only be taken as an example.  If you want to read up on exactly how Fairshare works, then head on over to [here](https://slurm.schedmd.com/priority_multifactor.html).
 
@@ -207,9 +207,9 @@ The following variables are set per job, and can be access from your SLURM Scrip
 
 The DeepThought HPC will set some additional environment variables to manipulate some of the Operating system functions. These directories are set at job creation time and then are removed when a job completes, crashes or otherwise exists.
 
-This means that if you leave anything in $TMP or $SHM directories it will be *removed when your job finishes*.
+This means that if you leave anything in $TMP, $BGFS or $SHM directories it will be *removed when your job finishes*.
 
-To make that abundantly clear. If the Job creates `/cluster/jobs/$SLURM_JOB_USER/$SLURM_JOB_ID` it will also **delete that entire directory when the job completes**. Ensure that your last step in any job creation is to _move any data you want to keep to /scratch or /home_.
+To make that abundantly clear. If the Job creates `/cluster/jobs/$SLURM_JOB_USER/$SLURM_JOB_ID` (the $BGFS location) it will also **delete that entire directory when the job completes**. Ensure that your last step in any job creation is to _move any data you want to keep to /scratch or /home_.
 
 
 |Variable Name                |   Description                                 | Value |
@@ -261,7 +261,7 @@ To reiterate the warning above - if you leave anything in the $TMP or $SHM Direc
 
 ### Filename Patterns 
 
-Some commands will take a filename.  THe following modifiers will allow you to generate files that are substituted with different variables controlled by SLURM.
+Some commands will take a filename.  The following modifiers will allow you to generate files that are substituted with different variables controlled by SLURM.
 
 | Symbol            | Substitution |
 |-|-|

diff --git a/docs/source/_static/_overrides.css b/docs/source/_static/_overrides.css
@@ -4,3 +4,35 @@
     height: 100% !important;
     max-width: 100% !important;  
 }
+
+.wy-side-nav-search {
+    background-color: #FFD300;
+}
+.wy-side-nav-search>a {
+    color: #002F60;
+}
+
+.wy-menu-vertical {
+    background-color:  #002F60;
+}
+
+.wy-nav-side {
+    background:  #002F60;
+    color: #002F60;
+}
+
+.wy-menu-vertical header, .wy-menu-vertical p.caption {
+    color: #F6EEE1;
+}
+
+.wy-menu-vertical a:hover {
+    background-color: #21509F;
+}
+
+a {
+    color: #002F60;
+}
+
+.wy-menu-vertical a {
+    color: #b7B7b7;
+}
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -3,11 +3,11 @@ Welcome to the DeepThought HPC
 
 The new Flinders University HPC is called DeepThought. This new HPC comprises of AMD EPYC based hardware and next-generation management software, allowing for a dynamic and agile HPC service. 
 
-.. _BeeGFS Section of Storage & Usage Guidelines: storage/storageusage.html
+.. _/cluster: storage/storageusage.html
 
 .. attention:: 
-    The new BeeGFS Parallel Filesystem mounted at /cluster has just been deployed, but is *not yet ready for usage*. It will appear in any disk 
-    usage listings on the HPC. For further information and to prepare for when this filesystem is fully released, please read the `BeeGFS Section of Storage & Usage Guidelines`_.
+    The new BeeGFS Parallel Filesystem mounted at /cluster has just been deployed, and is **now ready for usage**. It will appear in any disk 
+    usage listings on the HPC. For further information please read the `/cluster`_ section of the Storage Usage & Guidelines.
 
 .. attention::
     This documentation is under active development, meaning that it can
@@ -22,7 +22,7 @@ Attribution
 If you use the HPC to form a part of your research, you should attribute your usage. 
 Flinders has minted a DOI that points to this documentation, specific for the HPC Service. It will also allow for tracking the research outputs that the HPC has contributed to.  
 
-Text Citation
+Text Citation 
 ++++++++++++++
 
 .. _ARDC Data Citation: https://ardc.edu.au/resources/working-with-data/citation-identifiers/data-citation/
@@ -78,6 +78,7 @@ Table of Contents
     software/matlab.rst
     software/singularity.rst
     software/vasp.rst
+    software/opendatacube.rst
 
 
 .. toctree::

diff --git a/docs/source/software/ansys.rst b/docs/source/software/ansys.rst
@@ -1,17 +1,19 @@
 -------------------------
 ANSYS Engineering Suite 
 -------------------------
+
 =============
 ANSYS Status
 =============
+
 ANSYS 2021R2 is the current version of the ANSYS Suite installed on the HPC. Both Single-Node (-smp) and Multi-Node (-dis) execution is supported as well as GPU acceleration.
 
 
 .. _ANSYS: https://www.ansys.com/
 
 ===============
 ANSYS Overview 
-=============== 
+===============
 The ANSYS Engineering Suite is a comprehensive software suite for engineering simulation. More information on can be found on the `ANSYS`_ website.
 
 

diff --git a/docs/source/software/delft3d.rst b/docs/source/software/delft3d.rst
@@ -1,18 +1,19 @@
 -------------------------
 Delft3D 
 -------------------------
-=======
-Status
-=======
+=====================
+Delft3D Status
+=====================
+
 Delft3D 4, Revision 65936 is installed and available for use on the HPC.
 
-.. Delft3D: 
+.. _Delft3D Home: https://oss.deltares.nl/web/delft3d
 
-==================
+====================
 Delft3D Overview 
-==================
+====================
 
-From `Delft3D`_: 
+From `Delft3D Home`_: 
 
 Delft3D is Open Source Software and facilitates the hydrodynamic (Delft3D-FLOW module), morphodynamic (Delft3D-MOR module), waves (Delft3D-WAVE module), water quality (Delft3D-WAQ module including the DELWAQ kernel) and particle (Delft3D-PART module) modelling
 
@@ -21,12 +22,12 @@ Delft3D is Open Source Software and facilitates the hydrodynamic (Delft3D-FLOW m
 Delft3D Known Issues
 ================================
 
-Delft3D does **not** currently support Multi-Node Execution.  The binary swan_mpi.exe will *not work and immediately crash with errors*.
+Delft3D does **not** currently support Multi-Node Execution.  The binary swan_mpi.exe will *not* work and immediately crash with errors.
 
 
-+++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++
 Delft3D Program Quick List
-+++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++
 
 Below are two main binaries that are used as part of the Delft3D Suite
 

diff --git a/docs/source/software/gromacs.rst b/docs/source/software/gromacs.rst
@@ -1,18 +1,17 @@
 --------
 GROMACS 
 --------
-===============
+=======================================
 GROMACS Status
-===============
-
+=======================================
 GROMACS version 2021.5 is installed and available for use on the HPC.  
 
 .. _GROMACS: https://www.gromacs.org/
 
-=================
-GROMACS Overview 
-=================
 
+==========================================
+GROMACS Overview 
+==========================================
 From `GROMACS`_: 
 
 GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles.
@@ -22,19 +21,19 @@ It is primarily designed for biochemical molecules like proteins, lipids and nuc
 GROMACS supports all the usual algorithms you expect from a modern molecular dynamics implementation.
 
 
-======================================
+================================================================
 GROMACS Quickstart Command Line Guide
-=======================================
+================================================================
 
 GROMACS uses UCX and will require a custom mpirun invocation. The module system will warn you of this when you load the module. The following is a known good starting point:
 
 
 ``mpirun -mca pml ucx --mca btl ^vader,tcp,uct -x UCX_NET_DEVICES=bond0 <program> <options>``
 
 
-+++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++
 GROMACS Program Quick List
-+++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++
 
 Below is a quick reference list of the different programs that make up the GROMACS suite.