Merge pull request #36 from flindersuni/feature/minconda-jupyter

Jupyer/Conda & BeeGFS
flindersuni · Mar 21, 2022 · 8d4c985 · 8d4c985
2 parents 3cf24e5 + 0a5f3eb
commit 8d4c985
Show file tree

Hide file tree

Showing 6 changed files with 90 additions and 68 deletions.
diff --git a/docs/source/FAQ/faq.rst b/docs/source/FAQ/faq.rst
@@ -4,6 +4,16 @@ FAQ
 
 Below are some of the common steps that the team has been asked to resolve more than once, so we put them here to (hopefully) answer your questions before you have to wait in the Ticket Queue! 
 
+Host Not Found 
+===============
+
+When attemtping to connect to the HPC, you receive a message that says 'Could not find deepthought.flinders.edu.au'. 
+
+1. If you are on campus, contact ServiceDesk via ServiceOne or Phone.
+2. If you are off campus or working remotely, connect to the VPN and retry.
+
+
+
 What are the SLURM Partitions? 
 ===============================
 There are three at this point: 
@@ -17,7 +27,7 @@ You can omit the
 * #SBATCH partition=<name> directive
 
 
-as the sane-default for you is the hpc_general partition. If you need access to the GPU's you **must** user the hpc_gpu queue.
+as the sane-default for you is the general partition. If you need access to the GPU's you **must** use the gpu queue.
 
 SLURM - Tasks & OpenMPI/MPI
 ===========================

diff --git a/docs/source/FileTransfers/FileTransfersIntro.md b/docs/source/FileTransfers/FileTransfersIntro.md
@@ -8,46 +8,9 @@ Before we start, ensure that you have read the [Storage Overview & Usage Guideli
 
 All file-transfers are done via Secure File Transfer Protocol (SFTP), or Secure Copy Protocol (SCP). Other options, like the tool RSync are also usable. This guide will focus upon the GUI based tools, using SFTP.
 
-## Before we get started
-
-The HPC is a little different that your desktop at home when it comes to storage, not just computing power. It's a shared resource, so we cant store everybody's data for all time - there just isn't enough space.
-
-On DeepThought, are two main storage tiers, with a smaller pool for your documents and scripts. Firstly our bulk storage (approx. 250TB) is the 'Scratch' area (located at /scratch/user/$FAN) - and is slower, spinning Hard-Disk Drives (HDD's). The smaller, hyper-fast NVMe Solid-State Drives (located at /local) are roughly 400GB on the 'standard' nodes (1-16) and 1.5TB on the 'high-capacity' nodes (19-21).
-
-There is a critical difference between these two locations. The /scratch area is a common storage area. You can access it from all of the login, management and compute nodes on the HPC. This is not the same as /local, which is only available on each compute node.  That is - if you job is running on Node001, the /local only exists on that particular node - you cannot access it anywhere else on the HPC.
-
-- /home/$FAN
-- /scratch/$FAN
-
 ### Where is the old r_drive? 
 
 The old /r_drive/ mount points where a legacy implementation left over from the eRSA Project. All the data from these drives has been migrated to a /RDrive/ share with the same name, and will appear automatically. 
-
-### /Home
-
-Your 'home' directories. This is a small amount of storage (~11TB total) to store your small bits and pieces. This is the analogous to the Windows 'Documents' folder.
-
-At a command prompt, your home directory usually gets shortened to ~/.
-
-#### What to store in /home
-
-Here is a rough guide as to what should live in your /home/$FAN directory. In general, you want small, little things is here.
-
-- SLURM Scripts
-- Results from Jobs.
-- 'Small' Data-Sets (<5GB)
-
-### /Scratch
-
-Scratch is your working space. Depending upon your dataset, you may need to run your job here - this is not optimal and will be much slower than running it from /local. Scratch is still not an area to store your data permanently - there are no backups in place for the HPC, so ensure you follow the [HPC Research Data Flow]() and the [HPC Job Data Flow]().
-
-#### What to store in /scratch
-
-Here is a rough guide as to what should live in your /scratch/$FAN directory. In general, anything large, bulky and only needed for a little while should go here.
-
-- Job Working Data-sets
-- Intermediate files
-
 ## Linux/Unix File Transfers
 
 Linux / Unix based systems share native support for the SFTP Protocol. The Secure Copy Protocol (SCP) is also widely accepted, which can sometimes offer an edge in transfer speed. Tools such as RSYNC are also usable.

diff --git a/docs/source/SLURM/SLURMIntro.md b/docs/source/SLURM/SLURMIntro.md
@@ -209,7 +209,7 @@ The DeepThought HPC will set some additional environment variables to manipulate
 
 This means that if you leave anything in $TMP or $SHM directories it will be *removed when your job finishes*.
 
-To make that abundantly clear. If the Job creates `/local/jobs/$SLURM_USER/$SLURM_JOB_ID/` it will also **delete that entire directory when the job completes**. Ensure that your last step in any job creation is to _move any data you want to keep to /scratch or /home_.
+To make that abundantly clear. If the Job creates `/cluster/jobs/$SLURM_JOB_USER/$SLURM_JOB_ID` it will also **delete that entire directory when the job completes**. Ensure that your last step in any job creation is to _move any data you want to keep to /scratch or /home_.
 
 
 |Variable Name                |   Description                                 | Value |
@@ -220,7 +220,9 @@ To make that abundantly clear. If the Job creates `/local/jobs/$SLURM_USER/$SLUR
 | $TEMP                       | An alias for $TMP| /local/$SLURM_USER/$SLURM_JOB_ID/ |
 | $TEMPDIR                    | An alias for $TMP| /local/$SLURM_USER/$SLURM_JOB_ID/ |
 | $TEMP_DIR                   | An alias for $TMP| /local/$SLURM_USER/$SLURM_JOB_ID/ |
-| $SCRATCH_DIR                | A Per-Job Folder on the HPC /scratch mount  | /scratch/users/$SLURM_USER/$SLURM_JOB_ID/ |
+| $BGFS                       | The writable folder in /cluster for you job | /cluster/jobs/$SLURM_JOB_USER/$SLURM_JOB_ID| 
+| $BGFS_DIR                   | An alias for $BGFS | /cluster/jobs/$SLURM_JOB_USER/$SLURM_JOB_ID |
+| $BGFSDIR                    | An alias for $BGFS | /cluster/jobs/$SLURM_JOB_USER/$SLURM_JOB_ID|
 | $SHM_DIR                    | A Per-Job Folder on the Compute Node Shared-Memory / Tempfs Mount | /dev/shm/jobs/$USER/ |
 | $OMP_NUM_THREADS            | The OpenMP CPU Count Environment Variable | $SLURM_CPUS_PER_TASK |
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -3,6 +3,12 @@ Welcome to the DeepThought HPC
 
 The new Flinders University HPC is called DeepThought. This new HPC comprises of AMD EPYC based hardware and next-generation management software, allowing for a dynamic and agile HPC service. 
 
+.. _BeeGFS Section of Storage & Usage Guidelines: storage/storageusage.html
+
+.. attention:: 
+    The new BeeGFS Parallel Filesystem mounted at /cluster has just been deployed. For instructions on the restrictions and how to
+    take advantage of the performance increase this filesystem brings, please read the `BeeGFS Section of Storage & Usage Guidelines`_.
+
 .. attention::
     This documentation is under active development, meaning that it can
     change over time as we improve it. Please email deepthought@flinders.edu.au if

diff --git a/docs/source/software/jupyter.rst b/docs/source/software/jupyter.rst
@@ -20,3 +20,17 @@ via the following `Jupyter URL`_ or manually via https://deepweb.flinders.edu.au
 the same as the HPC, your FAN and password.
 
 If you are a student with access to the HPC, the above URLs may work - the URL http://deepteachweb.flinders.edu.au/jupyter is guaranteed to work correctly. 
+
+
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Using Conda Environments in Jupyter Hub
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can use your own custom environment in a Jupyter Kernel. To do so on the HPC via SLURM requires a few simple, extra steps. 
+
+
+1. Ensure you have a named conda environment that is loadable by ``conda activate <ENVIRONMENT_NAME>`` 
+2. Execute ``pip install cm-jupyter-eg-kernel-wlm``. You must use pip, as the package is not available via conda.
+3. Log into the HPC Jupyter instace at the `Jupyter URL`_.  
+4. Create a new Kernel, using the 'Conda via SLURM' Template. Ensure you change the environemnt from 'base' to the environment you wish to use. 
+5. Use this Kernel with your Jupyter Notebook
diff --git a/docs/source/storage/storageusage.rst b/docs/source/storage/storageusage.rst
@@ -5,10 +5,13 @@ Storage Overview & Usage Guidelines
 The HPC is a little different that your desktop at home when it comes to storage, not just computing power. It's a shared resource, so we cant store everybody's data for all time - there just isn't enough space. 
 So, before we start putting files onto the HPC, its best you know where to put them in the first place. 
 
-On DeepThought, are two main storage tiers. Firstly our bulk storage is the 'Scratch' area - and is slower, spinning Hard-Disk Drives (HDD's).
-The smaller, hyper-fast NVMe Solid-State Drives (SSD's) are located at /local and is much smaller. For the exact specifications and capacities, see the `System Specifications`_.
+On DeepThought, are three main storage tiers. Firstly our bulk storage is the 'Scratch' area - and is slower, spinning Hard-Disk Drives (HDD's). The next storage tier is the 'Cluster' parallel filesystem. 
+For distributed jobs, or jobs that required lots of staged files, this filesystem is the standard high-speed and low-latency HPC filesystem. Be aware that it is smaller that /scratch, 
+and as such, is a `transient` filesystem. Finally, there are hyper-fast NVMe Solid-State Drives (SSD's) located at /local. For the exact specifications and capacities, see the `System Specifications`_.
 
-There is a critical difference between these two locations. The /scratch area is a common storage area. You can access it from all of the login, management and compute nodes on the HPC. This is not the same as /local, which is only available on each compute node.  That is - if you job is running on Node001, the /local only exists on that particular node - you cannot access it anywhere else on the HPC.
+We also integrate into R-Drive to allow you access to your Research Data Storage allocations. This is an automatic process for any shares you have access to.  
+
+There is a critical differences between these three locations, so please read this page carefully.
 
 .. attention:: The HPC Job & Data Workflow, along with links to the new Data-Workflow Management Portal are under construction and will be linked here when completed.
 
@@ -17,17 +20,19 @@ Storage Accessibility Overview
 ################################
 As general guide, the following table presents the overall storage for the HPC.
 
-+-----------------------+--------------------------+-----------------------------+
-| Filesystem Location   | Accessible From          | Capacity                    |
-+=======================+==========================+=============================+
-| /scratch              |    All Nodes             | ~250TB                      |
-+-----------------------+--------------------------+-----------------------------+
-| /home                 | All Nodes                |    ~12TB                    |
-+-----------------------+--------------------------+-----------------------------+
-| /local                | Individual Compute Nodes | ~400GB or ~1.5TB            |
-+-----------------------+--------------------------+-----------------------------+
-| /RDrive/\<Share Name> |              Head Nodes  | Share Dependant             |
-+-----------------------+--------------------------+-----------------------------+
++-----------------------+--------------------------+------------------+
+| Filesystem Location   | Accessible From          | Capacity         |
++=======================+==========================+==================+
+| /scratch              | All Nodes                | ~250TB           |
++-----------------------+--------------------------+------------------+
+| /cluster              | All Nodes                | ~45TB            |
++-----------------------+--------------------------+------------------+
+| /home                 | All Nodes                | ~12TB            |
++-----------------------+--------------------------+------------------+
+| /local                | Individual Compute Nodes | ~400GB or ~1.5TB |
++-----------------------+--------------------------+------------------+
+| /RDrive/\<Share Name> | Head Nodes               | Share Dependant  |
++-----------------------+--------------------------+------------------+
 
 .. warning:: The HPC is classed as **volatile** storage. Your research data and dataset that you wanted backed up MUST be moved to /RDrive.
 
@@ -37,20 +42,6 @@ Usage Guidelines
 
 The following sections will go over the individual storage location/mounts along with some general guidelines of what should be stored where.
 
-=======
-/Home
-=======
-Your 'home' directories. This is a small amount of storage to store your small bits and pieces. This is the analogous to the Windows 'Documents' folder. At a command promp, your home directory will be shortened to ~/.
-
-^^^^^^^^^^^^^^^^^^^^^^^^
-What to store in /home
-^^^^^^^^^^^^^^^^^^^^^^^^
-Here is a rough guide as to what should live in your /home/$FAN directory. In general, you want small, little things is here.
-
-* SLURM Scripts
-* Results from Jobs.
-* 'Small' Data-Sets (<5GB)
-
 ==========
 /Scratch
 ==========
@@ -66,6 +57,42 @@ Here is a rough guide as to what should live in your /scratch/$FAN directory. In
 * Job Working Data-sets
 * Large Intermediate Files
 
+===========
+/cluster 
+===========
+
+Cluster is the new, high speed parallel filesystem for DeepThought, deployed with BeeGFS. **Please read this section carefully**. 
+
+The directories tyou can write to in /cluster are controller by SLURM.  When you job starts, SLURM sets multiple environment variables and 
+creates directories for you to use on this filesystem. See the environment variables sections of the SLURM guide for more information. 
+
+Once you job completes, is cancelled, or errors out, SLURM removes then entire directory of your job. That means, *if you do not move your data from the /cluster 
+filesystem, you will lose all of it*. This is by design, and the HPC Team cannot recover any data lost this way.
+
+
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+What to store in /cluster? 
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ * Your working data sets
+ * Temporary job files 
+ * Results, before you copy them back to /scratch 
+
+=======
+/Home
+=======
+Your 'home' directories. This is a small amount of storage to store your small bits and pieces. This is the analogous to the Windows 'Documents' folder. At a command promp, your home directory will be shortened to ~/.
+
+^^^^^^^^^^^^^^^^^^^^^^^^
+What to store in /home
+^^^^^^^^^^^^^^^^^^^^^^^^
+Here is a rough guide as to what should live in your /home/$FAN directory. In general, you want small, little things is here.
+
+* SLURM Scripts
+* Results from Jobs.
+* 'Small' Data-Sets (<5GB)
+
+
 =========
 /Local
 =========