04-hpc.Rmd

---
editor_options: 
  markdown: 
    wrap: 72
---

# High Performance Computing

Certain analysis use cases require high performance computing resources:

-   big data
-   parallel computing
-   lengthy computation times
-   restricted-use data

For analyses involving big data or models that take a long time to
estimate, a single laptop or desktop computer is often not powerful
enough or becomes inconvenient to use. Additionally, for analyses
involving restricted-use data, such as datasets containing personally
identifiable information, data use agreements typically stipulate that
the data should be stored and analyzed in a secure manner.

In these cases, you should use the high performance computing resources
available to emLab. emLab currently has two high performance computing
servers that are managed by UCSB’s General Research IT (GRIT). These
servers are named sequoia and quebracho. This section of the manual
describes how to use these two servers for high performance computing.

For now, please use sequoia for general emLab computing for most
projects. Quebracho is currently restricted to land use projects (e.g.,
`land-based-solutions` and projects starting with `cel`), so please only
use quebracho if you have already been doing so and have already
discussed this with Kathy or Robert. If you have any doubts about which
server to use, please use sequoia. Note that sequoia does not have a
GPU, but quebracho does. If you need access to a GPU and are not already
using quebracho, please contact Robert and Kathy to talk about using
quebracho.

## Available resources

|                                                     | Cores   | RAM             | GPU | USE                                   |
|------------------|--------------|--------------|--------------|--------------|
| quebracho                                           | 64      | 1TB             | Yes | Only land-use projects (for now)      |
| sequoia                                             | 192     | 1.5TB           | No  | All other emLab research              |
| [Knot](https://csc.cnsi.ucsb.edu/clusters/knot)     | 1,500   | 48 GB - 1TB     | Yes | UCSB shared resource                  |
| [Pod](https://csc.cnsi.ucsb.edu/clusters/pod)       | \~2,600 | 190 GB - 1.5 TB | Yes | UCSB shared resource                  |
| [Braid2](https://csc.cnsi.ucsb.edu/clusters/braid2) | \~2,200 | 192 - 368 GB    | Yes | UCSB condo cluster (PI must buy node) |

: HPC resources available to emLab researchers

The emLab SOP will focus on using quebracho and sequoia. For further
information on using other UCSB campus resources, you can refer to our
[specific guide](https://emlab-ucsb.github.io/cluster-guide/index.html)
on that. However please note that this guide is several years
out-of-date, and you may find better and more current information
directly on a UCSB website. Additionally, now that we have our own HPC
servers, we no longer recommend using Google Compute Engine, which is a
pay-as-you-go cloud computing server. It can be quite expensive, and has
setup challenges as compared to our own servers. However, if you need to
use GCE for whatever reason, emLab alumni Grant McDermott wrote a very
helpful tutorial on using [R Studio Server on
GCE](https://grantmcdermott.com/rstudio-server-compute-engine/).

## Available software

Quebracho currently has R Studio Server and JupyterLab installed. We
currently have R Studio installed on sequoia. GRIT manages these
installations for us. They will also manage updates for these.

If we wish to install additional software, we will need to decide on
these as a group and have GRIT install them for us. When considering new
software to install, we should consider whether or not it is already
available on other campus servers; what it will cost; and how many
people in emLab would use it. Generally speaking, if a specific piece of
software is expensive (e.g., Stata or Matlab), will not be used by too
many emLab folks, and is already available on other campus servers, we
should rely on these other campus servers and not install it on our own
servers. Users interested in MatLab should first try Pod which has the
necessary licenses and is available for free.

Sequoia will not have a python interface by default. If there is enough
interest, JupyterLab may be installed by making a request to GRIT. If
users wish to use python it is recommended that they install Visual
Studio Code (VS Code) available for free from Microsoft. With VS Code
installed, users can add the Remote SSH extension and access sequoia via
SSH tunnel. Further instructions can be found in the [VS Code
Documentation](https://csc.cnsi.ucsb.edu/clusters/pod). After accessing
sequoia via SSH tunnel, users may install their preferred python
distribution.
[Miniconda](https://docs.anaconda.com/free/miniconda/miniconda-install/)
is a good starting point, though other options are available. This
[Medium
article](https://vinurad13.medium.com/setting-up-miniconda-and-run-jupyter-notebooks-remotely-on-ubuntu-20-04-server-8d98a6cf4642)
is a good place for further installation guidance. Finally, it is
recommended to create custom python
[environments](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
for each project. 

For users that need Stata, it is already available on both UCSB’s Knot
cluster. More details for using Stata on Knot can be found
[here](https://static1.squarespace.com/static/573f69a2cf80a1adb090ba64/t/5b2c1f1288251b7405a4cb7a/1529618194751/UCSB_server_tutorial.pdf).
We will not be installing Stata on quebracho or sequoia.

For users of Matlab, it is already available on all campus clusters.
More details can be found
he<https://csc.cnsi.ucsb.edu/docs/using-matlab>re. We will not be
installing Matlab on quebracho or sequoia.

## Installing packages

You can install regular user-level R packages just like you would
normally using R Studio on your local machine. We recommend using the
renv R package to manage package dependencies for each project (i.e.,
GitHub repo) you work in. Please refer to the [emLab SOP section on
reproducibility](https://emlab-ucsb.github.io/SOP/3.3-reproducibility.html#package-dependencies)
for more information on renv.

Additionally, GRIT installs and updates many commonly used R packages on
the servers, which are accessible in a “site-library” for each server.
They update these once or twice a year. To add the GRIT R package
library to your library paths, you can run this line of code:
`.libPaths(c("/usr/local/lib/R/site-library/", .libPaths()))`

\
For system-level packages that you would normally need to install
through the terminal on your local machine (e.g., packages like `gdal`
or `libproj`), we will need to have GRIT install and manage these for
us. We have already had GRIT install many commonly-needed system-level
packages, which they will update once or twice a year. If you need a
particular package that is not yet installed, please start a help ticket
directly with GRIT: [help\@grit.ucsb.edu](mailto:help@grit.ucsb.edu) .

## Setting up a GRIT account

To use emLab’s HPC servers, you must have a GRIT account. Please refer
to the emLab Manual section on setting up and managing an account with
GRIT.

## Logging in

1.  On your local machine, connect to the UCSB Campus VPN. You can do so
    by downloading a VPN client for your operating system, such as
    Ivanti Pulse Secure. More details for connecting to the UCSB VPN and
    installation instructions are provided
    [here](https://www.it.ucsb.edu/network-infrastructure-services/ivanti-connect-secure-campus-vpn).
    Note: Even if you are on the UCSB campus you will still need to
    connect to the campus VPN.

2.  Once you’ve connected to the VPN, you are ready to access the
    server. To access R Studio Server, you can simply navigate to one of
    these links. Once there, you will be prompted to enter your GRIT
    user ID and password. Once you’ve done this, you are ready to use R
    Studio Server!

    -   Sequoia: <https://sequoia.grit.ucsb.edu/rstudio/>

    -   Quebracho: <https://quebracho.geog.ucsb.edu/rstudio/>

## Accessing data

Please refer to [this
section](https://emlab-ucsb.github.io/SOP/1.3-grit-data-storage-space.html#grit-data-storage-space)
of the emLab SOP for a description of the data directory structure for
our emLab GRIT data storage space.

All data in the emLab GRIT data storage space can be directly accessed
on each of the servers (sequoia and quebracho) without any changes to
the directory paths. All data in the `emlab/data` and
`emlab/projects/current-projects` directory physically lives on
high-speed hard drives attached to sequoia, so if you need to work on
data in these directories, you will have the best computing performance
when using sequoia. Please refer to [this
section](https://emlab-ucsb.github.io/SOP/2.3-accessing-emlab-data-in-r.html)
of the emLab SOP for a code snippet that can be used to directly access
data on the server in R.

\
In addition to having access to our emLab GRIT data storage space, which
is shared across all members of our team, all individual users also have
a private user-specific storage space. All GRIT users get a free 50GB
personal storage space by default. As a general best practice, we
recommend storing all data on the emLab data storage space, and only
storing cloned GitHub repos and user-specific R packages and settings in
your personal user space. For example, you should store all
project-specific data in the appropriate directory under the emLab data
storage space, but you should store all of your cloned GitHub repos s in
your personal storage space. By default, when you clone repos from
GitHub they are stored in your personal storage space, along with any of
your user-specific R packages and configurations. If for whatever reason
your personal storage space exceeds 50GB, it will stop working, so you
should ensure you always have a safe buffer. However, we envision that
if users only keep cloned GitHub repos and R packages in their personal
user space, they should not need to worry about hitting the 50GB limit.
You can check your current personal storage by typing df -h in the
terminal and then looking for your username.

## Accessing code

Here we can talk about how to use the servers with GitHub code
management. Essentially, you can work with projects and GitHub
repositories on RStudio Server exactly like you can on your personal
machine. One major difference is how to set up GitHub authentication,
which is a little different on the servers than it would be on your
personal machine. So we can provide explicit instructions on doing that.

Please refer to [this
section](https://emlab-ucsb.github.io/emlab-manual/emlab-workflow-and-platforms.html#git-and-github)
of the emLab SOP for directions on how to set up and manage git and
GitHub for your new server workspace. One important difference between
your personal laptop and using a server is that file permissions may be
such that other users can see and sometimes read or write files in your
directories. Ideally, any confidential information such as your git
credentials should be secured differently from your personal computer.
Step 6 of the Git and Github section of the emLab manual is therefore
not recommended in a multi-user server environment because your token
may end up viewable to other users as plain text. \

Instead of storing your Personal Authentication Token (PAT) as plain
text, it is recommended to use one of the following options. Using
either of these two approaches will also mean that credentials are
stored between sessions, which should make the user experience a bit
easier. The first approach, using an SSH key, is recommended

1.  Use an SSH key instead of a PAT

    1.  Set up your SSH key on the GRIT server.

        1.  If you are not using R Studio Server, or prefer to use the
            terminal, follow these instructions: 

            1.  You can generate a new SSH key with the terminal command

                1.  ssh-keygen -t ed25519 -C "email\@example.com"

            2.  You are prompted to select a location (hit enter for the
                default location)

            3.  You are prompted to set a password (hit enter to not
                require one)

            4.  Start your SSH agent in the background with 

                1.  eval "\$(ssh-agent -s)"

            5.  Add your private key to the SSH agent with

                1.  ssh-add \~/.ssh/id_ed25519

            6.  Copy your public key with 

                1.  cat \~/.ssh/id_rsa.pub \| xclip -selection clipboard

        2.  If you are using R Studio Server and you prefer to not use
            the terminal approach, the instructions are a bit more
            streamlined. Follow the instructions in this
            [link](https://happygitwithr.com/ssh-keys#option-1-set-up-from-rstudio).  

            1.  If after going through these instructions you prefer to
                not use a password, you can remove it using the
                instructions provided in this
                [link](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/working-with-ssh-key-passphrases#adding-or-changing-a-passphrase).

    2.  Add your public key to your GitHub account

        1.  GitHub \> Settings \> SSH and GPG keys \> New SSH key

        2.  Paste in your key (either from Step 6 in the terminal option
            above, or copied from R Studio Server in the R Studio option
            above)

    3.  Now you can clone your repository onto the GRIT server. This
        means that you need to either: 1) when cloning the repository
        for the first time you need to use the SSH url rather than the
        HTTPS url; or 2) if you’ve already clone the repository, set
        your repository URL to the SSH version

        1.  If cloning a repo for the first time using R Studio Server,
            you can simply click “File \> New Project \> Version Control
            \> Git”, and then enter your repo’s SSH link

            1.  [git\@github.com](mailto:git@github.com):username/example_repo.git
                (for example, this might look like
                git\@github.com:emlab-ucsb/ocean-ghg.git)

        2.  Alternatively, in the terminal You can manually set specific
            repo URLs to SSH with: 

            1.  git remote set-url origin
                [git\@github.com](mailto:git@github.com):username/example_repo.git
                (for example, this might look like
                git\@github.com:emlab-ucsb/ocean-ghg.git)

2.  Caching your PAT temporarily 

    1.  Create a PAT on GitHub

    2.  Add credential cache timeout instructions to your git config
        file

        1.  git config --global credential.helper 'cache --timeout=3600'

        2.  Adjust the timeout length (in seconds) as needed 

    3.  Push changes to GitHub

        1.  When prompted enter your username 

        2.  For the password enter your PAT

            1.  Future pushes will not not require you to enter
                credentials within the timeout window

    4.  This is not a good long term solution because you will need to
        re-enter your credentials anytime the server restarts or when
        your cache timeout ends

## Using htop to monitor shared resources

Always run [htop](https://htop.dev/) in the terminal to see how the many
cores and how much RAM are currently being used by others. You can
customize the htop display to make things easier to see. For example:

-   After entering htop, press F2 to enter setup. You can also click
    directly on setup to enter it.

-   Once you enter setup, if you have trouble seeing the setup options,
    you can try reducing your browser’s text size temporarily in order
    to see the setup options.

-   Sequoia has 128 cores so the default view with 4 columns means a
    pretty large display. In the Meters setup, you can change the left
    column to be “CPUs (1-4/8) [Bar]” and the right column to be “CPUs
    (5-8/8) [Bar]”. This will condense the output and force 8 columns. 

-   I also like to add disc IO to the left column below memory. 

-   In the “Display options” setup you can select some that will clean
    up the process information below the resources monitor. I like to
    make sure to select 

    -   Tree view

    -   Tree view sorted by PID

    -   Shadow other users’ process (makes it easier to see your own)

    -   Count CPUs from 1

    -   Enable the mouse

-   Press F10 when done

## Best practices for sharing our computational resources

Here we outline our best practices for using shared computational
resources. These are meant to be living guidelines that will be adapted
by our team as needed:

-   In order to keep some of our computational resources easy to use
    interactively without queues or SLURM, we will need to coordinate
    and share. Sharing is caring! Common courtesy can go a long way.
    Hopefully we can largely self-manage this. If not, we will need to
    move computational resources onto queues and SLURM, which could
    introduce barriers to analysis and rapid prototyping.

-   Always run [htop](https://htop.dev/) in the terminal to see how the
    many cores and how much RAM are currently being used by others

-   In general on sequoia, feel free to run analyses that use up to 20
    cores and 150GB of RAM. We will likely adaptively manage these
    specific numbers once we start using sequoia and getting a better
    understanding of how many resources we are using.

-   For larger analyses that require more cores or RAM, coordinate with
    others over the server slack channel (`#hpc-core-dination`) to
    ensure that workflows are not disrupted and that everyone has
    reasonable access to computational resources

-   Generally, we recommend piloting your code using a small subset of
    your data and/or just a single core, either on your local computer
    or on one of our HPC servers. Then once you know it works and have a
    sense of how much memory it will use and how long it will take to
    execute, you can go ahead and run the full analysis on the server.
    And if it looks like the full analysis will require resources beyond
    the standard recommend 20 cores and 150GB, coordinate with the team
    on `#hpc-core-dination`.