/
04-hpc.Rmd
386 lines (298 loc) · 18 KB
/
04-hpc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
---
editor_options:
markdown:
wrap: 72
---
# High Performance Computing
Certain analysis use cases require high performance computing resources:
- big data
- parallel computing
- lengthy computation times
- restricted-use data
For analyses involving big data or models that take a long time to
estimate, a single laptop or desktop computer is often not powerful
enough or becomes inconvenient to use. Additionally, for analyses
involving restricted-use data, such as datasets containing personally
identifiable information, data use agreements typically stipulate that
the data should be stored and analyzed in a secure manner.
In these cases, you should use the high performance computing resources
available to emLab. emLab currently has two high performance computing
servers that are managed by UCSB’s General Research IT (GRIT). These
servers are named sequoia and quebracho. This section of the manual
describes how to use these two servers for high performance computing.
For now, please use sequoia for general emLab computing for most
projects. Quebracho is currently restricted to land use projects (e.g.,
`land-based-solutions` and projects starting with `cel`), so please only
use quebracho if you have already been doing so and have already
discussed this with Kathy or Robert. If you have any doubts about which
server to use, please use sequoia. Note that sequoia does not have a
GPU, but quebracho does. If you need access to a GPU and are not already
using quebracho, please contact Robert and Kathy to talk about using
quebracho.
## Available resources
| | Cores | RAM | GPU | USE |
|------------------|--------------|--------------|--------------|--------------|
| quebracho | 64 | 1TB | Yes | Only land-use projects (for now) |
| sequoia | 192 | 1.5TB | No | All other emLab research |
| [Knot](https://csc.cnsi.ucsb.edu/clusters/knot) | 1,500 | 48 GB - 1TB | Yes | UCSB shared resource |
| [Pod](https://csc.cnsi.ucsb.edu/clusters/pod) | \~2,600 | 190 GB - 1.5 TB | Yes | UCSB shared resource |
| [Braid2](https://csc.cnsi.ucsb.edu/clusters/braid2) | \~2,200 | 192 - 368 GB | Yes | UCSB condo cluster (PI must buy node) |
: HPC resources available to emLab researchers
The emLab SOP will focus on using quebracho and sequoia. For further
information on using other UCSB campus resources, you can refer to our
[specific guide](https://emlab-ucsb.github.io/cluster-guide/index.html)
on that. However please note that this guide is several years
out-of-date, and you may find better and more current information
directly on a UCSB website. Additionally, now that we have our own HPC
servers, we no longer recommend using Google Compute Engine, which is a
pay-as-you-go cloud computing server. It can be quite expensive, and has
setup challenges as compared to our own servers. However, if you need to
use GCE for whatever reason, emLab alumni Grant McDermott wrote a very
helpful tutorial on using [R Studio Server on
GCE](https://grantmcdermott.com/rstudio-server-compute-engine/).
## Available software
Quebracho currently has R Studio Server and JupyterLab installed. We
currently have R Studio installed on sequoia. GRIT manages these
installations for us. They will also manage updates for these.
If we wish to install additional software, we will need to decide on
these as a group and have GRIT install them for us. When considering new
software to install, we should consider whether or not it is already
available on other campus servers; what it will cost; and how many
people in emLab would use it. Generally speaking, if a specific piece of
software is expensive (e.g., Stata or Matlab), will not be used by too
many emLab folks, and is already available on other campus servers, we
should rely on these other campus servers and not install it on our own
servers. Users interested in MatLab should first try Pod which has the
necessary licenses and is available for free.
Sequoia will not have a python interface by default. If there is enough
interest, JupyterLab may be installed by making a request to GRIT. If
users wish to use python it is recommended that they install Visual
Studio Code (VS Code) available for free from Microsoft. With VS Code
installed, users can add the Remote SSH extension and access sequoia via
SSH tunnel. Further instructions can be found in the [VS Code
Documentation](https://csc.cnsi.ucsb.edu/clusters/pod). After accessing
sequoia via SSH tunnel, users may install their preferred python
distribution.
[Miniconda](https://docs.anaconda.com/free/miniconda/miniconda-install/)
is a good starting point, though other options are available. This
[Medium
article](https://vinurad13.medium.com/setting-up-miniconda-and-run-jupyter-notebooks-remotely-on-ubuntu-20-04-server-8d98a6cf4642)
is a good place for further installation guidance. Finally, it is
recommended to create custom python
[environments](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
for each project.
For users that need Stata, it is already available on both UCSB’s Knot
cluster. More details for using Stata on Knot can be found
[here](https://static1.squarespace.com/static/573f69a2cf80a1adb090ba64/t/5b2c1f1288251b7405a4cb7a/1529618194751/UCSB_server_tutorial.pdf).
We will not be installing Stata on quebracho or sequoia.
For users of Matlab, it is already available on all campus clusters.
More details can be found
he<https://csc.cnsi.ucsb.edu/docs/using-matlab>re. We will not be
installing Matlab on quebracho or sequoia.
## Installing packages
You can install regular user-level R packages just like you would
normally using R Studio on your local machine. We recommend using the
renv R package to manage package dependencies for each project (i.e.,
GitHub repo) you work in. Please refer to the [emLab SOP section on
reproducibility](https://emlab-ucsb.github.io/SOP/3.3-reproducibility.html#package-dependencies)
for more information on renv.
Additionally, GRIT installs and updates many commonly used R packages on
the servers, which are accessible in a “site-library” for each server.
They update these once or twice a year. To add the GRIT R package
library to your library paths, you can run this line of code:
`.libPaths(c("/usr/local/lib/R/site-library/", .libPaths()))`
\
For system-level packages that you would normally need to install
through the terminal on your local machine (e.g., packages like `gdal`
or `libproj`), we will need to have GRIT install and manage these for
us. We have already had GRIT install many commonly-needed system-level
packages, which they will update once or twice a year. If you need a
particular package that is not yet installed, please start a help ticket
directly with GRIT: [help\@grit.ucsb.edu](mailto:help@grit.ucsb.edu) .
## Setting up a GRIT account
To use emLab’s HPC servers, you must have a GRIT account. Please refer
to the emLab Manual section on setting up and managing an account with
GRIT.
## Logging in
1. On your local machine, connect to the UCSB Campus VPN. You can do so
by downloading a VPN client for your operating system, such as
Ivanti Pulse Secure. More details for connecting to the UCSB VPN and
installation instructions are provided
[here](https://www.it.ucsb.edu/network-infrastructure-services/ivanti-connect-secure-campus-vpn).
Note: Even if you are on the UCSB campus you will still need to
connect to the campus VPN.
2. Once you’ve connected to the VPN, you are ready to access the
server. To access R Studio Server, you can simply navigate to one of
these links. Once there, you will be prompted to enter your GRIT
user ID and password. Once you’ve done this, you are ready to use R
Studio Server!
- Sequoia: <https://sequoia.grit.ucsb.edu/rstudio/>
- Quebracho: <https://quebracho.geog.ucsb.edu/rstudio/>
## Accessing data
Please refer to [this
section](https://emlab-ucsb.github.io/SOP/1.3-grit-data-storage-space.html#grit-data-storage-space)
of the emLab SOP for a description of the data directory structure for
our emLab GRIT data storage space.
All data in the emLab GRIT data storage space can be directly accessed
on each of the servers (sequoia and quebracho) without any changes to
the directory paths. All data in the `emlab/data` and
`emlab/projects/current-projects` directory physically lives on
high-speed hard drives attached to sequoia, so if you need to work on
data in these directories, you will have the best computing performance
when using sequoia. Please refer to [this
section](https://emlab-ucsb.github.io/SOP/2.3-accessing-emlab-data-in-r.html)
of the emLab SOP for a code snippet that can be used to directly access
data on the server in R.
\
In addition to having access to our emLab GRIT data storage space, which
is shared across all members of our team, all individual users also have
a private user-specific storage space. All GRIT users get a free 50GB
personal storage space by default. As a general best practice, we
recommend storing all data on the emLab data storage space, and only
storing cloned GitHub repos and user-specific R packages and settings in
your personal user space. For example, you should store all
project-specific data in the appropriate directory under the emLab data
storage space, but you should store all of your cloned GitHub repos s in
your personal storage space. By default, when you clone repos from
GitHub they are stored in your personal storage space, along with any of
your user-specific R packages and configurations. If for whatever reason
your personal storage space exceeds 50GB, it will stop working, so you
should ensure you always have a safe buffer. However, we envision that
if users only keep cloned GitHub repos and R packages in their personal
user space, they should not need to worry about hitting the 50GB limit.
You can check your current personal storage by typing df -h in the
terminal and then looking for your username.
## Accessing code
Here we can talk about how to use the servers with GitHub code
management. Essentially, you can work with projects and GitHub
repositories on RStudio Server exactly like you can on your personal
machine. One major difference is how to set up GitHub authentication,
which is a little different on the servers than it would be on your
personal machine. So we can provide explicit instructions on doing that.
Please refer to [this
section](https://emlab-ucsb.github.io/emlab-manual/emlab-workflow-and-platforms.html#git-and-github)
of the emLab SOP for directions on how to set up and manage git and
GitHub for your new server workspace. One important difference between
your personal laptop and using a server is that file permissions may be
such that other users can see and sometimes read or write files in your
directories. Ideally, any confidential information such as your git
credentials should be secured differently from your personal computer.
Step 6 of the Git and Github section of the emLab manual is therefore
not recommended in a multi-user server environment because your token
may end up viewable to other users as plain text. \
Instead of storing your Personal Authentication Token (PAT) as plain
text, it is recommended to use one of the following options. Using
either of these two approaches will also mean that credentials are
stored between sessions, which should make the user experience a bit
easier. The first approach, using an SSH key, is recommended
1. Use an SSH key instead of a PAT
1. Set up your SSH key on the GRIT server.
1. If you are not using R Studio Server, or prefer to use the
terminal, follow these instructions:
1. You can generate a new SSH key with the terminal command
1. ssh-keygen -t ed25519 -C "email\@example.com"
2. You are prompted to select a location (hit enter for the
default location)
3. You are prompted to set a password (hit enter to not
require one)
4. Start your SSH agent in the background with
1. eval "\$(ssh-agent -s)"
5. Add your private key to the SSH agent with
1. ssh-add \~/.ssh/id_ed25519
6. Copy your public key with
1. cat \~/.ssh/id_rsa.pub \| xclip -selection clipboard
2. If you are using R Studio Server and you prefer to not use
the terminal approach, the instructions are a bit more
streamlined. Follow the instructions in this
[link](https://happygitwithr.com/ssh-keys#option-1-set-up-from-rstudio).
1. If after going through these instructions you prefer to
not use a password, you can remove it using the
instructions provided in this
[link](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/working-with-ssh-key-passphrases#adding-or-changing-a-passphrase).
2. Add your public key to your GitHub account
1. GitHub \> Settings \> SSH and GPG keys \> New SSH key
2. Paste in your key (either from Step 6 in the terminal option
above, or copied from R Studio Server in the R Studio option
above)
3. Now you can clone your repository onto the GRIT server. This
means that you need to either: 1) when cloning the repository
for the first time you need to use the SSH url rather than the
HTTPS url; or 2) if you’ve already clone the repository, set
your repository URL to the SSH version
1. If cloning a repo for the first time using R Studio Server,
you can simply click “File \> New Project \> Version Control
\> Git”, and then enter your repo’s SSH link
1. [git\@github.com](mailto:git@github.com):username/example_repo.git
(for example, this might look like
git\@github.com:emlab-ucsb/ocean-ghg.git)
2. Alternatively, in the terminal You can manually set specific
repo URLs to SSH with:
1. git remote set-url origin
[git\@github.com](mailto:git@github.com):username/example_repo.git
(for example, this might look like
git\@github.com:emlab-ucsb/ocean-ghg.git)
2. Caching your PAT temporarily
1. Create a PAT on GitHub
2. Add credential cache timeout instructions to your git config
file
1. git config --global credential.helper 'cache --timeout=3600'
2. Adjust the timeout length (in seconds) as needed
3. Push changes to GitHub
1. When prompted enter your username
2. For the password enter your PAT
1. Future pushes will not not require you to enter
credentials within the timeout window
4. This is not a good long term solution because you will need to
re-enter your credentials anytime the server restarts or when
your cache timeout ends
## Using htop to monitor shared resources
Always run [htop](https://htop.dev/) in the terminal to see how the many
cores and how much RAM are currently being used by others. You can
customize the htop display to make things easier to see. For example:
- After entering htop, press F2 to enter setup. You can also click
directly on setup to enter it.
- Once you enter setup, if you have trouble seeing the setup options,
you can try reducing your browser’s text size temporarily in order
to see the setup options.
- Sequoia has 128 cores so the default view with 4 columns means a
pretty large display. In the Meters setup, you can change the left
column to be “CPUs (1-4/8) [Bar]” and the right column to be “CPUs
(5-8/8) [Bar]”. This will condense the output and force 8 columns.
- I also like to add disc IO to the left column below memory.
- In the “Display options” setup you can select some that will clean
up the process information below the resources monitor. I like to
make sure to select
- Tree view
- Tree view sorted by PID
- Shadow other users’ process (makes it easier to see your own)
- Count CPUs from 1
- Enable the mouse
- Press F10 when done
## Best practices for sharing our computational resources
Here we outline our best practices for using shared computational
resources. These are meant to be living guidelines that will be adapted
by our team as needed:
- In order to keep some of our computational resources easy to use
interactively without queues or SLURM, we will need to coordinate
and share. Sharing is caring! Common courtesy can go a long way.
Hopefully we can largely self-manage this. If not, we will need to
move computational resources onto queues and SLURM, which could
introduce barriers to analysis and rapid prototyping.
- Always run [htop](https://htop.dev/) in the terminal to see how the
many cores and how much RAM are currently being used by others
- In general on sequoia, feel free to run analyses that use up to 20
cores and 150GB of RAM. We will likely adaptively manage these
specific numbers once we start using sequoia and getting a better
understanding of how many resources we are using.
- For larger analyses that require more cores or RAM, coordinate with
others over the server slack channel (`#hpc-core-dination`) to
ensure that workflows are not disrupted and that everyone has
reasonable access to computational resources
- Generally, we recommend piloting your code using a small subset of
your data and/or just a single core, either on your local computer
or on one of our HPC servers. Then once you know it works and have a
sense of how much memory it will use and how long it will take to
execute, you can go ahead and run the full analysis on the server.
And if it looks like the full analysis will require resources beyond
the standard recommend 20 cores and 150GB, coordinate with the team
on `#hpc-core-dination`.