Skip to content

Commit

Permalink
content updates
Browse files Browse the repository at this point in the history
  • Loading branch information
mjpritchard committed Feb 26, 2024
1 parent 622465a commit 76c9460
Show file tree
Hide file tree
Showing 5 changed files with 86 additions and 108 deletions.
63 changes: 29 additions & 34 deletions content/docs/batch-computing/example-job-2-calc-md5s.md
@@ -1,8 +1,5 @@
---
aliases: /article/3836-example-job-2-calc-md5s
categories:
- LOTUS Batch Computing
collection: jasmin-documentation
description: 'Example Job 2: Calculating MD5 Checksums on many files'
slug: example-job-2-calc-md5s
title: 'Example Job 2: Calculating MD5 Checksums on many files'
Expand All @@ -14,36 +11,37 @@ point for developing their own workflows on LOTUS.

## Case 1: Calculating MD5 Checksums on many files

This is a simple case because (1) the archive only needs to be read by the
code and (2) the code that we need to run involves only the basic linux
commands so there are no issues with picking up dependencies from elsewhere.
This is a simple case because:

**Case Description**
1. the archive only needs to be read by the code and
2. the code that we need to run involves only the basic linux commands so there are no issues with picking up dependencies from elsewhere.

* we want to calculate the MD5 checksums of about 220,000 files. It will take a day or two to run them all in series.
* we have a text file that contains 220,000 lines - one file per line.
### Case Description**

**Solution under LOTUS**
- we want to calculate the MD5 checksums of about 220,000 files. It will take a day or two to run them all in series.
- we have a text file that contains 220,000 lines - one file per line.

* Split the 220,000 lines into 22 files of 10,000 lines.
* Write a template script to:
* Read a text file full of file paths
* Run the `md5sum` command on each file and log the result.
* Write a script to create 22 new scripts (based on the template script), each of which takes one of the input files and works through it.
### Solution under LOTUS

**And this is how it looks**
- Split the 220,000 lines into 22 files of 10,000 lines.
- Write a template script to:
- Read a text file full of file paths
- Run the `md5sum` command on each file and log the result.
- Write a script to create 22 new scripts (based on the template script), each of which takes one of the input files and works through it.

Log in to jasmin-sci1 server:
### And this is how it looks

```
Log in to the `sci` server (from a `login` server):

{{<command user="user" host="login1">}}
ssh -A <username>@sci1.jasmin.ac.uk
```
{{</command>}}

Split the big file

```
{{<command user="user" host="sci1">}}
split -l 10000 -d file_list.txt # Produces 22 files called "x00"..."x21"
```
{{</command>}}

Create the template file: `scan_files_template.sh`

Expand Down Expand Up @@ -71,7 +69,7 @@ done

Submit all 22 jobs to LOTUS:

```
```bash
for i in `ls /home/users/astephen/sst_cci/to_scan/` ; do
echo $i
sbatch -p short-serial -o /home/users/astephen/sst_cci/output/$i /home/users/astephen/sst_cci/bin/scan_files_${i}.sh
Expand All @@ -80,20 +78,21 @@ done

Watch the jobs running:

{{<command user="user" host="sci1">}}
squeue -u <username>
{{</command>}}

**And the result**
### And the result

All jobs ran within about an hour.

## Case 2: Checksumming CMIP5 Data

A variation on Case 2 has been used for checksumming datasets in the CMIP5
archive. The script below will find all NetCDF files in a DRS dataset and
archive. The Python code below will find all NetCDF files in a DRS dataset and
generate a checksums file and error log. Each dataset is submitted as a
separate bsub job.


```python
"""
Checksum a CMIP5 dataset
Expand Down Expand Up @@ -140,12 +139,8 @@ if __name__ == '__main__':
If you have a file containing a list of dataset ids you can submit each as a
separate job by invoking the above script as follows:

```console
./checksum_dataset.py $(cat datasets_to_checksum.dat)

sbatch-q short-serial -J cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128 -o cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.checksums -e cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.err /usr/bin/md5sum '/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/day/seaIce/day/r1i1p1/v20111128/*/*.nc'

Job <745307> is submitted to queue <lotus>. ...
```


{{<command user="user" host="sci1">}}
./checksum_dataset.py $(cat datasets_to_checksum.dat)
sbatch-q short-serial -J cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128 -o cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.checksums -e cmip5.output1.MOHC.HadGEM2-ES.rcp85.day.seaIce.day.r1i1p1.v20111128.err /usr/bin/md5sum '/badc/cmip5/data/cmip5/output1/MOHC/HadGEM2-ES/rcp85/day/seaIce/day/r1i1p1/v20111128/*/*.nc'
(out)Job <745307> is submitted to queue <lotus>. ...
{{</command>}}
1 change: 0 additions & 1 deletion content/docs/data-transfer/data-transfer-overview.md
@@ -1,6 +1,5 @@
---
aliases: /article/219-data-transfer-overview
date: 2021-06-16 17:39:34
description: Overview of data transfer
slug: data-transfer-overview
title: Data transfer overview
Expand Down
101 changes: 46 additions & 55 deletions content/docs/data-transfer/gridftp-ssh-auth.md
@@ -1,6 +1,5 @@
---
aliases: /article/3806-data-transfer-tools-gridftp-ssh-auth
date: 2021-06-16 17:41:08
description: 'Data Transfer Tools: GridFTP (SSH authentication)'
slug: gridftp-ssh-auth
title: 'GridFTP (SSH authentication)'
Expand All @@ -9,7 +8,6 @@ title: 'GridFTP (SSH authentication)'
This article describes how to transfer data using GridFTP with SSH
authentication.


{{<alert type="info">}}The `globus-url-copy` command used here should not be confused with the Globus online data transfer service. They used to be associated, but no longer. If you are starting out and looking for a reliable, high-performance transfer method, the recommendation now is to learn about [Globus Transfers with JASMIN](../globus-transfers-with-jasmin) (using the Globus online data transfer service) instead of command-line gridftp as described in this document.{{</alert>}}

## Introduction
Expand Down Expand Up @@ -39,10 +37,9 @@ Since you will be using SSH as the authentication mechanism, you should ensure
that your initial connection to the JASIMN transfer server is made with the -A
option enabled, to enable agent forwarding:



$ ssh -A username1@hpxfer1.jasmin.ac.uk

{{<command user="user" host="localhost">}}
ssh -A username1@hpxfer1.jasmin.ac.uk
{{</command>}}

Note that in order to use `hpxfer[12].jasmin.ac.uk` you will need to have
[high-performance data transfer access]({{< ref "hpxfer-access-role" >}}) on
Expand All @@ -54,10 +51,9 @@ on the remote server (This will only work if you already know that that server
supports GridFTP over SSH). In this case, we are making the connection to a
fictitious server `gridftp.remotesite.ac.uk`:



$ globus-url-copy -vb -list sshftp://username2@gridftp.remotesite.ac.uk/

{{<command user="username" host="hpxfer1">}}
globus-url-copy -vb -list sshftp://username2@gridftp.remotesite.ac.uk/
{{</command>}}

If `username` and `username2` are the same (on the different systems, the
`username@` part of the sshftp URI can be omitted.
Expand All @@ -79,10 +75,9 @@ the `username@` part of the sshftp URI can be omitted.
Please consult the documentation for the `globus-url-copy` command for the
full range of options and arguments.



$ globus-url-copy -help

{{<command user="username" host="hpxfer1">}}
globus-url-copy -help
{{</command>}}

See also <http://toolkit.globus.org/toolkit/docs/latest-
stable/gridftp/user/#gridftp-user-basic>
Expand All @@ -91,21 +86,19 @@ stable/gridftp/user/#gridftp-user-basic>
destination on the local (client) machine, for example a group workspace on
JASMIN:



$ globus-url-copy -vb sshftp://username@gridftp.remotesite.ac.uk/home/users/USERNAME/myfile /group_workspaces/jasmin/myworkspace/myfile

{{<command user="username" host="hpxfer1">}}
globus-url-copy -vb sshftp://username@gridftp.remotesite.ac.uk/home/users/USERNAME/myfile /group_workspaces/jasmin/myworkspace/myfile
{{</command>}}

The `-p N` and `-fast` options can additionally be used in combination to
enable `N` parallel streams at once, as shown below. You can experiment with N
in the range 4 to 32 to obtain the best performance, but please be aware that
many parallel transfers can draw heavily on shared resources and degrade
performance for other users:



$ globus-url-copy -vb -p 16 -fast sshftp://username@gridftp.remotesite.ac.uk/home/users/USERNAME/myfile /group_workspaces/jasmin/myworkspace/myfile

{{<command user="username" host="hpxfer1">}}
globus-url-copy -vb -p 16 -fast sshftp://username@gridftp.remotesite.ac.uk/home/users/USERNAME/myfile /group_workspaces/jasmin/myworkspace/myfile
{{</command>}}

3\. Test performance with large files by downloading from /dev/zero on the
remote server to /dev/null locally. This excludes any interaction with either
Expand All @@ -115,26 +108,24 @@ that the performance takes a while to "ramp up", so you will not see the best
rates if transferring small files individually as the process never gets up to
full speed:



$ globus-url-copy -vb -p 16 -fast sshftp://username@gridftp.remotesite.ac.uk/dev/zero /dev/null

{{<command user="username" host="hpxfer1">}}
globus-url-copy -vb -p 16 -fast sshftp://username@gridftp.remotesite.ac.uk/dev/zero /dev/null
{{</command>}}

Press CTRL-C to interrupt the transfer. Alternatively you can specify that the
transfer should continue for a fixed duration in seconds using the `-t`
option. In this example, data is transferred from the remote host
gridftp.remotesite.ac.uk to jasmin-xfer2.ceda.ac.uk.



[username@hpxfer1 ~]$ globus-url-copy -p 16 -fast -t 10 -vb sshftp://username2@gridftp.remotesite.ac.uk/dev/zero /dev/null
Source: sshftp://username2@gridftp.remotesite.ac.uk/dev/
Dest: file:///dev/
zero -> null

7797473280 bytes 929.52 MB/sec avg 1024.49 MB/sec inst
Cancelling copy...

{{<command user="username" host="hpxfer1">}}
globus-url-copy -p 16 -fast -t 10 -vb sshftp://username2@gridftp.remotesite.ac.uk/dev/zero /dev/null
(out) Source: sshftp://username2@gridftp.remotesite.ac.uk/dev/
(out) Dest: file:///dev/
(out) zero -> null
(out)
(out) 7797473280 bytes 929.52 MB/sec avg 1024.49 MB/sec inst
(out) Cancelling copy...
{{</command>}}

Note the transfer rate achieved in Megabytes/second (MB/sec), although for
various reasons this is not to be relied upon as an accurate expectation of
Expand All @@ -146,28 +137,26 @@ this is considered by some as easier to use.
4\. Recursively download the contents of a directory on a remote location to a
local destination.



$ globus-url-copy -vb -p 4 -fast -cc 4 -cd -r sshftp://username2@gridftp.remotesite.ac.uk/home/users/USERNAME/mydir/ /group_workspaces/jasmin/myworkspace/mydir/

{{<command user="username" host="hpxfer1">}}
globus-url-copy -vb -p 4 -fast -cc 4 -cd -r sshftp://username2@gridftp.remotesite.ac.uk/home/users/USERNAME/mydir/ /group_workspaces/jasmin/myworkspace/mydir/
{{</command>}}

Where:

* `-cc N` requests `N` concurrent transfers (in this case, each with `p=4` parallel streams)
* `-cd` requests creation of the destination directory if this does not already exist
* `-r` denotes recursive transfer of directories
* `-sync` and `-sync-level` options can be used to synchronise data between the two locations, where destination files do not exist or differ (by criteria that can be selected) from corresponding source files. See `-help` option for details.
- `-cc N` requests `N` concurrent transfers (in this case, each with `p=4` parallel streams)
- `-cd` requests creation of the destination directory if this does not already exist
- `-r` denotes recursive transfer of directories
- `-sync` and `-sync-level` options can be used to synchronise data between the two locations, where destination files do not exist or differ (by criteria that can be selected) from corresponding source files. See `-help` option for details.

## Upload data (push data from JASMIN to remote server)

The above commands can also be adapted to invoke transfers from a local source
to a remote destination, i.e. uploading data, since the commands all take the
following general form:



$ globus-url-copy [OPTIONS] source-uri desination-uri

{{<command user="username" host="hpxfer1">}}
globus-url-copy [OPTIONS] source-uri desination-uri
{{</command>}}

Be sure to check your connection with the remote machine via a simple SSH
login and then a directory listing as shown above.
Expand All @@ -180,15 +169,17 @@ and the JASMIN host is the server specified in the destination URI. The
following command should work connecting to one of the following transfer
servers: (see also [Transfer Servers]({{< ref "transfer-servers" >}}))

* `xfer[12].jasmin.ac.uk`
* `xfer3.jasmin.ac.uk` ([additional access role](https://accounts.jasmin.ac.uk/services/additional_services/xfer-sp) required)
* `hpxfer[12].jasmin.ac.uk` ([high-performance data transfer access]({{< ref "hpxfer-access-role" >}}) required)
- `xfer[12].jasmin.ac.uk`
- `xfer3.jasmin.ac.uk` ([additional access role](https://accounts.jasmin.ac.uk/services/additional_services/xfer-sp) required)
- `hpxfer[12].jasmin.ac.uk` ([high-performance data transfer access]({{< ref "hpxfer-access-role" >}}) required)

Push data to JASMIN from a remote server:



$ globus-url-copy -vb -p 8 -fast mydir/myfile sshftp://username@hpxfer1.jasmin.ac.uk/group_workspaces/jasmin/myworkspace/mydir/

{{<command user="username2" host="remotehost">}}
globus-url-copy -vb -p 8 -fast mydir/myfile sshftp://username@hpxfer1.jasmin.ac.uk/group_workspaces/jasmin/myworkspace/mydir/
{{</command>}}

Note that for this to work, you need to be able to authenticate over SSH to the JASMIN host. This should be possible if you can log in interactively, but will NOT work if you are using the command in a cron job or other situation where your ssh-agent (on the host remote to JASMIN) is not running and/or does not have access to your private key. For those situations, consider using either

- {{<link "globus-transfers-with-jasmin" >}}Globus (recommended){{</link>}}, or
- {{<link "gridftp-cert-based-auth">}}Gridftp using certificate-based authentication{{</link>}})
1 change: 0 additions & 1 deletion content/docs/interactive-computing/transfer-servers.md
Expand Up @@ -60,4 +60,3 @@ particularly if you have multiple terminal windows open on your own computer,
that you do not accidentally attempt `sudo`on a JASMIN machine: expect some
follow-up from the JASMIN team if you do!
{{</alert>}}

28 changes: 11 additions & 17 deletions content/docs/mass/how-to-apply-for-mass-access.md
@@ -1,9 +1,5 @@
---
aliases: /article/228-how-to-apply-for-mass-access
categories:
- MASS
collection: jasmin-documentation
date: 2021-09-07 12:33:03
description: How to apply for MASS access
slug: how-to-apply-for-mass-access
title: How to apply for MASS access
Expand All @@ -13,9 +9,9 @@ title: How to apply for MASS access

To access data held in the Met Office MASS archive, you will need:

- a sponsor
- access to the mass-cli1 client machine
- a MASS account
- a sponsor
- access to the mass-cli1 client machine
- a MASS account

Your sponsor will need to be a **Senior Met Office Scientist** with whom you
are working on a collaborative research project. If you are a Met Office
Expand All @@ -36,14 +32,14 @@ Please note that the link above is only visible to those in the Met Office.
The following information will be asked for, so please provide your sponsor
with any details they may not have:

- Your full name
- Your official email address
- Your organization's name
- Your department name
- The host country of your organization
- A list of MASS projects and/or data sets that you need access to. A full MOOSE dataset path is required, and your sponsor should help you determine this.
- Your JASMIN username
- Your JASMIN user ID number (UID). You can get this by typing echo $UID at the terminal on any JASMIN machine.
- Your full name
- Your official email address
- Your organization's name
- Your department name
- The host country of your organization
- A list of MASS projects and/or data sets that you need access to. A full MOOSE dataset path is required, and your sponsor should help you determine this.
- Your JASMIN username
- Your JASMIN user ID number (UID). You can get this by typing echo $UID at the terminal on any JASMIN machine.

The information you provide to the Met Office will be treated in accordance
with the [Met Office Privacy Policy](https://www.metoffice.gov.uk/about-
Expand Down Expand Up @@ -73,5 +69,3 @@ Note: If you have access to MASS on other systems you cannot copy those MOOSE
credentials file/s onto JASMIN – they will not work! Please also see the
[External Users’ MOOSE Guide]({{< ref "moose-the-mass-client-user-guide" >}})
for what MOOSE commands are available on JASMIN.


0 comments on commit 76c9460

Please sign in to comment.