Skip to content

Commit

Permalink
content updates
Browse files Browse the repository at this point in the history
  • Loading branch information
mjpritchard committed Feb 29, 2024
1 parent 39df846 commit e062d51
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 13 deletions.
27 changes: 25 additions & 2 deletions content/docs/data-transfer/data-transfer-overview.md
Expand Up @@ -58,5 +58,28 @@ tools]({{< ref "data-transfer-tools" >}}) available.

To achieve better transfer rates, for large transfers or where speed and reliability are important, you are recommended to:

- use the Globus data transfer service (recommended), or
- use the high-performance data transfer servers ({{<link "hpxfer-access-role">}}hpxfer access role required{{</link>}})
- use the {{<link "globus-transfers-with-jasmin">}}Globus data transfer service{{</link>}} (recommended), or
- use the high-performance data transfer servers ({{<link "hpxfer-access-role">}}hpxfer access role required{{</link>}})
- use other parallel-capable transfer tools such as bbcp, lftp (parallel-capable ftp client), or gridftp: see {{<link "data-transfer-tools">}}Data transfer tools{{</link>}}

Transfer rates depend on many factors, so try to consider all of these:

- **do you really need to transfer some/all of the data?**
- is the data in the CEDA Archive already (don't copy it, if so, just process it in-place!)
- can your workflow deal with processing just smaller "chunks" at a time (streaming)?
- do you really need to have/keep all the source data, if it's stored somewhere else?
- **the network path all the way from where the source data resides, to the destination file system**
- high-performance data transfer tools are great, but is the "last mile" over WiFi to your laptop?
- what is the length of the network path? If it's international or intercontinental, SSH-based methods won't work well. Consider Globus.
- **the host at each end**
- what sort of host is it (laptop, departmental server, virtual machine, physical machine) and what is its network connectivity?
- **the file systems at each end**
- not all file systems perform the same, for given types of data or transfer methods
- **the size and number of files involved**
- large numbers of small files can take a long time to transfer
- are the data in deep directory trees? These can take a long time to recreate on the destination file system
- consider creating a tar/zip archive to transfer fewer but larger files, or at least a method that copes well with many files in parallel or "in flight" at once.
- **checking data integrity**
- some methods will verify data integrity at source and destination to ensure integrity. This can be resource-heavy and slow.
- **time of day**
- would scheduling your transfer to happen at quieter times, mean that it completes more efficiently and/or without impacting others? Consider source and destination time zones!
21 changes: 10 additions & 11 deletions content/docs/data-transfer/scheduling-automating-transfers.md
Expand Up @@ -7,8 +7,8 @@ title: Scheduling/Automating Transfers

This article explains how to schedule or automate data transfers. It covers:

- Scheduling download tasks using cron and LOTUS
- Using Globus for transfer automation
- Scheduling download tasks using cron and LOTUS

## Overview

Expand All @@ -30,8 +30,7 @@ Some introductory information about how to do this is available in this article
(with more to follow)
but please also refer to the comprehensive Globus documentation and their
[automation examples](https://github.com/globus/automation-examples). You can choose whether
to schedule/automate tasks via the Globus web interface, command-line interface, or use their Globus SDK to build
Python code that uses this functionality.
to schedule/automate tasks via the {{<link "https://www.globus.org/blog/scheduled-and-recurring-transfers-now-available-globus-web-app">}}Globus web interface{{</link>}}, {{<link "https://docs.globus.org/cli/reference/">}}command-line interface{{</link>}}, or use their {{<link "https://globus-sdk-python.readthedocs.io/en/stable/examples/index.html" >}}Globus Python SDK{{</link>}} to build Python code that uses this functionality.

## Scheduling download tasks using cron and LOTUS

Expand All @@ -40,15 +39,15 @@ general tasks, **it should not be used for the work of executing those tasks its

### xfer3 - transfer machine with cron

The transfer server `xfer3.jasmin.ac.uk` is also provided with `cron`, and should be used where
The transfer server `xfer3.jasmin.ac.uk` is also provided with `cron`, and should be used where
a task is primarily a transfer rather than a processing task and needs the functionality
of a transfer server. For access to `xfer3` you will need the
of a transfer server. For access to `xfer3` you will need the
{{<link "https://accounts.jasmin.ac.uk/services/additional_services/xfer-sp/">}}xfer-sp access role{{</link>}}.
Please refer to the above `cron` guidance for best practice advice.

### invoking LOTUS from cron to carry out multiple download tasks
### Invoking LOTUS from cron to carry out multiple download tasks

Sometimes we need a tasks to be invoked from `cron` but executed where there
Sometimes we need a task to be invoked from `cron` but executed where there
are lots of nodes to parallelise the tasks (i.e. the LOTUS cluster). In this case, we DO need to use the `cron`
server rather than `xfer3`, since we need to be able to talk to LOTUS (`xfer3` can't do that, as a transfer server).

Expand Down Expand Up @@ -102,9 +101,9 @@ Due to networking limitations, LOTUS nodes cannot perform downloads using SSH-ba

Download tools installed on LOTUS nodes include:

* `wget`
* `curl`
* `ftp` (but not `lftp`)
- `wget`
- `curl`
- `ftp` (but not `lftp`)

In our simple example above, we can subit this script to LOTUS from the
command line with
Expand All @@ -127,7 +126,7 @@ ensure one instance of the task had finished before the next started: (see
30 * * * * crontamer -t 2h 'sbatch /home/users/username/test_download.sh'
```

## 2\. Multi-node downloads
### 2\. Multi-node downloads

We could expand this example to download multiple items, perhaps 1 directory
of data for each day of a month, and have 1 element of a job array handle the
Expand Down

0 comments on commit e062d51

Please sign in to comment.