content updates

cedadev · Feb 29, 2024 · e062d51 · e062d51
1 parent 39df846
commit e062d51
Show file tree

Hide file tree

Showing 2 changed files with 35 additions and 13 deletions.
diff --git a/content/docs/data-transfer/data-transfer-overview.md b/content/docs/data-transfer/data-transfer-overview.md
@@ -58,5 +58,28 @@ tools]({{< ref "data-transfer-tools" >}}) available.
 
 To achieve better transfer rates, for large transfers or where speed and reliability are important, you are recommended to:
 
-- use the Globus data transfer service (recommended), or
-- use the high-performance data transfer servers ({{<link "hpxfer-access-role">}}hpxfer access role required{{</link>}})
+- use the {{<link "globus-transfers-with-jasmin">}}Globus data transfer service{{</link>}} (recommended), or
+- use the high-performance data transfer servers ({{<link "hpxfer-access-role">}}hpxfer access role required{{</link>}})
+- use other parallel-capable transfer tools such as bbcp, lftp (parallel-capable ftp client), or gridftp: see {{<link "data-transfer-tools">}}Data transfer tools{{</link>}}
+
+Transfer rates depend on many factors, so try to consider all of these:
+
+- **do you really need to transfer some/all of the data?**
+  - is the data in the CEDA Archive already (don't copy it, if so, just process it in-place!)
+  - can your workflow deal with processing just smaller "chunks" at a time (streaming)?
+  - do you really need to have/keep all the source data, if it's stored somewhere else?
+- **the network path all the way from where the source data resides, to the destination file system**
+  - high-performance data transfer tools are great, but is the "last mile" over WiFi to your laptop?
+  - what is the length of the network path? If it's international or intercontinental, SSH-based methods won't work well. Consider Globus.
+- **the host at each end**
+  - what sort of host is it (laptop, departmental server, virtual machine, physical machine) and what is its network connectivity?
+- **the file systems at each end**
+  - not all file systems perform the same, for given types of data or transfer methods
+- **the size and number of files involved**
+  - large numbers of small files can take a long time to transfer
+  - are the data in deep directory trees? These can take a long time to recreate on the destination file system
+  - consider creating a tar/zip archive to transfer fewer but larger files, or at least a method that copes well with many files in parallel or "in flight" at once.
+- **checking data integrity**
+  - some methods will verify data integrity at source and destination to ensure integrity. This can be resource-heavy and slow.
+- **time of day**
+  - would scheduling your transfer to happen at quieter times, mean that it completes more efficiently and/or without impacting others? Consider source and destination time zones!
diff --git a/content/docs/data-transfer/scheduling-automating-transfers.md b/content/docs/data-transfer/scheduling-automating-transfers.md
@@ -7,8 +7,8 @@ title: Scheduling/Automating Transfers
 
 This article explains how to schedule or automate data transfers. It covers:
 
-- Scheduling download tasks using cron and LOTUS
 - Using Globus for transfer automation
+- Scheduling download tasks using cron and LOTUS
 
 ## Overview
 
@@ -30,8 +30,7 @@ Some introductory information about how to do this is available in this article
 (with more to follow)
 but please also refer to the comprehensive Globus documentation and their 
 [automation examples](https://github.com/globus/automation-examples). You can choose whether 
-to schedule/automate tasks via the Globus web interface, command-line interface, or use their Globus SDK to build
-Python code that uses this functionality.
+to schedule/automate tasks via the {{<link "https://www.globus.org/blog/scheduled-and-recurring-transfers-now-available-globus-web-app">}}Globus web interface{{</link>}}, {{<link "https://docs.globus.org/cli/reference/">}}command-line interface{{</link>}}, or use their {{<link "https://globus-sdk-python.readthedocs.io/en/stable/examples/index.html" >}}Globus Python SDK{{</link>}} to build Python code that uses this functionality.
 
 ## Scheduling download tasks using cron and LOTUS
 
@@ -40,15 +39,15 @@ general tasks, **it should not be used for the work of executing those tasks its
 
 ### xfer3 - transfer machine with cron
 
-The transfer server `xfer3.jasmin.ac.uk` is also provided with `cron`, and should be used where 
+The transfer server `xfer3.jasmin.ac.uk` is also provided with `cron`, and should be used where
 a task is primarily a transfer rather than a processing task and needs the functionality
-of a transfer server. For access to `xfer3` you will need the 
+of a transfer server. For access to `xfer3` you will need the
 {{<link "https://accounts.jasmin.ac.uk/services/additional_services/xfer-sp/">}}xfer-sp access role{{</link>}}.
 Please refer to the above `cron` guidance for best practice advice.
 
-### invoking LOTUS from cron to carry out multiple download tasks
+### Invoking LOTUS from cron to carry out multiple download tasks
 
-Sometimes we need a tasks to be invoked from `cron` but executed where there
+Sometimes we need a task to be invoked from `cron` but executed where there
 are lots of nodes to parallelise the tasks (i.e. the LOTUS cluster). In this case, we DO need to use the `cron`
 server rather than `xfer3`, since we need to be able to talk to LOTUS (`xfer3` can't do that, as a transfer server).
 
@@ -102,9 +101,9 @@ Due to networking limitations, LOTUS nodes cannot perform downloads using SSH-ba
 
 Download tools installed on LOTUS nodes include:
 
-  * `wget`
-  * `curl`
-  * `ftp` (but not `lftp`)
+- `wget`
+- `curl`
+- `ftp` (but not `lftp`)
 
 In our simple example above, we can subit this script to LOTUS from the
 command line with
@@ -127,7 +126,7 @@ ensure one instance of the task had finished before the next started: (see
 30 * * * * crontamer -t 2h 'sbatch /home/users/username/test_download.sh'
 ```
 
-## 2\. Multi-node downloads
+### 2\. Multi-node downloads
 
 We could expand this example to download multiple items, perhaps 1 directory
 of data for each day of a month, and have 1 element of a job array handle the