Scripts and utilities for providing fastqs to workflows.
This package can be installed via Bioconda:
conda install -c ebi-gene-expression-group atlas-fastq-provider
Installation will create a file 'atlas-fastq-provider-config.sh' in the same install directory as the main script fetchFastq.sh. Config variables can be modified in this file.
ENA_SSH_HOST='sra-login-1'
ENA_SSH_ROOT_DIR='/nfs/era-pub/vol1/'
ENA_PRIVATE_SSH_ROOT_DIR='/private/path'
ENA_FTP_ROOT_PATH='ftp://ftp.sra.ebi.ac.uk/vol1'
ENA_HTTP_ROOT_PATH='https://hl.fire.sdo.ebi.ac.uk/era-public'
FASTQ_PROVIDER_TEMPDIR='/tmp/atlas-fastq-provider'
ENA_TEST_FILE='ERR1888172_1.fastq.gz'
FETCH_FREQ_MILLIS=500
PROBE_UPDATE_FREQ_MINS=15
ENA_RETRIES=3
Overrides to these variables can also be supplied at runtime (see below).
fetchFastq.sh -f <file or uri> -t <target file> [-c <config file to override defaults>] [-s <source resource or directory>] [-m <retrieval method, default 'auto'>] [-p <public or private, default public>] [-l <library, by default inferred from file name>]
This is a generic utility to provide FASTQ files for use in pipelines etc. At the most basic level files can be downloaded from links, or linked to files in directories on the file system, with some extra sugar to indicate when a file is not present at the source and produce errors consistently etc.
There are then 'special cases' where things can be handled differently, for example in fetching files from ENA via SSH or the new HTTP endpoint. The special cases will be triggered based on the source, which if set to 'auto' (the default) will be guessed (e.g. ena for SRR/DRR/ERR identifiers).
For the ENA, there are three methods: FTP, SSH and HTTP. FTP is the default method that will work for everyone. EBI personal with the right privileges can also copy files over SSH directly from the ENA servers. There is also a new internal HTTP endpoint (currently unreliable) that can be used by EBI personnel. Specifying 'auto' will test each of these methods and select the fastest, storing results in a 'probe' file. This file will be updated according to the interval specified in the confi variable PROBE_UPDATE_FREQ_MINS.
fetchEnaLibraryFastqs.sh -l <library> -d <output directory> [-m <retrieval method, default 'auto'>] [-s <source directory for method 'dir'>] [-p <public or private, default public>] [-c <config file to override defaults>]
This is mostly a wrapper for fetchFastq.sh, following a listing of files at the source.
Sometimes it's useful to check that a file exists at source, without actually downloading it. This can can be done by supplying '-v' to fetchFastq.sh, which will cause the script to return an exit code of 0 after the file existence is checked, but before download.
fetchFastq.sh -f ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR188/006/ERR1888646/ERR1888646_1.fastq.gz -t ERR1888646_1.fastq.gz
fetchFastq.sh -f ERR1888646_1.fastq.gz -t ERR1888646_1.fastq.gz -m http
This fetches files using the new HTTP endpoint, as specified in ENA_HTTP_ROOT_PATH in the config file.
fetchFastq.sh -f ERR1888646_1.fastq.gz -t ERR1888646_1.fastq.gz -m ssh
This will attempt to pull files directly from the ENA server, using the host and path in the config file. To do this you must set environment variable 'ENA_SSH_USER'. This should be a user you either are, or can sudo to, with permissions to SSH to the SRA host. This is only likely to be possible if you're privileged member of staff at the EBI.
EBI personnel can also retrieve files from private locations on the ENA server by specifying ENA_PRIVATE_SSH_ROOT_DIR and running commands like:
fetchFastq.sh -f my_private_file1.fastq.gz -t my_private_file1.fastq.gz -p private -l ERR123456
Note that the library name must also be specified.
fetchFastq.sh -f ERR1888646_1.fastq.gz -t ERR1888646_1.fastq.gz -s /path/to/dir
fetchEnaLibraryFastqs.sh -l ERR1888646 -d ERR1888646
fetchEnaLibraryFastqs.sh -l ERR1888646 -d ERR1888646 -t srr
Files from the HCA can be downloaded given pseudo-URI formed like:
fetchFastq.sh -f hca://<bundle>/<file> -t <dest file>
... or by manually specifying method like:
fetchFastq.sh -m hca -f <bundle>/<file> -t <dest file>
This just passes the bundle UUID, along with the file filter, to azul, which provides links to the fastq files, which can then be downloaded.
A real example is:
fetchFastq.sh -f hca://0359ab85-bb92-4e6e-a819-12aa734ed12b/10X127_1_S28_L001_I1_001.fastq.gz -t foo.fastq.gz