# Download sequence read data

This notebook describes the procedures taken to download all the sequence read data.

First, we will start by setting some environment variables.

In [1]:
accessions_file_extension='.txt'
output_dir_fragment='data'

PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR
ls

1-download-data.ipynb            [0m[01;34mbin[0m           fastp.json  [01;34mmetagenomics[0m
1-download-data.nbconvert.ipynb  download.log  [01;34mfast.tmp[0m    [01;34mmicrobial[0m
2-count-kmers.ipynb              fastp.html    [01;34mhuman[0m       README.md


Now, let's download the data. 

The download code given below assumes you have the following [conda](https://docs.conda.io/en/latest/) environments setup to install [sra-tools](https://github.com/ncbi/sra-tools) and [fastp](https://github.com/OpenGene/fastp). This can be done with.

```bash
conda create --name sra-tools sra-tools fastp
```

We also assume that you have the command [pigz](https://zlib.net/pigz/) setup (for faster gzipping).

Once that's setup, let's define a bash function for downloading the data.

## Bash download function

In [2]:
# Purpose: Downloads sequence reads
# Args:
#      accessions_file: A file contining a list (one per line) of sequence read accessions to download.
#      output_dir: A directory to download the files into.
#      interleaved_fastp: Set to '--interleaved_in' if data is paired-end, otherwise set to ''.
download_sequence_reads() {
    accessions_file=$1
    output_dir=$2
    interleaved_fastp=$3
    
    jobs=1
    threads=24
    
    echo "Downloading files from ${accessions_file}"
    echo "Downloading files to ${output_dir}"
    
    rm -rf ${output_dir}
    mkdir -p ${output_dir}
    
    before=`date +%s`
    
    command="parallel -j ${jobs} -I% -a ${accessions_file} \
        fasterq-dump --threads ${threads} --split-spot --stdout % \2\> ${output_dir}/log.fasterq.%.err \| \
        fastp ${interleaved_fastp} -l 15 --stdin --stdout \2\> ${output_dir}/log.fastp.%.err \| \
        pigz -cp ${threads} - \> ${output_dir}/%.fastq.gz"
    echo ${command}
    conda run --name sra-tools ${command}
        
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}
}

# Download microbial data

Now, let's download the microbial data.

In [3]:
download_sequence_reads "microbial/microbial-genomes${accessions_file_extension}" "microbial/${output_dir_fragment}" '--interleaved_in'

Downloading files from microbial/microbial-genomes.txt
Downloading files to microbial/data
parallel -j 1 -I% -a microbial/microbial-genomes.txt fasterq-dump --threads 24 --split-spot --stdout % \2\> microbial/data/log.fasterq.%.err \| fastp --interleaved_in -l 15 --stdin --stdout \2\> microbial/data/log.fastp.%.err \| pigz -cp 24 - \> microbial/data/%.fastq.gz
Done. Took 63.05 minutes.

Awesome. We've downloaded all the microbial data. Let's check out the files.

In [4]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

-rw-r--r-- 1 apetkau grp_apetkau 1.4G Dec  8 20:14 microbial/data/ERR1144974.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 871M Dec  8 20:17 microbial/data/ERR1144975.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 720M Dec  8 20:20 microbial/data/ERR1144976.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 901M Dec  8 20:23 microbial/data/ERR1144977.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 831M Dec  8 20:26 microbial/data/ERR1144978.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 411M Dec  8 20:42 microbial/data/ERR3655992.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 426M Dec  8 20:41 microbial/data/ERR3655994.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 465M Dec  8 20:40 microbial/data/ERR3655996.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 474M Dec  8 20:38 microbial/data/ERR3655998.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau   20 Dec  8 20:37 microbial/data/ERR3656002.fastq.gz
50


Looks like 50 files are present.

Let's look for any files which did not completly download (size should be very small).

In [7]:
for file in microbial/data/*.fastq.gz
do
    size=`ls -l ${file} | awk '{ print $5 }'`
    if [ "$size" -lt 1000 ]
    then
        echo -e "${file}\t${size}"
    fi
done

microbial/data/ERR3656002.fastq.gz	20


Looks like this one file failed to download (only 20 bytes). Let's try and redownload it.

In [4]:
threads=24
output_dir=microbial/data
command="parallel -j 1 -I% \
    fasterq-dump --threads ${threads} --split-spot --stdout % \2\> ${output_dir}/log.fasterq.%.err \| \
    fastp ${interleaved_fastp} -l 15 --stdin --stdout \2\> ${output_dir}/log.fastp.%.err \| \
    pigz -cp ${threads} - \> ${output_dir}/%.fastq.gz ::: ERR3656002"
echo ${command}
conda run --name sra-tools ${command}

parallel -j 1 -I% fasterq-dump --threads 24 --split-spot --stdout % \2\> microbial/data/log.fasterq.%.err \| fastp -l 15 --stdin --stdout \2\> microbial/data/log.fastp.%.err \| pigz -cp 24 - \> microbial/data/%.fastq.gz ::: ERR3656002


In [5]:
ls -lh microbial/data/ERR3656002.fastq.gz

-rw-r--r-- 1 apetkau grp_apetkau 528M Dec  9 11:09 [0m[01;31mmicrobial/data/ERR3656002.fastq.gz[0m


Looks like we've got our last file properly downloaded now.

# Download metagenomics data

Let's now download the metagenomics data.

In [5]:
download_sequence_reads "metagenomics/metagenomes${accessions_file_extension}" "metagenomics/${output_dir_fragment}" '--interleaved_in'

Downloading files from metagenomics/metagenomes.txt
Downloading files to metagenomics/data
parallel -j 1 -I% -a metagenomics/metagenomes.txt fasterq-dump --threads 24 --split-spot --stdout % \2\> metagenomics/data/log.fasterq.%.err \| fastp --interleaved_in -l 15 --stdin --stdout \2\> metagenomics/data/log.fastp.%.err \| pigz -cp 24 - \> metagenomics/data/%.fastq.gz
Done. Took 610.05 minutes.

In [6]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

-rw-r--r-- 1 apetkau grp_apetkau  6.5G Dec  9 02:14 metagenomics/data/ERR1713331.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  2.4G Dec  9 06:06 metagenomics/data/ERR1713332.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 1011M Dec  9 02:20 metagenomics/data/ERR1713333.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  3.6G Dec  9 03:42 metagenomics/data/ERR1713334.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  4.3G Dec  8 20:51 metagenomics/data/ERR1713335.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  3.7G Dec  8 22:57 metagenomics/data/ERR1713336.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  6.5G Dec  9 03:03 metagenomics/data/ERR1713337.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  473M Dec  9 02:17 metagenomics/data/ERR1713339.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  345M Dec  9 01:18 metagenomics/data/ERR1713340.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  3.0G Dec  9 06:26 metagenomics/data/ERR1713341.fastq.gz
50


Looking for files that did not download properly.

In [13]:
for file in metagenomics/data/*.fastq.gz
do
    size=`ls -l ${file} | awk '{ print $5 }'`
    if [ "$size" -lt 1000 ]
    then
        echo -e "${file}\t${size}"
    fi
done

metagenomics/data/ERR1713342.fastq.gz	20


Let's try to redownload this file.

In [14]:
threads=24
output_dir=metagenomics/data
command="parallel -j 1 -I% \
    fasterq-dump --threads ${threads} --split-spot --stdout % \2\> ${output_dir}/log.fasterq.%.err \| \
    fastp ${interleaved_fastp} -l 15 --stdin --stdout \2\> ${output_dir}/log.fastp.%.err \| \
    pigz -cp ${threads} - \> ${output_dir}/%.fastq.gz ::: ERR1713342"
echo ${command}
conda run --name sra-tools ${command}

parallel -j 1 -I% fasterq-dump --threads 24 --split-spot --stdout % \2\> metagenomics/data/log.fasterq.%.err \| fastp -l 15 --stdin --stdout \2\> metagenomics/data/log.fastp.%.err \| pigz -cp 24 - \> metagenomics/data/%.fastq.gz ::: ERR1713342


In [15]:
ls -lh metagenomics/data/ERR1713342.fastq.gz

-rw-r--r-- 1 apetkau grp_apetkau 7.6G Dec  9 12:07 [0m[01;31mmetagenomics/data/ERR1713342.fastq.gz[0m


Looks like the file is properly downloaded.

# Human genomics data

Now let's download the human genomics data.

In [7]:
download_sequence_reads "human/human-genomes${accessions_file_extension}" "human/${output_dir_fragment}" ''

Downloading files from human/human-genomes.txt
Downloading files to human/data
parallel -j 1 -I% -a human/human-genomes.txt fasterq-dump --threads 24 --split-spot --stdout % \2\> human/data/log.fasterq.%.err \| fastp -l 15 --stdin --stdout \2\> human/data/log.fastp.%.err \| pigz -cp 24 - \> human/data/%.fastq.gz
Done. Took 142.77 minutes.

In [8]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

-rw-r--r-- 1 apetkau grp_apetkau  670M Dec  9 06:56 human/data/SRR038300.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  344M Dec  9 06:56 human/data/SRR039632.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau    20 Dec  9 06:58 human/data/SRR1012332.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  147M Dec  9 06:58 human/data/SRR1024141.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  535M Dec  9 06:59 human/data/SRR1033463.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  3.0G Dec  9 07:04 human/data/SRR1035695.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  2.6G Dec  9 07:07 human/data/SRR1047817.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  152M Dec  9 07:07 human/data/SRR1060774.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  1.9G Dec  9 07:09 human/data/SRR1174334.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau  2.1G Dec  9 07:11 human/data/SRR1193574.fastq.gz
50


Looking for files that did not download properly.

In [10]:
for file in human/data/*.fastq.gz
do
    size=`ls -l ${file} | awk '{ print $5 }'`
    if [ "$size" -lt 1000 ]
    then
        echo -e "${file}\t${size}"
    fi
done

human/data/SRR1012332.fastq.gz	20


Let's try to redownload this file.

In [11]:
threads=24
output_dir=human/data
command="parallel -j 1 -I% \
    fasterq-dump --threads ${threads} --split-spot --stdout % \2\> ${output_dir}/log.fasterq.%.err \| \
    fastp ${interleaved_fastp} -l 15 --stdin --stdout \2\> ${output_dir}/log.fastp.%.err \| \
    pigz -cp ${threads} - \> ${output_dir}/%.fastq.gz ::: SRR1012332"
echo ${command}
conda run --name sra-tools ${command}

parallel -j 1 -I% fasterq-dump --threads 24 --split-spot --stdout % \2\> human/data/log.fasterq.%.err \| fastp -l 15 --stdin --stdout \2\> human/data/log.fastp.%.err \| pigz -cp 24 - \> human/data/%.fastq.gz ::: SRR1012332


In [12]:
ls -lh human/data/SRR1012332.fastq.gz

-rw-r--r-- 1 apetkau grp_apetkau 248M Dec  9 11:29 [0m[01;31mhuman/data/SRR1012332.fastq.gz[0m


Great. We've got that downloaded.

Now, let's look at the total sizes of all data we've got.

# Data sizes

In [17]:
for dir in microbial metagenomics human
do
    echo ${dir}
    du -ch ${dir}/data/*.fastq.gz | grep total
done

microbial
19G	[01;31m[Ktotal[m[K
metagenomics
249G	[01;31m[Ktotal[m[K
human
122G	[01;31m[Ktotal[m[K


Awesome. We've now downloaded all the data we need :).