# Download sequence read data

This notebook describes the procedures taken to download all the sequence read data.

First, we will start by setting some environment variables.

In [None]:
accessions_file_extension='small.txt'
output_dir_fragment='data-small'

PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR
ls

Now, let's download the data. 

The download code given below assumes you have the following [conda](https://docs.conda.io/en/latest/) environments setup to install [sra-tools](https://github.com/ncbi/sra-tools) and [fastp](https://github.com/OpenGene/fastp). This can be done with.

```bash
conda create --name sra-tools sra-tools fastp
```

We also assume that you have the command [pigz](https://zlib.net/pigz/) setup (for faster gzipping).

Once that's setup, let's define a bash function for downloading the data.

## Bash download function

In [None]:
# Purpose: Downloads sequence reads
# Args:
#      accessions_file: A file contining a list (one per line) of sequence read accessions to download.
#      output_dir: A directory to download the files into.
#      interleaved_fastp: Set to '--interleaved_in' if data is paired-end, otherwise set to ''.
download_sequence_reads() {
    accessions_file=$1
    output_dir=$2
    interleaved_fastp=$3
    
    jobs=1
    threads=24
    
    echo "Downloading files from ${accessions_file}"
    echo "Downloading files to ${output_dir}"
    
    rm -rf ${output_dir}
    mkdir -p ${output_dir}
    
    before=`date +%s`
    
    command="parallel -j ${jobs} -I% -a ${accessions_file} \
        fasterq-dump --threads ${threads} --split-spot --stdout % \2\> ${output_dir}/log.fasterq.%.err \| \
        fastp ${interleaved_fastp} -l 15 --stdin --stdout \2\> ${output_dir}/log.fastp.%.err \| \
        pigz -cp ${threads} - \> ${output_dir}/%.fastq.gz"
    echo ${command}
    conda run --name sra-tools ${command}
        
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}
}

# Download microbial data

Now, let's download the microbial data.

In [None]:
download_sequence_reads "microbial/microbial-genomes-${accessions_file_extension}" "microbial/${output_dir_fragment}" '--interleaved_in'

Awesome. We've downloaded all the microbial data. Let's check out the files.

In [None]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

# Download metagenomics data

Let's now download the metagenomics data.

In [None]:
download_sequence_reads "metagenomics/metagenomes-${accessions_file_extension}" "metagenomics/${output_dir_fragment}" '--interleaved_in'

In [None]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

# Human genomics data

Now let's download the human genomics data.

In [None]:
download_sequence_reads "human/human-genomes-${accessions_file_extension}" "human/${output_dir_fragment}" ''

In [None]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

Awesome. We've now downloaded all the data we need :).