# Download sequence read data

This notebook describes the procedures taken to download all the sequence read data.

First, we will start by setting some environment variables.

In [18]:
PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR
ls

1-download-data.ipynb            [0m[01;34mbin[0m         fastp.json    [01;34mmicrobial[0m
1-download-data.nbconvert.ipynb  [01;34mdb[0m          [01;34mhuman[0m         README.md
2-clean-reads.ipynb              fastp.html  [01;34mmetagenomics[0m  Untitled.ipynb


Now, let's download the data. 

The download code given below assumes you have the following [conda](https://docs.conda.io/en/latest/) environments setup. This can be done with.

```bash
conda create --name sra-tools sra-tools
```

We also assume that you have the command [pigz](https://zlib.net/pigz/) setup (for faster gzipping).

Once that's setup, let's define a bash function for downloading the data.

## Bash download function

In [19]:
# Purpose: Downloads sequence reads
# Args:
#      accessions_file: A file contining a list (one per line) of sequence read accessions to download.
#      output_dir: A directory to download the files into.
#      pe: Set to 1 if dataset is pe, otherwise set to 0.
download_sequence_reads() {
    accessions_file=$1
    output_dir=$2
    pe=$3
    
    threads=48
    
    echo "Downloading files from ${accessions_file}"
    echo "Downloading files to ${output_dir}"
    
    rm -rf ${output_dir}
    mkdir -p ${output_dir}
    
    before=`date +%s`
    for name in `cat ${accessions_file}`;
    do
        output=${output_dir}/${name}.fastq.gz
        log_fasterq=${output_dir}/log.${name}.fasterq
        log_fastp=${output_dir}/log.${name}.fastp
    
        command="fasterq-dump --split-spot --stdout ${name} 2> ${log_fasterq} | fastp --interleaved_in --stdin --stdout 2> ${log_fastp} | pigz -cp ${threads} - > ${output}"
        echo ${command}
        conda run --name sra-tools ${command}
    done
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}
}

# Download microbial data

Now, let's download the microbial data.

In [20]:
accessions_file="microbial/microbial-genomes-small.txt"
output_dir="microbial/data-small"
download_sequence_reads ${accessions_file} ${output_dir} 1

Downloading files from microbial/microbial-genomes-small.txt
Downloading files to microbial/data-small
fasterq-dump --split-spot --stdout SRR10298904 2> microbial/data-small/log.SRR10298904.fasterq | fastp --interleaved_in --stdin --stdout 2> microbial/data-small/log.SRR10298904.fastp | pigz -cp 48 - > microbial/data-small/SRR10298904.fastq.gz
fasterq-dump --split-spot --stdout SRR10298906 2> microbial/data-small/log.SRR10298906.fasterq | fastp --interleaved_in --stdin --stdout 2> microbial/data-small/log.SRR10298906.fastp | pigz -cp 48 - > microbial/data-small/SRR10298906.fastq.gz
Done. Took 0.40 minutes.

Awesome. We've downloaded all the microbial data. Let's check out the files.

In [21]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

-rw-r--r-- 1 apetkau grp_apetkau 100M Dec  5 18:39 microbial/data-small/SRR10298904.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 191M Dec  5 18:40 microbial/data-small/SRR10298906.fastq.gz
2


# Download metagenomics data

Let's now download the metagenomics data.

In [None]:
accessions_file="metagenomics/metagenomes-small.txt"
output_dir="metagenomics/data-small"
download_sequence_reads ${accessions_file} ${output_dir} 1

In [None]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

# Human genomics data

Now let's download the human genomics data.

In [None]:
accessions_file="human/human-genomes-small.txt"
output_dir="human/data-small"
download_sequence_reads ${accessions_file} ${output_dir} 0

In [None]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

Awesome. We've now downloaded all the data we need :).