# Download sequence read data

This notebook describes the procedures taken to download all the sequence read data.

First, we will start by setting some environment variables.

In [1]:
PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR
ls

1-download-data.ipynb  [0m[01;34mfasterq.tmp.phac5021054.714[0m  [01;34mmetagenomics[0m
[01;34mbin[0m                    [01;34mfast.tmp[0m                     [01;34mmicrobial[0m
[01;34mdb[0m                     [01;34mhuman[0m                        README.md


Now, let's download the data. First let's define a bash function for downloading the data.

## Bash download function

In [2]:
# Purpose: Downloads sequence reads
# Args:
#      accessions_file: A file contining a list (one per line) of sequence read accessions to download.
#      output_dir: A directory to download the files into.
download_sequence_reads() {
    accessions_file=$1
    output_dir=$2
    
    threads=48
    
    echo "Downloading files from ${accessions_file}"
    echo "Downloading files to ${output_dir}"
    
    mkdir -p ${output_dir}
    rm ${output_dir}/*
    
    before=`date +%s`
    for name in `cat ${accessions_file}`;
    do
        output=${output_dir}/${name}.fastq
        log=${output_dir}/log.${name}
    
        command="fasterq-dump --threads ${threads} --split-spot -o ${output} ${name}"
        echo ${command} "1> ${log}.out 2> ${log}.err"
        ${command} 1> ${log}.out 2> ${log}.err
        
        command="pigz --processes ${threads} ${output}"
        echo ${command}
        ${command}
    done
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}
}

# Download microbial data

Now, let's download the microbial data.

In [3]:
accessions_file="microbial/microbial-genomes-small.txt"
output_dir="microbial/data-small"
download_sequence_reads ${accessions_file} ${output_dir}

Downloading files from microbial/microbial-genomes-small.txt
Downloading files to microbial/data-small
fasterq-dump --threads 48 --split-spot -o microbial/data-small/SRR10298904.fastq SRR10298904 1> microbial/data-small/log.SRR10298904.out 2> microbial/data-small/log.SRR10298904.err
pigz --processes 48 microbial/data-small/SRR10298904.fastq
fasterq-dump --threads 48 --split-spot -o microbial/data-small/SRR10298906.fastq SRR10298906 1> microbial/data-small/log.SRR10298906.out 2> microbial/data-small/log.SRR10298906.err
pigz --processes 48 microbial/data-small/SRR10298906.fastq
Done. Took 0.18 minutes.

Awesome. We've downloaded all the microbial data. Let's check out the files.

In [4]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

-rw-r--r-- 1 apetkau grp_apetkau 100M Dec  5 17:33 microbial/data-small/SRR10298904.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 192M Dec  5 17:33 microbial/data-small/SRR10298906.fastq.gz
2


# Download metagenomics data

Let's now download the metagenomics data.

In [5]:
accessions_file="metagenomics/metagenomes-small.txt"
output_dir="metagenomics/data-small"
download_sequence_reads ${accessions_file} ${output_dir}

Downloading files from metagenomics/metagenomes-small.txt
Downloading files to metagenomics/data-small
fasterq-dump --threads 48 --split-spot -o metagenomics/data-small/ERR1713335.fastq ERR1713335 1> metagenomics/data-small/log.ERR1713335.out 2> metagenomics/data-small/log.ERR1713335.err
pigz --processes 48 metagenomics/data-small/ERR1713335.fastq
Done. Took 2.90 minutes.

In [6]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

-rw-r--r-- 1 apetkau grp_apetkau 4.4G Dec  5 17:35 metagenomics/data-small/ERR1713335.fastq.gz
1


# Human genomics data

Now let's download the human genomics data.

In [7]:
accessions_file="human/human-genomes-small.txt"
output_dir="human/data-small"
download_sequence_reads ${accessions_file} ${output_dir}

Downloading files from human/human-genomes-small.txt
Downloading files to human/data-small
fasterq-dump --threads 48 --split-spot -o human/data-small/SRR038300.fastq SRR038300 1> human/data-small/log.SRR038300.out 2> human/data-small/log.SRR038300.err
pigz --processes 48 human/data-small/SRR038300.fastq
fasterq-dump --threads 48 --split-spot -o human/data-small/SRR039632.fastq SRR039632 1> human/data-small/log.SRR039632.out 2> human/data-small/log.SRR039632.err
pigz --processes 48 human/data-small/SRR039632.fastq
fasterq-dump --threads 48 --split-spot -o human/data-small/SRR537114.fastq SRR537114 1> human/data-small/log.SRR537114.out 2> human/data-small/log.SRR537114.err
pigz --processes 48 human/data-small/SRR537114.fastq
Done. Took 0.53 minutes.

In [8]:
ls -lh ${output_dir}/*.gz | head
ls ${output_dir}/*.gz | wc -l

-rw-r--r-- 1 apetkau grp_apetkau 747M Dec  5 17:36 human/data-small/SRR038300.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 345M Dec  5 17:36 human/data-small/SRR039632.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 528M Dec  5 17:37 human/data-small/SRR537114.fastq.gz
3


Awesome. We've now downloaded all the data we need :).