# Download sequence read data

This notebook describes the procedures taken to download all the sequence read data.

First, we will start by setting some environment variables.

In [1]:
PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR
ls

1-download-data.ipynb  [0m[01;34mdb[0m        [01;34mhuman[0m         [01;34mmicrobial[0m
[01;34mbin[0m                    [01;34mfast.tmp[0m  [01;34mmetagenomics[0m  README.md


Now, let's download the data. First let's define a bash function for downloading the data.

## Bash download function

In [19]:
# Purpose: Downloads sequence reads
# Args:
#      accessions_file: A file contining a list (one per line) of sequence read accessions to download.
#      output_dir: A directory to download the files into.
download_sequence_reads() {
    accessions_file=$1
    output_dir=$2
    
    threads=48
    
    echo "Downloading files from ${accessions_file}"
    echo "Downloading files to ${output_dir}"
    
    mkdir -p ${output_dir}
    
    before=`date +%s`
    for name in `cat ${accessions_file}`;
    do
        output=${output_dir}/${name}
        log=${output_dir}/log.${name}
    
        command="fasterq-dump --threads ${threads} -o ${output} --split-files ${name}"
        echo ${command} "1> ${log}.out 2> ${log}.err"
        ${command} 1> ${log}.out 2> ${log}.err
        
        command="pigz --processes ${threads} ${output_dir}/${name}*"
        echo ${command}
        ${command}
    done
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}
}

# Download microbial data

Now, let's download the microbial data.

In [20]:
accessions_file="microbial/microbial-genomes-small.txt"
output_dir="microbial/data-small"
download_sequence_reads ${accessions_file} ${output_dir}

Downloading files from microbial/microbial-genomes-small.txt
Downloading files to microbial/data-small
fasterq-dump --threads 48 -o microbial/data-small/SRR10298904 --split-files SRR10298904 1> microbial/data-small/log.SRR10298904.out 2> microbial/data-small/log.SRR10298904.err
pigz --processes 48 microbial/data-small/SRR10298904_1.fastq microbial/data-small/SRR10298904_1.fastq.gz microbial/data-small/SRR10298904_2.fastq microbial/data-small/SRR10298904_2.fastq.gz
microbial/data-small/SRR10298904_1.fastq.gz exists -- overwrite (y/n)? Done. Took 2.08 minutes.

Awesome. We've downloaded all the microbial data. Let's check out the files.

In [4]:
ls -lh ${output_dir} | head
ls ${output_dir} | wc -l

total 101M
-rw-r--r-- 1 apetkau grp_apetkau 48M Dec  5 16:18 SRR10298904_1.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 54M Dec  5 16:18 SRR10298904_2.fastq.gz
2
