# Download sequence read data

This notebook describes the procedures taken to download all the sequence read data.

First, we will start by setting some environment variables.

In [None]:
PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR
ls

Now, let's download the data. First let's define a bash function for downloading the data.

## Bash download function

In [None]:
# Purpose: Downloads sequence reads
# Args:
#      accessions_file: A file contining a list (one per line) of sequence read accessions to download.
#      output_dir: A directory to download the files into.
download_sequence_reads() {
    accessions_file=$1
    output_dir=$2
    
    threads=48
    
    echo "Downloading files from ${accessions_file}"
    echo "Downloading files to ${output_dir}"
    
    mkdir -p ${output_dir}
    rm ${output_dir}/*
    
    before=`date +%s`
    for name in `cat ${accessions_file}`;
    do
        output=${output_dir}/${name}
        log=${output_dir}/log.${name}
    
        command="fasterq-dump --threads ${threads} -o ${output} --split-files ${name}"
        echo ${command} "1> ${log}.out 2> ${log}.err"
        ${command} 1> ${log}.out 2> ${log}.err
        
        command="pigz --processes ${threads} ${output_dir}/${name}*"
        echo ${command}
        ${command}
    done
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}
}

# Download microbial data

Now, let's download the microbial data.

In [None]:
accessions_file="microbial/microbial-genomes-small.txt"
output_dir="microbial/data-small"
download_sequence_reads ${accessions_file} ${output_dir}

Awesome. We've downloaded all the microbial data. Let's check out the files.

In [None]:
ls -lh ${output_dir}/*.fastq.gz | head
ls ${output_dir}/*.fastq.gz | wc -l

# Download metagenomics data

Let's now download the metagenomics data.

In [None]:
accessions_file="metagenomics/metagenomes-small.txt"
output_dir="metagenomics/data-small"
download_sequence_reads ${accessions_file} ${output_dir}

In [None]:
ls -lh ${output_dir}/*.fastq.gz | head
ls ${output_dir}/*.fastq.gz | wc -l

# Human genomics data

Now let's download the human genomics data.

In [None]:
accessions_file="human/human-genomes-small.txt"
output_dir="human/data-small"
download_sequence_reads ${accessions_file} ${output_dir}

In [None]:
ls -lh ${output_dir}/*.fastq.gz | head
ls ${output_dir}/*.fastq.gz | wc -l

Awesome. We've now downloaded all the data we need :).