 # Downsample data
 
 In order to get everything to run in a reasonable amount of time (and resources) we are going to downsample/reduce the size of the fastq files we've downloaded.
 
Let's setup some environment variables.

In [1]:
PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR

Now, let's reduce the size of the datasets.

The download code given below assumes you have the following [conda](https://docs.conda.io/en/latest/) environments setup: [seqtk](https://github.com/ncbi/sra-tools).

```bash
conda create --name seqtk seqtk
```

## Get total reads per dataset

Once that's setup, we'll also need a method to get the total number of bases in each dataset. Luckily these were logged by `fastp` in the download step. Let's take a look at one of these files:

In [2]:
cat microbial/data/log.fastp.SRR10527353.err

Streaming uncompressed interleaved reads to STDOUT...
Enable interleaved output mode for paired-end input.

Read1 before filtering:
total reads: 1832803
total bases: 183295006
Q20 bases: 180082065(98.2471%)
Q30 bases: 177176497(96.6619%)

Read2 before filtering:
total reads: 1832803
total bases: 183299394
Q20 bases: 176719382(96.4102%)
Q30 bases: 172125734(93.9041%)

Read1 after filtering:
total reads: 1788198
total bases: 178805653
Q20 bases: 176306123(98.6021%)
Q30 bases: 173939308(97.2784%)

Read2 aftering filtering:
total reads: 1788198
total bases: 178803689
Q20 bases: 174675827(97.6914%)
Q30 bases: 171177338(95.7348%)

Filtering result:
reads passed filter: 3576396
reads failed due to low quality: 88972
reads failed due to too many N: 238
reads failed due to too short: 0
reads with adapter trimmed: 1320
bases trimmed due to adapters: 22093

Duplication rate: 0.919835%

Insert size peak (evaluated by paired-end reads): 167

JSON report: fastp.json
HTML report: fastp.html

fastp --

What we need to do is extract the total base pairs from this dataset. Let's build a bash function to do this.

### Extract base pairs

In [3]:
# Purpose: Extract base pairs
# Args:
#      fastp_file: The fastp log file
# Output: Prints the total base pairs from this log file.
extract_total_base_pairs() {
    fastp_file=$1
    
    # Test if paired-end dataset
    if grep 'Read2 aftering filtering' ${fastp_file} 1> /dev/null 2> /dev/null;
    then
        read1_bases=`grep 'Read1 after filtering' -A 2 ${fastp_file} | grep 'total bases' | sed -e 's/total bases: //'`
        read2_bases=`grep 'Read2 aftering filtering' -A 2 ${fastp_file} | grep 'total bases' | sed -e 's/total bases: //'`
        
        echo "(${read1_bases}+${read2_bases})" | bc
    else
        read_bases=`grep 'Read1 after filtering' -A 2 ${fastp_file} | grep 'total bases' | sed -e 's/total bases: //'`
        
        # Prints same number, but will return error code of '$read_bases' is not a number.
        echo "(${read_bases})" | bc
    fi
}

Let's test it out.

In [4]:
extract_total_base_pairs "microbial/data/log.fastp.SRR10527353.err"

357609342


Looks good.

Now, let's define a function to downsample the reads (using number of base pairs to help us determine the size of the subsample).

## Downsample reads

In [5]:
# Purpose: Downsample reads
# Args:
#      input_dir: The data input directory.
#      output_dir: The downsampled data output directory.
# Output: Saves downsapled files in $output_dir
downsample_reads() {
    input_dir=$1
    output_dir=$2
    
    mkdir ${output_dir}
    
    jobs=50
    max_bp=100000000 # 100 million
    
    commands_file=`mktemp`
    
    for file in ${input_dir}/*.fastq.gz
    do
        accession=`basename ${file} .fastq.gz`
        
        total_bp=`extract_total_base_pairs "${input_dir}/log.fastp.${accession}.err"`
        reduction_ratio=`echo "${max_bp}/${total_bp}" | bc -l`
        reduction_ratio=`printf "%0.6f" ${reduction_ratio}` # rounding/including leading zero
        
        if [ "${max_bp}" -lt "${total_bp}" ]
        then
            echo "seqtk sample -s 111 ${file} ${reduction_ratio} | gzip --to-stdout > ${output_dir}/${accession}.fastq.gz" >> ${commands_file}
        else
            echo "cp ${file} ${output_dir}/${accession}.fastq.gz" >> ${commands_file}
        fi
    done
    
    echo "Will execute commands from file ${commands_file} like:"
    cat ${commands_file}
    
    command="parallel -j ${jobs} -a ${commands_file}"
    echo ${command}
    conda run --name seqtk ${command}
}

Great. We've got it defined. So let's run with the microbial dataset.

## Downsample microbial reads

In [6]:
downsample_reads "microbial/data" "microbial/data-downsampled"

mkdir: cannot create directory ‘microbial/data-downsampled’: File exists
Will execute commands from file /tmp/tmp.v3bmCJ3MeJ like:
seqtk sample -s 111 microbial/data/ERR1144974.fastq.gz 0.027492 | gzip --to-stdout > microbial/data-downsampled/ERR1144974.fastq.gz
seqtk sample -s 111 microbial/data/ERR1144975.fastq.gz 0.044712 | gzip --to-stdout > microbial/data-downsampled/ERR1144975.fastq.gz
seqtk sample -s 111 microbial/data/ERR1144976.fastq.gz 0.054031 | gzip --to-stdout > microbial/data-downsampled/ERR1144976.fastq.gz
seqtk sample -s 111 microbial/data/ERR1144977.fastq.gz 0.043622 | gzip --to-stdout > microbial/data-downsampled/ERR1144977.fastq.gz
seqtk sample -s 111 microbial/data/ERR1144978.fastq.gz 0.046892 | gzip --to-stdout > microbial/data-downsampled/ERR1144978.fastq.gz
seqtk sample -s 111 microbial/data/ERR3655992.fastq.gz 0.139094 | gzip --to-stdout > microbial/data-downsampled/ERR3655992.fastq.gz
seqtk sample -s 111 microbial/data/ERR3655994.fastq.gz 0.133909 | gzip --to-s

Okay. Let's check out the data.

In [7]:
ls -lh microbial/data-downsampled | head -n 5
ls microbial/data-downsampled/ | wc -l

total 3.0G
-rw-r--r-- 1 apetkau grp_apetkau 41M Dec  9 18:42 ERR1144974.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 42M Dec  9 18:42 ERR1144975.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 42M Dec  9 18:42 ERR1144976.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 42M Dec  9 18:42 ERR1144977.fastq.gz
50


Looks good. Let's try on the metagenomics data.

## Downsample metagenomics reads 

In [8]:
downsample_reads "metagenomics/data" "metagenomics/data-downsampled"

mkdir: cannot create directory ‘metagenomics/data-downsampled’: File exists
Will execute commands from file /tmp/tmp.XQSLZgmvtQ like:
seqtk sample -s 111 metagenomics/data/ERR1713331.fastq.gz 0.008529 | gzip --to-stdout > metagenomics/data-downsampled/ERR1713331.fastq.gz
seqtk sample -s 111 metagenomics/data/ERR1713332.fastq.gz 0.023233 | gzip --to-stdout > metagenomics/data-downsampled/ERR1713332.fastq.gz
seqtk sample -s 111 metagenomics/data/ERR1713333.fastq.gz 0.056409 | gzip --to-stdout > metagenomics/data-downsampled/ERR1713333.fastq.gz
seqtk sample -s 111 metagenomics/data/ERR1713334.fastq.gz 0.015918 | gzip --to-stdout > metagenomics/data-downsampled/ERR1713334.fastq.gz
seqtk sample -s 111 metagenomics/data/ERR1713335.fastq.gz 0.013456 | gzip --to-stdout > metagenomics/data-downsampled/ERR1713335.fastq.gz
seqtk sample -s 111 metagenomics/data/ERR1713336.fastq.gz 0.015331 | gzip --to-stdout > metagenomics/data-downsampled/ERR1713336.fastq.gz
seqtk sample -s 111 metagenomics/data/

In [9]:
ls -lh metagenomics/data-downsampled | head -n 5
ls metagenomics/data-downsampled | wc -l

total 3.0G
-rw-r--r-- 1 apetkau grp_apetkau 59M Dec  9 18:49 ERR1713331.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 60M Dec  9 18:46 ERR1713332.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 60M Dec  9 18:44 ERR1713333.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau 61M Dec  9 18:47 ERR1713334.fastq.gz
50


Great. Now let's do this on the human data.

## Downsample human reads

In [10]:
downsample_reads "human/data" "human/data-downsampled"

mkdir: cannot create directory ‘human/data-downsampled’: File exists
Will execute commands from file /tmp/tmp.eIhfnz4TFo like:
seqtk sample -s 111 human/data/SRR038300.fastq.gz 0.185818 | gzip --to-stdout > human/data-downsampled/SRR038300.fastq.gz
seqtk sample -s 111 human/data/SRR039632.fastq.gz 0.269121 | gzip --to-stdout > human/data-downsampled/SRR039632.fastq.gz
seqtk sample -s 111 human/data/SRR1012332.fastq.gz 0.326105 | gzip --to-stdout > human/data-downsampled/SRR1012332.fastq.gz
seqtk sample -s 111 human/data/SRR1024141.fastq.gz 0.429048 | gzip --to-stdout > human/data-downsampled/SRR1024141.fastq.gz
seqtk sample -s 111 human/data/SRR1033463.fastq.gz 0.159711 | gzip --to-stdout > human/data-downsampled/SRR1033463.fastq.gz
seqtk sample -s 111 human/data/SRR1035695.fastq.gz 0.031318 | gzip --to-stdout > human/data-downsampled/SRR1035695.fastq.gz
seqtk sample -s 111 human/data/SRR1047817.fastq.gz 0.030416 | gzip --to-stdout > human/data-downsampled/SRR1047817.fastq.gz
seqtk sam

In [11]:
ls -lh human/data-downsampled | head -n 5
ls human/data-downsampled | wc -l

total 3.4G
-rw-r--r-- 1 apetkau grp_apetkau  116M Dec  9 18:52 SRR038300.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau   83M Dec  9 18:52 SRR039632.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau   74M Dec  9 18:52 SRR1012332.fastq.gz
-rw-r--r-- 1 apetkau grp_apetkau   59M Dec  9 18:52 SRR1024141.fastq.gz
50
