 # Downsample data
 
 In order to get everything to run in a reasonable amount of time (and resources) we are going to downsample/reduce the size of the fastq files we've downloaded.
 
Let's setup some environment variables.

In [None]:
PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR

Now, let's reduce the size of the datasets.

The download code given below assumes you have the following [conda](https://docs.conda.io/en/latest/) environments setup: [seqtk](https://github.com/ncbi/sra-tools).

```bash
conda create --name seqtk seqtk
```

## Get total reads per dataset

Once that's setup, we'll also need a method to get the total number of bases in each dataset. Luckily these were logged by `fastp` in the download step. Let's take a look at one of these files:

In [None]:
cat microbial/data/log.fastp.SRR10527353.err

What we need to do is extract the total base pairs from this dataset. Let's build a bash function to do this.

### Extract base pairs

In [None]:
# Purpose: Extract base pairs
# Args:
#      fastp_file: The fastp log file
# Output: Prints the total base pairs from this log file.
extract_total_base_pairs() {
    fastp_file=$1
    
    # Test if paired-end dataset
    if grep 'Read2 aftering filtering' ${fastp_file} 1> /dev/null 2> /dev/null;
    then
        read1_bases=`grep 'Read1 after filtering' -A 2 ${fastp_file} | grep 'total bases' | sed -e 's/total bases: //'`
        read2_bases=`grep 'Read2 aftering filtering' -A 2 ${fastp_file} | grep 'total bases' | sed -e 's/total bases: //'`
        
        echo "(${read1_bases}+${read2_bases})" | bc
    else
        read_bases=`grep 'Read1 after filtering' -A 2 ${fastp_file} | grep 'total bases' | sed -e 's/total bases: //'`
        
        # Prints same number, but will return error code of '$read_bases' is not a number.
        echo "(${read_bases})" | bc
    fi
}

Let's test it out.

In [None]:
extract_total_base_pairs "microbial/data/log.fastp.SRR10527353.err"

Looks good.

Now, let's define a function to downsample the reads (using number of base pairs to help us determine the size of the subsample).

## Downsample reads

In [None]:
# Purpose: Downsample reads
# Args:
#      input_dir: The data input directory.
#      output_dir: The downsampled data output directory.
# Output: Saves downsapled files in $output_dir
downsample_reads() {
    input_dir=$1
    output_dir=$2
    
    mkdir ${output_dir}
    
    jobs=50
    max_bp=100000000 # 100 million
    
    commands_file=`mktemp`
    
    for file in ${input_dir}/*.fastq.gz
    do
        accession=`basename ${file} .fastq.gz`
        
        total_bp=`extract_total_base_pairs "${input_dir}/log.fastp.${accession}.err"`
        reduction_ratio=`echo "${max_bp}/${total_bp}" | bc -l`
        reduction_ratio=`printf "%0.6f" ${reduction_ratio}` # rounding/including leading zero
        
        if [ "${max_bp}" -lt "${total_bp}" ]
        then
            echo "seqtk sample -s 111 ${file} ${reduction_ratio} | gzip --to-stdout > ${output_dir}/${accession}.fastq.gz" >> ${commands_file}
        else
            echo "cp ${file} ${output_dir}/${accession}.fastq.gz" >> ${commands_file}
        fi
    done
    
    echo "Will execute commands from file ${commands_file} like:"
    cat ${commands_file}
    
    command="parallel -j ${jobs} -a ${commands_file}"
    echo ${command}
    conda run --name seqtk ${command}
}

Great. We've got it defined. So let's run with the microbial dataset.

## Downsample microbial reads

In [None]:
downsample_reads "microbial/data" "microbial/data-downsampled"

Okay. Let's check out the data.

In [None]:
ls -lh microbial/data-downsampled | head -n 5
ls microbial/data-downsampled/ | wc -l

Looks good. Let's try on the metagenomics data.

## Downsample metagenomics reads 

In [None]:
downsample_reads "metagenomics/data" "metagenomics/data-downsampled"

In [None]:
ls -lh metagenomics/data-downsampled | head -n 5
ls metagenomics/data-downsampled | wc -l

Great. Now let's do this on the human data.

## Downsample human reads

In [None]:
downsample_reads "human/data" "human/data-downsampled"

In [None]:
ls -lh human/data-downsampled | head -n 5
ls human/data-downsampled | wc -l