# Break into kmers with Jellyfish

Both BIGSI and HowDeSBT operate on k-mers (of some size k), which are inserted into Bloom filters. While each program can take a variety of input files (and HowDeSBT can count k-mers itself), in order to measure performance we wish to start with a common set of inputs. So, we will break our data into kmers with the program [jellyfish](https://github.com/gmarcais/Jellyfish) ahead of time.

First, let's setup some directories.

In [1]:
data_dir=data-downsampled
kmer_dir=kmer-downsampled
kmer_counts_dir=kmer-counts-jellyfish
kmers_input_sizes="9 11 13 15 17"

jobs=50
threads=4

PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR

The code given below assumes you have the following [conda](https://docs.conda.io/en/latest/) environments setup to install [jellyfish](https://github.com/gmarcais/Jellyfish). This can be done with.

```bash
conda create --name jellyfish jellyfish
```

Let's verify these commands exist (and verify versions).

In [2]:
conda run --name jellyfish jellyfish --version

jellyfish 2.2.8


Once this step is complete, we need to figure out the maximum number of kmers for our 3 datasets (for passing to `jellyfish` to set the hash size). Let's define a bash function for this.

## Find max data kmer cardinality

In [3]:
# Purpose: Gets max kmer cardinality estimates on sequence reads.
# Args:
#      input_dir: The input directory containing all the kmer counts.
#      kmer_size: The kmer_size to find the max.
# Output: Prints the the maximum kmer count for the kmer size
#         in this directory (files named like `kmer-9.tsv`).
get_max_kmer_cardinality() {
    input_dir=$1
    kmer_size=$2
    
    cut -f 2 "${input_dir}/kmer-${kmer_size}.tsv" | sort -n | tail -n 1 | awk '{print int($1+0.5)}'
}

Let's test this code out.

In [4]:
get_max_kmer_cardinality "microbial/${kmer_dir}" "13"
get_max_kmer_cardinality "human/${kmer_dir}" "13"
get_max_kmer_cardinality "metagenomics/${kmer_dir}" "13"

4912734
5580383
7067981


Great. Now that we have this setup, we can move to defining a function to count and produce a list of all kmers in the dataset using `jellyfish`.

## Bash jellyfish kmer count function

In [5]:
# Purpose: Runs jellyfish on sequence reads to count kmers.
# Args:
#      type_dir: The input type directory (e.g., microbial,)
#      output_dir: A directory to save the jellyfish output into.
#      kmer_sizes: The kmer_sizes to run (separated by spaces).
# Output: Jellyfish kmer counts in the passed output directory.
run_jellyfish() {
    type_dir=$1
    output_dir=$2
    kmer_sizes=$3
    
    input_dir=${type_dir}/${data_dir}
    
    rm -rf ${output_dir}
    mkdir ${output_dir}
    
    before=`date +%s`
    
    for kmer_size in ${kmer_sizes}
    do
        # Find max kmers
        max_kmer=`get_max_kmer_cardinality "${type_dir}/${kmer_dir}" "${kmer_size}"`
        
        # Hash size is 10x the max kmers in dataset.
        #hash_size=`echo "10*${max_kmer}" | bc`
        
        output_dir_kmer=${output_dir}/${kmer_size}
        mkdir ${output_dir_kmer}
        
        commands_file=`mktemp`
    
        # Let's generate a list of commands to a temporary ${commands_file}
        for file in ${input_dir}/*.fastq.gz
        do
            accession=`basename ${file} .fastq.gz`

            jellyfish_out=${output_dir_kmer}/${accession}.jf
            jellyfish_log=${output_dir_kmer}/jellyfish.count.${accession}.log
                        
            kmer_counts_out=${output_dir_kmer}/${accession}.kmer.gz
            kmer_counts_log=${output_dir_kmer}/jellyfish.dump.${accession}.log
        
            # Command to generate a list of kmers with jellyfish and dump to a text file (gzipped)
            command="/usr/bin/time -v jellyfish count --size ${max_kmer} --threads ${threads} --mer-len ${kmer_size} --output ${jellyfish_out} \
                --canonical <(gzip -d --stdout ${file}) 2> ${jellyfish_log}.err 1> ${jellyfish_log} && \
                /usr/bin/time -v jellyfish dump --column --tab ${jellyfish_out} 2> ${kmer_counts_log}.err | cut -f 1 | gzip --stdout > ${kmer_counts_out} && \
                rm ${jellyfish_out}"
            echo ${command} >> ${commands_file}
        done
        
        # Now, let's execute those commands in parallel
        printf "Will execute commands from [%s] like:\n" ${commands_file}
        head -n 1 ${commands_file}
        
        command="parallel -j ${jobs} -a ${commands_file}"
        echo $command
        ${command}
    done
        
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}
}

Let's generate some basic stats on these kmer lists.

## Bash kmer stats

In [6]:
# Purpose: Generates stats on kmers.
# Args:
#      input_dir: Input dir for kmer directories (e.g., 9, 15, etc)
# Output: Jellyfish kmer stats for each kmer size (subdirectory) for each sample.
jellyfish_kmer_stats() {
    input_dir=$1
    
    echo -e "kmer\ttotal sizes" > ${input_dir}/kmer-file-sizes.tsv
    for kmer_dir_stats in ${input_dir}/*
    do
        kmer=`basename ${kmer_dir_stats}`
    
        # Skip the one non-directory file we create above
        if [ -d ${kmer_dir_stats} ]
        then
            echo -e "accession\tkmer_count\tfile_size_kb" > ${kmer_dir_stats}/kmer-counts.tsv
            for kmer_file in ${kmer_dir_stats}/*.kmer.gz
            do
                accession=`basename ${kmer_file} .kmer.gz`
                kmer_count=`zcat ${kmer_file} | wc -l`
                file_size=`du -sk ${kmer_file} | cut -f 1`
                echo -e "$accession\t${kmer_count}\t${file_size}" >> ${kmer_dir_stats}/kmer-counts.tsv
            done

            total=`du -ch ${kmer_dir_stats}/*.kmer.gz | grep total | sed -e 's/total//'`
            echo -e "${kmer}\t${total}" >> ${input_dir}/kmer-file-sizes.tsv
        fi
    done
}

Now that we've got our mccortex code defined. Let's run it on a dataset.

## Microbial kmer generation

Let's first generate a list of all kmers (for different sizes) for the microbial dataset.

In [7]:
input_dir_type="microbial"
run_jellyfish "${input_dir_type}" "${input_dir_type}/${kmer_counts_dir}" "${kmers_input_sizes}"

Will execute commands from [/tmp/tmp.0T5oX4Tml2] like:
/usr/bin/time -v jellyfish count --size 131021 --threads 4 --mer-len 9 --output microbial/kmer-counts-jellyfish/9/ERR1144974.jf --canonical <(gzip -d --stdout microbial/data-downsampled/ERR1144974.fastq.gz) 2> microbial/kmer-counts-jellyfish/9/jellyfish.count.ERR1144974.log.err 1> microbial/kmer-counts-jellyfish/9/jellyfish.count.ERR1144974.log && /usr/bin/time -v jellyfish dump --column --tab microbial/kmer-counts-jellyfish/9/ERR1144974.jf 2> microbial/kmer-counts-jellyfish/9/jellyfish.dump.ERR1144974.log.err | cut -f 1 | gzip --stdout > microbial/kmer-counts-jellyfish/9/ERR1144974.kmer.gz && rm microbial/kmer-counts-jellyfish/9/ERR1144974.jf
parallel -j 50 -a /tmp/tmp.0T5oX4Tml2
Will execute commands from [/tmp/tmp.nDZmpHprSV] like:
/usr/bin/time -v jellyfish count --size 1621748 --threads 4 --mer-len 11 --output microbial/kmer-counts-jellyfish/11/ERR1144974.jf --canonical <(gzip -d --stdout microbial/data-downsampled/ERR1144974.

Awesome. We've generated our kmer list. Let's look at the files.

In [8]:
ls -lh ${input_dir_type}/${kmer_counts_dir}/9 | head -n 5

total 14M
-rw-r--r-- 1 apetkau grp_apetkau 258K Dec 11 13:52 ERR1144974.kmer.gz
-rw-r--r-- 1 apetkau grp_apetkau 258K Dec 11 13:52 ERR1144975.kmer.gz
-rw-r--r-- 1 apetkau grp_apetkau 258K Dec 11 13:52 ERR1144976.kmer.gz
-rw-r--r-- 1 apetkau grp_apetkau 258K Dec 11 13:52 ERR1144977.kmer.gz
ls: write error: Broken pipe


Let's look at what data we've genearted.

In [9]:
zcat ${input_dir_type}/${kmer_counts_dir}/9/ERR1144976.kmer.gz | head -n 5

AAAAAAAAA
AAAAAAAAC
AAAAAAAAG
AAAAAAAAT
AAAAAAACA

gzip: stdout: Broken pipe


This file contains a list of all kmers along with counts of the kmers in the dataset.

Let's now generate some basic stats.

In [10]:
jellyfish_kmer_stats "${input_dir_type}/${kmer_counts_dir}"

Now let's look at what we have.

In [11]:
column -s$'\t' -t ${input_dir_type}/${kmer_counts_dir}/9/kmer-counts.tsv | head -n 5
wc -l ${input_dir_type}/${kmer_counts_dir}/9/kmer-counts.tsv

accession    kmer_count  file_size_kb
ERR1144974   126204      260
ERR1144975   126306      260
ERR1144976   126216      260
ERR1144977   126135      260
51 microbial/kmer-counts-jellyfish/9/kmer-counts.tsv


This file contains a list of the true (not estimated) kmers in each file, along with the file sizes. Let's look at the total size of files `jellyfish` made (compressed).

In [12]:
cat ${input_dir_type}/${kmer_counts_dir}/kmer-file-sizes.tsv

kmer	total sizes
11	217M	
13	622M	
15	854M	
17	987M	
9	13M	


This lists the total size of all intermediate files we've generated.

# Metagenomics

Let's continue with the metagenomics data.

In [13]:
input_dir_type="metagenomics"
run_jellyfish "${input_dir_type}" "${input_dir_type}/${kmer_counts_dir}" "${kmers_input_sizes}"

Will execute commands from [/tmp/tmp.RW6EYF0iPq] like:
/usr/bin/time -v jellyfish count --size 131085 --threads 4 --mer-len 9 --output metagenomics/kmer-counts-jellyfish/9/ERR1713331.jf --canonical <(gzip -d --stdout metagenomics/data-downsampled/ERR1713331.fastq.gz) 2> metagenomics/kmer-counts-jellyfish/9/jellyfish.count.ERR1713331.log.err 1> metagenomics/kmer-counts-jellyfish/9/jellyfish.count.ERR1713331.log && /usr/bin/time -v jellyfish dump --column --tab metagenomics/kmer-counts-jellyfish/9/ERR1713331.jf 2> metagenomics/kmer-counts-jellyfish/9/jellyfish.dump.ERR1713331.log.err | cut -f 1 | gzip --stdout > metagenomics/kmer-counts-jellyfish/9/ERR1713331.kmer.gz && rm metagenomics/kmer-counts-jellyfish/9/ERR1713331.jf
parallel -j 50 -a /tmp/tmp.RW6EYF0iPq
Will execute commands from [/tmp/tmp.TFNMYsug4M] like:
/usr/bin/time -v jellyfish count --size 1892737 --threads 4 --mer-len 11 --output metagenomics/kmer-counts-jellyfish/11/ERR1713331.jf --canonical <(gzip -d --stdout metagenomic

Let's look at the output.

In [14]:
jellyfish_kmer_stats "${input_dir_type}/${kmer_counts_dir}"
column -s$'\t' -t ${input_dir_type}/${kmer_counts_dir}/9/kmer-counts.tsv | head -n 5
wc -l ${input_dir_type}/${kmer_counts_dir}/9/kmer-counts.tsv
cat ${input_dir_type}/${kmer_counts_dir}/kmer-file-sizes.tsv

accession   kmer_count  file_size_kb
ERR1713331  131068      408
ERR1713332  131068      408
ERR1713333  131070      408
ERR1713334  131070      408
51 metagenomics/kmer-counts-jellyfish/9/kmer-counts.tsv
kmer	total sizes
11	332M	
13	1.4G	
15	2.0G	
17	2.3G	
9	20M	


On to the human data.

# Human

In [15]:
input_dir_type="human"
run_jellyfish "${input_dir_type}" "${input_dir_type}/${kmer_counts_dir}" "${kmers_input_sizes}"

Will execute commands from [/tmp/tmp.DUN83VdHix] like:
/usr/bin/time -v jellyfish count --size 131013 --threads 4 --mer-len 9 --output human/kmer-counts-jellyfish/9/SRR038300.jf --canonical <(gzip -d --stdout human/data-downsampled/SRR038300.fastq.gz) 2> human/kmer-counts-jellyfish/9/jellyfish.count.SRR038300.log.err 1> human/kmer-counts-jellyfish/9/jellyfish.count.SRR038300.log && /usr/bin/time -v jellyfish dump --column --tab human/kmer-counts-jellyfish/9/SRR038300.jf 2> human/kmer-counts-jellyfish/9/jellyfish.dump.SRR038300.log.err | cut -f 1 | gzip --stdout > human/kmer-counts-jellyfish/9/SRR038300.kmer.gz && rm human/kmer-counts-jellyfish/9/SRR038300.jf
parallel -j 50 -a /tmp/tmp.DUN83VdHix
Will execute commands from [/tmp/tmp.EQNAR0GGH4] like:
/usr/bin/time -v jellyfish count --size 1618141 --threads 4 --mer-len 11 --output human/kmer-counts-jellyfish/11/SRR038300.jf --canonical <(gzip -d --stdout human/data-downsampled/SRR038300.fastq.gz) 2> human/kmer-counts-jellyfish/11/jellyf

The output.

In [16]:
jellyfish_kmer_stats "${input_dir_type}/${kmer_counts_dir}"
column -s$'\t' -t ${input_dir_type}/${kmer_counts_dir}/9/kmer-counts.tsv | head -n 5
wc -l ${input_dir_type}/${kmer_counts_dir}/9/kmer-counts.tsv
cat ${input_dir_type}/${kmer_counts_dir}/kmer-file-sizes.tsv

accession   kmer_count  file_size_kb
SRR038300   130917      264
SRR039632   130864      264
SRR1012332  83083       260
SRR1024141  124755      256
51 human/kmer-counts-jellyfish/9/kmer-counts.tsv
kmer	total sizes
11	179M	
13	552M	
15	720M	
17	798M	
9	13M	


Awesome. We're all done converting our data into a common format for both BIGSI and HowDeSBT.