# Break into kmers with Jellyfish

Both BIGSI and HowDeSBT operate on k-mers (of some size k), which are inserted into Bloom filters. While each program can take a variety of input files (and HowDeSBT can count k-mers itself), in order to measure performance we wish to start with a common set of inputs. So, we will break our data into kmers with the program [jellyfish](https://github.com/gmarcais/Jellyfish) ahead of time.

First, let's setup some directories.

In [1]:
data_dir=data-downsampled
kmer_dir=kmer-downsampled
kmer_counts_dir=kmer-counts-jellyfish
kmers_input_sizes="9"

PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR

The code given below assumes you have the following [conda](https://docs.conda.io/en/latest/) environments setup to install [jellyfish](https://github.com/gmarcais/Jellyfish). This can be done with.

```bash
conda create --name jellyfish jellyfish
```

Let's verify these commands exist (and verify versions).

In [2]:
conda run --name jellyfish jellyfish --version

jellyfish 2.2.8


Once this step is complete, we need to figure out the maximum number of kmers for our 3 datasets (for passing to `jellyfish` to set the hash size). Let's define a bash function for this.

## Find max data kmer cardinality

In [3]:
# Purpose: Gets max kmer cardinality estimates on sequence reads.
# Args:
#      input_dir: The input directory containing all the kmer counts.
#      kmer_size: The kmer_size to find the max.
# Output: Prints the the maximum kmer count for the kmer size
#         in this directory (files named like `kmer-9.tsv`).
get_max_kmer_cardinality() {
    input_dir=$1
    kmer_size=$2
    
    cut -f 2 "${input_dir}/kmer-${kmer_size}.tsv" | sort -n | tail -n 1 | awk '{print int($1+0.5)}'
}

Let's test this code out.

In [4]:
get_max_kmer_cardinality "microbial/${kmer_dir}" "31"
get_max_kmer_cardinality "human/${kmer_dir}" "31"
get_max_kmer_cardinality "metagenomics/${kmer_dir}" "31"

38109197
50876500
78652018


Great. Now that we have this setup, we can move to defining a function to count and produce a list of all kmers in the dataset using `jellyfish`.

## Bash jellyfish kmer count function

In [5]:
# Purpose: Runs jellyfish on sequence reads to count kmers.
# Args:
#      type_dir: The input type directory (e.g., microbial,)
#      output_dir: A directory to save the jellyfish output into.
#      kmer_sizes: The kmer_sizes to run (separated by spaces).
# Output: Jellyfish kmer counts in the passed output directory.
run_jellyfish() {
    type_dir=$1
    output_dir=$2
    kmer_sizes=$3
    
    input_dir=${type_dir}/${data_dir}
    
    threads=24
    
    rm -rf ${output_dir}
    mkdir ${output_dir}
    
    before=`date +%s`
    
    for kmer_size in ${kmer_sizes}
    do
        # Find max kmers
        max_kmer=`get_max_kmer_cardinality "${type_dir}/${kmer_dir}" "${kmer_size}"`
        
        # Hash size is 10x the max kmers in dataset.
        #hash_size=`echo "10*${max_kmer}" | bc`
        
        output_dir_kmer=${output_dir}/${kmer_size}
        mkdir ${output_dir_kmer}
    
        #for file in ${input_dir}/*.fastq.gz
        for file in ${input_dir}/ERR1144976.fastq.gz
        do
            accession=`basename ${file} .fastq.gz`

            jellyfish_out=${output_dir_kmer}/${accession}.jf
            jellyfish_log=${output_dir_kmer}/jellyfish.count.${accession}.log
        
            command="/usr/bin/time -v jellyfish count --size ${max_kmer} --threads ${threads} --mer-len ${kmer_size} --output ${jellyfish_out} \
                --canonical <(gzip -d --stdout ${file}) 2> ${jellyfish_log}.err 1> ${jellyfish_log}"
            echo ${command}
            conda run --name jellyfish ${command}
            
            kmer_counts_out=${output_dir_kmer}/${accession}.kmer.gz
            kmer_counts_log=${output_dir_kmer}/jellyfish.dump.${accession}.log
            
            command="/usr/bin/time -v jellyfish dump --column --tab ${jellyfish_out} 2> ${kmer_counts_log}.err | gzip --stdout > ${kmer_counts_out}"
            echo ${command}
            conda run --name jellyfish ${command}
            
            rm ${jellyfish_out}
        done
    done
    
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}
}

Now that we've got our mccortex code defined. Let's run it on a dataset.

## Microbial kmer generation

Let's first generate a list of all kmers (for different sizes) for the microbial dataset.

In [8]:
input_dir_type="microbial"
run_jellyfish "${input_dir_type}" "${input_dir_type}/${kmer_counts_dir}" "${kmers_input_sizes}"

/usr/bin/time -v jellyfish count --size 131085 --threads 24 --mer-len 9 --output microbial/kmer-counts-jellyfish/9/ERR1144976.jf --canonical <(gzip -d --stdout microbial/data-downsampled/ERR1144976.fastq.gz) 2> microbial/kmer-counts-jellyfish/9/jellyfish.count.ERR1144976.log.err 1> microbial/kmer-counts-jellyfish/9/jellyfish.count.ERR1144976.log
/usr/bin/time -v jellyfish dump --column --tab microbial/kmer-counts-jellyfish/9/ERR1144976.jf 2> microbial/kmer-counts-jellyfish/9/jellyfish.dump.ERR1144976.log.err | gzip --stdout > microbial/kmer-counts-jellyfish/9/ERR1144976.kmer.gz
Done. Took 0.05 minutes.

Awesome. We've generated our kmer list. Let's take a look at the files.

In [10]:
ls -lh ${input_dir_type}/${kmer_counts_dir}/9

total 676K
-rw-r--r-- 1 apetkau grp_apetkau 667K Dec 10 16:48 [0m[01;31mERR1144976.kmer.gz[0m
-rw-r--r-- 1 apetkau grp_apetkau    0 Dec 10 16:48 jellyfish.count.ERR1144976.log
-rw-r--r-- 1 apetkau grp_apetkau  860 Dec 10 16:48 jellyfish.count.ERR1144976.log.err
-rw-r--r-- 1 apetkau grp_apetkau  791 Dec 10 16:48 jellyfish.dump.ERR1144976.log.err


Let's look at what data we've genearted.

In [12]:
zcat ${input_dir_type}/${kmer_counts_dir}/9/ERR1144976.kmer.gz | head -n 5

AAAAAAAAA	2871
CCCGGTGGC	5395
ATGCTCAAG	720
TAACTGACA	39
GGTACAACC	421

gzip: stdout: Broken pipe


This file contains a list of all kmers along with counts of the kmers in the dataset.