# Kmer cardinality in dataset

Let's count all the unique kmers (set cardinality) in all our datasets so we can get an idea of the kmer diversity of our datasets.

First, we'll setup some variables.

In [None]:
kmer_sizes_list="9 15 21 25 29 31"
data_dir_name="data"

PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR
ls

Here, `kmer_sizes` is a list of all the possible kmer sizes we will count. The other variables just let us make sure we are in the proper directory.

## K-mer size estimation code

Now, let's setup the code for doing the k-mer counting. We'll be using the program [dashing](https://github.com/dnbaker/dashing), and specifically the `dashing hll` command to estimate the unique number of kmers in our datset using a HyperLogLog algorithm.

In order for this to work, you'll have to make sure to create a [conda](https://docs.conda.io/en/latest/) environment called `dashing` which contains the binary `dashing_s512` (the binary was not available as part of conda so I created the environment and copied the binary to the `bin/` directory).

In [None]:
# Purpose: Counts k-mers using `dashing`
# Args:
#      data_type_dir: The directory for the specific data type (human, microbial, metagenomics).
#      kmer_sizes: A string listing the k-mer sizes, separated by spaces (e.g., "9 15 21").
count_kmers() {
    data_type_dir=$1
    kmer_sizes=$2
    
    threads=4
    jobs=50
    
    data_dir=${data_type_dir}/${data_dir_name}
    kmer_output_dir=${data_type_dir}/kmer
    
    rm -rf ${kmer_output_dir}
    mkdir ${kmer_output_dir}
    
    # Make string of all files, minus directory and '.fast.gz' part.
    # E.g., "dir/file1.fastq.gz dir/file2.fastq.gz" becomes "file1 file2"
    files=''
    for f in ${data_dir}/*.fastq.gz
    do
        name=`basename ${f} .fastq.gz`
        files="${files} ${name}"
    done
    
    before=`date +%s`
    
    for kmer_size in ${kmer_sizes}
    do
        output=${kmer_output_dir}/kmer-${kmer_size}.tsv
        log=${kmer_output_dir}/kmer-${kmer_size}.log
        
        command="parallel --jobs ${jobs} -I% \
            dashing_s512 hll -k ${kmer_size} -p ${threads} ${data_dir}/%.fastq.gz \2\>\> ${log}.err \| \
            grep 'Estimated number of unique exact matches' \| \
            sed -e 's/Estimated number of unique exact matches: /%\t/' \
            ::: ${files} > ${output}"
            
        echo ${command}
        conda run --name dashing ${command}
    done
    
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}    
}

Great. Now let's run it on the different data types.

## Microbial unique kmers

In [None]:
count_kmers "microbial" "${kmer_sizes_list}"

Okay, let's take a look at the output.

In [None]:
ls microbial/kmer

We have multiple files for each k-mer, named by the k-mer size.

In [None]:
head -n 5 microbial/kmer/kmer-9.tsv
wc -l microbial/kmer/kmer-*.tsv

Each of these files is a `tsv` file containing the k-mer counts for each sample in our dataset.

# Metagenomics unique kmers

Okay, now let's do this for the metagenomics data.

In [None]:
count_kmers "metagenomics" "${kmer_sizes_list}"

In [None]:
ls metagenomics/kmer
head -n 5 metagenomics/kmer/kmer-9.tsv
wc -l metagenomics/kmer/kmer-*.tsv

Awesome. Let's finally do this for the human genomics data.

# Human unique kmers

In [None]:
count_kmers "human" "${kmer_sizes_list}"

In [None]:
ls human/kmer
head -n 5 human/kmer/kmer-9.tsv
wc -l human/kmer/kmer-*.tsv

Great. We've finished counting all possible combinations of kmers. We're set to generate the figures :).