# Kmer cardinality in dataset (downsampled data)

Let's count all the unique kmers (set cardinality) in all our datasets so we can get an idea of the kmer diversity of our datasets.

First, we'll setup some variables.

In [1]:
kmer_sizes_list="9 11 13 15 17"
data_dir_name="data-downsampled"

PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR

Here, `kmer_sizes` is a list of all the possible kmer sizes we will count. The other variables just let us make sure we are in the proper directory.

## K-mer size estimation code

Now, let's setup the code for doing the k-mer counting. We'll be using the program [dashing](https://github.com/dnbaker/dashing), and specifically the `dashing hll` command to estimate the unique number of kmers in our datset using a HyperLogLog algorithm.

In order for this to work, you'll have to make sure to create a [conda](https://docs.conda.io/en/latest/) environment called `dashing` which contains the binary `dashing_s512` (the binary was not available as part of conda so I created the environment and copied the binary to the `bin/` directory).

In [2]:
# Purpose: Counts k-mers using `dashing`
# Args:
#      data_type_dir: The directory for the specific data type (human, microbial, metagenomics).
#      kmer_sizes: A string listing the k-mer sizes, separated by spaces (e.g., "9 15 21").
count_kmers() {
    data_type_dir=$1
    kmer_sizes=$2
    
    threads=4
    jobs=50
    
    data_dir=${data_type_dir}/${data_dir_name}
    kmer_output_dir=${data_type_dir}/kmer-downsampled
    
    rm -rf ${kmer_output_dir}
    mkdir ${kmer_output_dir}
    
    # Make string of all files, minus directory and '.fast.gz' part.
    # E.g., "dir/file1.fastq.gz dir/file2.fastq.gz" becomes "file1 file2"
    files=''
    for f in ${data_dir}/*.fastq.gz
    do
        name=`basename ${f} .fastq.gz`
        files="${files} ${name}"
    done
    
    before=`date +%s`
    
    for kmer_size in ${kmer_sizes}
    do
        output=${kmer_output_dir}/kmer-${kmer_size}.tsv
        log=${kmer_output_dir}/kmer-${kmer_size}.log
        
        command="parallel --jobs ${jobs} -I% \
            dashing_s512 hll -k ${kmer_size} -p ${threads} ${data_dir}/%.fastq.gz \2\>\> ${log}.err \| \
            grep 'Estimated number of unique exact matches' \| \
            sed -e 's/Estimated number of unique exact matches: /%\t/' \
            ::: ${files} > ${output}"
            
        echo ${command}
        conda run --name dashing ${command}
    done
    
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}    
}

Great. Now let's run it on the different data types.

## Microbial unique kmers

In [3]:
count_kmers "microbial" "${kmer_sizes_list}"

parallel --jobs 50 -I% dashing_s512 hll -k 9 -p 4 microbial/data-downsampled/%.fastq.gz \2\>\> microbial/kmer-downsampled/kmer-9.log.err \| grep 'Estimated number of unique exact matches' \| sed -e 's/Estimated number of unique exact matches: /%\t/' ::: ERR1144974 ERR1144975 ERR1144976 ERR1144977 ERR1144978 ERR3655992 ERR3655994 ERR3655996 ERR3655998 ERR3656002 ERR3656004 ERR3656010 ERR3656012 ERR3656013 ERR3656015 ERR3656018 ERR3656019 SRR10298903 SRR10298904 SRR10298905 SRR10298906 SRR10298907 SRR10512964 SRR10512965 SRR10512968 SRR10513325 SRR10513326 SRR10513328 SRR10513363 SRR10513672 SRR10519468 SRR10519469 SRR10519616 SRR10519617 SRR10519619 SRR10519620 SRR10519637 SRR10521982 SRR10521983 SRR10521984 SRR10527348 SRR10527349 SRR10527351 SRR10527352 SRR10527353 SRR8088181 SRR8088182 SRR8088183 SRR8088184 SRR8088185 > microbial/kmer-downsampled/kmer-9.tsv
parallel --jobs 50 -I% dashing_s512 hll -k 11 -p 4 microbial/data-downsampled/%.fastq.gz \2\>\> microbial/kmer-downsampled/kmer-

Okay, let's take a look at the output.

In [4]:
ls microbial/kmer-downsampled

kmer-11.log.err  kmer-13.tsv      kmer-17.log.err  kmer-9.tsv
kmer-11.tsv      kmer-15.log.err  kmer-17.tsv
kmer-13.log.err  kmer-15.tsv      kmer-9.log.err


We have multiple files for each k-mer, named by the k-mer size.

In [5]:
head -n 5 microbial/kmer-downsampled/kmer-9.tsv
wc -l microbial/kmer-downsampled/kmer-*.tsv

ERR1144974	126224.000000
ERR1144975	126327.000000
ERR1144976	126245.000000
ERR3655992	128193.000000
ERR3655994	129773.000000
  50 microbial/kmer-downsampled/kmer-11.tsv
  50 microbial/kmer-downsampled/kmer-13.tsv
  50 microbial/kmer-downsampled/kmer-15.tsv
  50 microbial/kmer-downsampled/kmer-17.tsv
  50 microbial/kmer-downsampled/kmer-9.tsv
 250 total


Each of these files is a `tsv` file containing the k-mer counts for each sample in our dataset.

# Metagenomics unique kmers

Okay, now let's do this for the metagenomics data.

In [6]:
count_kmers "metagenomics" "${kmer_sizes_list}"

parallel --jobs 50 -I% dashing_s512 hll -k 9 -p 4 metagenomics/data-downsampled/%.fastq.gz \2\>\> metagenomics/kmer-downsampled/kmer-9.log.err \| grep 'Estimated number of unique exact matches' \| sed -e 's/Estimated number of unique exact matches: /%\t/' ::: ERR1713331 ERR1713332 ERR1713333 ERR1713334 ERR1713335 ERR1713336 ERR1713337 ERR1713339 ERR1713340 ERR1713341 ERR1713342 ERR1713343 ERR1713344 ERR1713345 ERR1713351 ERR1713352 ERR1713353 ERR1713355 ERR1713356 ERR1713358 ERR1713359 ERR1713361 ERR1713362 ERR1713363 ERR1713366 ERR1713371 ERR1713372 ERR1713373 ERR1713374 ERR1713375 ERR1713376 ERR1713378 ERR1713379 ERR1713381 ERR1713382 ERR1713388 ERR1713389 ERR1713391 ERR1713393 ERR1713395 ERR1713396 ERR1713397 ERR1713399 ERR1713400 ERR1713401 ERR1713402 ERR1713403 ERR1713405 ERR1713406 ERR1713409 > metagenomics/kmer-downsampled/kmer-9.tsv
parallel --jobs 50 -I% dashing_s512 hll -k 11 -p 4 metagenomics/data-downsampled/%.fastq.gz \2\>\> metagenomics/kmer-downsampled/kmer-11.log.err \|

In [7]:
ls metagenomics/kmer-downsampled
head -n 5 metagenomics/kmer-downsampled/kmer-9.tsv
wc -l metagenomics/kmer-downsampled/kmer-*.tsv

kmer-11.log.err  kmer-13.tsv      kmer-17.log.err  kmer-9.tsv
kmer-11.tsv      kmer-15.log.err  kmer-17.tsv
kmer-13.log.err  kmer-15.tsv      kmer-9.log.err
ERR1713333	131083.000000
ERR1713352	131077.000000
ERR1713379	131080.000000
ERR1713399	131085.000000
ERR1713366	131083.000000
  50 metagenomics/kmer-downsampled/kmer-11.tsv
  50 metagenomics/kmer-downsampled/kmer-13.tsv
  50 metagenomics/kmer-downsampled/kmer-15.tsv
  50 metagenomics/kmer-downsampled/kmer-17.tsv
  50 metagenomics/kmer-downsampled/kmer-9.tsv
 250 total


Awesome. Let's finally do this for the human genomics data.

# Human unique kmers

In [8]:
count_kmers "human" "${kmer_sizes_list}"

parallel --jobs 50 -I% dashing_s512 hll -k 9 -p 4 human/data-downsampled/%.fastq.gz \2\>\> human/kmer-downsampled/kmer-9.log.err \| grep 'Estimated number of unique exact matches' \| sed -e 's/Estimated number of unique exact matches: /%\t/' ::: SRR038300 SRR039632 SRR1012332 SRR1024141 SRR1033463 SRR1035695 SRR1047817 SRR1060774 SRR1174334 SRR1193574 SRR1292581 SRR1294106 SRR1303626 SRR1313077 SRR1313078 SRR1313092 SRR1313097 SRR1313105 SRR1313120 SRR1313154 SRR1313156 SRR1313198 SRR1313216 SRR1313228 SRR1519066 SRR191403 SRR191429 SRR191455 SRR191463 SRR191480 SRR191487 SRR191494 SRR191527 SRR191555 SRR191563 SRR191621 SRR191646 SRR191675 SRR191693 SRR191696 SRR299111 SRR306849 SRR353653 SRR387778 SRR393767 SRR403006 SRR496397 SRR518951 SRR518958 SRR537114 > human/kmer-downsampled/kmer-9.tsv
parallel --jobs 50 -I% dashing_s512 hll -k 11 -p 4 human/data-downsampled/%.fastq.gz \2\>\> human/kmer-downsampled/kmer-11.log.err \| grep 'Estimated number of unique exact matches' \| sed -e 's/

In [9]:
ls human/kmer-downsampled
head -n 5 human/kmer-downsampled/kmer-9.tsv
wc -l human/kmer-downsampled/kmer-*.tsv

kmer-11.log.err  kmer-13.tsv      kmer-17.log.err  kmer-9.tsv
kmer-11.tsv      kmer-15.log.err  kmer-17.tsv
kmer-13.log.err  kmer-15.tsv      kmer-9.log.err
SRR191527	37398.000000
SRR191463	61953.000000
SRR191494	44799.000000
SRR191487	38059.000000
SRR191480	57169.000000
  50 human/kmer-downsampled/kmer-11.tsv
  50 human/kmer-downsampled/kmer-13.tsv
  50 human/kmer-downsampled/kmer-15.tsv
  50 human/kmer-downsampled/kmer-17.tsv
  50 human/kmer-downsampled/kmer-9.tsv
 250 total


# Unique kmers in union of all datasets

Let's look at the kmers in the union of all datasets.

In [10]:
for data_type in microbial metagenomics human
do
    echo "Working on ${data_type}"
    for kmer_size in ${kmer_sizes_list}
    do
        input_dir=${data_type}/data-downsampled
        output=${data_type}/data-downsampled/total-unique-kmers-${kmer_size}.txt
        command="dashing_s512 hll -k ${kmer_size} -p 56 ${input_dir}/*.fastq.gz | tee ${output}"
        #echo ${command}
        echo "For kmer size ${kmer_size}"
        conda run --name dashing ${command}
    done
done

Working on microbial
For kmer size 9
Estimated number of unique exact matches: 131085.000000
Dashing version: v0.4.0-5-g9c13
[int bns::hll_main(int, char**)] Processing 50 paths with 56 threads
For kmer size 11
Estimated number of unique exact matches: 2095943.000000
Dashing version: v0.4.0-5-g9c13
[int bns::hll_main(int, char**)] Processing 50 paths with 56 threads
For kmer size 13
Estimated number of unique exact matches: 23984486.000000
Dashing version: v0.4.0-5-g9c13
[int bns::hll_main(int, char**)] Processing 50 paths with 56 threads
For kmer size 15
Estimated number of unique exact matches: 60960845.000000
Dashing version: v0.4.0-5-g9c13
[int bns::hll_main(int, char**)] Processing 50 paths with 56 threads
For kmer size 17
Estimated number of unique exact matches: 73106902.000000
Dashing version: v0.4.0-5-g9c13
[int bns::hll_main(int, char**)] Processing 50 paths with 56 threads
Working on metagenomics
For kmer size 9
Estimated number of unique exact matches: 131085.000000
Dashing

Great. We've finished counting all possible combinations of kmers. We're set to generate the figures :).