---
layout: post  
---

What are the memory/speed trade-offs of using an ordered dictionary to store kmers and their respective counts as opposed to using sorted vectors?

Initially, I would expect that the ordered dictionary would have an additional memory overhead of storing the hash table (cost) in order to increase the rate of lookup (benefit).

The flip side of this is that the sorted vector would have little to no memory overhead beyond the actual kmers themselves (benefit), but our best-case search time should be slower (cost) with an expected runtime proportional to $$log2(\text{K}) \text{ where K = # of kmers}$$

Another potential benefit of using the sorted vectors is that they can be memory mapped onto disk, which would allow us to work with kmer datasets that are larger than the available RAM of the machine

So.... let's benchmark them!

In [7]:
import Pkg
pkgs = [
    "BenchmarkTools",
    "DataStructures",
    "Random",
    "BioSequences",
    "Primes",
    "StatsBase",
    "Statistics"
]
Pkg.add(pkgs)
for pkg in pkgs
    eval(Meta.parse("import $pkg"))
end

[32m[1m   Updating[22m[39m registry at `~/.julia/registries/General`


[?25l    

[32m[1m   Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`




[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`


In [8]:
k = 3

3

In [9]:
sequence = BioSequences.randdnaseq(Random.seed!(1), 10)

10nt DNA Sequence:
TCGTCCCAGG

In [10]:
KMER_TYPE = BioSequences.DNAMer{3}

BioSequences.Mer{BioSequences.DNAAlphabet{2},3}

In [11]:
kmer_counts = StatsBase.countmap(
    BioSequences.canonical(kmer.fw)
        for kmer in BioSequences.each(KMER_TYPE, sequence))

Dict{Any,Int64} with 8 entries:
  ACG => 1
  GAC => 1
  CCC => 1
  GGA => 1
  CGA => 1
  AGG => 1
  CAG => 1
  CCA => 1

In [12]:
sorted_kmer_counts = collect(sort(kmer_counts))

8-element Array{Pair{Any,Int64},1}:
 ACG => 1
 AGG => 1
 CAG => 1
 CCA => 1
 CCC => 1
 CGA => 1
 GAC => 1
 GGA => 1

In [13]:
kmer_counts_dict = 
    DataStructures.OrderedDict(
        kmer => (index = i, count = c) for (i, (kmer, c)) in enumerate(sorted_kmer_counts)
)

OrderedCollections.OrderedDict{BioSequences.Mer{BioSequences.DNAAlphabet{2},3},NamedTuple{(:index, :count),Tuple{Int64,Int64}}} with 8 entries:
  ACG => (index = 1, count = 1)
  AGG => (index = 2, count = 1)
  CAG => (index = 3, count = 1)
  CCA => (index = 4, count = 1)
  CCC => (index = 5, count = 1)
  CGA => (index = 6, count = 1)
  GAC => (index = 7, count = 1)
  GGA => (index = 8, count = 1)

In [14]:
kmers = first.(sorted_kmer_counts)

8-element Array{BioSequences.Mer{BioSequences.DNAAlphabet{2},3},1}:
 ACG
 AGG
 CAG
 CCA
 CCC
 CGA
 GAC
 GGA

In [15]:
counts = last.(sorted_kmer_counts)

8-element Array{Int64,1}:
 1
 1
 1
 1
 1
 1
 1
 1

In [16]:
Base.summarysize(kmers)

104

In [17]:
Base.summarysize(counts)

104

In [18]:
Base.summarysize(kmer_counts_dict)

416

In [19]:
# time to see if something is in the list
generate_kmer(k) = BioSequences.canonical(BioSequences.DNAMer(BioSequences.randdnaseq(k)))
BenchmarkTools.@benchmark get(kmer_counts_dict, $(generate_kmer(k)), (index = 0, count = 0))

BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     38.151 ns (0.00% GC)
  median time:      40.450 ns (0.00% GC)
  mean time:        43.954 ns (7.19% GC)
  maximum time:     4.177 μs (98.89% GC)
  --------------
  samples:          10000
  evals/sample:     992

In [20]:
x = BenchmarkTools.@benchmark searchsorted(kmers, $(generate_kmer(k)))

BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     33.992 ns (0.00% GC)
  median time:      34.569 ns (0.00% GC)
  mean time:        38.648 ns (8.02% GC)
  maximum time:     3.569 μs (98.67% GC)
  --------------
  samples:          10000
  evals/sample:     993

In [26]:
R = 3:9

for sequence_length in [10^i for i in R]
    println("sequence_length = $sequence_length")
    sequence = BioSequences.randdnaseq(Random.seed!(sequence_length), sequence_length)
    max_k = Int(ceil(log(4, sequence_length)))
    for k in Primes.primes(3, max_k)
        println("\tk = $k")
        KMER_TYPE = BioSequences.DNAMer{k}
        kmer_counts = StatsBase.countmap(
            BioSequences.canonical(kmer.fw)
                for kmer in BioSequences.each(KMER_TYPE, sequence))
        sorted_kmer_counts = collect(sort(kmer_counts))
        kmer_counts_dict = 
            DataStructures.OrderedDict(
                kmer => count for (kmer, count) in sorted_kmer_counts
            )
        kmers = first.(sorted_kmer_counts)
        counts = last.(sorted_kmer_counts)
        
        println("\t\ttotal kmers = $(length(kmers))\n")

        vector_size = Base.summarysize(kmers) + Base.summarysize(counts)

        dict_size = Base.summarysize(kmer_counts_dict)
        
        relative_size = dict_size / vector_size
        println("\t\tDict size relative to vectors\t\t: ", round(dict_size / vector_size, digits=1))
        
        vector_results = BenchmarkTools.@benchmark searchsorted(kmers, $(generate_kmer(k)))
        dict_results = BenchmarkTools.@benchmark get(kmer_counts_dict, $(generate_kmer(k)), 0)
        
        relative_performance = Statistics.median(vector_results).time / Statistics.median(dict_results).time
        println("\t\tDict performance relative to vectors\t: ", round(relative_performance, digits=1))
        println("\t\tnormalized performance\t\t\t: ", round(relative_performance / relative_size, digits=1))

    end
end

sequence_length = 1000
	k = 3
		total kmers = 32

		Dict size relative to vectors		: 1.6
		Dict performance relative to vectors	: 1.2
		normalized performance			: 0.8
	k = 5
		total kmers = 430

		Dict size relative to vectors		: 1.6
		Dict performance relative to vectors	: 1.4
		normalized performance			: 0.9
sequence_length = 10000
	k = 3
		total kmers = 32

		Dict size relative to vectors		: 1.6
		Dict performance relative to vectors	: 1.2
		normalized performance			: 0.7
	k = 5
		total kmers = 512

		Dict size relative to vectors		: 1.5
		Dict performance relative to vectors	: 1.3
		normalized performance			: 0.9
	k = 7
		total kmers = 5750

		Dict size relative to vectors		: 1.7
		Dict performance relative to vectors	: 1.5
		normalized performance			: 0.9
sequence_length = 100000
	k = 3
		total kmers = 32

		Dict size relative to vectors		: 1.6
		Dict performance relative to vectors	: 1.1
		normalized performance			: 0.7
	k = 5
		total kmers = 512

		Dict size relative to vectors	

Looks like dictionary has a 50% storage overhead above just using Vectors, which isn't bad, but not great either

Let's see how much of a performance improvement we can get by continuing to increase, which appears to be where the hashing finally starts to pay off in terms of speed improvement

In [27]:
function assess_dict_vs_vectors(sequence, k)
    KMER_TYPE = BioSequences.DNAMer{k}
    kmer_counts = StatsBase.countmap(BioSequences.canonical(kmer.fw) for kmer in BioSequences.each(KMER_TYPE, sequence))
    sorted_kmer_counts = collect(sort(kmer_counts))
    
    kmer_counts_dict = DataStructures.OrderedDict(
            kmer => count for (kmer, count) in sorted_kmer_counts
        )
    
    kmers = first.(sorted_kmer_counts)
    counts = last.(sorted_kmer_counts)

    println("\t\ttotal kmers = $(length(kmers))\n")

    vector_size = Base.summarysize(kmers) + Base.summarysize(counts)

    dict_size = Base.summarysize(kmer_counts_dict)

    relative_size = dict_size / vector_size
    println("\t\tDict size relative to vectors\t\t: ", round(dict_size / vector_size, digits=1))

    vector_results = BenchmarkTools.@benchmark searchsorted(kmers, $(generate_kmer(k)))
    dict_results = BenchmarkTools.@benchmark get(kmer_counts_dict, $(generate_kmer(k)), 0)

    relative_performance = Statistics.median(vector_results).time / Statistics.median(dict_results).time
    println("\t\tDict performance relative to vectors\t: ", round(relative_performance, digits=1))
    println("\t\tnormalized performance\t\t\t: ", round(relative_performance / relative_size, digits=1))
end

assess_dict_vs_vectors (generic function with 1 method)

In [None]:
sequence_length = 10^9
println("sequence_length = $sequence_length")
sequence = BioSequences.randdnaseq(Random.seed!(sequence_length), sequence_length)
for k in Primes.primes(3, 31)
    println("\tk = $k")
    assess_dict_vs_vectors(sequence, k)
end

sequence_length = 1000000000
