---
layout: post  
title: Indexing Kmers  
date: 2020-12-05
author: Cameron Prybol  

---

- Assume that you have a known kmer length, k, and a known alphabet (either DNA nucleotides ACGT or Amino Acids)

There are serveral ways that we can store kmers.
The simplest way would be to just store the actual kmer.
One of the most convenient ways to store kmers is to have a hash, which enables quick and constant time lookups to see if a given kmer exists.
The hash method doesn't scale well though since it requires every kmer to remain in memory.
Keeping all kmers in memory is often do-able, but since I'd like to only write one method that works everywhere regardless of available memory resources (we are working under the assumption that there is sufficient disk storage to store and analyze the data), we'll focus on methods that allow memory-mapping to disk such that we can work with datasets that are larger than available RAM, but will remain entirely in RAM when they can.

When memory-mapping to disk, we must use fixed-size datatypes.
The two options that seem most practical are:
- immutable, fixed-length containers of nucleotides or amino acids
    - Tuples
    - [Static Arrays](https://github.com/JuliaArrays/StaticArrays.jl)
- integers
    - for a given k-length, there is a finite # of possible kmers
    - for a given biological alphabet of N characters, the value is N^k
    - given an ordering of these characters, we can deterministically solve for the index that a given kmer would occupy in a sorted list of all possible kmers of k-length
    - thus, we can unambiguously map between an integer index and a given kmer with a known alphabet
    - this only becomes an issue when the size of the possible kmers is larger than what can be stored in native integer datatypes
        - e.g. UInt32, 64
        
Since I also want to keep track of the number of occurances of each kmer in a given dataset, I THINK that the most concise way to keep track of kmers is a sparse count-vector of the # of times each kmer was observed, where:
- size of the sparse count vector = N^k where N = size of alphabet and k = length of kmer
- kmers are mapped to an integer, such that the i-th index of the count vector represents the i-th kmer in a hypothetical dense sorted list of all possible kmers
- any kmer with counts > 0 exists, and thus the counts vector is sufficient to store the kmers and their frequencies

In [5]:
Pkg.build("FFTW")

[32m[1m   Building[22m[39m FFTW → `~/.julia/packages/FFTW/DMUbN/deps/build.log`


In [6]:
import Pkg
pkgs = [
    "BenchmarkTools",
    "BioSequences",
    "BioSymbols",
    "PlotlyJS",
    "Primes",
    "ProgressMeter",
    "Statistics",
    "StatsPlots",
    "Test",
]
Pkg.add(pkgs)
for pkg in pkgs
    eval(Meta.parse("import $pkg"))
end

StatsPlots.plotlyjs()

[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`
┌ Info: Precompiling StatsPlots [f3b207a7-027a-5e70-b257-86293d7955fd]
└ @ Base loading.jl:1278


Plots.PlotlyJSBackend()

In [2]:
function index_to_kmer(index::Integer, k::Integer, alphabet::NTuple{N,T}) where N where T <: Union{BioSymbols.DNA, BioSymbols.AminoAcid}
    @assert k > 0 "invalid k: $k"
    max_index = N^k
    @assert 0 < index <= max_index "invalid index: $index not within 1:$(max_index)"
    kmer = Vector{T}(undef, k)
    for i in k:-1:1
        divisor = N^(i-1)
        alphabet_index = Int(ceil(index/divisor))
        index = index % divisor
        if alphabet_index == 0
            alphabet_index = N
        end
        kmer[length(kmer)-i+1] = alphabet[alphabet_index]
    end
    return Tuple(kmer)
end

index_to_kmer (generic function with 1 method)

In [3]:
function kmer_to_index(kmer::NTuple{N_Kmer, T}, alphabet::NTuple{N_Alphabet,T}) where N_Kmer where N_Alphabet where T <: Union{BioSymbols.DNA, BioSymbols.AminoAcid}
    index = 0
    for i in N_Kmer:-1:2
        alphabet_index = findfirst(x -> x == kmer[N_Kmer-i+1], alphabet)
        alphabet_index
        index += (alphabet_index - 1) * N_Alphabet^(i-1)
    end
    index += findfirst(x -> x == kmer[end], alphabet)
    return index
end

kmer_to_index (generic function with 1 method)

In [7]:
ALPHABET = filter(symbol -> BioSymbols.iscertain(symbol), BioSymbols.alphabet(BioSymbols.DNA))

Test.@testset "Test kmer <=> index transformations" begin
    for k in 1:3
        for (index, kmer) in enumerate(sort!(vec(collect(Iterators.product([ALPHABET for i in 1:k]...)))))
            Test.@test index == kmer_to_index(kmer, ALPHABET)
            Test.@test kmer == index_to_kmer(index, k, ALPHABET)
        end
    end
end

[37m[1mTest Summary:                       | [22m[39m[32m[1mPass  [22m[39m[36m[1mTotal[22m[39m
Test kmer <=> index transformations | [32m 168  [39m[36m  168[39m


Test.DefaultTestSet("Test kmer <=> index transformations", Any[], 168, false)

In [8]:
ALPHABET = filter(symbol -> BioSymbols.iscertain(symbol) && !BioSymbols.isterm(symbol), BioSymbols.alphabet(BioSymbols.AminoAcid))

Test.@testset "Test kmer <=> index transformations" begin
    for k in 1:3
        for (index, kmer) in enumerate(sort!(vec(collect(Iterators.product([ALPHABET for i in 1:k]...)))))
            Test.@test index == kmer_to_index(kmer, ALPHABET)
            Test.@test kmer == index_to_kmer(index, k, ALPHABET)
        end
    end
end

[37m[1mTest Summary:                       | [22m[39m[32m[1m Pass  [22m[39m[36m[1mTotal[22m[39m
Test kmer <=> index transformations | [32m22308  [39m[36m22308[39m


Test.DefaultTestSet("Test kmer <=> index transformations", Any[], 22308, false)

Observe that memory allocations and gc time are zero, while the indexing algorithm scales linearly with the size of k.
Linear scaling to size of k while the # of possible kmers increases exponentially means that this approach should scale better than 

In [13]:
ks = []
results = []
ALPHABET = filter(symbol -> BioSymbols.iscertain(symbol), BioSymbols.alphabet(BioSymbols.DNA))
ProgressMeter.@showprogress for k in Primes.primes(3, 100)
    kmer = Tuple(BioSequences.randdnaseq(k))
    result = BenchmarkTools.@benchmark kmer_to_index($kmer, $ALPHABET)
    push!(ks, k)
    push!(results, result)
end

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:01:03[39m


In [50]:
rand(results)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     37.621 ns (0.00% GC)
  median time:      37.641 ns (0.00% GC)
  mean time:        38.581 ns (0.00% GC)
  maximum time:     334.274 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

In [48]:
xs = []
ys = []
subsampling_size = 100
for (k, result) in zip(ks, results)
    for time in rand(result.times, subsampling_size)
        push!(xs, k)
        push!(ys, time)
    end
end

StatsPlots.scatter(
    xs,
    ys,
    title = "k-length vs index-lookup time for DNA alphabet",
    ylabel="median nano-seconds per lookup",
    xlabel="k length",
    legend=false
)

In [53]:
ks = []
results = []
ALPHABET = filter(symbol -> BioSymbols.iscertain(symbol), BioSymbols.alphabet(BioSymbols.AminoAcid))
ProgressMeter.@showprogress for k in Primes.primes(3, 7)
    kmer = Tuple(BioSequences.randaaseq(k))
    result = BenchmarkTools.@benchmark kmer_to_index($kmer, $ALPHABET)
    push!(ks, k)
    push!(results, result)
end

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:07[39m


In [54]:
rand(results)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     83.997 ns (0.00% GC)
  median time:      84.309 ns (0.00% GC)
  mean time:        85.397 ns (0.00% GC)
  maximum time:     308.899 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     963

In [55]:
xs = []
ys = []
subsampling_size = 100
for (k, result) in zip(ks, results)
    for time in rand(result.times, subsampling_size)
        push!(xs, k)
        push!(ys, time)
    end
end

StatsPlots.scatter(
    xs,
    ys,
    title = "k-length vs index-lookup time for DNA alphabet",
    ylabel="median nano-seconds per lookup",
    xlabel="k length",
    legend=false
)