---
layout: post  
title: Indexing Kmers  
date: 2020-12-05
author: Cameron Prybol  

---

- Assume that you have a known kmer length, k, and a known alphabet (either DNA nucleotides ACGT or Amino Acids)
- Can store kmers as:
    - their actual kmer, which will be a
        - tuple of characters
        - static vector of characters
    - their integer index
    - sparse bit-vector

objective:
- given a random kmer see if:
    - kmer is in list
        - which has fastest search time?

In [1]:
import Pkg
pkgs = [
    "BenchmarkTools",
    "BioSequences"
]
Pkg.add(pkgs)
for pkg in pkgs
    eval(Meta.parse("import $pkg"))
end

[32m[1m  Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[?25l[2K[?25h[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.0/Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.0/Manifest.toml`
[90m [no changes][39m


┌ Info: Recompiling stale cache file /home/jupyter-cjprybol/.julia/compiled/v1.0/BenchmarkTools/ZXPQo.ji for BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
└ @ Base loading.jl:1190


In [None]:
# # these aren't in the order that I'd like
# kmers = Iterators.product([ALPHABET for i in 1:k]...)
# vec(collect(kmers))

In [176]:
ALPHABET = ['A', 'C', 'G', 'T']

4-element Array{Char,1}:
 'A'
 'C'
 'G'
 'T'

In [172]:
k = 3

3

In [246]:
function index_to_kmer(index::Int, k::Int, alphabet::Vector{Char})
    @assert k > 0 "invalid k: $k"
    alphabet_size = length(alphabet)
    max_index = alphabet_size^k
    @assert 0 < index <= max_index "invalid index: $index not within 1:$(max_index)"
    kmer = Vector{eltype(alphabet)}(undef, k)
    for i in k:-1:1
        divisor = alphabet_size^(i-1)
        alphabet_index = Int(ceil(index/divisor))
        index = index % divisor
        if alphabet_index == 0
            alphabet_index = alphabet_size
        end
        kmer[length(kmer)-i+1] = alphabet[alphabet_index]
    end
    return kmer
end

index_to_kmer (generic function with 2 methods)

In [248]:
i = 0
for c1 in ALPHABET
    for c2 in ALPHABET
        for c3 in ALPHABET
            i += 1
            is_match = [c1, c2, c3] == index_to_kmer(i, k, ALPHABET)
            println(i, "\t", c1, c2, c3, is_match)
        end
    end
end

1	AAAtrue
2	AACtrue
3	AAGtrue
4	AATtrue
5	ACAtrue
6	ACCtrue
7	ACGtrue
8	ACTtrue
9	AGAtrue
10	AGCtrue
11	AGGtrue
12	AGTtrue
13	ATAtrue
14	ATCtrue
15	ATGtrue
16	ATTtrue
17	CAAtrue
18	CACtrue
19	CAGtrue
20	CATtrue
21	CCAtrue
22	CCCtrue
23	CCGtrue
24	CCTtrue
25	CGAtrue
26	CGCtrue
27	CGGtrue
28	CGTtrue
29	CTAtrue
30	CTCtrue
31	CTGtrue
32	CTTtrue
33	GAAtrue
34	GACtrue
35	GAGtrue
36	GATtrue
37	GCAtrue
38	GCCtrue
39	GCGtrue
40	GCTtrue
41	GGAtrue
42	GGCtrue
43	GGGtrue
44	GGTtrue
45	GTAtrue
46	GTCtrue
47	GTGtrue
48	GTTtrue
49	TAAtrue
50	TACtrue
51	TAGtrue
52	TATtrue
53	TCAtrue
54	TCCtrue
55	TCGtrue
56	TCTtrue
57	TGAtrue
58	TGCtrue
59	TGGtrue
60	TGTtrue
61	TTAtrue
62	TTCtrue
63	TTGtrue
64	TTTtrue
