Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 112 lines (74 sloc) 3.568 kb
52ed6b7 @ctb updated copyright, and version to 0.4
ctb authored
1 =============================
d805f55 @ctb initial sphinxification
ctb authored
2 ktable, simple k-mer counting
52ed6b7 @ctb updated copyright, and version to 0.4
ctb authored
3 =============================
f7b94cc @ctb initial commit
ctb authored
4
d805f55 @ctb initial sphinxification
ctb authored
5 khmer ktables are a simple C++ library for counting k-mers in DNA
6 sequences. There's a complete Python wrapping and it should be pretty
7 darned fast; it's intended for genome-scale k-mer counting.
f7b94cc @ctb initial commit
ctb authored
8
d805f55 @ctb initial sphinxification
ctb authored
9 'ktable's are a table of 4**k counters. khmer then maps each k-mer
10 into this table with a simple (and reversible) hash function.
f7b94cc @ctb initial commit
ctb authored
11
12 Right now, only the Python interface is documented here. The C++
13 interface is essentially identical; if you need to use it and want
14 it documented, drop me a line.
15
16 Counting Speed and Memory Usage
17 ===============================
18
19 On the 5 mb *Shewanella oneidensis* genome, khmer takes less than a second
20 to count all k-mers, for any k between 6 and 12. At 13 it craps out
21 because the table goes over my default stack size limit.
22
23 Approximate memory usage can be calculated by finding the size of a
24 ``long long`` on your machine and then multiplying that by 4**k.
25 For a 12bp wordsize, this works out to 16384*1024; on an Intel-based
26 processor running Linux, ``long long`` is 8 bytes, so memory usage
27 is approximately 128 mb.
28
29 Python interface
30 ================
31
32 Essentially everything requires a ``ktable``.
33
34 ::
35
36 import khmer
37 ktable = khmer.new_ktable(L)
38
39 These commands will create a new ``ktable`` of size 4**L, suitable
40 for counting L-mers.
41
42 Each ``ktable`` object has a few accessor functions:
43
44 * ``ktable.ksize()`` will return L.
45
46 * ``ktable.max_hash()`` will return the max hash value in the table, 4**L - 1.
47
48 * ``ktable.n_entries()`` will return the number of table entries, 4**L.
49
50 The forward and reverse hashing functions are directly accessible:
51
52 * ``hashval = ktable.forward_hash(kmer)`` will return the hash value
53 of the given kmer.
54
55 * ``kmer = ktable.reverse_hash(hashval)`` will return the kmer that hashes
56 to the given hashval.
57
58 There are also some counting functions:
59
60 * ``ktable.count(kmer)`` will increment the count associated with the given kmer
61 by one.
62
63 * ``ktable.consume(sequence)`` will run through the sequence and count
64 each kmer present.
65
66 * ``n = ktable.get(kmer|hashval)`` will return the count associated with the
67 given kmer string or the given hashval, whichever is passed in.
68
69 * ``ktable.set(kmer|hashval, count)`` set the count for the given kmer
70 string or hashval.
71
72 In all of the cases above, 'kmer' is an L-length string, 'hashval' is
73 a non-negative integer, and 'sequence' is a DNA sequence containg ONLY
74 A/C/G/T.
75
76 **Note:** 'N' is not a legal DNA character as far as khmer is concerned!
77
78 And, finally, there are some set operations:
79
80 * ``ktable.clear()`` empties the ktable.
81
82 * ``ktable.update(other)`` adds all of the entries in ``other`` into
83 ``ktable``. The wordsize must be the same for both ktables.
84
85 * ``intersection = ktable.intersect(other)`` returns a ktable where
86 only nonzero entries in both ktables are kept. The count for ach
87 entry is the sum of the counts in ``ktable`` and ``other``.
88
89 An Example
90 ==========
91
92 This short code example will count all 6-mers present in the given
93 DNA sequence, and then print them all out along with their prevalence.
94
95 ::
96
97 # make a new ktable, L=6
98 ktable = khmer.new_ktable(6)
99
100 # count all k-mers in the given string
101 ktable.consume("ATGAGAGACACAGGGAGAGACCCAATTAGAGAATTGGACC")
102
103 # run through all entries. if they have nonzero presence, print.
104 for i in range(0, ktable.n_entries()):
105 n = ktable.get(i)
106 if n:
107 print ktable.reverse_hash(i), "is present", n, "times."
108
109 ::
110
111 CTB, 3/2005
Something went wrong with that request. Please try again.