[MRG] CQF based storage #1675

betatim · 2017-04-18T12:39:17Z

This is a PR on top of #1667

With this the mechanics of calling:

import khmer
cqf = khmer.QFCounttable(3, ...)
cqf.add("ACG")
cqf.get("ACG")

works. It doesn't yet do anything particularly smart other than create a small CQF and put stuff into it.

load and save support

Is it mergeable?
make test Did it pass the tests?
make clean diff-cover If it introduces new functionality in
scripts/ is it tested?
make format diff_pylint_report cppcheck doc pydocstyle Is it well
formatted?
For substantial changes or changes to the command-line interface, is it
documented in CHANGELOG.md? See keepachangelog
for more details.
Was a spellchecker run on the source code and documentation after
changes were made?
Do the changes respect streaming IO? (Are they
tested for streaming IO?)

betatim · 2017-04-21T08:55:05Z

I think this is basically working and looks pretty promising!

Used https://gist.github.com/betatim/65b970943bee792e0d83c82fc8688935 to do some benchmarking and try to understand how the CQF works/behaves. (The way to measure extra mem used is pretty unreliable...watch the process' virtual mem in top instead)

False positive rate after storing entries unique kmers evaluated using 20k kmers that are definitely not in the filter:

entries	BF	CQF
5000000	0.00145	0.00015
8000000	0.00315	0.0003
9000000	0.00395	0.0003

For roughly same amount of memory used per data structure we get an order of magnitude less false positives?! Quite positive ...

betatim · 2017-04-21T16:01:45Z

Updated gist now contains script for timing measurements: python fp.py [bf|cqf]

kind	load	query	t_load (s)	t_queryNP (s)	t_queryP (s)
bf	9000000	1000000	6.6050	0.77769	0.805
cqf	9000000	1000000	5.5954	0.6589	0.72047

load and query are the number of kmers loaded into the filter and the number of kmers queried. t_X is time taken. queryNP = query kmers not present, queryP = query kmers present.

betatim · 2017-05-11T20:33:33Z

Using -Ofast and -msse4.2 -D__SSE4_2_ as compile options for all of khmer seems to slow things down. Where as for the CQF they make a difference. Check that the commands actually take effect/how they interact with other arguments.

betatim · 2017-06-08T20:16:11Z

khmer/_oxli/graphs.pxd

@@ -42,16 +42,16 @@ cdef extern from "oxli/hashtable.hh" namespace "oxli":
        void count(const char *)
        void count(HashIntoType)
        bool add(const char *)
-        bool add(HashIntoType)
+        #bool add(HashIntoType)


@camillescott or @luizirber any insight why I need to comment out the second overload? Without this I get an error telling me that it can't convert a const char* into HashIntoType.

A "hack" found on SO: change the second line to bool add2 "add"(HashIntoType). Somehow tricks cython but because you provide a "real name" that is the same as above the compiler figures it all out. This seems to make the compiler happy but segfaults the second time I try to add a kmer :-/

Segfault also happens with the original code on OSX.

betatim · 2017-06-09T15:38:05Z

khmer/_oxli/graphs.pxd

@@ -42,16 +42,16 @@ cdef extern from "oxli/hashtable.hh" namespace "oxli":
        void count(const char *)
        void count(HashIntoType)
        bool add(const char *)
-        bool add(HashIntoType)
+        bool add2 "add"(HashIntoType)


betatim · 2017-06-12T13:25:59Z

khmer/_oxli/graphs.pyx

+        """
+        if isinstance(kmer, (unicode, str)):
+            data = _bstring(kmer)
+            return deref(self.c_table).add(deref(self.c_table).hash_dna(data))


Does one of the level 7 cython wizards know why you have to convert the string to a hash yourself instead of getting the overloaded method to do our bidding for us?

camillescott · 2017-06-14T16:14:38Z

khmer/__init__.py

@@ -359,6 +363,17 @@ def __new__(cls, k, starting_size, n_tables):
        return counttable


+class QFCounttable(_QFCounttable):


Note that now (in the dawn of the Glorious Revolution), we'll be removing this pattern -- this __new__ method should just be implemented in the __cinit__.

betatim · 2017-06-15T15:16:03Z

khmer/_oxli/graphs.pyx

+        """Calculate the k-mer abundance distribution of the given file_name."""
+        read_parser = FastxParser(file_name)
+        cdef uint64_t * x = deref(self.c_table).abundance_distribution[CpFastxReader](
+                read_parser._this, (<CPyHashtable_Object>tracking).hashtable)


Need help again @camillescott :-/ tracking is a Nodegraph which I'd like to somehow unpack so I can take the hashtable attribute and pass it along.

A bit like:

khmer_KHashtable_Object * tracking_obj = NULL; PyArg_ParseTuple(args, "O!", &khmer_KHashtable_Type, &tracking_obj) // now use tracking_obj->hashtable

but how? It seems silly that we can't reach into these extension types 😕

codecov-io · 2017-06-16T14:04:48Z

Codecov Report

Merging #1675 into master will decrease coverage by <.01%.
The diff coverage is 0%.

@@            Coverage Diff            @@
##           master   #1675      +/-   ##
=========================================
- Coverage    0.05%   0.05%   -0.01%     
=========================================
  Files          90      90              
  Lines       11336   11439     +103     
  Branches     2992    3062      +70     
=========================================
  Hits            6       6              
- Misses      11330   11433     +103

Impacted Files	Coverage Δ
include/oxli/hashtable.hh	`0% <0%> (ø)`	⬆️
khmer/_oxli/oxli_exception_convert.cc	`0% <0%> (ø)`	⬆️
khmer/__init__.py	`2.66% <0%> (-0.03%)`	⬇️
src/oxli/storage.cc	`0% <0%> (ø)`	⬆️
include/oxli/storage.hh	`0% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a9a2bb...27a4d8c. Read the comment docs.

betatim · 2017-06-16T14:38:21Z

sse4, cqf, 9000000, 1000000, 5.88, 0.66, 0.70
avx, cqf, 9000000, 1000000, 6.02, 0.66, 0.72
w/o, cqf, 9000000, 1000000, 5.90, 0.66, 0.72

Whether or not we use the SSE4.2 extension seems to make only a very small difference. For now disabling it as we would need a good way to detect whether or not the CPU supports it (if someone has a good idea on that let me know).

betatim · 2017-06-22T12:59:20Z

Ready for review! @luizirber, @camillescott (I seek approval from the cython deity 🙇), @standage , or @ctb

There is a "detail" to sort out with n_occupied() and how to handle the data structure becoming full. However there is enough code to start reviewing it and then we sort things after/during. This doesn't make the CQF available from any of the scripts so I think we are OK to merge this and then doctor around a bit more after.

ctb · 2017-06-22T13:00:15Z

+1 for review

betatim · 2017-06-23T12:54:19Z

My conclusion from splatlab/cqf#4 is that there is no way to get the CQF to behave like a BF which allows the user to continue inserting items beyond the point where the FP rate "explodes". The good thing is we have a way to gracefully exit when we reach that point.

ctb · 2017-06-23T12:55:54Z

good to know & good to have!

betatim · 2017-06-23T14:00:09Z

Can we decide on how to handle that situation when we actually expose this via one of the scripts?

ctb · 2017-06-23T14:04:03Z

How about raising an exception and then having scripts output some useful information e.g. total # of k-mers? But I'm leary of putting too much effort into this until we have a try at it ourselves. No idea what to expect / how big a problem it really will be.

camillescott

Wow, this is immense. Great work!

General comments:

There are several methods that throw C++ exceptions. We should probably handle these properly in the extern wrapper (ie the +MemoryError, etc syntax)
CQFilter should be derived from Hashtable. Seeing as you've essentially (very helpfully) done the job of cythonizing Hashtable anyhow, I'd suggest moving most of that code into an extension class called Hashtable that throws a NotImplemented in its __cinit__.

camillescott · 2017-06-26T17:37:30Z

include/oxli/storage.hh

+
+public:
+  QFStorage(int size) {
+    qf_init(&cf, (1ULL << size), size+8, 0);


Maybe some brief code comments on these init params?

camillescott · 2017-06-26T17:39:45Z

khmer/_oxli/graphs.pxd

@@ -85,6 +85,43 @@ cdef extern from "oxli/hashtable.hh" namespace "oxli":
    cdef cppclass CpNodetable "oxli::Nodetable" (CpHashtable):
        CpNodetable(WordLength, vector[uint64_t])

+    cdef cppclass CpQFCounttable "oxli::QFCounttable":


this should be derived from CpHashtable (see above with Nodetable and Counttable), which will allow you to omit all these redefinitions

Tried that and you end up with a bunch of these:

khmer/_oxli/graphs.pyx:58:43: Cannot assign type 'char *' to 'HashIntoType' Error compiling Cython file: ------------------------------------------------------------ ... For Nodetables and Counttables, this function will fail if the supplied k-mer contains non-ACGT characters. """ if isinstance(kmer, str): temp = kmer.encode('utf-8') return deref(self.c_table).get_count(<char*>temp)

not sure why defining them again in the same order (instead of inheriting) fixes this :-/

Basically, agree that this is stupid but don't know how to improve.

This referring to the class hierarchy or to the temp variable thing?

The class hierarachy

camillescott · 2017-06-26T17:41:25Z

khmer/_oxli/graphs.pyx

+        self.c_table.reset(new CpQFCounttable(k, int(log(starting_size, 2))))
+
+    def add(self, kmer):
+        """Increment the count of this k-mer.


Note that in liboxli, add returns whether the k-mer was new, while count does not. I don't think this is explicitly documented beyond just being that way in the code, however.

camillescott · 2017-06-26T17:42:22Z

khmer/_oxli/graphs.pyx

+        value of the kmer.
+        """
+        if isinstance(kmer, str):
+            temp = kmer.encode('utf-8')


i believe you should be able to write this as cdef char * temp = ...?

You can't define a cdef inside the if statement, so you could declare it outside and then assign here but I'd claim that is even more clunky than the cast.

luizirber · 2017-06-26T18:17:24Z

Quick comment: I started the wrapper for C++ -> Python exceptions in the HLL Cython PR, this is how you declare it on the Cython class.

betatim · 2017-07-06T12:33:40Z

Like the exception handling, for the moment I will add except + to those methods that need it, once #1730 is merged, I'll switch to that.

This way we can have one implementation and stick to the API

betatim · 2017-07-13T15:20:12Z

Ready for review! @luizirber, @camillescott, @standage , or @ctb

betatim · 2017-07-13T15:55:39Z

An alternative to merging this would be to pull out the Hashtable stuff and get that merged first, then think about CQF stuff more (if needed). With the Hashtable stuff merged we can bring about more of the glorious cython future for other parts.

camillescott

Looks good to me other than the couple typing comments I left. I'm also open to just merging this as-is and changing those over in a subsequent PR which removes the CPython interface as well. re: @betatim's earlier comment, and suggestion from @ctb , I'm for just merging all this at once. In light of my comments being suggestions and perhaps better to be done in a new PR, I'm approving now.

Thoughts?

camillescott · 2017-07-20T21:27:34Z

khmer/_oxli/graphs.pyx

+                                                           total_reads,
+                                                           n_consumed)
+
+    def abundance_distribution(self, file_name, tracking):


Why not require this to be Hashtable tracking? The CPython version gets deprecated via this PR anyhow.

camillescott · 2017-07-20T21:28:16Z

khmer/_oxli/graphs.pyx

+            raise ValueError('Expected file_name to be string, '
+                             'got {} instead.'.format(type(file_name)))
+
+        cdef CPyHashtable_Object* hashtable


ie here we can just pull the _this pointer from the Cython extension class.

camillescott · 2017-07-20T21:29:37Z

khmer/_oxli/graphs.pyx

+            abunds.append(x[i])
+        return abunds
+
+    def abundance_distribution_with_reads_parser(self, read_parser, tracking):


Same thoughts as above. Additionally, read_parser can be required to be a FastxParser from oxli.parsing, with the underlying CpFastxReader pulled from the this pointer.

Trying to remember why exactly I went with this. Part of this is there are a bunch of tests that start failing when I add:

from khmer._oxli.parsing import FastxParser as ReadParser

to khmer/init.py, which would seem to be a useful thing to do to keep all the people using ReadParsers happy. Not sure this is super tricky, but can't work it out atm :-/ Merge now, improve later?

ctb · 2017-07-21T13:56:56Z

Looks like test failures in the assembly code cc @camillescott

betatim · 2017-07-21T14:04:34Z

Related to the discussion between @standage and @camillescott on slack yesterday regarding bytes vs strings in the latest version of cython, I think.

(nb. assembly code -> the other ASM)

see cython/cython#1790

betatim · 2017-07-27T14:53:37Z

Nearly forgot the most important part: ✋ (high five) and thanks for the patient commenting :) Long may the cython revolution continue ✊ ;)

betatim mentioned this pull request Apr 20, 2017

nslots and maximum value for key splatlab/cqf#3

Closed

betatim force-pushed the try/cqf_storage branch from 777cbcd to b679728 Compare June 8, 2017 14:12

betatim changed the base branch from try/cqf to master June 8, 2017 19:22

betatim commented Jun 8, 2017

View reviewed changes

betatim force-pushed the try/cqf_storage branch 3 times, most recently from ef21660 to 1e15ca1 Compare June 9, 2017 15:36

betatim commented Jun 9, 2017

View reviewed changes

betatim commented Jun 12, 2017

View reviewed changes

camillescott reviewed Jun 14, 2017

View reviewed changes

betatim commented Jun 15, 2017

View reviewed changes

betatim force-pushed the try/cqf_storage branch 2 times, most recently from 1c562b7 to 07ffa8f Compare June 16, 2017 09:01

betatim force-pushed the try/cqf_storage branch from a3ff82c to ed0b99c Compare June 22, 2017 11:12

betatim changed the title ~~[WIP] CQF based storage~~ [MRG] CQF based storage Jun 23, 2017

camillescott reviewed Jun 26, 2017

View reviewed changes

betatim force-pushed the try/cqf_storage branch from cdb88b1 to b2a57ba Compare July 6, 2017 14:19

betatim added 15 commits July 13, 2017 17:11

Add abundance_distribution method

3ac01e6

Add abundance_distribution_with_reads_parser

7cd08ee

Update c++ examples for CQF

a690c40

Move __new__ code to __cinit__

bcdb1a8

Switch to using ReadParser everywhere

401e990

Remove log2() for python2 compatibility

e8fe765

Disable SSE4.2 extensions

fbbca07

Move QF struct definition to header file

6157486

Add load/save to QFCounttable

a2200ed

Add CQF to liboxli (fixes make cpp-demo)

d942533

Temporary commit while working out noccupied details

7b4423d

Add methods to query n_occupied slots in CQF storage

7b771ef

Add CHANGELOG message

f5781ac

Switch definition of count and add in QFCounttable

755a7ab

This way we can have one implementation and stick to the API

Refactor inheritence structure of cython classes

4d68e48

betatim force-pushed the try/cqf_storage branch from 74a3ab8 to 0662938 Compare July 13, 2017 15:11

Add proper support for passing on custom exceptions

bb051c1

betatim force-pushed the try/cqf_storage branch from 0662938 to bb051c1 Compare July 13, 2017 15:18

camillescott approved these changes Jul 20, 2017

View reviewed changes

Merge branch 'master' into try/cqf_storage

8ab902b

Merge remote-tracking branch 'origin' into try/cqf_storage

27a4d8c

camillescott merged commit c19d0f1 into master Jul 26, 2017

camillescott deleted the try/cqf_storage branch July 26, 2017 21:57

standage mentioned this pull request Jul 27, 2017

Try out the Counting Quotient Filter as a replacement? for Bloom filter/CMS. #1667

Closed

8 tasks

betatim mentioned this pull request Aug 1, 2017

Bulk loading fails with QFCounttable #1751

Closed

		@@ -359,6 +363,17 @@ def __new__(cls, k, starting_size, n_tables):
		return counttable


		class QFCounttable(_QFCounttable):

[MRG] CQF based storage #1675

[MRG] CQF based storage #1675

Conversation

betatim commented Apr 18, 2017 • edited Loading

betatim commented Apr 21, 2017

betatim commented Apr 21, 2017

betatim commented May 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jun 16, 2017 • edited Loading

Codecov Report

betatim commented Jun 16, 2017

betatim commented Jun 22, 2017

ctb commented Jun 22, 2017 via email

betatim commented Jun 23, 2017

ctb commented Jun 23, 2017 via email

betatim commented Jun 23, 2017

ctb commented Jun 23, 2017 via email

camillescott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luizirber commented Jun 26, 2017

betatim commented Jul 6, 2017

betatim commented Jul 13, 2017

betatim commented Jul 13, 2017

camillescott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctb commented Jul 21, 2017

betatim commented Jul 21, 2017 • edited Loading

betatim commented Jul 27, 2017

betatim commented Apr 18, 2017 •

edited

Loading

betatim commented May 11, 2017 •

edited

Loading

codecov-io commented Jun 16, 2017 •

edited

Loading

betatim commented Jul 21, 2017 •

edited

Loading