-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] CQF based storage #1675
[MRG] CQF based storage #1675
Conversation
I think this is basically working and looks pretty promising! Used https://gist.github.com/betatim/65b970943bee792e0d83c82fc8688935 to do some benchmarking and try to understand how the CQF works/behaves. (The way to measure extra mem used is pretty unreliable...watch the process' virtual mem in False positive rate after storing
For roughly same amount of memory used per data structure we get an order of magnitude less false positives?! Quite positive ... |
Updated gist now contains script for timing measurements:
load and query are the number of kmers loaded into the filter and the number of kmers queried. t_X is time taken. queryNP = query kmers not present, queryP = query kmers present. |
Using |
khmer/_oxli/graphs.pxd
Outdated
@@ -42,16 +42,16 @@ cdef extern from "oxli/hashtable.hh" namespace "oxli": | |||
void count(const char *) | |||
void count(HashIntoType) | |||
bool add(const char *) | |||
bool add(HashIntoType) | |||
#bool add(HashIntoType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@camillescott or @luizirber any insight why I need to comment out the second overload? Without this I get an error telling me that it can't convert a const char*
into HashIntoType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A "hack" found on SO: change the second line to bool add2 "add"(HashIntoType)
. Somehow tricks cython but because you provide a "real name" that is the same as above the compiler figures it all out. This seems to make the compiler happy but segfaults the second time I try to add a kmer :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Segfault also happens with the original code on OSX.
ef21660
to
1e15ca1
Compare
khmer/_oxli/graphs.pxd
Outdated
@@ -42,16 +42,16 @@ cdef extern from "oxli/hashtable.hh" namespace "oxli": | |||
void count(const char *) | |||
void count(HashIntoType) | |||
bool add(const char *) | |||
bool add(HashIntoType) | |||
bool add2 "add"(HashIntoType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dat hack?!
khmer/_oxli/graphs.pyx
Outdated
""" | ||
if isinstance(kmer, (unicode, str)): | ||
data = _bstring(kmer) | ||
return deref(self.c_table).add(deref(self.c_table).hash_dna(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does one of the level 7 cython wizards know why you have to convert the string to a hash yourself instead of getting the overloaded method to do our bidding for us?
khmer/__init__.py
Outdated
@@ -359,6 +363,17 @@ def __new__(cls, k, starting_size, n_tables): | |||
return counttable | |||
|
|||
|
|||
class QFCounttable(_QFCounttable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that now (in the dawn of the Glorious Revolution), we'll be removing this pattern -- this __new__
method should just be implemented in the __cinit__
.
khmer/_oxli/graphs.pyx
Outdated
"""Calculate the k-mer abundance distribution of the given file_name.""" | ||
read_parser = FastxParser(file_name) | ||
cdef uint64_t * x = deref(self.c_table).abundance_distribution[CpFastxReader]( | ||
read_parser._this, (<CPyHashtable_Object>tracking).hashtable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need help again @camillescott :-/ tracking
is a Nodegraph
which I'd like to somehow unpack so I can take the hashtable
attribute and pass it along.
A bit like:
khmer_KHashtable_Object * tracking_obj = NULL;
PyArg_ParseTuple(args, "O!", &khmer_KHashtable_Type, &tracking_obj)
// now use tracking_obj->hashtable
but how? It seems silly that we can't reach into these extension types 😕
1c562b7
to
07ffa8f
Compare
Codecov Report
@@ Coverage Diff @@
## master #1675 +/- ##
=========================================
- Coverage 0.05% 0.05% -0.01%
=========================================
Files 90 90
Lines 11336 11439 +103
Branches 2992 3062 +70
=========================================
Hits 6 6
- Misses 11330 11433 +103
Continue to review full report at Codecov.
|
sse4, cqf, 9000000, 1000000, 5.88, 0.66, 0.70 Whether or not we use the SSE4.2 extension seems to make only a very small difference. For now disabling it as we would need a good way to detect whether or not the CPU supports it (if someone has a good idea on that let me know). |
Ready for review! @luizirber, @camillescott (I seek approval from the cython deity 🙇), @standage , or @ctb There is a "detail" to sort out with |
+1 for review
|
My conclusion from splatlab/cqf#4 is that there is no way to get the CQF to behave like a BF which allows the user to continue inserting items beyond the point where the FP rate "explodes". The good thing is we have a way to gracefully exit when we reach that point. |
good to know & good to have!
|
Can we decide on how to handle that situation when we actually expose this via one of the scripts? |
How about raising an exception and then having scripts output some useful
information e.g. total # of k-mers?
But I'm leary of putting too much effort into this until we have a try at
it ourselves. No idea what to expect / how big a problem it really will
be.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is immense. Great work!
General comments:
- There are several methods that throw C++ exceptions. We should probably handle these properly in the extern wrapper (ie the
+MemoryError
, etc syntax) - CQFilter should be derived from Hashtable. Seeing as you've essentially (very helpfully) done the job of cythonizing Hashtable anyhow, I'd suggest moving most of that code into an extension class called
Hashtable
that throws aNotImplemented
in its__cinit__
.
include/oxli/storage.hh
Outdated
|
||
public: | ||
QFStorage(int size) { | ||
qf_init(&cf, (1ULL << size), size+8, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some brief code comments on these init params?
khmer/_oxli/graphs.pxd
Outdated
@@ -85,6 +85,43 @@ cdef extern from "oxli/hashtable.hh" namespace "oxli": | |||
cdef cppclass CpNodetable "oxli::Nodetable" (CpHashtable): | |||
CpNodetable(WordLength, vector[uint64_t]) | |||
|
|||
cdef cppclass CpQFCounttable "oxli::QFCounttable": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be derived from CpHashtable
(see above with Nodetable and Counttable), which will allow you to omit all these redefinitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried that and you end up with a bunch of these:
khmer/_oxli/graphs.pyx:58:43: Cannot assign type 'char *' to 'HashIntoType'
Error compiling Cython file:
------------------------------------------------------------
...
For Nodetables and Counttables, this function will fail if the
supplied k-mer contains non-ACGT characters.
"""
if isinstance(kmer, str):
temp = kmer.encode('utf-8')
return deref(self.c_table).get_count(<char*>temp)
not sure why defining them again in the same order (instead of inheriting) fixes this :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, agree that this is stupid but don't know how to improve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This referring to the class hierarchy or to the temp variable thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class hierarachy
khmer/_oxli/graphs.pyx
Outdated
self.c_table.reset(new CpQFCounttable(k, int(log(starting_size, 2)))) | ||
|
||
def add(self, kmer): | ||
"""Increment the count of this k-mer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in liboxli, add
returns whether the k-mer was new, while count
does not. I don't think this is explicitly documented beyond just being that way in the code, however.
value of the kmer. | ||
""" | ||
if isinstance(kmer, str): | ||
temp = kmer.encode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe you should be able to write this as cdef char * temp = ...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't define a cdef
inside the if statement, so you could declare it outside and then assign here but I'd claim that is even more clunky than the cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair 'nuff
Quick comment: I started the wrapper for C++ -> Python exceptions in the HLL Cython PR, this is how you declare it on the Cython class. |
Like the exception handling, for the moment I will add |
This way we can have one implementation and stick to the API
Ready for review! @luizirber, @camillescott, @standage , or @ctb |
An alternative to merging this would be to pull out the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me other than the couple typing comments I left. I'm also open to just merging this as-is and changing those over in a subsequent PR which removes the CPython interface as well. re: @betatim's earlier comment, and suggestion from @ctb , I'm for just merging all this at once. In light of my comments being suggestions and perhaps better to be done in a new PR, I'm approving now.
Thoughts?
total_reads, | ||
n_consumed) | ||
|
||
def abundance_distribution(self, file_name, tracking): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not require this to be Hashtable tracking
? The CPython version gets deprecated via this PR anyhow.
raise ValueError('Expected file_name to be string, ' | ||
'got {} instead.'.format(type(file_name))) | ||
|
||
cdef CPyHashtable_Object* hashtable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ie here we can just pull the _this
pointer from the Cython extension class.
abunds.append(x[i]) | ||
return abunds | ||
|
||
def abundance_distribution_with_reads_parser(self, read_parser, tracking): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thoughts as above. Additionally, read_parser
can be required to be a FastxParser
from oxli.parsing
, with the underlying CpFastxReader
pulled from the this pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to remember why exactly I went with this. Part of this is there are a bunch of tests that start failing when I add:
from khmer._oxli.parsing import FastxParser as ReadParser
to khmer/init.py, which would seem to be a useful thing to do to keep all the people using ReadParser
s happy. Not sure this is super tricky, but can't work it out atm :-/ Merge now, improve later?
Looks like test failures in the assembly code cc @camillescott |
Related to the discussion between @standage and @camillescott on slack yesterday regarding bytes vs strings in the latest version of cython, I think. (nb. assembly code -> the other ASM) |
Nearly forgot the most important part: ✋ (high five) and thanks for the patient commenting :) Long may the cython revolution continue ✊ ;) |
This is a PR on top of #1667
With this the mechanics of calling:
works. It doesn't yet do anything particularly smart other than create a small CQF and put stuff into it.
make test
Did it pass the tests?make clean diff-cover
If it introduces new functionality inscripts/
is it tested?make format diff_pylint_report cppcheck doc pydocstyle
Is it wellformatted?
documented in
CHANGELOG.md
? See keepachangelogfor more details.
changes were made?
tested for streaming IO?)