Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] CQF based storage #1675

Merged
merged 38 commits into from
Jul 26, 2017
Merged

[MRG] CQF based storage #1675

merged 38 commits into from
Jul 26, 2017

Conversation

betatim
Copy link
Member

@betatim betatim commented Apr 18, 2017

This is a PR on top of #1667

With this the mechanics of calling:

import khmer
cqf = khmer.QFCounttable(3, ...)
cqf.add("ACG")
cqf.get("ACG")

works. It doesn't yet do anything particularly smart other than create a small CQF and put stuff into it.

  • load and save support

  • Is it mergeable?
  • make test Did it pass the tests?
  • make clean diff-cover If it introduces new functionality in
    scripts/ is it tested?
  • make format diff_pylint_report cppcheck doc pydocstyle Is it well
    formatted?
  • For substantial changes or changes to the command-line interface, is it
    documented in CHANGELOG.md? See keepachangelog
    for more details.
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Do the changes respect streaming IO? (Are they
    tested for streaming IO?)

@betatim
Copy link
Member Author

betatim commented Apr 21, 2017

I think this is basically working and looks pretty promising!

Used https://gist.github.com/betatim/65b970943bee792e0d83c82fc8688935 to do some benchmarking and try to understand how the CQF works/behaves. (The way to measure extra mem used is pretty unreliable...watch the process' virtual mem in top instead)

False positive rate after storing entries unique kmers evaluated using 20k kmers that are definitely not in the filter:

entries BF CQF
5000000 0.00145 0.00015
8000000 0.00315 0.0003
9000000 0.00395 0.0003

For roughly same amount of memory used per data structure we get an order of magnitude less false positives?! Quite positive ...

@betatim
Copy link
Member Author

betatim commented Apr 21, 2017

Updated gist now contains script for timing measurements: python fp.py [bf|cqf]

kind load query t_load (s) t_queryNP (s) t_queryP (s)
bf 9000000 1000000 6.6050 0.77769 0.805
cqf 9000000 1000000 5.5954 0.6589 0.72047

load and query are the number of kmers loaded into the filter and the number of kmers queried. t_X is time taken. queryNP = query kmers not present, queryP = query kmers present.

@betatim
Copy link
Member Author

betatim commented May 11, 2017

Using -Ofast and -msse4.2 -D__SSE4_2_ as compile options for all of khmer seems to slow things down. Where as for the CQF they make a difference. Check that the commands actually take effect/how they interact with other arguments.

@betatim betatim changed the base branch from try/cqf to master June 8, 2017 19:22
@@ -42,16 +42,16 @@ cdef extern from "oxli/hashtable.hh" namespace "oxli":
void count(const char *)
void count(HashIntoType)
bool add(const char *)
bool add(HashIntoType)
#bool add(HashIntoType)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camillescott or @luizirber any insight why I need to comment out the second overload? Without this I get an error telling me that it can't convert a const char* into HashIntoType.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "hack" found on SO: change the second line to bool add2 "add"(HashIntoType). Somehow tricks cython but because you provide a "real name" that is the same as above the compiler figures it all out. This seems to make the compiler happy but segfaults the second time I try to add a kmer :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Segfault also happens with the original code on OSX.

@betatim betatim force-pushed the try/cqf_storage branch 3 times, most recently from ef21660 to 1e15ca1 Compare June 9, 2017 15:36
@@ -42,16 +42,16 @@ cdef extern from "oxli/hashtable.hh" namespace "oxli":
void count(const char *)
void count(HashIntoType)
bool add(const char *)
bool add(HashIntoType)
bool add2 "add"(HashIntoType)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dat hack?!

"""
if isinstance(kmer, (unicode, str)):
data = _bstring(kmer)
return deref(self.c_table).add(deref(self.c_table).hash_dna(data))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does one of the level 7 cython wizards know why you have to convert the string to a hash yourself instead of getting the overloaded method to do our bidding for us?

@@ -359,6 +363,17 @@ def __new__(cls, k, starting_size, n_tables):
return counttable


class QFCounttable(_QFCounttable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that now (in the dawn of the Glorious Revolution), we'll be removing this pattern -- this __new__ method should just be implemented in the __cinit__.

"""Calculate the k-mer abundance distribution of the given file_name."""
read_parser = FastxParser(file_name)
cdef uint64_t * x = deref(self.c_table).abundance_distribution[CpFastxReader](
read_parser._this, (<CPyHashtable_Object>tracking).hashtable)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need help again @camillescott :-/ tracking is a Nodegraph which I'd like to somehow unpack so I can take the hashtable attribute and pass it along.

A bit like:

khmer_KHashtable_Object * tracking_obj = NULL;
PyArg_ParseTuple(args, "O!", &khmer_KHashtable_Type, &tracking_obj)
// now use tracking_obj->hashtable

but how? It seems silly that we can't reach into these extension types 😕

@betatim betatim force-pushed the try/cqf_storage branch 2 times, most recently from 1c562b7 to 07ffa8f Compare June 16, 2017 09:01
@codecov-io
Copy link

codecov-io commented Jun 16, 2017

Codecov Report

Merging #1675 into master will decrease coverage by <.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1675      +/-   ##
=========================================
- Coverage    0.05%   0.05%   -0.01%     
=========================================
  Files          90      90              
  Lines       11336   11439     +103     
  Branches     2992    3062      +70     
=========================================
  Hits            6       6              
- Misses      11330   11433     +103
Impacted Files Coverage Δ
include/oxli/hashtable.hh 0% <0%> (ø) ⬆️
khmer/_oxli/oxli_exception_convert.cc 0% <0%> (ø) ⬆️
khmer/__init__.py 2.66% <0%> (-0.03%) ⬇️
src/oxli/storage.cc 0% <0%> (ø) ⬆️
include/oxli/storage.hh 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a9a2bb...27a4d8c. Read the comment docs.

@betatim
Copy link
Member Author

betatim commented Jun 16, 2017

sse4, cqf, 9000000, 1000000, 5.88, 0.66, 0.70
avx, cqf, 9000000, 1000000, 6.02, 0.66, 0.72
w/o, cqf, 9000000, 1000000, 5.90, 0.66, 0.72

Whether or not we use the SSE4.2 extension seems to make only a very small difference. For now disabling it as we would need a good way to detect whether or not the CPU supports it (if someone has a good idea on that let me know).

@betatim
Copy link
Member Author

betatim commented Jun 22, 2017

Ready for review! @luizirber, @camillescott (I seek approval from the cython deity 🙇), @standage , or @ctb

There is a "detail" to sort out with n_occupied() and how to handle the data structure becoming full. However there is enough code to start reviewing it and then we sort things after/during. This doesn't make the CQF available from any of the scripts so I think we are OK to merge this and then doctor around a bit more after.

@ctb
Copy link
Member

ctb commented Jun 22, 2017 via email

@betatim
Copy link
Member Author

betatim commented Jun 23, 2017

My conclusion from splatlab/cqf#4 is that there is no way to get the CQF to behave like a BF which allows the user to continue inserting items beyond the point where the FP rate "explodes". The good thing is we have a way to gracefully exit when we reach that point.

@ctb
Copy link
Member

ctb commented Jun 23, 2017 via email

@betatim betatim changed the title [WIP] CQF based storage [MRG] CQF based storage Jun 23, 2017
@betatim
Copy link
Member Author

betatim commented Jun 23, 2017

Can we decide on how to handle that situation when we actually expose this via one of the scripts?

@ctb
Copy link
Member

ctb commented Jun 23, 2017 via email

Copy link
Member

@camillescott camillescott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is immense. Great work!

General comments:

  • There are several methods that throw C++ exceptions. We should probably handle these properly in the extern wrapper (ie the +MemoryError, etc syntax)
  • CQFilter should be derived from Hashtable. Seeing as you've essentially (very helpfully) done the job of cythonizing Hashtable anyhow, I'd suggest moving most of that code into an extension class called Hashtable that throws a NotImplemented in its __cinit__.


public:
QFStorage(int size) {
qf_init(&cf, (1ULL << size), size+8, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some brief code comments on these init params?

@@ -85,6 +85,43 @@ cdef extern from "oxli/hashtable.hh" namespace "oxli":
cdef cppclass CpNodetable "oxli::Nodetable" (CpHashtable):
CpNodetable(WordLength, vector[uint64_t])

cdef cppclass CpQFCounttable "oxli::QFCounttable":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be derived from CpHashtable (see above with Nodetable and Counttable), which will allow you to omit all these redefinitions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried that and you end up with a bunch of these:

khmer/_oxli/graphs.pyx:58:43: Cannot assign type 'char *' to 'HashIntoType'

Error compiling Cython file:
------------------------------------------------------------
...
        For Nodetables and Counttables, this function will fail if the
        supplied k-mer contains non-ACGT characters.
        """
        if isinstance(kmer, str):
            temp = kmer.encode('utf-8')
            return deref(self.c_table).get_count(<char*>temp)

not sure why defining them again in the same order (instead of inheriting) fixes this :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, agree that this is stupid but don't know how to improve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This referring to the class hierarchy or to the temp variable thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class hierarachy

self.c_table.reset(new CpQFCounttable(k, int(log(starting_size, 2))))

def add(self, kmer):
"""Increment the count of this k-mer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in liboxli, add returns whether the k-mer was new, while count does not. I don't think this is explicitly documented beyond just being that way in the code, however.

value of the kmer.
"""
if isinstance(kmer, str):
temp = kmer.encode('utf-8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe you should be able to write this as cdef char * temp = ...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't define a cdef inside the if statement, so you could declare it outside and then assign here but I'd claim that is even more clunky than the cast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair 'nuff

@luizirber
Copy link
Member

Quick comment: I started the wrapper for C++ -> Python exceptions in the HLL Cython PR, this is how you declare it on the Cython class.

@betatim
Copy link
Member Author

betatim commented Jul 6, 2017

Like the exception handling, for the moment I will add except + to those methods that need it, once #1730 is merged, I'll switch to that.

@betatim
Copy link
Member Author

betatim commented Jul 13, 2017

Ready for review! @luizirber, @camillescott, @standage , or @ctb

@betatim
Copy link
Member Author

betatim commented Jul 13, 2017

An alternative to merging this would be to pull out the Hashtable stuff and get that merged first, then think about CQF stuff more (if needed). With the Hashtable stuff merged we can bring about more of the glorious cython future for other parts.

Copy link
Member

@camillescott camillescott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me other than the couple typing comments I left. I'm also open to just merging this as-is and changing those over in a subsequent PR which removes the CPython interface as well. re: @betatim's earlier comment, and suggestion from @ctb , I'm for just merging all this at once. In light of my comments being suggestions and perhaps better to be done in a new PR, I'm approving now.

Thoughts?

total_reads,
n_consumed)

def abundance_distribution(self, file_name, tracking):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not require this to be Hashtable tracking? The CPython version gets deprecated via this PR anyhow.

raise ValueError('Expected file_name to be string, '
'got {} instead.'.format(type(file_name)))

cdef CPyHashtable_Object* hashtable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie here we can just pull the _this pointer from the Cython extension class.

abunds.append(x[i])
return abunds

def abundance_distribution_with_reads_parser(self, read_parser, tracking):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thoughts as above. Additionally, read_parser can be required to be a FastxParser from oxli.parsing, with the underlying CpFastxReader pulled from the this pointer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to remember why exactly I went with this. Part of this is there are a bunch of tests that start failing when I add:

from khmer._oxli.parsing import FastxParser as ReadParser

to khmer/init.py, which would seem to be a useful thing to do to keep all the people using ReadParsers happy. Not sure this is super tricky, but can't work it out atm :-/ Merge now, improve later?

@ctb
Copy link
Member

ctb commented Jul 21, 2017

Looks like test failures in the assembly code cc @camillescott

@betatim
Copy link
Member Author

betatim commented Jul 21, 2017

Related to the discussion between @standage and @camillescott on slack yesterday regarding bytes vs strings in the latest version of cython, I think.

(nb. assembly code -> the other ASM)

see cython/cython#1790

@betatim
Copy link
Member Author

betatim commented Jul 27, 2017

Nearly forgot the most important part: ✋ (high five) and thanks for the patient commenting :) Long may the cython revolution continue ✊ ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants