New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HyperLogLog Counter #257

Merged
merged 52 commits into from Jan 23, 2015

Conversation

Projects
None yet
3 participants
@luizirber
Member

luizirber commented Jan 14, 2014

HyperLogLog Counter

This implements #233. A HyperLogLog counter is a probabilistic counter capable of cardinality estimation (how many unique elements appears in a dataset) with very low memory consumption. In khmer it can be useful to give better guesses for hash size parameters (-x and -N).

References

TODO

  • add tests

  • implement Google paper suggestion (64 bits hash function, bias correction)

    Kinda done. Using a 64-bits SHA-1 or MurmurHash3, but should rerun bias correction analysis from paper for new constants. For now they work reasonably well.

  • ask @mr-c and @camillescott about the right way to implement extensions

  • think more about get_rho function

  • parallel reading and HLL merges for better performance

  • put constants in a separate header

@mr-c

This comment has been minimized.

Show comment
Hide comment
@mr-c

mr-c Jan 20, 2014

Contributor

Looks like Hudson had an IO burp communicating with the iMac builder. I'm
rerunning the tests now.

On Mon, Jan 20, 2014 at 5:30 PM, ged-jenkins notifications@github.comwrote:

Test FAILed.
Refer to this link for build results:
http://ci.ged.msu.edu/job/khmer-multi-pullrequest/155/


Reply to this email directly or view it on GitHubhttps://github.com//pull/257#issuecomment-32803431
.

Contributor

mr-c commented Jan 20, 2014

Looks like Hudson had an IO burp communicating with the iMac builder. I'm
rerunning the tests now.

On Mon, Jan 20, 2014 at 5:30 PM, ged-jenkins notifications@github.comwrote:

Test FAILed.
Refer to this link for build results:
http://ci.ged.msu.edu/job/khmer-multi-pullrequest/155/


Reply to this email directly or view it on GitHubhttps://github.com//pull/257#issuecomment-32803431
.

@mr-c mr-c closed this Apr 2, 2014

@luizirber luizirber reopened this Apr 2, 2014

Show outdated Hide outdated lib/hllcounter.cc
Show outdated Hide outdated lib/hllcounter.cc
Show outdated Hide outdated lib/hllcounter.cc
Show outdated Hide outdated lib/hllcounter.cc
Show outdated Hide outdated lib/hllcounter.cc
Show outdated Hide outdated lib/hllcounter.cc
}
this->add(kmer);
kmer.erase(0, 1);

This comment has been minimized.

@luizirber

luizirber Dec 15, 2014

Member

This is a weird way to iterate over k-mers, any suggestions?

@luizirber

luizirber Dec 15, 2014

Member

This is a weird way to iterate over k-mers, any suggestions?

This comment has been minimized.

@mr-c

mr-c Dec 15, 2014

Contributor

How about the KMerIterator from lib/hashtable.hh?

@mr-c

mr-c Dec 15, 2014

Contributor

How about the KMerIterator from lib/hashtable.hh?

This comment has been minimized.

@luizirber

luizirber Dec 15, 2014

Member

I'm not using the two-bit representation (HLL supports k > 32 because of this), so KMerIterator is not applicable in this case.

@luizirber

luizirber Dec 15, 2014

Member

I'm not using the two-bit representation (HLL supports k > 32 because of this), so KMerIterator is not applicable in this case.

This comment has been minimized.

@ctb

ctb Dec 15, 2014

Member

On Mon, Dec 15, 2014 at 11:18:09AM -0800, Luiz Irber wrote:

  • HashIntoType j = x & (this->m - 1);
  • this->M[j] = max(this->M[j], get_rho(x >> this->p, 64 - this->p));
    +}

+unsigned int HLLCounter::consume_string(const std::string &s) {

  • unsigned int n_consumed = 0;
  • std::string kmer = "";
  • for(std::string::const_iterator it = s.begin(); it != s.end(); ++it) {
  •    kmer.push_back(*it);
    
  •    if (kmer.size() < _ksize) {
    
  •        continue;
    
  •    }
    
  •    this->add(kmer);
    
  •    kmer.erase(0, 1);
    

I'm not using the two-bit representation (HLL supports k > 32 because of this), so KMerIterator is not applicable in this case.

What about using the same interface?

@ctb

ctb Dec 15, 2014

Member

On Mon, Dec 15, 2014 at 11:18:09AM -0800, Luiz Irber wrote:

  • HashIntoType j = x & (this->m - 1);
  • this->M[j] = max(this->M[j], get_rho(x >> this->p, 64 - this->p));
    +}

+unsigned int HLLCounter::consume_string(const std::string &s) {

  • unsigned int n_consumed = 0;
  • std::string kmer = "";
  • for(std::string::const_iterator it = s.begin(); it != s.end(); ++it) {
  •    kmer.push_back(*it);
    
  •    if (kmer.size() < _ksize) {
    
  •        continue;
    
  •    }
    
  •    this->add(kmer);
    
  •    kmer.erase(0, 1);
    

I'm not using the two-bit representation (HLL supports k > 32 because of this), so KMerIterator is not applicable in this case.

What about using the same interface?

This comment has been minimized.

@mr-c

mr-c Dec 19, 2014

Contributor

@ctb The KMerIterator interface is very oriented to 64bit 2bit representations so it wouldn't be a good fit for Luiz's need: http://ci.ged.msu.edu/job/khmer-master/doxygen/classkhmer_1_1_k_mer_iterator.html

@mr-c

mr-c Dec 19, 2014

Contributor

@ctb The KMerIterator interface is very oriented to 64bit 2bit representations so it wouldn't be a good fit for Luiz's need: http://ci.ged.msu.edu/job/khmer-master/doxygen/classkhmer_1_1_k_mer_iterator.html

Show outdated Hide outdated lib/hllcounter.cc
Show outdated Hide outdated lib/kmer_hash.cc
case 6:
return 0.709;
default:
return 0.7213 / (1.0 + 1.079 / (1 << p));

This comment has been minimized.

@mr-c

mr-c Dec 19, 2014

Contributor

It would be great if all the constants in this code were documented + explained

@mr-c

mr-c Dec 19, 2014

Contributor

It would be great if all the constants in this code were documented + explained

This comment has been minimized.

@ctb

ctb Jan 2, 2015

Member

I believe these are from the HLL paper; a citation is probably sufficient.

@ctb

ctb Jan 2, 2015

Member

I believe these are from the HLL paper; a citation is probably sufficient.

This comment has been minimized.

@mr-c

mr-c Jan 23, 2015

Contributor

Speaking of citations, shall we add the HLL paper to our CITATION file?

@mr-c

mr-c Jan 23, 2015

Contributor

Speaking of citations, shall we add the HLL paper to our CITATION file?

This comment has been minimized.

@luizirber

luizirber Jan 23, 2015

Member

Not clear, do we add papers for Bloom filters or Count-min sketch?

@luizirber

luizirber Jan 23, 2015

Member

Not clear, do we add papers for Bloom filters or Count-min sketch?

This comment has been minimized.

@mr-c

mr-c Jan 23, 2015

Contributor

Nope; you are off the hook.

@ctb do we want to output cite requests for other algos?

@mr-c

mr-c Jan 23, 2015

Contributor

Nope; you are off the hook.

@ctb do we want to output cite requests for other algos?

This comment has been minimized.

@ctb

ctb Jan 23, 2015

Member

File issue. Let's merge this sucker.

Titus Brown, ctbrown@ucdavis.edu

On Jan 22, 2015, at 7:19 PM, Michael R. Crusoe notifications@github.com wrote:

In lib/hllcounter.cc:

  •        message << "Max error is " << valid_upper_bound;
    
  •    } else {
    
  •        message << "Min error is " << valid_lower_bound;
    
  •    }
    
  •    throw khmer_exception(message.str().c_str());
    
  • }
  • switch (p) {
  • case 4:
  •    return 0.673;
    
  • case 5:
  •    return 0.697;
    
  • case 6:
  •    return 0.709;
    
  • default:
  •    return 0.7213 / (1.0 + 1.079 / (1 << p));
    
    Nope; you are off the hook.

@ctb do we want to output cite requests for other algos?


Reply to this email directly or view it on GitHub.

@ctb

ctb Jan 23, 2015

Member

File issue. Let's merge this sucker.

Titus Brown, ctbrown@ucdavis.edu

On Jan 22, 2015, at 7:19 PM, Michael R. Crusoe notifications@github.com wrote:

In lib/hllcounter.cc:

  •        message << "Max error is " << valid_upper_bound;
    
  •    } else {
    
  •        message << "Min error is " << valid_lower_bound;
    
  •    }
    
  •    throw khmer_exception(message.str().c_str());
    
  • }
  • switch (p) {
  • case 4:
  •    return 0.673;
    
  • case 5:
  •    return 0.697;
    
  • case 6:
  •    return 0.709;
    
  • default:
  •    return 0.7213 / (1.0 + 1.079 / (1 << p));
    
    Nope; you are off the hook.

@ctb do we want to output cite requests for other algos?


Reply to this email directly or view it on GitHub.

@mr-c

This comment has been minimized.

Show comment
Hide comment
@mr-c

mr-c Dec 19, 2014

Contributor
  • coverity scan
Contributor

mr-c commented Dec 19, 2014

  • coverity scan
Show outdated Hide outdated lib/kmer_hash.hh
Show outdated Hide outdated khmer/__init__.py
Show outdated Hide outdated khmer/_khmermodule.cc
Show outdated Hide outdated khmer/_khmermodule.cc
Show outdated Hide outdated lib/hllcounter.cc
Show outdated Hide outdated lib/hllcounter.cc
Show outdated Hide outdated lib/hllcounter.cc
Show outdated Hide outdated sandbox/unique-kmers.py
@luizirber

This comment has been minimized.

Show comment
Hide comment
@luizirber

luizirber Jan 1, 2015

Member

Mostly done, it has:

  • tests
  • reasonable coverage
  • is a proper Python object (doesn't crash IPython anymore)
  • unique-kmers.py is an importable script and uses argparse
  • passes pep8/astyle/cppcheck/coverity

There are some improvements to be done in the C++ guts, but it is not a merge blocker.

What still needs to be done (probably another PR?):

  • Describe better what are the constants (and better variable names?)
  • Error-check the get_rho function
  • Better constant arrays initialization

Anything else?

Member

luizirber commented Jan 1, 2015

Mostly done, it has:

  • tests
  • reasonable coverage
  • is a proper Python object (doesn't crash IPython anymore)
  • unique-kmers.py is an importable script and uses argparse
  • passes pep8/astyle/cppcheck/coverity

There are some improvements to be done in the C++ guts, but it is not a merge blocker.

What still needs to be done (probably another PR?):

  • Describe better what are the constants (and better variable names?)
  • Error-check the get_rho function
  • Better constant arrays initialization

Anything else?

@luizirber

This comment has been minimized.

Show comment
Hide comment
@luizirber

luizirber Jan 1, 2015

Member
  • Is it mergable?
  • Did it pass the tests?
  • If it introduces new functionality in scripts/ is it tested?
    Check for code coverage.
  • Is it well formatted? Look at make pep8, make diff_pylint_report,
    make cppcheck, and make doc output. Use make format and manual
    fixing as needed.
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
  • Was a spellchecker run on the source code and documentation after
    changes were made?
Member

luizirber commented Jan 1, 2015

  • Is it mergable?
  • Did it pass the tests?
  • If it introduces new functionality in scripts/ is it tested?
    Check for code coverage.
  • Is it well formatted? Look at make pep8, make diff_pylint_report,
    make cppcheck, and make doc output. Use make format and manual
    fixing as needed.
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
  • Was a spellchecker run on the source code and documentation after
    changes were made?
@luizirber

This comment has been minimized.

Show comment
Hide comment
@luizirber

luizirber Jan 2, 2015

Member

ready for review and merge @mr-c @camillescott @ctb

Member

luizirber commented Jan 2, 2015

ready for review and merge @mr-c @camillescott @ctb

Show outdated Hide outdated ChangeLog
@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jan 2, 2015

Member

Overall, looks good. A few issues to address before merge, I think, but nothing big.

One additional point: could you put a docstring in sandbox/unique-kmers, please?

Member

ctb commented Jan 2, 2015

Overall, looks good. A few issues to address before merge, I think, but nothing big.

One additional point: could you put a docstring in sandbox/unique-kmers, please?

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jan 2, 2015

Member

Oh, and fix the get_parser description for unique-kmers? copy/paste FTL :)

Member

ctb commented Jan 2, 2015

Oh, and fix the get_parser description for unique-kmers? copy/paste FTL :)

@luizirber

This comment has been minimized.

Show comment
Hide comment
@luizirber

luizirber Jan 2, 2015

Member

Fixed @ctb comments, removed TODO and added improvements to issues #714, #715, #716 and #717.

Member

luizirber commented Jan 2, 2015

Fixed @ctb comments, removed TODO and added improvements to issues #714, #715, #716 and #717.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jan 3, 2015

Member

I couldn't just do a pull w/o generating conflicts - did you do a rebase?

Member

ctb commented Jan 3, 2015

I couldn't just do a pull w/o generating conflicts - did you do a rebase?

@luizirber

This comment has been minimized.

Show comment
Hide comment
@luizirber

luizirber Jan 21, 2015

Member

Checklist for new CPython types

  • the CPython object name is of the form khmer_${OBJECTNAME}_Object
  • Named struct with PyObject_HEAD macro
  • static PyTypeObject khmer_${OBJECTNAME}_Type with the following
    entries
    • PyObject_HEAD_INIT(NULL)
    • all fields should have their name in a comment for readability
    • The tp_name filed is a dotted name with both the module name and
      the name of the type within the module. Example: khmer.ReadAligner
    • Deallocator defined and cast to (destructor) in tp_dealloc
      • The object's deallocator must be self->ob_type->tp_free(self);
    • Do not define a tp_getattr
    • BONUS: write methods to present the state of the object via
      tp_str & tp_repr
    • Do pass in the array of methods in tp_methods
    • Do define a new method in tp_new
  • PyMethodDef arrays contain doc strings
    • Methods are cast to PyCFunctionss
  • Type methods use their type Object in the method signature.
  • Type creation method decrements the reference to self
    (Py_DECREF(self);) before each error-path exit (return NULL;)
  • No factory methods. Example: khmer_new_readaligner
  • Type object is passed to PyType_Ready and its return code is checked
    in init_khmer()
  • The reference count for the type object is incremented before adding
    it to the module: Py_INCREF(&khmer_${OBJECTNAME}_Type);.
Member

luizirber commented Jan 21, 2015

Checklist for new CPython types

  • the CPython object name is of the form khmer_${OBJECTNAME}_Object
  • Named struct with PyObject_HEAD macro
  • static PyTypeObject khmer_${OBJECTNAME}_Type with the following
    entries
    • PyObject_HEAD_INIT(NULL)
    • all fields should have their name in a comment for readability
    • The tp_name filed is a dotted name with both the module name and
      the name of the type within the module. Example: khmer.ReadAligner
    • Deallocator defined and cast to (destructor) in tp_dealloc
      • The object's deallocator must be self->ob_type->tp_free(self);
    • Do not define a tp_getattr
    • BONUS: write methods to present the state of the object via
      tp_str & tp_repr
    • Do pass in the array of methods in tp_methods
    • Do define a new method in tp_new
  • PyMethodDef arrays contain doc strings
    • Methods are cast to PyCFunctionss
  • Type methods use their type Object in the method signature.
  • Type creation method decrements the reference to self
    (Py_DECREF(self);) before each error-path exit (return NULL;)
  • No factory methods. Example: khmer_new_readaligner
  • Type object is passed to PyType_Ready and its return code is checked
    in init_khmer()
  • The reference count for the type object is incremented before adding
    it to the module: Py_INCREF(&khmer_${OBJECTNAME}_Type);.

@luizirber luizirber self-assigned this Jan 21, 2015

@luizirber

This comment has been minimized.

Show comment
Hide comment
@luizirber

luizirber Jan 21, 2015

Member

Ready for re-review @mr-c

Member

luizirber commented Jan 21, 2015

Ready for re-review @mr-c

return h ^ r;
}
HashIntoType _hash_murmur_forward(const std::string& kmer)

This comment has been minimized.

@mr-c

mr-c Jan 21, 2015

Contributor

This function is never used?

@mr-c

mr-c Jan 21, 2015

Contributor

This function is never used?

This comment has been minimized.

@luizirber

luizirber Jan 22, 2015

Member

No, I kept for consistency with the original _hash function (which has this variant). Should I expose the MurmurHash3 functions to Python land?

@luizirber

luizirber Jan 22, 2015

Member

No, I kept for consistency with the original _hash function (which has this variant). Should I expose the MurmurHash3 functions to Python land?

This comment has been minimized.

@mr-c

mr-c Jan 22, 2015

Contributor

Sure

@mr-c

mr-c Jan 22, 2015

Contributor

Sure

This comment has been minimized.

@luizirber
@luizirber

This comment has been minimized.

@mr-c

mr-c Jan 22, 2015

Contributor

+1 !

On Thu Jan 22 2015 at 6:15:52 PM Luiz Irber notifications@github.com
wrote:

In lib/kmer_hash.cc
#257 (comment):

+HashIntoType _hash_murmur(const std::string& kmer,

  •                      HashIntoType& h, HashIntoType& r)
    
    +{
  • HashIntoType out[2];
  • uint32_t seed = 0;
  • MurmurHash3_x64_128((void *)kmer.c_str(), kmer.size(), seed, &out);
  • h = out[0];
  • std::string rev = khmer::_revcomp(kmer);
  • MurmurHash3_x64_128((void *)rev.c_str(), rev.size(), seed, &out);
  • r = out[0];
  • return h ^ r;
    +}

+HashIntoType _hash_murmur_forward(const std::string& kmer)

Check 703c28d
703c28d


Reply to this email directly or view it on GitHub
https://github.com/ged-lab/khmer/pull/257/files#r23419494.

@mr-c

mr-c Jan 22, 2015

Contributor

+1 !

On Thu Jan 22 2015 at 6:15:52 PM Luiz Irber notifications@github.com
wrote:

In lib/kmer_hash.cc
#257 (comment):

+HashIntoType _hash_murmur(const std::string& kmer,

  •                      HashIntoType& h, HashIntoType& r)
    
    +{
  • HashIntoType out[2];
  • uint32_t seed = 0;
  • MurmurHash3_x64_128((void *)kmer.c_str(), kmer.size(), seed, &out);
  • h = out[0];
  • std::string rev = khmer::_revcomp(kmer);
  • MurmurHash3_x64_128((void *)rev.c_str(), rev.size(), seed, &out);
  • r = out[0];
  • return h ^ r;
    +}

+HashIntoType _hash_murmur_forward(const std::string& kmer)

Check 703c28d
703c28d


Reply to this email directly or view it on GitHub
https://github.com/ged-lab/khmer/pull/257/files#r23419494.

{
HashIntoType out[2];
uint32_t seed = 0;
MurmurHash3_x64_128((void *)kmer.c_str(), kmer.size(), seed, &out);

This comment has been minimized.

@mr-c

mr-c Jan 21, 2015

Contributor

implicit conversion loses integer precision: 'size_type' (aka 'unsigned long') to 'int' [-Wshorten-64-to-32]

@mr-c

mr-c Jan 21, 2015

Contributor

implicit conversion loses integer precision: 'size_type' (aka 'unsigned long') to 'int' [-Wshorten-64-to-32]

This comment has been minimized.

@luizirber

luizirber Jan 22, 2015

Member

I can cast to int, but it will only silence the warning (the original MurmurHash code uses int).

Update: I went ahead and modified the original code to use size_t too.

@luizirber

luizirber Jan 22, 2015

Member

I can cast to int, but it will only silence the warning (the original MurmurHash code uses int).

Update: I went ahead and modified the original code to use size_t too.

h = out[0];
std::string rev = khmer::_revcomp(kmer);
MurmurHash3_x64_128((void *)rev.c_str(), rev.size(), seed, &out);

This comment has been minimized.

@mr-c

mr-c Jan 21, 2015

Contributor

implicit conversion loses integer precision: 'size_type' (aka 'unsigned long') to 'int' [-Wshorten-64-to-32]

@mr-c

mr-c Jan 21, 2015

Contributor

implicit conversion loses integer precision: 'size_type' (aka 'unsigned long') to 'int' [-Wshorten-64-to-32]

@mr-c

This comment has been minimized.

Show comment
Hide comment
@mr-c

mr-c Jan 21, 2015

Contributor

Other than those 4 things this is ready to merge!

Contributor

mr-c commented Jan 21, 2015

Other than those 4 things this is ready to merge!

Show outdated Hide outdated ChangeLog
@mr-c

This comment has been minimized.

Show comment
Hide comment
@mr-c

mr-c Jan 23, 2015

Contributor

Jenkins, retest this please

Contributor

mr-c commented Jan 23, 2015

Jenkins, retest this please

mr-c added a commit that referenced this pull request Jan 23, 2015

@mr-c mr-c merged commit 5f76972 into master Jan 23, 2015

1 check passed

default Merged build finished.
Details

@mr-c mr-c deleted the feature/hll-counter branch Jan 23, 2015

@mr-c

This comment has been minimized.

Show comment
Hide comment
@mr-c

mr-c Jan 23, 2015

Contributor

Great job, @luizirber !

Contributor

mr-c commented Jan 23, 2015

Great job, @luizirber !

@luizirber

This comment has been minimized.

Show comment
Hide comment
@luizirber

luizirber Jan 23, 2015

Member

🎉 🎊 🎈 🍻 🍰

Member

luizirber commented Jan 23, 2015

🎉 🎊 🎈 🍻 🍰

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jan 23, 2015

Member

:)

Titus Brown, ctbrown@ucdavis.edu

On Jan 22, 2015, at 7:28 PM, Luiz Irber notifications@github.com wrote:


Reply to this email directly or view it on GitHub.

Member

ctb commented Jan 23, 2015

:)

Titus Brown, ctbrown@ucdavis.edu

On Jan 22, 2015, at 7:28 PM, Luiz Irber notifications@github.com wrote:


Reply to this email directly or view it on GitHub.

@mr-c mr-c referenced this pull request Feb 20, 2015

Closed

hyperloglog counter #233

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment