Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HyperLogLog Counter #257

Merged
merged 52 commits into from
Jan 23, 2015
Merged

HyperLogLog Counter #257

merged 52 commits into from
Jan 23, 2015

Conversation

luizirber
Copy link
Member

HyperLogLog Counter

This implements #233. A HyperLogLog counter is a probabilistic counter capable of cardinality estimation (how many unique elements appears in a dataset) with very low memory consumption. In khmer it can be useful to give better guesses for hash size parameters (-x and -N).

References

TODO

  • add tests

  • implement Google paper suggestion (64 bits hash function, bias correction)

    Kinda done. Using a 64-bits SHA-1 or MurmurHash3, but should rerun bias correction analysis from paper for new constants. For now they work reasonably well.

  • ask @mr-c and @camillescott about the right way to implement extensions

  • think more about get_rho function

  • parallel reading and HLL merges for better performance

  • put constants in a separate header

@mr-c
Copy link
Contributor

mr-c commented Jan 20, 2014

Looks like Hudson had an IO burp communicating with the iMac builder. I'm
rerunning the tests now.

On Mon, Jan 20, 2014 at 5:30 PM, ged-jenkins notifications@github.comwrote:

Test FAILed.
Refer to this link for build results:
http://ci.ged.msu.edu/job/khmer-multi-pullrequest/155/


Reply to this email directly or view it on GitHubhttps://github.com//pull/257#issuecomment-32803431
.

@mr-c mr-c closed this Apr 2, 2014
@luizirber luizirber reopened this Apr 2, 2014
@luizirber luizirber force-pushed the feature/hll-counter branch 3 times, most recently from db7a5a9 to 3a6dfed Compare November 17, 2014 00:48
@luizirber luizirber force-pushed the feature/hll-counter branch 2 times, most recently from 511242e to cc6c1bf Compare December 15, 2014 18:39
using namespace std;
using namespace khmer;

std::map<int, std::vector<double> > rawEstimateData;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the ref to 'std' here given the using line above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I prefer to drop the "using namespace std", is there a guideline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say that we don't use it but there are at least 6 counterexamples. You are welcome to not include it.

@luizirber luizirber force-pushed the feature/hll-counter branch 2 times, most recently from c30e205 to 6c55cfc Compare December 19, 2014 16:40
case 6:
return 0.709;
default:
return 0.7213 / (1.0 + 1.079 / (1 << p));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if all the constants in this code were documented + explained

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these are from the HLL paper; a citation is probably sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of citations, shall we add the HLL paper to our CITATION file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear, do we add papers for Bloom filters or Count-min sketch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope; you are off the hook.

@ctb do we want to output cite requests for other algos?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File issue. Let's merge this sucker.

Titus Brown, ctbrown@ucdavis.edu

On Jan 22, 2015, at 7:19 PM, Michael R. Crusoe notifications@github.com wrote:

In lib/hllcounter.cc:

  •        message << "Max error is " << valid_upper_bound;
    
  •    } else {
    
  •        message << "Min error is " << valid_lower_bound;
    
  •    }
    
  •    throw khmer_exception(message.str().c_str());
    
  • }
  • switch (p) {
  • case 4:
  •    return 0.673;
    
  • case 5:
  •    return 0.697;
    
  • case 6:
  •    return 0.709;
    
  • default:
  •    return 0.7213 / (1.0 + 1.079 / (1 << p));
    
    Nope; you are off the hook.

@ctb do we want to output cite requests for other algos?


Reply to this email directly or view it on GitHub.

@mr-c
Copy link
Contributor

mr-c commented Dec 19, 2014

  • coverity scan

HashIntoType _hash_murmur(const std::string kmer);
HashIntoType _hash_murmur(const std::string kmer,
HashIntoType& h, HashIntoType& r);
HashIntoType _hash_murmur_forward(const std::string kmer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the cppcheck notes for these lines http://ci.ged.msu.edu/job/khmer-pullrequest/976/label=linux/cppcheckResult/

@luizirber
Copy link
Member Author

Checklist for new CPython types

  • the CPython object name is of the form khmer_${OBJECTNAME}_Object
  • Named struct with PyObject_HEAD macro
  • static PyTypeObject khmer_${OBJECTNAME}_Type with the following
    entries
    • PyObject_HEAD_INIT(NULL)
    • all fields should have their name in a comment for readability
    • The tp_name filed is a dotted name with both the module name and
      the name of the type within the module. Example: khmer.ReadAligner
    • Deallocator defined and cast to (destructor) in tp_dealloc
      • The object's deallocator must be self->ob_type->tp_free(self);
    • Do not define a tp_getattr
    • BONUS: write methods to present the state of the object via
      tp_str & tp_repr
    • Do pass in the array of methods in tp_methods
    • Do define a new method in tp_new
  • PyMethodDef arrays contain doc strings
    • Methods are cast to PyCFunctionss
  • Type methods use their type Object in the method signature.
  • Type creation method decrements the reference to self
    (Py_DECREF(self);) before each error-path exit (return NULL;)
  • No factory methods. Example: khmer_new_readaligner
  • Type object is passed to PyType_Ready and its return code is checked
    in init_khmer()
  • The reference count for the type object is incremented before adding
    it to the module: Py_INCREF(&khmer_${OBJECTNAME}_Type);.

@luizirber
Copy link
Member Author

Ready for re-review @mr-c

return h ^ r;
}

HashIntoType _hash_murmur_forward(const std::string& kmer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is never used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I kept for consistency with the original _hash function (which has this variant). Should I expose the MurmurHash3 functions to Python land?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check 703c28d

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 !

On Thu Jan 22 2015 at 6:15:52 PM Luiz Irber notifications@github.com
wrote:

In lib/kmer_hash.cc
#257 (comment):

+HashIntoType _hash_murmur(const std::string& kmer,

  •                      HashIntoType& h, HashIntoType& r)
    
    +{
  • HashIntoType out[2];
  • uint32_t seed = 0;
  • MurmurHash3_x64_128((void *)kmer.c_str(), kmer.size(), seed, &out);
  • h = out[0];
  • std::string rev = khmer::_revcomp(kmer);
  • MurmurHash3_x64_128((void *)rev.c_str(), rev.size(), seed, &out);
  • r = out[0];
  • return h ^ r;
    +}

+HashIntoType _hash_murmur_forward(const std::string& kmer)

Check 703c28d
703c28d


Reply to this email directly or view it on GitHub
https://github.com/ged-lab/khmer/pull/257/files#r23419494.

@mr-c
Copy link
Contributor

mr-c commented Jan 21, 2015

Other than those 4 things this is ready to merge!

@luizirber luizirber force-pushed the feature/hll-counter branch 2 times, most recently from 491d390 to 91c0f17 Compare January 22, 2015 21:52
and configuration.
* setup.py: added a function to check if compiler supports OpenMP.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one newline here

@mr-c
Copy link
Contributor

mr-c commented Jan 23, 2015

Jenkins, retest this please

mr-c added a commit that referenced this pull request Jan 23, 2015
@mr-c mr-c merged commit 5f76972 into master Jan 23, 2015
@mr-c mr-c deleted the feature/hll-counter branch January 23, 2015 00:26
@mr-c
Copy link
Contributor

mr-c commented Jan 23, 2015

Great job, @luizirber !

@luizirber
Copy link
Member Author

🎉 🎊 🎈 🍻 🍰

@ctb
Copy link
Member

ctb commented Jan 23, 2015

:)

Titus Brown, ctbrown@ucdavis.edu

On Jan 22, 2015, at 7:28 PM, Luiz Irber notifications@github.com wrote:


Reply to this email directly or view it on GitHub.

@mr-c mr-c mentioned this pull request Feb 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants