-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HyperLogLog Counter #257
HyperLogLog Counter #257
Conversation
Looks like Hudson had an IO burp communicating with the iMac builder. I'm On Mon, Jan 20, 2014 at 5:30 PM, ged-jenkins notifications@github.comwrote:
|
db7a5a9
to
3a6dfed
Compare
511242e
to
cc6c1bf
Compare
using namespace std; | ||
using namespace khmer; | ||
|
||
std::map<int, std::vector<double> > rawEstimateData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the ref to 'std' here given the using
line above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I prefer to drop the "using namespace std", is there a guideline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to say that we don't use it but there are at least 6 counterexamples. You are welcome to not include it.
c30e205
to
6c55cfc
Compare
case 6: | ||
return 0.709; | ||
default: | ||
return 0.7213 / (1.0 + 1.079 / (1 << p)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if all the constants in this code were documented + explained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these are from the HLL paper; a citation is probably sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of citations, shall we add the HLL paper to our CITATION file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear, do we add papers for Bloom filters or Count-min sketch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope; you are off the hook.
@ctb do we want to output cite requests for other algos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File issue. Let's merge this sucker.
Titus Brown, ctbrown@ucdavis.edu
On Jan 22, 2015, at 7:19 PM, Michael R. Crusoe notifications@github.com wrote:
In lib/hllcounter.cc:
message << "Max error is " << valid_upper_bound;
} else {
message << "Min error is " << valid_lower_bound;
}
throw khmer_exception(message.str().c_str());
- }
- switch (p) {
- case 4:
return 0.673;
- case 5:
return 0.697;
- case 6:
return 0.709;
- default:
Nope; you are off the hook.return 0.7213 / (1.0 + 1.079 / (1 << p));
@ctb do we want to output cite requests for other algos?
—
Reply to this email directly or view it on GitHub.
|
22f562b
to
178a324
Compare
HashIntoType _hash_murmur(const std::string kmer); | ||
HashIntoType _hash_murmur(const std::string kmer, | ||
HashIntoType& h, HashIntoType& r); | ||
HashIntoType _hash_murmur_forward(const std::string kmer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the cppcheck notes for these lines http://ci.ged.msu.edu/job/khmer-pullrequest/976/label=linux/cppcheckResult/
Checklist for new CPython types
|
6289b32
to
57f2aa0
Compare
Ready for re-review @mr-c |
return h ^ r; | ||
} | ||
|
||
HashIntoType _hash_murmur_forward(const std::string& kmer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is never used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I kept for consistency with the original _hash function (which has this variant). Should I expose the MurmurHash3 functions to Python land?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check 703c28d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 !
On Thu Jan 22 2015 at 6:15:52 PM Luiz Irber notifications@github.com
wrote:
In lib/kmer_hash.cc
#257 (comment):+HashIntoType _hash_murmur(const std::string& kmer,
+{HashIntoType& h, HashIntoType& r)
- HashIntoType out[2];
- uint32_t seed = 0;
- MurmurHash3_x64_128((void *)kmer.c_str(), kmer.size(), seed, &out);
- h = out[0];
- std::string rev = khmer::_revcomp(kmer);
- MurmurHash3_x64_128((void *)rev.c_str(), rev.size(), seed, &out);
- r = out[0];
- return h ^ r;
+}
+HashIntoType _hash_murmur_forward(const std::string& kmer)
—
Reply to this email directly or view it on GitHub
https://github.com/ged-lab/khmer/pull/257/files#r23419494.
Other than those 4 things this is ready to merge! |
491d390
to
91c0f17
Compare
and configuration. | ||
* setup.py: added a function to check if compiler supports OpenMP. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one newline here
703c28d
to
883a3ef
Compare
Jenkins, retest this please |
Great job, @luizirber ! |
🎉 🎊 🎈 🍻 🍰 |
:) Titus Brown, ctbrown@ucdavis.edu
|
HyperLogLog Counter
This implements #233. A HyperLogLog counter is a probabilistic counter capable of cardinality estimation (how many unique elements appears in a dataset) with very low memory consumption. In khmer it can be useful to give better guesses for hash size parameters (-x and -N).
References
TODO
add tests
implement Google paper suggestion (64 bits hash function, bias correction)
Kinda done. Using a 64-bits SHA-1 or MurmurHash3, but should rerun bias correction analysis from paper for new constants. For now they work reasonably well.
ask @mr-c and @camillescott about the right way to implement extensions
think more about get_rho function
parallel reading and HLL merges for better performance
put constants in a separate header