Skip to content
This repository has been archived by the owner on Jul 7, 2020. It is now read-only.

HyperLogLogPlus merge introduces error when counters overlap #31

Closed
deGravity opened this issue May 7, 2013 · 8 comments
Closed

HyperLogLogPlus merge introduces error when counters overlap #31

deGravity opened this issue May 7, 2013 · 8 comments
Milestone

Comments

@deGravity
Copy link

When merging two HyperLogLogPlus counters that have a large intersection of their underlying sets it is possible to introduce a large amount of error in cardinality estimation. This is caused by checking if the size of the two sparse lists sums to a number greater than the sparseThreshold. If the two lists share many elements, then their actual merge should be a list much smaller than the sparse threshold - this can cause a sparse counter to be promoted to a normal counter long before it should. If this happens it both wastes space and, if it happens too early can cause large errors in cardinality estimation ( I have seen up to 30% in tests ) due to the bias estimation curves not having samples for cardinalities that low.

Forcing merges to always merge sparse lists before checking size helps, but does not completely eliminate the problem. I believe this is due to problems in the merge implementation that I will open another issue for.

The easiest way to reproduce this error is to produce two counters with identical elements at just over 1/2 the sparseThreshold, then merge them.

@seancarr
Copy link

seancarr commented May 7, 2013

I recommend not using the sparse set as it's currently implemented because it actually uses a lot more memory than the normal set. I'm surprised you're getting 30% error with the normal set at any cardinality though. Are you using the latest version of the code? What is the cardinality range you're testing with?

@deGravity
Copy link
Author

I am more concerned with serialization size than memory ( I have modified the sparse list serialization format to be much smaller ). The 30% error only occurs at very low p values, but the same problem can occur for higher p with lower error. I found the problem while testing merging a counter with a copy of itself (it was actually a test of serialization - I rematerialized a counter after serializing it, and figured that an easy way to test if the counters were identical was to merge the counter with its re-materialized self and see if anything changed). My ranges for p, sp, and cardinalities are as follows:

public static final long[] cardinalities = { 0, 1, 10, 100, 1000, 10000, 100000 };
public static final int[] ps = { 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 };
public static final int[] sps = { 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 };

@abramsm
Copy link
Contributor

abramsm commented May 9, 2013

I ran a test of every combination of cardinalities, ps, and sps and I was not able to reproduce the error. I built the HLLP, serialized, rehydrated into a new object, then merged that object with the original. The cardinalities always identical. Can you share a test that shows the issue.

Regarding sparse representation. Can you please share your implementation that is more memory efficient?

@deGravity deGravity reopened this May 9, 2013
@deGravity
Copy link
Author

Yes. I need to do some refactoring and jump through some hoops to comply with my company's open source policies (the pull request will come from a different account) but I will get the changes to you. It isn't more memory efficient; it is a change in getBytes() and the Builder() class to make serialization more efficient.

@abramsm
Copy link
Contributor

abramsm commented May 13, 2013

Great, thanks we appreciate it.

Matt

On Mon, May 13, 2013 at 4:08 PM, deGravity notifications@github.com wrote:

Yes. I need to do some refactoring and jump through some hoops to comply
with my company's open source policies (the pull request will come from a
different account) but I will get the changes to you. It isn't more memory
efficient; it is a change in getBytes() and the Builder() class to make
serialization more efficient.


Reply to this email directly or view it on GitHubhttps://github.com//issues/31#issuecomment-17837052
.

@abramsm
Copy link
Contributor

abramsm commented Jul 23, 2013

does our work on #38 meet your needs? We've made significant improvements to space required for serialization and memory footprint.

@cburroughs
Copy link
Contributor

@deGravity Have you had a chance to look at this again?

@tea-dragon
Copy link
Contributor

seems likely to be resolved to me. please re-open if not

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants