HyperLogLogPlus merge introduces error when counters overlap #31

deGravity · 2013-05-07T22:36:57Z

When merging two HyperLogLogPlus counters that have a large intersection of their underlying sets it is possible to introduce a large amount of error in cardinality estimation. This is caused by checking if the size of the two sparse lists sums to a number greater than the sparseThreshold. If the two lists share many elements, then their actual merge should be a list much smaller than the sparse threshold - this can cause a sparse counter to be promoted to a normal counter long before it should. If this happens it both wastes space and, if it happens too early can cause large errors in cardinality estimation ( I have seen up to 30% in tests ) due to the bias estimation curves not having samples for cardinalities that low.

Forcing merges to always merge sparse lists before checking size helps, but does not completely eliminate the problem. I believe this is due to problems in the merge implementation that I will open another issue for.

The easiest way to reproduce this error is to produce two counters with identical elements at just over 1/2 the sparseThreshold, then merge them.

seancarr · 2013-05-07T22:51:29Z

I recommend not using the sparse set as it's currently implemented because it actually uses a lot more memory than the normal set. I'm surprised you're getting 30% error with the normal set at any cardinality though. Are you using the latest version of the code? What is the cardinality range you're testing with?

deGravity · 2013-05-07T23:33:38Z

I am more concerned with serialization size than memory ( I have modified the sparse list serialization format to be much smaller ). The 30% error only occurs at very low p values, but the same problem can occur for higher p with lower error. I found the problem while testing merging a counter with a copy of itself (it was actually a test of serialization - I rematerialized a counter after serializing it, and figured that an easy way to test if the counters were identical was to merge the counter with its re-materialized self and see if anything changed). My ranges for p, sp, and cardinalities are as follows:

public static final long[] cardinalities = { 0, 1, 10, 100, 1000, 10000, 100000 };
public static final int[] ps = { 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 };
public static final int[] sps = { 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 };

abramsm · 2013-05-09T19:17:56Z

I ran a test of every combination of cardinalities, ps, and sps and I was not able to reproduce the error. I built the HLLP, serialized, rehydrated into a new object, then merged that object with the original. The cardinalities always identical. Can you share a test that shows the issue.

Regarding sparse representation. Can you please share your implementation that is more memory efficient?

deGravity · 2013-05-13T20:08:58Z

Yes. I need to do some refactoring and jump through some hoops to comply with my company's open source policies (the pull request will come from a different account) but I will get the changes to you. It isn't more memory efficient; it is a change in getBytes() and the Builder() class to make serialization more efficient.

abramsm · 2013-05-13T23:26:37Z

Great, thanks we appreciate it.

Matt

On Mon, May 13, 2013 at 4:08 PM, deGravity notifications@github.com wrote:

Yes. I need to do some refactoring and jump through some hoops to comply
with my company's open source policies (the pull request will come from a
different account) but I will get the changes to you. It isn't more memory
efficient; it is a change in getBytes() and the Builder() class to make
serialization more efficient.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/31#issuecomment-17837052
.

abramsm · 2013-07-23T21:41:45Z

does our work on #38 meet your needs? We've made significant improvements to space required for serialization and memory footprint.

cburroughs · 2014-01-28T13:26:00Z

@deGravity Have you had a chance to look at this again?

tea-dragon · 2015-03-16T17:04:49Z

seems likely to be resolved to me. please re-open if not

deGravity closed this as completed May 9, 2013

deGravity reopened this May 9, 2013

tea-dragon closed this as completed Mar 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HyperLogLogPlus merge introduces error when counters overlap #31

HyperLogLogPlus merge introduces error when counters overlap #31

deGravity commented May 7, 2013

seancarr commented May 7, 2013

deGravity commented May 7, 2013

abramsm commented May 9, 2013

deGravity commented May 13, 2013

abramsm commented May 13, 2013

abramsm commented Jul 23, 2013

cburroughs commented Jan 28, 2014

tea-dragon commented Mar 16, 2015

HyperLogLogPlus merge introduces error when counters overlap #31

HyperLogLogPlus merge introduces error when counters overlap #31

Comments

deGravity commented May 7, 2013

seancarr commented May 7, 2013

deGravity commented May 7, 2013

abramsm commented May 9, 2013

deGravity commented May 13, 2013

abramsm commented May 13, 2013

abramsm commented Jul 23, 2013

cburroughs commented Jan 28, 2014

tea-dragon commented Mar 16, 2015