Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hyperloglog pfmerge inflates key size #3819

Closed
refaelos opened this issue Feb 20, 2017 · 37 comments
Closed

hyperloglog pfmerge inflates key size #3819

refaelos opened this issue Feb 20, 2017 · 37 comments

Comments

@refaelos
Copy link

Hey,

I'll go straight to the point:
I have 2 hyperloglog keys that I want to pfmerge into one. Each key has pfcount of 2 and a size of 160 bytes. When I merge them, the count is 4 (which is good) but size becomes 14400 bytes!

Why does it happen?

@itamarhaber
Copy link
Member

Redis uses two different internal encodings for HLL: dense that is always 12288 bytes or sparse (see the the hll_sparse_max_bytes configuration directive for controlling the threshold). An HLL will begin sparse and, if it grows, will eventually converted to dense.

That said, it appears that keys created by PFMERGE are always encoded with the dense representation, regardless the configuration and/or the resulting HLL's cardinality. IMHO this is a minor issue, if at all, given that merging HLLs is rarely (read, if ever) done on sparse data.

@refaelos
Copy link
Author

@itamarhaber thanks for the detailed explanation. It helps.

I think it's an issue with redis b/c one of our databases just grew from 4GB to to 12GB just because we merged some HLLs as a part of some maintenance we did. Do you think this issue can be solved?

@antirez
Copy link
Contributor

antirez commented Feb 20, 2017

Hello. It is actually possible to fix this issue... in theory after merging Redis should check again if the HyperLogLog is sparse, and convert it back into a sparse HLL. Tagging the issue to solve it at some point (not sure why, quite some backlog...).

@refaelos
Copy link
Author

@antirez thanks. Looking forward to it.

@refaelos
Copy link
Author

@antirez can you tell when this will be done?

@antirez antirez added this to the Urgent milestone Jul 23, 2017
@antirez
Copy link
Contributor

antirez commented Jul 23, 2017

Moved it to the Urgent milestone, but not sure about the ETA. However being under urgent will make it faster. There are other priorities to be ready in mid-September and the vacations in between, so let's see what happens. Cheers.

@refaelos
Copy link
Author

@antirez 👍

@mbarkhau
Copy link

Have there been any updates on this, maybe something in a development branch that I can look at?

@antirez
Copy link
Contributor

antirez commented Dec 20, 2017

Sorry we are late about improving this one. Maybe @artix75 wants to take over it and improve the code faster than I could?

@mbarkhau
Copy link

I'm looking at pfmergeCommand.

By that point i think the merge has already been performed, but it may now be the case that
sdslen(o->ptr) > server.hll_sparse_max_bytes, which I'm guessing is why the conversion using hllSparseToDense(o) is done (or maybe there are other reasons?).

If not, I guess the steps would be something like:

  1. Check if hdr->encoding == HLL_DENSE and proceed as before
  2. Only if hdr->encoding == HLL_SPARSE && sdslen(o->ptr) > server.hll_sparse_max_bytes perform hllSparseToDense(o)

As for the what the subsequent code is doing and what's appropriate for the HLL_SPARSE case, I'm still looking at.

@mbarkhau
Copy link

Unfortunately I won't have time to dig into this any further at least until january.

@refaelos
Copy link
Author

@antirez @artix75 - this is a major issue with redis as it prevents the merge operation from happening on large scale redis servers.

Please see if you can fit this into your schedule. It would be very useful.

Thanks!

@antirez
Copy link
Contributor

antirez commented Dec 22, 2017

Yep not ideal, checking the problem today.

@antirez
Copy link
Contributor

antirez commented Dec 22, 2017

Hello, I pushed a fix into unstable, testing it right now.

antirez added a commit that referenced this issue Dec 22, 2017
The commit splits the add functions into a set() and add() set of
functions, so that it's possible to set registers in an independent way
just having the index and count.

Related to #3819, otherwise a fix is not possible.
antirez added a commit that referenced this issue Dec 22, 2017
@antirez
Copy link
Contributor

antirez commented Dec 22, 2017

P.S. as a side effect PFMERGE could be much faster in certain small cardinality use cases.

@antirez
Copy link
Contributor

antirez commented Dec 22, 2017

Warning: I found a few bugs, fixing.

@antirez
Copy link
Contributor

antirez commented Dec 22, 2017

Just pushed a fix, we should be fine now. I'm writing a stress tester for this code.

antirez added a commit that referenced this issue Dec 22, 2017
This is a fix for the #3819 improvements. The o->ptr may change because
of hllSparseSet() calls, so 'hdr' must be correctly re-fetched.
@refaelos
Copy link
Author

@antirez thank you so much for this!
Waiting for the release 👍

@mbarkhau
Copy link

@antirez thank you for taking this up.

Looking at the diffs i wonder:

  1. Isn't possible that after PFMERGE of two sparse HLL values, the new sparse HLL will be larger than hll_sparse_max_bytes.
  2. For the keys that I've already done PFMERGE on, I'm now stuck with the dense representation? Is it possible to somehow convert them back to sparse?

@mbarkhau
Copy link

Ok 1. seems to be answered by this

As a side effect the function may promote the HLL representation from sparse to dense
https://github.com/antirez/redis/blob/cbe9d725a727d438983c88f67b32bb98115e38dc/src/hyperloglog.c#L646

@antirez
Copy link
Contributor

antirez commented Dec 22, 2017

Hello @mbarkhau, there is none of the problems you mentioned. What you say in "2" is actually the bug that I fixed, this was the old behavior. About "1", if the new representation after PFMERGE cannot be stored into a sparse representation it will automatically be promoted to dense.

@mbarkhau
Copy link

@antirez just to make sure there's no misunderstanding. I have a redis instance running 4.0.2 with a few 100k hll values that were generated using pfmerge and are now stored in the dense format. You're saying that if I now update to unstable, and go through these keys doing pfmerge again, the resulting hll values may end up in the sparse representation?

If so, that would be amazing!

@antirez
Copy link
Contributor

antirez commented Dec 22, 2017

No sorry, I misunderstood, but your problem is one worth fixing indeed! I'll try to find a solution today, if I run out of time, after returning back from holidays. Thanks for reporting this problem!

@mbarkhau
Copy link

Just to mention another use case for this: Somebody might change the hll_sparse_max_bytes after the fact and want to convert their existing entries.

@antirez
Copy link
Contributor

antirez commented Dec 22, 2017

Yes, I was more focused actually on this last use case, because the current problem, we can fix with a DEBUG command to reconvert back, but on the long run to fix this on loading is much better.

@antirez
Copy link
Contributor

antirez commented Dec 22, 2017

To do this efficiently could require a function that can than be used in order to implement even faster PFMERGE btw. That is, a function able to insert into the sparse representation without re-parsing everything, if it can assume the element we are going to insert is bigger than any other one already in. I'll study the problem with care.

@antirez
Copy link
Contributor

antirez commented Dec 22, 2017

However note that you have already a simple way to convert Dense -> Sparse right now (after applying this patch I published here)

DEL temp_var
PFMERGE temp_var my_HLL
RENAME my_HLL temp

Try it with care before deploying.

0xtonyxia pushed a commit to 0xtonyxia/redis that referenced this issue Jan 10, 2018
The commit splits the add functions into a set() and add() set of
functions, so that it's possible to set registers in an independent way
just having the index and count.

Related to redis#3819, otherwise a fix is not possible.
0xtonyxia pushed a commit to 0xtonyxia/redis that referenced this issue Jan 10, 2018
0xtonyxia pushed a commit to 0xtonyxia/redis that referenced this issue Jan 10, 2018
This is a fix for the redis#3819 improvements. The o->ptr may change because
of hllSparseSet() calls, so 'hdr' must be correctly re-fetched.
antirez added a commit that referenced this issue Jan 11, 2018
The commit splits the add functions into a set() and add() set of
functions, so that it's possible to set registers in an independent way
just having the index and count.

Related to #3819, otherwise a fix is not possible.
antirez added a commit that referenced this issue Jan 11, 2018
antirez added a commit that referenced this issue Jan 11, 2018
This is a fix for the #3819 improvements. The o->ptr may change because
of hllSparseSet() calls, so 'hdr' must be correctly re-fetched.
JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Jan 13, 2018
The commit splits the add functions into a set() and add() set of
functions, so that it's possible to set registers in an independent way
just having the index and count.

Related to redis#3819, otherwise a fix is not possible.
JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Jan 13, 2018
JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Jan 13, 2018
This is a fix for the redis#3819 improvements. The o->ptr may change because
of hllSparseSet() calls, so 'hdr' must be correctly re-fetched.
JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Jan 13, 2018
The commit splits the add functions into a set() and add() set of
functions, so that it's possible to set registers in an independent way
just having the index and count.

Related to redis#3819, otherwise a fix is not possible.
JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Jan 13, 2018
JackieXie168 pushed a commit to JackieXie168/redis that referenced this issue Jan 13, 2018
This is a fix for the redis#3819 improvements. The o->ptr may change because
of hllSparseSet() calls, so 'hdr' must be correctly re-fetched.
@mbarkhau
Copy link

mbarkhau commented Feb 1, 2018

Pardon me if I misunderstood how this is supposed to work. I've tried the following and thought that key c would end up in the sparse representation.

127.0.0.1:6379> INFO server
# Server
redis_version:4.0.7
...
127.0.0.1:6379> DEL a
(integer) 1
127.0.0.1:6379> DEL b
(integer) 1
127.0.0.1:6379> CONFIG SET hll-sparse-max-bytes 1
OK
127.0.0.1:6379> PFADD a test
(integer) 1
127.0.0.1:6379> STRLEN a
(integer) 12304
127.0.0.1:6379> CONFIG SET hll-sparse-max-bytes 10000
OK
127.0.0.1:6379> PFADD b test
(integer) 1
127.0.0.1:6379> STRLEN b
(integer) 21
127.0.0.1:6379> PFMERGE c a b
OK
127.0.0.1:6379> STRLEN c
(integer) 12304

@refaelos
Copy link
Author

refaelos commented Feb 1, 2018

@mbarkhau I don't think this fix is pushed to any live version yet.

@mbarkhau
Copy link

mbarkhau commented Feb 2, 2018

@refaelos I read this in RELEASENOTES

================================================================================
Redis 4.0.7     Released Wed Jan 24 11:01:40 CET 2018
================================================================================
...
* HyperLogLogs are no longer converted from sparse to dense in order
  to be merged. (Salvatore Sanfilippo)

But I guess this is maybe just one part of this issue?

@antirez
Copy link
Contributor

antirez commented Feb 2, 2018

@mbarkhau why merging a sparse and dense HLL should result into a sparse one?

@antirez
Copy link
Contributor

antirez commented Feb 2, 2018

The previous bug was that merging two sparse representations, always produced a dense one... There is however no "auto conversion" to the new settings. That would be quite costly probably. However what we want to do is the ability to do that on RDB reloadings. Now you can get this effect with AOF reloads btw.

@refaelos
Copy link
Author

refaelos commented Feb 2, 2018

@antirez the problem was that merging two HLLs made the merged one huge compared to what it should be. Is it now fixed in 4.0.7 ?

@mbarkhau
Copy link

mbarkhau commented Feb 2, 2018

@mbarkhau why merging a sparse and dense HLL should result into a sparse one?

From your earlier comment I thought this is how it would behave.

@antirez wrote:
However note that you have already a simple way to convert Dense -> Sparse right now (after applying this patch I published here)

DEL temp_var
PFMERGE temp_var my_HLL
RENAME my_HLL temp

So you're saying Dense -> Sparse is not possible with 4.0.7?

@antirez
Copy link
Contributor

antirez commented Feb 2, 2018

@mbarkhau no it's not possible to go from dense to sparse right now. What we are avoiding with the patch, is to have a sparse+sparse merge producing always a dense representation. However this development is possible for the future, but in order to avoid consuming CPU we have to do it when it's, let's say... opportunistic to do the conversion, that is in theory during PFCOUNT, because once we scan and get the number of zeroes, we can figure if it's worth to attempt a conversion. Surely doable but needs some care, and there are open problems, like, what's happening to the slaves given that PFCOUNT is a read only command?

@antirez
Copy link
Contributor

antirez commented Feb 5, 2018

Closing this issue since the auto-promotion to dense representation is now fixed. We have to consider, later, if we should also try to convert from dense to sparse at loading time in case the new configuration makes it applicable, however I've the feeling that in order to do that, we need to introduce a proper HLL type in Redis instead of relying on the string format. That could be addressed in Redis 5.0, but because 5.0 has as target just streams I'm not sure it there will be time.

@antirez antirez closed this as completed Feb 5, 2018
@refaelos
Copy link
Author

refaelos commented Feb 5, 2018

@antirez so the issue with inflating memory is resolved? Can we safely merge HLLs now using the latest version?

pulllock pushed a commit to pulllock/redis that referenced this issue Jun 28, 2023
The commit splits the add functions into a set() and add() set of
functions, so that it's possible to set registers in an independent way
just having the index and count.

Related to redis#3819, otherwise a fix is not possible.
pulllock pushed a commit to pulllock/redis that referenced this issue Jun 28, 2023
pulllock pushed a commit to pulllock/redis that referenced this issue Jun 28, 2023
This is a fix for the redis#3819 improvements. The o->ptr may change because
of hllSparseSet() calls, so 'hdr' must be correctly re-fetched.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants