Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PFADD hangs on empty values #1657

Closed
pepijndevos opened this issue Apr 3, 2014 · 6 comments
Closed

PFADD hangs on empty values #1657

pepijndevos opened this issue Apr 3, 2014 · 6 comments

Comments

@pepijndevos
Copy link

The following Python snippet makes a Redis build of cf34507b870124f99a68f13128274af3d708b50b on Arch Linux hang at 100% CPU.

import redis
rcon = redis.StrictRedis()
rcon.execute_command('PFADD', "testest2", "a", "", "b")
@badboy
Copy link
Contributor

badboy commented Apr 3, 2014

The bug is in hyperloglog.c: https://github.com/antirez/redis/blob/unstable/src/hyperloglog.c#L319-322

hash = MurmurHash64A(ele,elesize,0);
...
while((hash & bit) == 0) {

Because the string is empty it hashes to 0. The following while loop thus never ends.

@mattsta
Copy link
Contributor

mattsta commented Apr 3, 2014

Exactly right!

Redis's MurmurHash64A is hashing the empty string to 0, which infinite loops us right here: https://github.com/antirez/redis/blob/unstable/src/hyperloglog.c#L322 because (0 & X) can't not be 0.

Another implementation of murmur had the same problem (hash of the empty string being 0) and did a very tiny fix to include the string terminator \0 as well: johnhaddon/cortex@6a2cad3

If we give hllAdd the length of the entire string including the null terminator (len = sdslen(arg)+1), we stop infinite looping and generate meaningful hash values. Of course, that does change all existing hash values. Better to catch this now before anybody has production data running against it.

Before:

Returning H of: 510903276987443985 (hash of "a")
Returning H of: 0 (hash of "")
<infinite loop>

After:

Returning H of: 7517249135209847859  (hash of "a")
Returning H of: 6351753276682545529  (hash of "")
Returning H of: 3063979243939886323  (hash of "b")

antirez added a commit that referenced this issue Apr 3, 2014
We need to guarantee that the last bit is 1, otherwise an element may
hash to just zeroes with probability 1/(2^64) and trigger an infinite
loop.

See issue #1657.
@antirez
Copy link
Contributor

antirez commented Apr 3, 2014

Fixed, thank you. The bug was triggered by the behavior of this specific hash function implementation, but actually the bug is unrelated to that since a legitimate element may hash to 00000...0000.

@antirez antirez closed this as completed Apr 3, 2014
@antirez
Copy link
Contributor

antirez commented Apr 3, 2014

Well actually I think we should follow Matt advice and modify the hash function since it is uncool that a common value maps to a such rare pattern...

@antirez
Copy link
Contributor

antirez commented Apr 4, 2014

Ok guys I've considered this a bit more. My opinion:

  • The principle of changing the hashing function so that the empty string does not map to "" is good.
  • The fix for cortex is wrong. The problem is not fixed, just the empty string is never hashed :-) Moreover semantically everything gets more complex, you can't pass a binary string which is not null terminated, you do always 1 byte more of work. No good reasons IMHO.

The real source of issues is that simply I selected "0" as seed. The seed argument is how the hash value is initialized, and is as a side effect the hash value of the empty string. The only issue here is that when I wrote the original code, I failed to realize this and set it to zero. So I'm changing this value.

However I'm tankful to my lameness since otherwise the bug about counting bits could be still there, even if the probability of it actually being triggered is very small for us to ever notice, probably! But could be used as a DOS triggered by outside.

antirez added a commit that referenced this issue Apr 4, 2014
Using a seed of zero has the side effect of having the empty string
hashing to what is a very special case in the context of HyperLogLog: a
very long run of zeroes.

This did not influenced the correctness of the result with 16k registers
because of the harmonic mean, but still it is inconvenient that a so
obvious value maps to a so special hash.

The seed 0xadc83b19 is used instead, which is the first 64 bits of the
SHA1 of the empty string.

Reference: issue #1657.
@mattsta
Copy link
Contributor

mattsta commented Apr 4, 2014

👍 Two fixes for the price of one bug report! Happy to see the "empty string avoids hashing" scenario got re-fixed too. :octocat:

antirez added a commit that referenced this issue Apr 16, 2014
We need to guarantee that the last bit is 1, otherwise an element may
hash to just zeroes with probability 1/(2^64) and trigger an infinite
loop.

See issue #1657.
antirez added a commit that referenced this issue Apr 16, 2014
Using a seed of zero has the side effect of having the empty string
hashing to what is a very special case in the context of HyperLogLog: a
very long run of zeroes.

This did not influenced the correctness of the result with 16k registers
because of the harmonic mean, but still it is inconvenient that a so
obvious value maps to a so special hash.

The seed 0xadc83b19 is used instead, which is the first 64 bits of the
SHA1 of the empty string.

Reference: issue #1657.
antirez added a commit that referenced this issue Apr 16, 2014
We need to guarantee that the last bit is 1, otherwise an element may
hash to just zeroes with probability 1/(2^64) and trigger an infinite
loop.

See issue #1657.
antirez added a commit that referenced this issue Apr 16, 2014
Using a seed of zero has the side effect of having the empty string
hashing to what is a very special case in the context of HyperLogLog: a
very long run of zeroes.

This did not influenced the correctness of the result with 16k registers
because of the harmonic mean, but still it is inconvenient that a so
obvious value maps to a so special hash.

The seed 0xadc83b19 is used instead, which is the first 64 bits of the
SHA1 of the empty string.

Reference: issue #1657.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants