Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MurmurHash3 collision property #58

Open
JuliaX92 opened this issue Apr 10, 2018 · 4 comments
Open

MurmurHash3 collision property #58

JuliaX92 opened this issue Apr 10, 2018 · 4 comments

Comments

@JuliaX92
Copy link

Hi,

I have to choose a hash function for a Bloom Filter in my Bachelor's thesis.
As recommended in some tutorials I used a version of the MurmurHash3.
My supervisor wants me to find the value of the collision property but I cannot find the place of the documentation.
I have read that it is low but not a real value.
If someone could help to find the collision property, I'd be very greatful.

Thanks,
Julia

@sebres
Copy link

sebres commented Apr 10, 2018

I hope you don't mean it in cryptographic sense, because MMH-3 function belongs not to the class of strong cryptography.
It's a hash-function which primary role to provide a good distribution within short time, so it trades off "correctness" for speed.

So again - it generates fast and good enough distributed hash value. Not more.

As regards the calculating of the odds resp. the chance of a collision of some hash algorithms, it is similar to generalization of the birthday problem.

One may assume that for the ideal hash-function with size N, the count of generated hashes without collisions seeks to 2N.

But, as the BP says us, the expected number of N-bit hashes that can be generated before getting a collision is not 2N, but rather only 2​N⁄2.
So in case of 64-bit MMH you'll have to generate up-to 4294967296 items to catch a collision.
For 128-bit MMH it seeks to 18446744073709551616 items.

In terms of ideal function:

  • it needs to pass Chi-Square distribution tests;
  • be strong by Avalanche Effect (no or extremely hardly forecast);
  • be collision resistant: e. g. 2 different keys should have only a random chance to collision.

It is not so easy to do mathematical correct calculations of real collision chance outside of your range parameters, without of knowledge of the keysets and bucket sizes (so for which sets do you need it), how the data to be hashed was generated, etc.

But you could also estimate it (or calculate the limits), using simply simulation with your basic key data.
This good article may help you by estimation of collision property in your case.

So for example (from the article), there were more than 400 collisions by 2M different file paths (by the way for your data-sets it may look totally different).
This meant that we've odds of 5000:1 (so in about the probability 0.0002 to catch a collision on such a dataset).
To calculate how it will really look by another count of picked hashes, you should apply GBP-algorithm.
See here for example, how you can do it.

See also Probability of secure hash function collisions with proof, it's another nice article of John D. Cook, PhD to theme hash collisions.

@lemire
Copy link

lemire commented Apr 10, 2018

When MurmurHash is used as a deterministic function (without randomization), then the answer is that you can find two keys that always collide. With 100% probability.

How do I know this? Simply because there are more strings that you can hash than there are hash values. So you must have collisions.

So maybe you randomize MurmurHash... MurmurHash is at best pairwise universal because it is an iterated hash function...

Even if you could randomize MurmurHash, it is not clear that there are known bounds on the universality of randomized MurmurHash.

If you just have such proofs, there are related functions with formal proofs, see...

... but MurmurHash was not designed with universality in mind.

Or else, maybe you mean something else... maybe you mean regularity.

I don't know that it is regular, however.

So I don't know what you mean.

@sebres
Copy link

sebres commented Apr 11, 2018

the answer is that you can find two keys that always collide. With 100% probability.

@lemire I think her question was not whether a collision is possible (it's pretty clear), it was rather how large may be the chance to catch a collision on some set of keys, e. g. by 2 different text-messages (and this probability is definitively not 100% 😄), also not too large, if this keys are pseudo- resp. even totally not random. This is highly depended also on the dataset used as well as on the set-size (with other words on count of the hashes picked).
As already said above, by absolutely random-sets the count of items to get a collision by 64-bit hash would be 232 (and not 264) so 4294967296 items.
If they are not really random, it is not so easy to estimate, but still possible.

@lemire
Copy link

lemire commented Apr 11, 2018

I don't know what the question was... so I am speculating... hoping that it will help clear up the question...

  • Given two text messages, they either always collide, or they do not (assuming that the hash function is not randomized). This might be a trivial observation but I am not sure everyone realizes that. If your text messages are fixed, then you need to have a randomized hash function to be able to talk about probabilities. Then you want to talk about universality. There is no established result regarding a randomized MurmurHash's universality (to my knowledge)
  • If we are considering the scenario where we pick text strings at random... and that's usually not how applications are built (e.g., very often the strings are provided to you, possibly by an adversary, we rarely work with random strings)... then if you pick the text messages at random, all that matters is that they randomly end up on any given hash value. That's regularity. We can expect good regularity out of MurmurHash, but I am not sure that there are established results.
  • Finally, there is the scenario described by @sebres : you have a set of strings and you want to know how many of them collide. Again, what matters is regularity (how evenly are the hash values spread out) but now we care also about the characteristics of the set involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants