Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Crystal::Hasher for computing a hash value #4946

Merged
merged 1 commit into from Sep 11, 2017
Merged

Introduce Crystal::Hasher for computing a hash value #4946

merged 1 commit into from Sep 11, 2017

Conversation

asterite
Copy link
Member

@asterite asterite commented Sep 10, 2017

This PR is a preparation to better implement the hash method for Crystal objects.

Instead of asking types to implement hash, we now provide a default implementation and ask subclasses to implement hash(hasher). A hasher provides a safe way to compute a hash, and can also compute a different value for the same program at different runs to prevent DoS attacks. So this fixes #2817

The current implementation lives in Crystal::Hasher and it's not public. The idea is not to leak this information. For example in Ruby you can't pass a hasher to methods, and you can't choose a hasher. This is all decided by Ruby's runtime, and hidden, and this is good because it removes the burden from the user.

The implementation here is safe(r) because it randomizes the initial hash value. However, I'm sure we could use SipHash or FNV or some other algorithm. But this is just a start. Next PRs should just touch Crystal::Hasher (and probably BigInt and others because the hash is not correct there).

Fixes #2817
Fixes #4578

@asterite
Copy link
Member Author

@asterite asterite commented Sep 10, 2017

This implements the code organization part of #4675 but not the actual "good" implementation. But this PR is simpler to review.

@asterite
Copy link
Member Author

@asterite asterite commented Sep 10, 2017

Also fixes #4578

@RX14
Copy link
Contributor

@RX14 RX14 commented Sep 10, 2017

@asterite I don't think "fixes #foo" outside the main PR body actually gets github to close the issues (you can actually see on the issue page if a PR will close it now, which is cool)

@asterite
Copy link
Member Author

@asterite asterite commented Sep 10, 2017

@RX14 Thanks, updated!

end

def float(value)
@value = @value * 31 + value.to_f64.unsafe_as(UInt64)
Copy link
Contributor

@RX14 RX14 Sep 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use this on the floats.

Copy link
Member Author

@asterite asterite Sep 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it for a next PR

Copy link
Contributor

@RX14 RX14 Sep 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO then?

Copy link
Member Author

@asterite asterite Sep 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a big todo at the top of the struct. On fact in this PR I could have used zero as a seed, but I wanted to see that it worked with it.

Copy link
Member Author

@asterite asterite Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you are asking is not exactly covered by the above TODO. Still, I'd like to tackle these issues one at a time. After we merge this, I'll probably wait for someone (you? :-)) to use float normalization, though I'll require a test to see why it's needed. Then another PR can tackle another (better) hash algorithm, etc.

Copy link
Contributor

@akzhan akzhan Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canonicalize is OK, but implementation proposed by @funny-falcon (PyHash inspired) is more robust and generic normalization that takes in care canonicalization too. Will propose it as separate PR.

RX14
RX14 approved these changes Sep 11, 2017
Copy link
Member

@bcardiff bcardiff left a comment

I like how the pattern take advantage of the struct and chain the results. 💙

It's a pity that all the methods in the hasher have a value argument that hide the getter of the sole ivar @value. 🤷‍♂️

Copy link
Contributor

@ysbaddaden ysbaddaden left a comment

Looks great, and it avoids pointers. Now I'd love to see halfsiphash and siphash hashing!

Copy link
Member

@straight-shoota straight-shoota left a comment

I'd suggest to add add a simple explanatory doc string to each hash(hasher) implementation, similar to the one in the example for def_hash: Define a `hash(hasher)` method based on @name and @age.
At least all the obsolete documentation should be updated or removed.

src/enum.cr Outdated
# Returns a hash value. This is the hash of the underlying value.
def hash
value.hash
def hash(hasher)
Copy link
Member

@straight-shoota straight-shoota Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc

Copy link
Member Author

@asterite asterite Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a "See Object#hash(hasher) to all of these

src/float.cr Outdated
@@ -86,6 +86,10 @@ struct Float
end
end

def hash(hasher)
Copy link
Member

@straight-shoota straight-shoota Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc

src/hash.cr Outdated
# ```
def hash
hash = size
def hash(hasher)
Copy link
Member

@straight-shoota straight-shoota Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc

src/bool.cr Outdated
@@ -42,8 +42,8 @@ struct Bool
end

# Returns a hash value for this boolean: 0 for `false`, 1 for `true`.
Copy link
Member

@straight-shoota straight-shoota Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fixed hash value should be removed.

src/char.cr Outdated
@@ -415,8 +415,8 @@ struct Char
end

# Returns this char's codepoint.
Copy link
Member

@straight-shoota straight-shoota Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc string is no longer correct.

def hash
object_id
end
def_hash object_id
Copy link
Member

@straight-shoota straight-shoota Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc

src/xml/node.cr Outdated
def hash
object_id
end
def_hash object_id
Copy link
Member

@straight-shoota straight-shoota Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc

def hash
object_id
end
def_hash object_id
Copy link
Member

@straight-shoota straight-shoota Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc

src/yaml/any.cr Outdated
def hash
raw.hash
end
def_hash raw
Copy link
Member

@straight-shoota straight-shoota Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc

src/json/any.cr Outdated
def hash
raw.hash
end
def_hash raw
Copy link
Member

@straight-shoota straight-shoota Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc

@@ -206,11 +206,13 @@ describe "Float" do

describe "hash" do
it "does for Float32" do
1.2_f32.hash.should_not eq(0)
1.2_f32.hash.should eq(1.2_f32.hash)
1.2_f32.hash.should_not eq(1.3_f32.hash)
Copy link
Contributor

@akzhan akzhan Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this expectation is wrong in common case.

Copy link
Member Author

@asterite asterite Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this then. Thanks

end

it "does for Float64" do
1.2.hash.should_not eq(0)
1.2.hash.should eq(1.2.hash)
1.2.hash.should_not eq(1.3.hash)
Copy link
Contributor

@akzhan akzhan Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

end

def float(value)
@value = @value * 31 + value.to_f64.unsafe_as(UInt64)
Copy link
Contributor

@akzhan akzhan Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canonicalize is OK, but implementation proposed by @funny-falcon (PyHash inspired) is more robust and generic normalization that takes in care canonicalization too. Will propose it as separate PR.

@asterite
Copy link
Member Author

@asterite asterite commented Sep 11, 2017

I forced pushed a few fixes (maybe next time I'll add more commits so it's simpler to review)

@bcardiff I changed the instance variable name of Hasher to result to avoid this confusion
@straight-shoota I added some basic docs, just "See Object#hash(hasher). Maybe these docs should actually be hidden, but it's not a big deal
@akzhan I removed those hash.should_not eq specs

I also changed one spec that would randomly fail. This is when a hash has the values, for example, 1 and 1.0 as keys. Their value are equal and their hash value, depending on the initial hasher's seed, could result equal, and so could result equal for Hash. This is wrong and should probably be improved. We will probably have to introduce an eql? or equal_for_hash method, similar to Ruby, where 1 == 1.0 is true but 1.eql?(1.0) is false. Another way would be to hardcode this logic in Hash, though it gets tricky when these values are nested inside other values. Maybe def_equals can also automatically define eql?. In any case, this discussion belongs to a different issue.

@akzhan
Copy link
Contributor

@akzhan akzhan commented Sep 11, 2017

when values are equal, their hashes should be equal too (as documented in Crystal documentation, and it's implemented in Python).

This is number normalization that was introduced by #4675 pull request, and will be proposed again later (just grep for hash_normalize in that PR or see for https://github.com/crystal-lang/crystal/pull/4675/files#diff-7eb0dd478a501936778ce47c2e9cde49 and https://github.com/crystal-lang/crystal/pull/4675/files#diff-0de55713d46889d8ddd209fcf1f6f120R161).

@RX14
Copy link
Contributor

@RX14 RX14 commented Sep 11, 2017

@asterite === and eql? are probably the wrong names for what I consider to be internal functions. equal_for_hash and equal_for_case are better names I think.

Although i think that @akzhan's recommendation is probably better.

@asterite
Copy link
Member Author

@asterite asterite commented Sep 11, 2017

@RX14 @akzhan I don't see how float normalization will solve that particular issue. Care to explain?

@akzhan
Copy link
Contributor

@akzhan akzhan commented Sep 11, 2017

@RX14
Copy link
Contributor

@RX14 RX14 commented Sep 11, 2017

The distinction is between float normalization which transforms a float into an unambiguous bitpattern, and number hash normalization which generates a unique hash from a unique numerical value. I guess thats why LLVM uses canonicalization to refer to the former. We should do the same.

@asterite
Copy link
Member Author

@asterite asterite commented Sep 11, 2017

But hash normalization for numbers will make this problem worse, because 1 and 1.0 will have the same hash value and because 1 == 1.0 a Hash can't distinguish them. So this:

{1 => 'a', 1.0 => 'b'}

will end up being:

{1.0 => 'b'}

because the second key replaces the first one, they are indistinguishable.

One way for that to work is not to use == to compare them, but use .eql? or equals_for_hash.

@RX14
Copy link
Contributor

@RX14 RX14 commented Sep 11, 2017

@asterite What we were suggesting is exactly that behaviour: you cannot insert two "ones" into a hash - regardless of their representation.

@asterite
Copy link
Member Author

@asterite asterite commented Sep 11, 2017

@RX14 Sounds good for now. I don't think it's consistent with union types, but we can delay this "issue" (if it is one) for later.

@asterite asterite merged commit f9237fd into crystal-lang:master Sep 11, 2017
2 checks passed
@asterite
Copy link
Member Author

@asterite asterite commented Sep 11, 2017

Merged!

Feel free to open issues regarding number normalization and using other hashing algorithms, or correct hash values for big numbers.

@asterite asterite deleted the refactor/hash branch Sep 11, 2017
@oprypin
Copy link
Member

@oprypin oprypin commented Sep 11, 2017

@asterite, see #2818 for why normalization is so important

@oprypin
Copy link
Member

@oprypin oprypin commented Sep 11, 2017

In fact, as I understand it, now numbers of two different types like {1 => 'a', 1.0 => 'b'} can undeterministically with a very low probability have the same hash causing {1.0 => 'b'}

# TODO: the implementation is naive and should probably use
# another algorithm like SipHash or FNV.

@@seed = uninitialized UInt64
Copy link
Member

@oprypin oprypin Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@seed = Random::System.rand(UInt64::MIN..UInt64::MAX) with #4894

Copy link
Contributor

@akzhan akzhan Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It breaks the compiler at all with compiler bug inside. I don't know why :)

But I'll pull request with Random::Secure.

end

def float(value)
@result *= @result * 31 + value.to_f64.unsafe_as(UInt64)
Copy link
Member

@oprypin oprypin Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this *= intended?

@oprypin
Copy link
Member

@oprypin oprypin commented Sep 11, 2017

I really dislike that this got merged in only 19 hours. I didn't have the opportunity to see it.

@asterite
Copy link
Member Author

@asterite asterite commented Sep 11, 2017

@oprypin Don't worry, this is not the final code. This just introduces Hasher as a unified way to compute a hash. I think providing a hash(hasher) method that objects must implement to append a hash value is the way to go. But do you think there's a different way to do it?

@oprypin
Copy link
Member

@oprypin oprypin commented Sep 11, 2017

The general idea is fine, if not the most elegant. In this case it's more about the small things (please re-check that *=!)

@akzhan
Copy link
Contributor

@akzhan akzhan commented Sep 11, 2017

@asterite

@oprypin Don't worry, this is not the final code. This just introduces Hasher as a unified way to compute a hash. I think providing a hash(hasher) method that objects must implement to append a hash value is the way to go. But do you think there's a different way to do it?

I'm sorry, but @funny-falcon code was the same in terms of 'final', but a lot better. Your things are great, but now we should repeat all of its great proposals. Ok.

@asterite
Copy link
Member Author

@asterite asterite commented Sep 11, 2017

@akzhan Maybe, but I don't think so. That PR adds a hash_normalize methods to numbers, but I don't think that's correct. That will only be used to implement hash, so it should be in Hasher.

It also introduces Number::HashNormalize type. That belongs in Hasher too. unseeded is not useful either. It also changes StringPool, I don't know why.

Anyway, it had a lot more code than what was needed. And it was so big it received many comments until the author got tired. I think it's better to do this by little steps.

@ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Sep 12, 2017

BTW: I ported SipHash and made a streaming version:
https://gist.github.com/ysbaddaden/d51c66bc8e07d193ecbb9f7620be8ef5

In terms of performance, computing the hash in one bulk was on par with the official C implementation. Streaming the input adds a ~20% overhead, because the state must be regularly updated.

key = uninitialized UInt8[16]
SecureRandom.random_bytes(key.to_slice)

hasher = SipHash64(1, 3).new(key)
hasher.update("some ")
hasher.update("incoming ")
hasher.update("data ")
hasher.final # => UInt64

@ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Oct 10, 2017

BTW again: I published a shard for SipHash and HalfSipHash: https://github.com/ysbaddaden/siphash.cr

@RX14
Copy link
Contributor

@RX14 RX14 commented Oct 10, 2017

@ysbaddaden any reason why this isn't a PR to replace the default implementation? I think SipHash13 should be the default (perhaps only) hasher.

@akzhan
Copy link
Contributor

@akzhan akzhan commented Oct 10, 2017

Yes, it may be good proposal (but @funny-falcon one is better for LLVM i suppose).

Anyway I don't try to propose any implementation until pushed number normalization.

@akzhan
Copy link
Contributor

@akzhan akzhan commented Oct 10, 2017

And please leave result instead of final, there are no reasons to change the naming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment