Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change computation of hash value. #4621

Closed
wants to merge 5 commits into from

Conversation

funny-falcon
Copy link
Contributor

To protect against Hash DoS, change the way hash value is computed.
Class|Struct should define method def hashme(hasher) and call
hasher << @ivar inside.

As an option, for speed, and for backward compatibility, def hash
still could be implemented. It will be used for Hash of matched type.
Thread#hash and Enum#hash is implemented as unseeded cause they are
used before StdHasher @@seed is initialized.

But it is better to implement def hashme(hasher).

StdHasher is default hasher that uses hashme and it is used as default
seeded hasher. It also implements fasthash used for fast hash of
primitive types, and unseeded for Enums.

Fixes #4578
Prerequisite for #4557

Copy link
Contributor

@RX14 RX14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a decent idea in general but the implementation is very weak. You should also modify def_hash

src/object.cr Outdated
end

# Protocol method for safe hashing
abstract def hashme(hasher)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be renamed to just hash(hasher) and let method overloading do the work. The precedent is in how to_s and to_json work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RX14 , how could I distinguish def hash(hasher) from def hash with responds_to?(:hash) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@funny-falcon just implement def hash(hasher) in base classes to not use responds_to?.

src/object.cr Outdated
@@ -58,13 +58,19 @@ class Object
nil
end

# Generates an `Int` hash value for this object.
# Generates an `Int` hash value for this object (usually `UInt32`)
#
# This method must have the property that `a == b` implies `a.hash == b.hash`.
#
# The hash value is used along with `==` by the `Hash` class to determine if two objects
# reference the same hash key.
abstract def hash
Copy link
Contributor

@RX14 RX14 Jun 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line, it's redundant because of the method definition below.

elsif self == to_u64
to_u64
else
(self % 0x7fffffffffffffe7_i64).to_i64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this constant deserves a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is prime number.

@@ -267,8 +267,22 @@ struct BigInt < Int
to_s io
end

private def hash_normalize
if self == to_i64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These first two conditions are weird to me, can you explain what's going on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment to Object#hash:

This method must have the property that a == b implies a.hash == b.hash

$ bin/crystal eval 'require "big"; b=BigInt.new(-1000); p b.to_i64, b.to_u64'
Using compiled compiler at `.build/crystal'
-1000
1000

So I can't use just to_u64. Nor I can just use to_i64 cause of numbers between INT64MAX and UINT64_MAX. Although, probably I can use self & 0xffffffffffffffff_u64, but it will be tweaky.

src/enum.cr Outdated
@@ -275,8 +275,16 @@ struct Enum
end

# Returns a hash value. This is the hash of the underlying value.
# Enum hash is exception from random seeding, cause several Hashes are
# initialized on very early stages of bootstrap, and StdHasher seed is not
# filled yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to work around this by adjusting the require order in the prelude.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not too convenient in Crystal.
Issues are cause of:

  • enum Signal used in src/event/signal_handler.cr
  • enum LLVM::Attribute defined in src/llvm/events/cr (don't know, where it is used)
  • class Thread is also used very early in src/thread.cr : @@threads = Set(Thread).new

If you point me how to reorder prelude to fill StdHasher @@seed before this hashes constructed, it will be great.

It could be fixed also with defining special-cased Hash, if def hashkey(key); key.hash; end will be defined, and then overloaded in subclass specially for this use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is src/prelude.cr, but I doesn't check for.

c = normalize_byte(c)
h = 31 * h + c
v = (v<<8) | normalize_byte(c).to_u32
if v >= 0x1000000_u32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also has no comment or explanation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just "take four chars in a row". I could use counter here. Just was lasy to introduce another one variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write a comment, please :)

src/stdhasher.cr Outdated
struct StdHasher
@@seed = StaticArray(UInt32, 7).new{|i| 0_u32}
def self.init
raise "#{@@seed[0]} #{@@seed[1]}" unless @@seed[0] == 0_u32 && @@seed[1] == 0_u32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sanity check remained from debugging. I'll remove it.

src/stdhasher.cr Outdated
# Also it has specialized methods for primitive keys with different seeds.
struct StdHasher
@@seed = StaticArray(UInt32, 7).new{|i| 0_u32}
def self.init
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just make this part of the @@seed initializer and avoid having an init method? Do you see any other init methods called from the prelude?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works with just filling in StdHasher body. init could be removed.

end

def initialize
@h = Pointer(Impl).malloc(1, Impl.new(@@seed[0], @@seed[1]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly what's the point of this it just seems to me you've replicated a class exactly by using 2 structs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not!!!! See StdHasher.build method: it can allocate Impl on a stack. There is no way to allocate class on a stack. Compiler is NOT smart enough!!!

If you want, I may remove StdHasher#initalize() at all, so there will be no malloc in this file.
I don't use it. I put it just for completeness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1517 : @asterite thought compiler could be smart enough. But compiler is not smart enough yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need pointers (nor allocate HEAP memory). As long as you access @h (not through an accessor) you can mutate @h properties directly:

struct Hasher
  struct Impl
    property v : Int32
    
    def initialize(@v)
    end
  end
  
  def initialize(seed)
    @h = Impl.new(seed)
  end
  
  def <<(v)
    @h.v += v
  end
  
  def digest
    @h.v
  end
end

hasher = Hasher.new(0)
hasher << 1
hasher << 4
p hasher.digest  # => 5

See https://play.crystal-lang.org/#/r/299v

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initializers could then be reduced to:

def self.new
  new(@@seed[0], @@seed[1])
end

protected def initialize(a, b)
  @h = Impl.new(a, b)
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method chaining is nice sugar but not really necessary. Your suggestions for alternate names would all be without chaining as well. I don't see that << would imply chainability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway class/struct creators will have knowledge to return Hasher in hash, for example.

Moreover it's just hasher << whaaaaat pattern.

And compiler will help with type stricture.

Copy link
Contributor

@akzhan akzhan Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, @RX14 already said about def_hash, looks like 80+% authors just will use this macro.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akzhan, cause it is easy to break with chaining:

# good
def hash(hasher)
  hasher << @a << @b
end
# wrong, but compiles without error
def hash(hasher)
  hasher << @a << @b
  hasher << @c
end
# corrected
def hash(hasher)
  hasher = hasher << @a << @b
  hasher << @c
end

So, resolution:

  • default hasher is a struct without pointers (for safety),
  • it causes that def hasher have to return hasher,
  • << is used, but without chaining (to reduce usage errors),
    Is it correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

src/stdhasher.cr Outdated
self
end

def <<(v : Int8|Int16|Int32|UInt8|UInt16|UInt32)
Copy link
Contributor

@RX14 RX14 Jun 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use crystal tool format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, use bin/crystal tool format, it's changed in master branch.

@kostya
Copy link
Contributor

kostya commented Jun 26, 2017

why hashme? but not def hash(hasher)

@funny-falcon
Copy link
Contributor Author

@kostya , how could I distinguish def hash(hasher) from def hash with responds_to?(:hash) ?

@kostya
Copy link
Contributor

kostya commented Jun 26, 2017

is responds_to? needed?, may be implement default hash it on Object?

@funny-falcon
Copy link
Contributor Author

@RX14, @kostya, @akzhan you are all right: it is possible to go with hash(hasher) instead of hasme(hasher), and responds_to? isn't need.

I'll put corrected version later today or tommorow.

@@ -143,6 +143,10 @@ struct BigRational < Number
to_f64.hash
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok (and for othet same notes).

hash_normalize.hash
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

@@ -39,6 +39,10 @@ struct BigFloat < Float
to_f64.hash
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

src/float.cr Outdated
hash_normalize.hash
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

src/float.cr Outdated
hash_normalize.hash
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

StdHasher.hashit self
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

src/indexable.cr Outdated
def hash
reduce(31 * size) do |memo, elem|
31 * memo + elem.hash
def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

@@ -4,8 +4,8 @@ struct XML::Namespace
def initialize(@document : Node, @ns : LibXML::NS*)
end

def hash
object_id
def hashme(hasher)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this perhaps be generalized? Like if a type has method object_id, use this for calculating the hash, whether it is a reference or not.

src/float.cr Outdated
@@ -148,8 +148,17 @@ struct Float32
Printer.print(self, io)
end

private def hash_normalize
t = to_i32
t == self ? t : unsafe_as(Int32)
Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad idea, but I know, that current implementation is also weird.

Will be replaced with #4581 I hope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is neccessary cause of definition of hash function: "if a==b then hash(a)==hash(b)".

Without it Hash(Int32|Float32, V) is broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does #4581 fix hashing of Float32|Float64 ? If so, then it could be better to apply.
Do you want I include it to my patch, or they will be applied in some order?

Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@funny-falcon Float hashes in pyhash manner are identical to integer values if no fraction part exists. Feel free to include this, of course.

p [1, 2, 3, 4, 1.0, 2.0_f64, 3.0_f32, 4.0].map &.hash
[1, 2, 3, 4, 1, 2, 3, 4]

p [1, 2, 3, 4, 1.0, 2.0_f64, 3.0_f32, 4.1].map &.hash
[1, 2, 3, 4, 1, 2, 3, 1932733648]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, there is one exception - -1 value. Will fix now. There is one line fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean more 4.25_f64.hash == 4.25_f32 ? If not, then my variant is simpler. If yes, then it is better to use more correct variant.

(note: I believe 4.1_f64 != 4.1_f32 in any case, but I could be mistaken).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.5_f32.hash == 0.5_f64.hash # => true
0.75_f32.hash == 0.75_f64.hash # => true
4.1_f32.hash == 4.1_f64.hash # => false - because values differ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Ok, I will include that function in next version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be strict:

puts "%.20f" % 4.1
puts "%.20f" % 4.1_f32

4.09999999999999964473
4.09999990463256835938

src/hash.cr Outdated
hash = size
def hashme(hasher)
hasher << size
dgst = hasher.digest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need no shortcuts. digest sounds better than dgst.

@@ -262,11 +262,6 @@ struct JSON::Any
end

# :nodoc:
def hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beause Struct already provides correct hash(hasher).

src/int.cr Outdated
@@ -317,7 +317,15 @@ struct Int
end

def hash
self
if self.is_a? Primitive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case could Int not be a primitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigInt.

@@ -0,0 +1,248 @@
require "secure_random"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SecureRandom is stdlib, not corelib, it shouldn't be injected into prelude (indireclty).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But without "secure_random" no safety for hash table.

What do you suggest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Crystal::System::Random directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ysbaddaden, big thanks!!!! I'll do.

@funny-falcon
Copy link
Contributor Author

btw, it will be ok if I'll push with force this PR? I prefer to keep history clean.
But if it is not acceptable here, I will put "appending" commits.

@akzhan
Copy link
Contributor

akzhan commented Jun 26, 2017

Appending of commits is OK. Result will be squashed and reworded anyway.

src/stdhasher.cr Outdated
end

def <<(v : UInt64)
high = (v>>32).to_u32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, Crystal@master has new value.unsafe_as(UInt32) method. AFAIR it's faster for these cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? Here is integer to integer conversion. Isn't it compiler primitive already?

Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, No.

Crystal compiler is lower than 1.0, so lot of optimizations doesn't exist yet.

You have i64.i32 conversion here.

Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wrong :) Just found these primitives :)

src/stdhasher.cr Outdated
if high != 0
self << high
end
self << v.to_u32
Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this code isn't compatible with #4581. There is 31/62bit divider.

Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have commented Hashing::Type but you may introduce it as alias StdHasher::Type = Int32.

So please update it, you need these lines for 64bit integers.

  private HASH_BITS     =     31 # sizeof(StdHasher::Type) >= 8 ? 61 : 31

    exp = exp >= 0 ? exp % HASH_BITS : HASH_BITS - 1 - ((-1 - exp) % HASH_BITS)
    x = ((x << exp) & HASH_MODULUS) | x >> (HASH_BITS - exp)

@akzhan
Copy link
Contributor

akzhan commented Jun 27, 2017

I suppose that StdHasher should have something like StdHasher::ValueType alias to declare hash value type.

To protect against Hash DoS, change the way hash value is computed.
Class|Struct should define method `def hashme(hasher)` and call
`hasher << @ivar` inside.

As an option, for speed, and for backward compatibility, `def hash`
still could be implemented. It will be used for Hash of matched type.
`Thread#hash` and `Enum#hash` is implemented as unseeded cause they are
 used before `StdHasher @@seed` is initialized.

But it is better to implement `def hashme(hasher)`.

StdHasher is default hasher that uses `hashme` and it is used as default
seeded hasher. It also implements `fasthash` used for fast hash of
primitive types, and `unseeded` for `Enums`.

Fixes crystal-lang#4578
Prerequisite for crystal-lang#4557
though Enum#hash could be re-introduced later for performance.
doesn't include crystal-lang#4581 yet (number normalization)
@funny-falcon
Copy link
Contributor Author

I pushed intermediate changes.
Biggest remaining issue is inclusion of number normalization.
Also open issue is dependency on SecureRandom and inclusion into prelude.

bytesize = name.bytesize
slice = name.to_slice
hasher.raw(bytesize)
while c + 4 < bytesize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use idiomatic crystal (no impact on performance):

slice = name.to_slice

0.step(to: slice.size, by: 4) do |c|
  # ...
end

slice.each do |c|
  # ...
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice looks to allocate Array. Am I mistaken?
Computing hash should not do heap allocation.

Where is step defined? I didn't find it in 'int.cr' (while upto were there) and gave up.
Ah, I see, it is in number.cr. My bad. I'll fix it.