Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash now uses an open addressing algorithm #8017

Merged
merged 21 commits into from Aug 4, 2019

Conversation

@asterite
Copy link
Member

commented Jul 30, 2019

This improves its performance, both in time (always) and memory (generally).

Fixes #4557

The algorithm

The algorithm is basically the one that Ruby uses.

How to review this

There's just a single commit. My recommendation is to first look at the final code, which is thoroughly documented (the important bits are at the top), then at the diff.

I tried to document this really well because otherwise in a month I won't remember any of this. It also enables anyone to understand how it works and possibly keep optimizing things.

Why?

The old Hash implementation uses closed addressing: buckets with linked lists. This apparently isn't great because of pointer chasing and not having good cache locality. Using open addressing removes pointer chasing and improves cache locality. It could just be a theory but empirically it performs much better.

Give me some benchmarks!

I used this code as a benchmark:

Code for benchmarking old Hash vs. new Hash
require "benchmark"
require "./old_hash"
require "random/secure"

def benchmark_create_empty
  Benchmark.ips do |x|
    x.report("old, create") do
      OldHash(Int32, Int32).new
    end
    x.report("new, create") do
      Hash(String, String).new
    end
  end
end

def benchmark_insert_strings(sizes)
  sizes.each do |size|
    values = Array.new(size) { Random::Secure.hex }

    Benchmark.ips do |x|
      x.report("old, insert (type: String, size: #{size})") do
        hash = OldHash(String, String).new
        values.each do |value|
          hash[value] = value
        end
      end
      x.report("new, insert (type: String, size: #{size})") do
        hash = Hash(String, String).new
        values.each do |value|
          hash[value] = value
        end
      end
    end
  end
end

def benchmark_insert_ints(sizes)
  sizes.each do |size|
    values = Array.new(size) { rand(Int32) }

    Benchmark.ips do |x|
      x.report("old, insert (type: Int32, size: #{size})") do
        hash = OldHash(Int32, Int32).new
        values.each do |value|
          hash[value] = value
        end
      end
      x.report("new, insert (type: Int32, size: #{size})") do
        hash = Hash(Int32, Int32).new
        values.each do |value|
          hash[value] = value
        end
      end
    end
  end
end

def benchmark_read_strings(sizes)
  sizes.each do |size|
    values = Array.new(size) { Random::Secure.hex }

    old_hash = OldHash(String, String).new
    new_hash = Hash(String, String).new

    values.each do |value|
      old_hash[value] = value
      new_hash[value] = value
    end

    Benchmark.ips do |x|
      x.report("old, read (type: String, size: #{size})") do
        values.each do |value|
          old_hash[value]
        end
      end
      x.report("new, read (type: String, size: #{size})") do
        values.each do |value|
          new_hash[value]
        end
      end
    end
  end
end

def benchmark_read_ints(sizes)
  sizes.each do |size|
    values = Array.new(size) { rand(1_00_000) }

    old_hash = OldHash(Int32, Int32).new
    new_hash = Hash(Int32, Int32).new

    values.each do |value|
      old_hash[value] = value
      new_hash[value] = value
    end

    Benchmark.ips do |x|
      x.report("old, read (type: Int32, size: #{size})") do
        values.each do |value|
          old_hash[value]
        end
      end
      x.report("new, read (type: Int32, size: #{size})") do
        values.each do |value|
          new_hash[value]
        end
      end
    end
  end
end

sizes = [5, 10, 15, 20, 30, 50, 100, 200, 500, 1_000, 10_000, 100_000]

benchmark_create_empty()
puts
benchmark_insert_strings(sizes)
puts
benchmark_insert_ints(sizes)
puts
benchmark_read_strings(sizes)
puts
benchmark_read_ints(sizes)

Results:

old, create  18.50M ( 54.06ns) (± 1.82%)   160B/op   2.34× slower
new, create  43.32M ( 23.08ns) (± 0.88%)  64.0B/op        fastest

old, insert (type: String, size: 5)   3.60M (277.58ns) (± 7.01%)  480B/op   1.38× slower
new, insert (type: String, size: 5)   4.97M (201.28ns) (± 4.77%)  384B/op        fastest
old, insert (type: String, size: 10)   1.95M (513.74ns) (± 1.64%)  801B/op   1.33× slower
new, insert (type: String, size: 10)   2.59M (386.02ns) (± 3.01%)  832B/op        fastest
old, insert (type: String, size: 15)   1.27M (786.17ns) (± 3.56%)  1.09kB/op   1.55× slower
new, insert (type: String, size: 15)   1.98M (505.71ns) (± 4.48%)    832B/op        fastest
old, insert (type: String, size: 20) 922.03k (  1.08µs) (± 0.99%)  1.41kB/op   1.37× slower
new, insert (type: String, size: 20)   1.27M (788.95ns) (± 2.86%)  1.67kB/op        fastest
old, insert (type: String, size: 30) 639.38k (  1.56µs) (± 5.63%)  2.03kB/op   1.58× slower
new, insert (type: String, size: 30)   1.01M (992.38ns) (± 6.01%)  1.67kB/op        fastest
old, insert (type: String, size: 50) 355.49k (  2.81µs) (± 5.77%)  3.28kB/op   1.49× slower
new, insert (type: String, size: 50) 530.85k (  1.88µs) (± 3.95%)  3.81kB/op        fastest
old, insert (type: String, size: 100) 154.93k (  6.45µs) (± 4.04%)  7.06kB/op   1.81× slower
new, insert (type: String, size: 100) 280.47k (  3.57µs) (± 3.92%)  7.09kB/op        fastest
old, insert (type: String, size: 200)  71.71k ( 13.94µs) (± 3.27%)  13.3kB/op   1.91× slower
new, insert (type: String, size: 200) 136.71k (  7.31µs) (± 2.21%)  14.5kB/op        fastest
old, insert (type: String, size: 500)  23.46k ( 42.62µs) (± 3.53%)  36.1kB/op   2.46× slower
new, insert (type: String, size: 500)  57.82k ( 17.30µs) (± 4.78%)  28.4kB/op        fastest
old, insert (type: String, size: 1000)  12.78k ( 78.27µs) (± 1.72%)  67.4kB/op   2.20× slower
new, insert (type: String, size: 1000)  28.12k ( 35.56µs) (± 4.20%)  56.4kB/op        fastest
old, insert (type: String, size: 10000)   1.00k (997.78µs) (± 6.41%)  662kB/op   2.12× slower
new, insert (type: String, size: 10000)   2.13k (469.91µs) (± 5.39%)  896kB/op        fastest

old, insert (type: Int32, size: 5)   4.83M (207.19ns) (± 0.80%)  400B/op   1.78× slower
new, insert (type: Int32, size: 5)   8.60M (116.23ns) (± 1.76%)  240B/op        fastest
old, insert (type: Int32, size: 10)   2.78M (359.30ns) (± 0.91%)  640B/op   1.72× slower
new, insert (type: Int32, size: 10)   4.79M (208.73ns) (± 1.11%)  448B/op        fastest
old, insert (type: Int32, size: 15)   1.90M (525.28ns) (± 2.96%)  880B/op   1.75× slower
new, insert (type: Int32, size: 15)   3.33M (300.60ns) (± 5.98%)  448B/op        fastest
old, insert (type: Int32, size: 20)   1.30M (769.02ns) (± 6.52%)  1.09kB/op   1.54× slower
new, insert (type: Int32, size: 20)   2.00M (500.54ns) (± 4.76%)    976B/op        fastest
old, insert (type: Int32, size: 30) 972.78k (  1.03µs) (± 4.94%)  1.56kB/op   1.72× slower
new, insert (type: Int32, size: 30)   1.67M (597.26ns) (± 4.29%)    976B/op        fastest
old, insert (type: Int32, size: 50) 582.06k (  1.72µs) (± 5.23%)   2.5kB/op   1.50× slower
new, insert (type: Int32, size: 50) 872.34k (  1.15µs) (± 7.97%)  1.88kB/op        fastest
old, insert (type: Int32, size: 100) 234.48k (  4.26µs) (± 7.09%)   5.5kB/op   2.08× slower
new, insert (type: Int32, size: 100) 488.20k (  2.05µs) (± 6.79%)  4.14kB/op        fastest
old, insert (type: Int32, size: 200) 126.98k (  7.88µs) (± 4.74%)  10.2kB/op   1.84× slower
new, insert (type: Int32, size: 200) 233.78k (  4.28µs) (± 7.45%)  8.47kB/op        fastest
old, insert (type: Int32, size: 500)  40.86k ( 24.47µs) (± 4.14%)  28.3kB/op   2.55× slower
new, insert (type: Int32, size: 500) 104.34k (  9.58µs) (± 7.55%)  16.5kB/op        fastest
old, insert (type: Int32, size: 1000)  17.87k ( 55.97µs) (± 3.24%)  51.8kB/op   2.57× slower
new, insert (type: Int32, size: 1000)  45.98k ( 21.75µs) (± 7.72%)  32.5kB/op        fastest
old, insert (type: Int32, size: 10000)   1.52k (658.91µs) (± 3.88%)  506kB/op   2.16× slower
new, insert (type: Int32, size: 10000)   3.28k (305.04µs) (± 4.09%)  513kB/op        fastest

old, read (type: String, size: 5)  11.64M ( 85.90ns) (± 6.51%)  0.0B/op   2.25× slower
new, read (type: String, size: 5)  26.20M ( 38.16ns) (± 2.98%)  0.0B/op        fastest
old, read (type: String, size: 10)   5.69M (175.80ns) (± 4.64%)  0.0B/op   1.39× slower
new, read (type: String, size: 10)   7.93M (126.07ns) (± 6.97%)  0.0B/op        fastest
old, read (type: String, size: 15)   4.15M (240.69ns) (± 2.01%)  0.0B/op   1.16× slower
new, read (type: String, size: 15)   4.82M (207.47ns) (± 3.30%)  0.0B/op        fastest
old, read (type: String, size: 20)   2.93M (341.71ns) (± 1.47%)  0.0B/op   1.52× slower
new, read (type: String, size: 20)   4.46M (224.21ns) (± 1.25%)  0.0B/op        fastest
old, read (type: String, size: 30)   1.87M (534.24ns) (± 2.90%)  0.0B/op   1.45× slower
new, read (type: String, size: 30)   2.71M (369.30ns) (± 1.90%)  0.0B/op        fastest
old, read (type: String, size: 50) 992.08k (  1.01µs) (± 4.51%)  0.0B/op   1.71× slower
new, read (type: String, size: 50)   1.70M (589.77ns) (± 1.65%)  0.0B/op        fastest
old, read (type: String, size: 100) 596.10k (  1.68µs) (± 1.60%)  0.0B/op   1.40× slower
new, read (type: String, size: 100) 832.91k (  1.20µs) (± 1.82%)  0.0B/op        fastest
old, read (type: String, size: 200) 260.30k (  3.84µs) (± 1.63%)  0.0B/op   1.53× slower
new, read (type: String, size: 200) 398.84k (  2.51µs) (± 1.84%)  0.0B/op        fastest
old, read (type: String, size: 500) 111.03k (  9.01µs) (± 2.89%)  0.0B/op   1.40× slower
new, read (type: String, size: 500) 155.71k (  6.42µs) (± 3.33%)  0.0B/op        fastest
old, read (type: String, size: 1000)  41.59k ( 24.04µs) (± 3.07%)  0.0B/op   1.71× slower
new, read (type: String, size: 1000)  71.27k ( 14.03µs) (± 3.81%)  0.0B/op        fastest
old, read (type: String, size: 10000)   2.66k (376.32µs) (± 4.55%)  0.0B/op   2.46× slower
new, read (type: String, size: 10000)   6.54k (152.95µs) (± 4.03%)  0.0B/op        fastest

old, read (type: Int32, size: 5)  23.67M ( 42.24ns) (± 1.64%)  0.0B/op   2.30× slower
new, read (type: Int32, size: 5)  54.35M ( 18.40ns) (± 3.14%)  0.0B/op        fastest
old, read (type: Int32, size: 10)  11.66M ( 85.77ns) (± 4.63%)  0.0B/op   1.85× slower
new, read (type: Int32, size: 10)  21.58M ( 46.33ns) (± 3.20%)  0.0B/op        fastest
old, read (type: Int32, size: 15)   7.81M (128.00ns) (± 4.15%)  0.0B/op   1.49× slower
new, read (type: Int32, size: 15)  11.63M ( 86.00ns) (± 3.99%)  0.0B/op        fastest
old, read (type: Int32, size: 20)   5.80M (172.48ns) (± 5.34%)  0.0B/op   1.73× slower
new, read (type: Int32, size: 20)  10.02M ( 99.77ns) (± 4.54%)  0.0B/op        fastest
old, read (type: Int32, size: 30)   3.67M (272.51ns) (± 2.99%)  0.0B/op   1.84× slower
new, read (type: Int32, size: 30)   6.74M (148.30ns) (± 1.97%)  0.0B/op        fastest
old, read (type: Int32, size: 50)   2.05M (488.04ns) (± 5.57%)  0.0B/op   1.89× slower
new, read (type: Int32, size: 50)   3.87M (258.41ns) (± 4.24%)  0.0B/op        fastest
old, read (type: Int32, size: 100)   1.09M (921.59ns) (± 8.72%)  0.0B/op   1.70× slower
new, read (type: Int32, size: 100)   1.84M (542.80ns) (± 5.52%)  0.0B/op        fastest
old, read (type: Int32, size: 200) 535.83k (  1.87µs) (± 5.84%)  0.0B/op   1.66× slower
new, read (type: Int32, size: 200) 891.31k (  1.12µs) (± 5.49%)  0.0B/op        fastest
old, read (type: Int32, size: 500) 236.68k (  4.23µs) (± 3.61%)  0.0B/op   1.52× slower
new, read (type: Int32, size: 500) 360.85k (  2.77µs) (± 4.31%)  0.0B/op        fastest
old, read (type: Int32, size: 1000) 106.13k (  9.42µs) (± 4.88%)  0.0B/op   1.66× slower
new, read (type: Int32, size: 1000) 175.92k (  5.68µs) (± 4.08%)  0.0B/op        fastest
old, read (type: Int32, size: 10000)   4.60k (217.62µs) (± 2.94%)  0.0B/op   3.00× slower
new, read (type: Int32, size: 10000)  13.80k ( 72.47µs) (± 1.67%)  0.0B/op        fastest

As you can see the new implementation is always faster than the old one. Sometimes more memory is used, sometimes less.

I also ran some @kostya benchmarks that used Hash in their implementation. Here are the results:

Havlak:

old: 12.49s, 375.1Mb
new: 7.58s, 215.7Mb

Havlak seems to be a benchmark measuring how well a language performs in general algorithmic tasks... the new results look good! 😊

Brainfuck:

old: 5.20s, 1.8Mb
new: 4.22s, 1.8Mb

JSON (when using JSON.parse):

old: 2.20s, 1137.0Mb
new: 2.07s, 961.3Mb

Knucleotide:

old: 1.63s, 26.5Mb
new: 1.01s, 32.4Mb

Then some more benchmarks...

There's HTTP::Request#from_io which I recently optimized:

old: from_io 498.47k (  2.01µs) (± 0.89%)  816B/op  fastest
new: from_io 549.22k (  1.82µs) (± 1.89%)  720B/op  fastest

(using wrk against the sample http server increases the requests/sec from 118355.73 to about 122000)

Also for curiosity I compared creating a Hash with 1_000_000 elements and seeing how it compares to Ruby and the old Hash.

Code for creating a Hash with 1_000_000 elements in Ruby and Crystal
size = 1_000_000

h = {0 => 0}

time = Time.now
size.times do |i|
  h[i] = i
end
puts "Insert: #{Time.now - time}"

time = Time.now
size.times do |i|
  h[i]
end
puts "Read: #{Time.now - time}"

Results:

Ruby 2.7-dev:
  Insert: 0.151813
  Read:   0.13749
Crystal old:
  Insert: 0.238662
  Read:   0.129462
Crystal new:
  Insert: 0.070804
  Read:   0.041008

Ruby was faster than Crystal! Ruby is simply amazing ❤️. But now Crystal is even faster!

The compiler uses Hash all over the place!

So compile times now should go down! Right? ... Right?

Well, unfortunately no. I think the main reason is that the times are bound by the number of method instances, not by the performance of the many hashes used.

Memory consumption did seem to go down a bit.

When will I have this?

We could push this to 0.31.0. Or... we could have it in 0.30.0 if we delay the release a bit more (note: I don't manage releases, but if the community doesn't mind waiting a bit to get a huge performance boost in their apps then I think we could relax the release date).

In 0.31.0.

Final thoughts

  1. Thank you Ruby! ❤️ ❤️ ❤️
  2. Thank you Vladimir Makarov! ❤️ ❤️ ❤️
  3. Algorithms are cool!
  4. There's an optimization in the original Ruby algorithm which uses bitmasks instead of remainder or % to fit a number inside a range... I did that as almost the last thing because I didn't believe it would improve performance a lot... and it doubled the performance! 😮
  5. Feel free to target this PR and benchmark your code and post any interesting speedups here!
  6. I hope CI 32 bits passes! 😊
src/hash.cr Outdated Show resolved Hide resolved
Copy link
Contributor

left a comment

Using #as will provide more type information in case of errros compared to #not_nil!

src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
This improves its performance, both in time and memory.
@asterite asterite force-pushed the asterite:open-addressing-hash branch from 368e083 to d4c35de Jul 31, 2019
Copy link
Contributor

left a comment

I seem to recall @funny-falcon ran into some issues with the GC finding false positives while he was working on this. That is not something you have seen?

src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
@lribeiro

This comment has been minimized.

Copy link

commented Jul 31, 2019

This gives a nice speedup on our internal app for log processing!

@bcardiff will this be considered for 0.30?

Time in seconds

test 0.29 0.30dev speedup
forti 7.353 6.722 9.4%
forti_kv 5.462 4.867 12,2%
dns_grt 4.714 4.466 5,6%

Details

~/Work/crystal-1/bin/crystal version

Using compiled compiler at `.build/crystal'
Crystal 0.30.0-dev [ec2a26f9a] (2019-07-31)

LLVM: 8.0.0
Default target: x86_64-apple-macosx
~/Work/crystal-1/bin/crystal build --release -o bin/log_clean src/log_clean.cr 
Using compiled compiler at `.build/crystal'
bin/log_clean -c test/forti/forti.yml >    /dev/null  6.83s user 1.17s system 119% cpu 6.722 total
bin/log_clean -c test/forti/forti_kv.yml > /dev/null  5.11s user 2.02s system 146% cpu 4.867 total
bin/log_clean -c config/dns_grt.yml >      /dev/null  4.50s user 0.74s system 117% cpu 4.466 total

crystal version

Crystal 0.29.0 (2019-06-06)

LLVM: 6.0.1
Default target: x86_64-apple-macosx
bin/log_clean -c test/forti/forti.yml >    /dev/null  7.49s user 1.26s system 119% cpu 7.353 total
bin/log_clean -c test/forti/forti_kv.yml > /dev/null  5.80s user 2.08s system 144% cpu 5.462 total
bin/log_clean -c config/dns_grt.yml >      /dev/null  4.78s user 0.79s system 118% cpu 4.714 total
@yxhuvud

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2019

I wish github had properly threaded review comments. This is a comment to my own review comment: I note that you get lets of repeated allocation warnings and that one CI fail due to out of memory.


# Computes the next index in `@indices`, needed when an index is not empty.
private def next_index(index : Int32) : Int32
fit_in_indices(index + 1)

This comment has been minimized.

Copy link
@funny-falcon

funny-falcon Jul 31, 2019

Contributor

"linear probing" is a bad idea.
Use "quadratic probing" instead. It has same good cache locality on first three probes as for linear probing.

This comment has been minimized.

Copy link
@asterite

asterite Jul 31, 2019

Author Member

In Ruby they tried quadratic probing and it turned out to be a bit slower. They use linear probing. That's why I also chose that, and it's simpler.

Ref: https://github.com/ruby/ruby/blob/c94cc6d968a7241a487591a9753b171d8081d335/st.c#L869-L872

But if you want we can try it. I know nothing about quadratic probing, is it just index, index + 1, index + 4, index + 9, index + 16, etc.`?

This comment has been minimized.

Copy link
@konovod

konovod Jul 31, 2019

Contributor

According to the ref they aren't using linear probing now, they are using Double Hashing (see secondary hash in comments).

This comment has been minimized.

Copy link
@asterite

asterite Jul 31, 2019

Author Member

It's true! However I don't understand how that works:

 Still other bits of the hash value are used when the mapping
 results in a collision.  In this case we use a secondary hash
 value which is a result of a function of the collision bin
 index and the original hash value.  The function choice
 guarantees that we can traverse all bins and finally find the
 corresponding bin as after several iterations the function
 becomes a full cycle linear congruential generator because it
 satisfies requirements of the Hull-Dobell theorem.

I don't know about that theorem so it's harder for me to implement something I don't understand. But I'll definitely have that in mind for further optimizing this (but it'll me some more time, but PRs are welcome!).

src/hash.cr Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
@straight-shoota

This comment has been minimized.

Copy link
Member

commented Jul 31, 2019

The OOM failure on test_linux32 is pretty regular and unrelated to this PR.

# If we have less than 8 elements we avoid computing the hash
# code and directly compare the keys (might be cheaper than
# computing a hash code of a complex structure).
if entries_size <= 8

This comment has been minimized.

Copy link
@funny-falcon

funny-falcon Jul 31, 2019

Contributor

I doubt it is useful. I could be mistaken.
Was it benchmarked?

This comment has been minimized.

Copy link
@asterite

asterite Jul 31, 2019

Author Member

Yes. I just benchmarked it again because I introduced that optimization some time ago, before introducing more optimizations and changing other things.

I ran the same bencharmks as in my main post but only for String. The benchmark for insert don't have a difference. For read it gives this (old is without the optimization, new is with it):

old, read (type: String, size: 1)  70.65M ( 14.15ns) (± 2.49%)  0.0B/op   2.68× slower
new, read (type: String, size: 1) 189.05M (  5.29ns) (± 3.89%)  0.0B/op        fastest
old, read (type: String, size: 2)  39.07M ( 25.59ns) (± 2.05%)  0.0B/op   2.26× slower
new, read (type: String, size: 2)  88.16M ( 11.34ns) (± 4.01%)  0.0B/op        fastest
old, read (type: String, size: 3)  26.56M ( 37.65ns) (± 1.84%)  0.0B/op   2.00× slower
new, read (type: String, size: 3)  53.24M ( 18.78ns) (± 1.89%)  0.0B/op        fastest
old, read (type: String, size: 4)  19.98M ( 50.06ns) (± 2.17%)  0.0B/op   1.72× slower
new, read (type: String, size: 4)  34.35M ( 29.11ns) (± 3.70%)  0.0B/op        fastest
old, read (type: String, size: 5)  16.00M ( 62.49ns) (± 1.61%)  0.0B/op   1.49× slower
new, read (type: String, size: 5)  23.78M ( 42.05ns) (± 2.05%)  0.0B/op        fastest
old, read (type: String, size: 6)  13.11M ( 76.25ns) (± 1.49%)  0.0B/op   1.39× slower
new, read (type: String, size: 6)  18.17M ( 55.02ns) (± 2.68%)  0.0B/op        fastest
old, read (type: String, size: 7)  11.21M ( 89.24ns) (± 1.37%)  0.0B/op   1.22× slower
new, read (type: String, size: 7)  13.71M ( 72.93ns) (± 3.94%)  0.0B/op        fastest
old, read (type: String, size: 8)   9.57M (104.47ns) (± 3.21%)  0.0B/op   1.16× slower
new, read (type: String, size: 8)  11.14M ( 89.77ns) (± 3.77%)  0.0B/op        fastest

So you can see it's a big optimization. You can also see it becomes faster and faster to start compare the hash first when there are more and more elements, starting from 9 it's almost the same.

I wonder if this optimization could also be applied in the Ruby code... 🤔

I might give it a try later (in Ruby), now I'm curious!

This comment has been minimized.

Copy link
@asterite

asterite Jul 31, 2019

Author Member

Actually, I think if the key type is more complex and == takes more time then it will be slower. So maybe it should also be applied for a few cases... maybe just do it if the key is a String because we know it's true.

This comment has been minimized.

Copy link
@asterite

asterite Jul 31, 2019

Author Member

I just tried it with keys being Array(String) and the differences are actually much bigger, meaning that it's faster to avoid checking the hash code altogether for some reason... probably because to compute the hash code you need to consider the entire structure, but to compare elements you bail out as soon as you find a difference.

@yxhuvud

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2019

@straight-shoota Are all the warnings about large allocations also old?

src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated
end
end

private module BaseIterator
def initialize(@hash, @current)
def initialize(@hash)
@index = 0

This comment has been minimized.

Copy link
@funny-falcon

funny-falcon Jul 31, 2019

Contributor

I'd rather track index of a first element (and update it in delete_entry_and_update_counts).
If one calls hash.shift in a loop, they will suffer from O(n^2) without that. (Queue or LRU usecases).

If @indices_size_pow2 will be changed to UInt8 and placed near @indices_bytesize, then addition if @first : UInt32 will not change size of hash structure.

This comment has been minimized.

Copy link
@asterite

asterite Jul 31, 2019

Author Member

I'm sorry but I don't understand what you mean.

This comment has been minimized.

Copy link
@asterite

asterite Jul 31, 2019

Author Member

Ah, yes, I see what you mean. I think we should either remove Hash#shift or implement your suggestion. We copied Hash#shift from Ruby but I don't know what's a real use for that it.

This comment has been minimized.

Copy link
@asterite

asterite Jul 31, 2019

Author Member

Done in 01e072c !

@funny-falcon I'll be happy if you can review the commit. The change turned out to be pretty simple thanks to the built abstractions.

@funny-falcon

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2019

Rather clean code! I liked to read it.

@asterite

This comment has been minimized.

Copy link
Member Author

commented Jul 31, 2019

@funny-falcon

Rather clean code! I liked to read it.

Thank you! 😊 And thank you for the thorough review, you found many good things to fix.

@asterite

This comment has been minimized.

Copy link
Member Author

commented Jul 31, 2019

@yxhuvud

This is a comment to my own review comment: I note that you get lets of repeated allocation warnings and that one CI fail due to out of memory.

Yes, but I think the errors were there before. I think Hash might require a bit more memory because now it always doubles its size when a resize is needed. Also we keep adding tests to the spec suite and right now the compiler specs leak memory. I'll open another issue to track this problem. The solution for now is to just run specs on CI in smaller chunks instead of in two big chunks. This is a problem that only affects CI, regular machines usually have a lot of memory and don't need to compile the entire Crystal's spec suite.

@RX14

This comment has been minimized.

Copy link
Member

commented Aug 1, 2019

@funny-falcon the benefit is that we can increase the load factor, reducing the memory footprint. How much effect that has in practice, I'm not sure. Since the indices array is probably small in terms of bytes compared to the entries array. Doing actual calculations would be interesting.

Certainly a subject for another PR.

@funny-falcon

This comment has been minimized.

Copy link
Contributor

commented Aug 1, 2019

@funny-falcon the benefit is that we can increase the load factor, reducing the memory footprint.

@RX14 increasing load factor along will make this code much more complex. Current its beauty is from simplicity of sizes: always power of two, indices are twice larger than entries.

In my attempts for this issue I always tried to make load factor greater (close to 0.6-0.75), and to untangle indices and entries size. But my code were much more complex.

@RX14

This comment has been minimized.

Copy link
Member

commented Aug 1, 2019

increasing load factor along will make this code much more complex

I realise that, but Hash performance is extremely important for the speed of the language. The performance/complexity tradeoff here is complicated. That's why I suggest we discuss this later. The performance benefits of robin hood hashing probably need to be tried in practice to get an idea of whether it's wanted. The gains in this PR are very substantial, and should not be delayed for such an investigation.

@funny-falcon

This comment has been minimized.

Copy link
Contributor

commented Aug 1, 2019

I realise that, but Hash performance is extremely important for the speed of the language.

If hash performance is extremely important, then you don't want to increase load factor.
Rubin Hood has no performance benefits. If you read paper mentioned by @ysbaddaden carefully, then you noticed:

Actually, the mean of quadratic probing is slightly better than quadratic robin hood hashing (1.3 vs 1.45).

Robin Hood hashing doesn't give better "average" performance. It only gives guarantee of bounded "worst" performance. In other words, application will not be magically faster in average with Robin Hood hashing. But there will be more guarantee about time of each separate operation.

Rust used to have load factor 0.9 with Robin Hood hashing, but it gives measurable performance overhead. rust-lang/rust#38003

Now they moved to other hash implementation that combines bulk SSE check of a bucket of entries with quadratic probing of buckets.
Yes, it has impressive 0.875 load factor.
But it is not easily mapped to "ordered" hash. Bits of hash values have to be embedded into index table to use same trick.

Note that main benefit from Robin Hood hashing is a CPU cache locality. But given we have to journey from index table to entries table to check value, and order of elements in index table does not correlate to order in entries table, we already have pure cache locality. Embedding bits of hash value into index table will help, but will increase memory consumption.

Other trick to shortage average collision chain in quadratic probing is to use "collision" bit: on insertion set "collision" on all collisioned index entries, and during lookup iteration should stop if current element has no "collision" bit set. It trades some CPU for shorter collision chain. I believe, it makes "quadratic probing" much closer to "robin hood" hashing in term of "collision chain length distribution".

Any way, all my words should be taken with grain of salt. I often make false assumtions.

asterite added 5 commits Aug 1, 2019
Previously we didn't try to compact small hashes when a resize was
needed. However we can do that if we have many deleted elements
to avoid a reallocation. This should improve performance when shifting
or deleting elements and then inserting more elements.
@RX14

This comment has been minimized.

Copy link
Member

commented Aug 1, 2019

@funny-falcon thank you! The above articles on RH hashing weren't entirely clear on this.

Memory overhead of Hash is important too, and I'd be interested in a graph of the allocation size of @indexes and @entries in bytes for varying Hash(Reference, Reference) sizes. I suspect that the load factor in @indexes is fairly insignificant compared to the size of @entries.

@asterite

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2019

@funny-falcon Thank you! I implemented all your suggestions. Let me know if I got something wrong 😊

Although I think that many things can be still optimized (representation of things, the probing, the load factor, etc.) I think this is already a good improvement and we can continue refining things after this PR.

@RX14
RX14 approved these changes Aug 2, 2019
Copy link
Member

left a comment

Absolutely wonderful, thank you so much for your hard work on this, and all you do for Crystal, Ary.

@asterite

This comment has been minimized.

Copy link
Member Author

commented Aug 3, 2019

Could someone explain to me with some ascii art of graphics how Robin Hood hashing works?

When a collision occur, compare the two items’ probing count, the one with larger probing number stays and the other continue to probe. Repeat until the probing item finds an empty spot.

I can't understand this. Or... I can understand what it says but not what it means.

Let's suppose we start with empty buckets:

[_, _, _, _, _, _] 

Now we want to insert 10, which hashes to bucket 1:

[_, 10, _, _, _, _] 

Say we want to insert 20, which hashes to bucket 1 too... What do we do? The probe count for 10 is 0 (I guess?) and for 20 we don't know yet but it's definitely higher than 0 because it's occupied. So we swap them...?

[_, 20, 10, _, _, _] 

Next comes 30, again with bucket 1. So 20 has probe count 0, 10 has probe count 1, and 30 will have probe count 2... so we span everything again?

[_, 30, 20, 10, _, _] 

I can't see the benefit of this, but I'm sure there's something I'm not understanding.

@konovod

This comment has been minimized.

Copy link
Contributor

commented Aug 4, 2019

In Robin hood hashing, we compare old element probe count with current probe count of our element.
So in your example nothing will be shifted (current probe count will be equal to element probe count) - hash table will be
[_, 10, 20, 30, _, _]
Now if we insert element 21 that hashes to 2, we will compare it to 20 (0 vs 1) and 30(1 vs 2) and still insert at 4:
[_, 10, 20, 30, 21, _]
But now if we insert 40 that hashes to 1:
it will skip 10, 20, 30 as they have same probe count, but will steal a place from 21 as it will have probe count 3 and 21 will have just 1:
[_, 10, 20, 30, 40, _]
and then we continue to search place for 21, finding it at next position:
[_, 10, 20, 30, 40, 21]

@asterite

This comment has been minimized.

Copy link
Member Author

commented Aug 4, 2019

@konovod Thanks, I understand now!

@asterite asterite merged commit 3d8609b into crystal-lang:master Aug 4, 2019
4 of 5 checks passed
4 of 5 checks passed
ci/circleci: test_linux32 Your tests failed on CircleCI
Details
ci/circleci: check_format Your tests passed on CircleCI!
Details
ci/circleci: test_darwin Your tests passed on CircleCI!
Details
ci/circleci: test_linux Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@asterite asterite deleted the asterite:open-addressing-hash branch Aug 4, 2019
@asterite

This comment has been minimized.

Copy link
Member Author

commented Aug 4, 2019

Thank you everyone for the suggestions!

If you find optimizations please send PRs and benchmarks.

@asterite

This comment has been minimized.

Copy link
Member Author

commented Aug 4, 2019

Some optimizations ideas:

  • Robin Hood hashing
  • Double hashing
  • Quadratic probing
  • Reduce the allocated size of @entries (it could be possible to allocate less than @indices_size / 2)
  • Other ideas...?
@j8r

This comment has been minimized.

Copy link
Contributor

commented Aug 4, 2019

That's good points @asterite, better to create issue to track them. It can even be a RFC.

@asterite

This comment has been minimized.

Copy link
Member Author

commented Aug 4, 2019

No. If someone finds an optimization please send a PR with benchmark. I don't plan to change anything else otherwise, and there's no need to track anything.

@asterite asterite added this to the 0.31.0 milestone Aug 4, 2019
@funny-falcon

This comment has been minimized.

Copy link
Contributor

commented Aug 5, 2019

@j8r

This comment has been minimized.

Copy link
Contributor

commented Aug 5, 2019

I think it would be better to have proper RFCs about the design decisions, instead of PRs, just a thought.
Of course that's up to the core members to organize their project as they wish, and what they consider the best.

@vlazar vlazar referenced this pull request Oct 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.