Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Hash for repeated removals and insertions #14539

Merged

Conversation

HertzDevil
Copy link
Contributor

@HertzDevil HertzDevil commented Apr 24, 2024

Hash deletions do not clear @indices, so subsequent insertions with the same key cannot use those slots, and effectively behave like hash collisions. This PR adds an extra sentinel value for deleted index slots; they can be later filled in and, unlike the empty sentinel, do not halt index scans. See https://forum.crystal-lang.org/t/hash-delete-followed-by-insert-performance-issues/6784 for a discussion. Credit goes to @homonoidian for discovering this.

Benchmark:
require "benchmark"

Benchmark.ips do |b|
  (2..10).each do |i|
    capacity = 1 << i
    b.report("capacity = #{capacity}, empty") do
      h = Hash(Int32, Int32).new(initial_capacity: capacity)
      10000.times do
        h.delete(100)
        h[100] = 123
      end
    end
  end
end

puts

Benchmark.ips do |b|
  (2..10).each do |i|
    capacity = 1 << i
    b.report("capacity = #{capacity}, filled") do
      h = Hash(Int32, Int32).new(initial_capacity: capacity)
      (capacity // 4).times { |i| h[i] = i }
      10000.times do
        h.delete(100)
        h[100] = 123
      end
    end
  end
end

Before:

   capacity = 4, empty  14.96k ( 66.83µs) (± 0.53%)    176B/op         fastest
   capacity = 8, empty  14.96k ( 66.83µs) (± 0.69%)    176B/op    1.00× slower
  capacity = 16, empty  13.82k ( 72.38µs) (± 0.94%)    272B/op    1.08× slower
  capacity = 32, empty   1.91k (522.44µs) (± 3.49%)    592B/op    7.82× slower
  capacity = 64, empty   1.07k (938.40µs) (± 0.57%)  0.99kB/op   14.04× slower
 capacity = 128, empty 582.25  (  1.72ms) (± 0.50%)  2.32kB/op   25.70× slower
 capacity = 256, empty 322.94  (  3.10ms) (± 0.50%)  4.38kB/op   46.34× slower
 capacity = 512, empty 169.62  (  5.90ms) (± 0.94%)   8.1kB/op   88.22× slower
capacity = 1024, empty  88.60  ( 11.29ms) (± 1.19%)  16.1kB/op  168.89× slower

   capacity = 4, filled   7.97k (125.51µs) (± 0.63%)    176B/op        fastest
   capacity = 8, filled   6.94k (144.02µs) (± 0.82%)    177B/op   1.15× slower
  capacity = 16, filled   4.55k (219.98µs) (± 1.70%)    272B/op   1.75× slower
  capacity = 32, filled   2.23k (448.43µs) (± 1.06%)    592B/op   3.57× slower
  capacity = 64, filled   1.18k (849.64µs) (± 0.64%)  0.99kB/op   6.77× slower
 capacity = 128, filled 660.20  (  1.51ms) (± 0.65%)  2.32kB/op  12.07× slower
 capacity = 256, filled 365.25  (  2.74ms) (± 0.59%)  4.39kB/op  21.81× slower
 capacity = 512, filled 198.40  (  5.04ms) (± 4.68%)   8.1kB/op  40.16× slower
capacity = 1024, filled 102.61  (  9.75ms) (± 2.87%)  16.1kB/op  77.65× slower

After:

   capacity = 4, empty  15.26k ( 65.52µs) (± 0.39%)    176B/op        fastest
   capacity = 8, empty  15.24k ( 65.60µs) (± 0.57%)    176B/op   1.00× slower
  capacity = 16, empty  14.19k ( 70.48µs) (± 1.32%)    272B/op   1.08× slower
  capacity = 32, empty  10.70k ( 93.47µs) (± 0.79%)    592B/op   1.43× slower
  capacity = 64, empty  10.69k ( 93.54µs) (± 3.13%)  0.99kB/op   1.43× slower
 capacity = 128, empty  10.77k ( 92.86µs) (± 1.04%)  2.32kB/op   1.42× slower
 capacity = 256, empty  11.43k ( 87.45µs) (± 1.32%)  4.39kB/op   1.33× slower
 capacity = 512, empty  11.51k ( 86.85µs) (± 0.67%)   8.1kB/op   1.33× slower
capacity = 1024, empty  11.50k ( 86.95µs) (± 0.60%)  16.1kB/op   1.33× slower

   capacity = 4, filled   8.02k (124.74µs) (± 0.72%)    176B/op   1.18× slower
   capacity = 8, filled   7.04k (141.99µs) (± 0.66%)    177B/op   1.34× slower
  capacity = 16, filled   4.57k (218.80µs) (± 0.54%)    272B/op   2.07× slower
  capacity = 32, filled   9.12k (109.67µs) (± 1.14%)    592B/op   1.04× slower
  capacity = 64, filled   9.45k (105.78µs) (± 0.80%)  0.99kB/op        fastest
 capacity = 128, filled   7.51k (133.13µs) (± 0.95%)  2.32kB/op   1.26× slower
 capacity = 256, filled   7.58k (131.92µs) (± 0.77%)  4.39kB/op   1.25× slower
 capacity = 512, filled   7.61k (131.43µs) (± 0.66%)   8.1kB/op   1.24× slower
capacity = 1024, filled   8.86k (112.91µs) (± 0.68%)  16.1kB/op   1.07× slower

This scenario now runs in O(1) instead of O(n) time. Note however that deletion is now the opposite, running in O(n) time to the number of hash collisions, instead of O(1). That means it is possible to craft other scenarios where running time grows from linear to quadratic:

Benchmark:
require "benchmark"

record BadKey, x : Int32 do
  def hash
    1
  end
end

# all keys have hash collisions; the first key in the collision chain is
# deleted at every step, so `#index_for_entry_index` returns immediately,
# but `#delete_index` shifts the entire chain of indices
Benchmark.ips do |b|
  (6..10).each do |n|
    b.report("N = #{1 << n}") do
      h = Hash(BadKey, Void*).new(initial_capacity: 4096)
      keys = Array.new(1 << n) { |i| BadKey.new(i) }
      keys.each { |k| h[k] = Pointer(Void).null }

      keys.cycle(1000) do |k|
        h.delete(k)
        h[k] = Pointer(Void).null
      end
    end
  end
end

Before:

  N = 64   4.29  (233.37ms) (± 0.37%)  80.6kB/op        fastest
 N = 128   2.13  (469.17ms) (± 0.18%)  81.1kB/op   2.01× slower
 N = 256   1.05  (952.04ms) (± 0.08%)  81.3kB/op   4.08× slower
 N = 512 511.41m (  1.96s ) (± 0.04%)  82.0kB/op   8.38× slower
N = 1024 239.93m (  4.17s ) (± 0.17%)  84.1kB/op  17.86× slower

After:

  N = 64  89.49  ( 11.18ms) (± 0.50%)  80.4kB/op         fastest
 N = 128  22.58  ( 44.30ms) (± 0.75%)  80.8kB/op    3.96× slower
 N = 256   5.66  (176.81ms) (± 0.54%)  81.3kB/op   15.82× slower
 N = 512   1.38  (722.15ms) (± 2.30%)  82.0kB/op   64.62× slower
N = 1024 341.43m (  2.93s ) (± 0.43%)  84.1kB/op  262.09× slower

Note that checking Hash::Entry#deleted? doesn't suffice because that also returns true for elements in @entries which were previously unused. Also this PR doesn't change how @entries is used, and #do_compaction will still be called every now and then whenever @entries reaches its capacity (similar to alternating #push and #shift calls on an Array).

@crysbot
Copy link

crysbot commented Apr 25, 2024

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/hash-delete-followed-by-insert-performance-issues/6784/11

@HertzDevil HertzDevil marked this pull request as ready for review April 25, 2024 03:12
src/hash.cr Outdated Show resolved Hide resolved
src/hash.cr Outdated Show resolved Hide resolved
@straight-shoota straight-shoota added this to the 1.13.0 milestone Jun 10, 2024
@straight-shoota straight-shoota merged commit 504fdb7 into crystal-lang:master Jun 12, 2024
60 checks passed
@HertzDevil HertzDevil deleted the perf/hash-deleted-index branch June 26, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants