Optimize `Hash` for repeated removals and insertions #14539

HertzDevil · 2024-04-24T23:11:11Z

Hash deletions do not clear @indices, so subsequent insertions with the same key cannot use those slots, and effectively behave like hash collisions. This PR adds an extra sentinel value for deleted index slots; they can be later filled in and, unlike the empty sentinel, do not halt index scans. See https://forum.crystal-lang.org/t/hash-delete-followed-by-insert-performance-issues/6784 for a discussion. Credit goes to @homonoidian for discovering this.

Benchmark:

require "benchmark"

Benchmark.ips do |b|
  (2..10).each do |i|
    capacity = 1 << i
    b.report("capacity = #{capacity}, empty") do
      h = Hash(Int32, Int32).new(initial_capacity: capacity)
      10000.times do
        h.delete(100)
        h[100] = 123
      end
    end
  end
end

puts

Benchmark.ips do |b|
  (2..10).each do |i|
    capacity = 1 << i
    b.report("capacity = #{capacity}, filled") do
      h = Hash(Int32, Int32).new(initial_capacity: capacity)
      (capacity // 4).times { |i| h[i] = i }
      10000.times do
        h.delete(100)
        h[100] = 123
      end
    end
  end
end

Before:

   capacity = 4, empty  14.96k ( 66.83µs) (± 0.53%)    176B/op         fastest
   capacity = 8, empty  14.96k ( 66.83µs) (± 0.69%)    176B/op    1.00× slower
  capacity = 16, empty  13.82k ( 72.38µs) (± 0.94%)    272B/op    1.08× slower
  capacity = 32, empty   1.91k (522.44µs) (± 3.49%)    592B/op    7.82× slower
  capacity = 64, empty   1.07k (938.40µs) (± 0.57%)  0.99kB/op   14.04× slower
 capacity = 128, empty 582.25  (  1.72ms) (± 0.50%)  2.32kB/op   25.70× slower
 capacity = 256, empty 322.94  (  3.10ms) (± 0.50%)  4.38kB/op   46.34× slower
 capacity = 512, empty 169.62  (  5.90ms) (± 0.94%)   8.1kB/op   88.22× slower
capacity = 1024, empty  88.60  ( 11.29ms) (± 1.19%)  16.1kB/op  168.89× slower

   capacity = 4, filled   7.97k (125.51µs) (± 0.63%)    176B/op        fastest
   capacity = 8, filled   6.94k (144.02µs) (± 0.82%)    177B/op   1.15× slower
  capacity = 16, filled   4.55k (219.98µs) (± 1.70%)    272B/op   1.75× slower
  capacity = 32, filled   2.23k (448.43µs) (± 1.06%)    592B/op   3.57× slower
  capacity = 64, filled   1.18k (849.64µs) (± 0.64%)  0.99kB/op   6.77× slower
 capacity = 128, filled 660.20  (  1.51ms) (± 0.65%)  2.32kB/op  12.07× slower
 capacity = 256, filled 365.25  (  2.74ms) (± 0.59%)  4.39kB/op  21.81× slower
 capacity = 512, filled 198.40  (  5.04ms) (± 4.68%)   8.1kB/op  40.16× slower
capacity = 1024, filled 102.61  (  9.75ms) (± 2.87%)  16.1kB/op  77.65× slower

After:

   capacity = 4, empty  15.26k ( 65.52µs) (± 0.39%)    176B/op        fastest
   capacity = 8, empty  15.24k ( 65.60µs) (± 0.57%)    176B/op   1.00× slower
  capacity = 16, empty  14.19k ( 70.48µs) (± 1.32%)    272B/op   1.08× slower
  capacity = 32, empty  10.70k ( 93.47µs) (± 0.79%)    592B/op   1.43× slower
  capacity = 64, empty  10.69k ( 93.54µs) (± 3.13%)  0.99kB/op   1.43× slower
 capacity = 128, empty  10.77k ( 92.86µs) (± 1.04%)  2.32kB/op   1.42× slower
 capacity = 256, empty  11.43k ( 87.45µs) (± 1.32%)  4.39kB/op   1.33× slower
 capacity = 512, empty  11.51k ( 86.85µs) (± 0.67%)   8.1kB/op   1.33× slower
capacity = 1024, empty  11.50k ( 86.95µs) (± 0.60%)  16.1kB/op   1.33× slower

   capacity = 4, filled   8.02k (124.74µs) (± 0.72%)    176B/op   1.18× slower
   capacity = 8, filled   7.04k (141.99µs) (± 0.66%)    177B/op   1.34× slower
  capacity = 16, filled   4.57k (218.80µs) (± 0.54%)    272B/op   2.07× slower
  capacity = 32, filled   9.12k (109.67µs) (± 1.14%)    592B/op   1.04× slower
  capacity = 64, filled   9.45k (105.78µs) (± 0.80%)  0.99kB/op        fastest
 capacity = 128, filled   7.51k (133.13µs) (± 0.95%)  2.32kB/op   1.26× slower
 capacity = 256, filled   7.58k (131.92µs) (± 0.77%)  4.39kB/op   1.25× slower
 capacity = 512, filled   7.61k (131.43µs) (± 0.66%)   8.1kB/op   1.24× slower
capacity = 1024, filled   8.86k (112.91µs) (± 0.68%)  16.1kB/op   1.07× slower

This scenario now runs in O(1) instead of O(n) time. Note however that deletion is now the opposite, running in O(n) time to the number of hash collisions, instead of O(1). That means it is possible to craft other scenarios where running time grows from linear to quadratic:

Benchmark:

require "benchmark"

record BadKey, x : Int32 do
  def hash
    1
  end
end

# all keys have hash collisions; the first key in the collision chain is
# deleted at every step, so `#index_for_entry_index` returns immediately,
# but `#delete_index` shifts the entire chain of indices
Benchmark.ips do |b|
  (6..10).each do |n|
    b.report("N = #{1 << n}") do
      h = Hash(BadKey, Void*).new(initial_capacity: 4096)
      keys = Array.new(1 << n) { |i| BadKey.new(i) }
      keys.each { |k| h[k] = Pointer(Void).null }

      keys.cycle(1000) do |k|
        h.delete(k)
        h[k] = Pointer(Void).null
      end
    end
  end
end

Before:

  N = 64   4.29  (233.37ms) (± 0.37%)  80.6kB/op        fastest
 N = 128   2.13  (469.17ms) (± 0.18%)  81.1kB/op   2.01× slower
 N = 256   1.05  (952.04ms) (± 0.08%)  81.3kB/op   4.08× slower
 N = 512 511.41m (  1.96s ) (± 0.04%)  82.0kB/op   8.38× slower
N = 1024 239.93m (  4.17s ) (± 0.17%)  84.1kB/op  17.86× slower

After:

  N = 64  89.49  ( 11.18ms) (± 0.50%)  80.4kB/op         fastest
 N = 128  22.58  ( 44.30ms) (± 0.75%)  80.8kB/op    3.96× slower
 N = 256   5.66  (176.81ms) (± 0.54%)  81.3kB/op   15.82× slower
 N = 512   1.38  (722.15ms) (± 2.30%)  82.0kB/op   64.62× slower
N = 1024 341.43m (  2.93s ) (± 0.43%)  84.1kB/op  262.09× slower

Note that checking Hash::Entry#deleted? doesn't suffice because that also returns true for elements in @entries which were previously unused. Also this PR doesn't change how @entries is used, and #do_compaction will still be called every now and then whenever @entries reaches its capacity (similar to alternating #push and #shift calls on an Array).

crysbot · 2024-04-25T00:00:30Z

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/hash-delete-followed-by-insert-performance-issues/6784/11

src/hash.cr

Optimize Hash for repeated removals and insertions

d6821aa

HertzDevil added performance topic:stdlib:collection labels Apr 24, 2024

HertzDevil marked this pull request as draft April 24, 2024 23:18

fixup

0a6793d

HertzDevil marked this pull request as ready for review April 25, 2024 03:12

straight-shoota reviewed Apr 25, 2024

View reviewed changes

src/hash.cr Outdated Show resolved Hide resolved

src/hash.cr Outdated Show resolved Hide resolved

keep only #delete_index

e53117f

ysbaddaden approved these changes Jun 10, 2024

View reviewed changes

straight-shoota added this to the 1.13.0 milestone Jun 10, 2024

straight-shoota merged commit 504fdb7 into crystal-lang:master Jun 12, 2024
60 checks passed

HertzDevil deleted the perf/hash-deleted-index branch June 26, 2024 12:31

BrewTestBot mentioned this pull request Jul 10, 2024

crystal 1.13.0 Homebrew/homebrew-core#176873

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `Hash` for repeated removals and insertions #14539

Optimize `Hash` for repeated removals and insertions #14539

HertzDevil commented Apr 24, 2024 •

edited

Loading

crysbot commented Apr 25, 2024

Optimize Hash for repeated removals and insertions #14539

Optimize Hash for repeated removals and insertions #14539

Conversation

HertzDevil commented Apr 24, 2024 • edited Loading

crysbot commented Apr 25, 2024

Optimize `Hash` for repeated removals and insertions #14539

Optimize `Hash` for repeated removals and insertions #14539

HertzDevil commented Apr 24, 2024 •

edited

Loading