Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with Regex #13712

Closed
jwoertink opened this issue Jul 27, 2023 · 3 comments
Closed

Memory leak with Regex #13712

jwoertink opened this issue Jul 27, 2023 · 3 comments

Comments

@jwoertink
Copy link
Contributor

jwoertink commented Jul 27, 2023

Using PCRE 1, it seems this script will increase in memory, but slowly and top out at 3933458 for total_bytes, but using PCRE 2 will increase in memory quick and top out at 6034353.

crystal build -Duse_pcre regex.cr vs crystal build -Duse_pcre2 regex.cr

1_000.times do
  emotes = [
    ":D",
    ";)",
    ":hugging:",
    ":cheers:",
    ":spicy:",
    ":chef_kiss:",
    ":)",
    ":party_scream:",
    ":100:",
    "<3",
    ":love:",
  ]

  text = "a long message with nothing in it a long message with nothing in it a long message with nothing in it a long message with nothing in it"

  emotes.each do |emote|
    regex = Regex.escape(emote)
    regex += "\\b" if emote.matches?(/\w$/)
    text.matches?(Regex.new(regex))
  end

  puts GC.stats.total_bytes
end

Can someone confirm if I'm reading this correctly?

❯ crystal -v                     
Crystal 1.9.1 [c355a34e5] (2023-07-17)
                                 
LLVM: 15.0.7                     
Default target: x86_64-unknown-linux-gnu
@jwoertink
Copy link
Contributor Author

Shrank up the example a little and decided to try using Benchmark.memory to see how much that block uses. I expected there to be a little bit of variance, but it seems it waivers a lot more than I'd expect.

require "benchmark"

counts = Hash(Int64, Int64).new(0i64)

1_000.times do
  emotes = [
    ":D",
  ]

  text = "a long message with nothing in it a long message with nothing in it a long message with nothing in it a long message with nothing in it"

  total = Benchmark.memory {
    emotes.each do |emote|
      regex = Regex.escape(emote)
      regex += "\\b" if emote.matches?(/\w$/)
      text.matches?(Regex.new(regex))
    end
  }

  counts[total.to_i64] += 1

  {% if flag?(:gc_collect) %}
  GC.collect
  {% end %}
end

Here's an example of the outputs I got from this with different variances of building where the Key is the amount of memory used, and the value is the number of times that exact amount was used.

# crystal build -Duse_pcre --release regex.cr
{496 => 1, 512 => 1, 368 => 21, 640 => 1, 3120 => 1, 880 => 1, 304 => 10, 688 => 1, 208 => 4, 4240 => 1, 1488 => 1, 4144 => 1, 64 => 867, 1408 => 1, 4160 => 16, 1344 => 15, 4096 => 21, 2880 => 15, 4064 => 18, 2112 => 1, 8096 => 1, 8257 => 1}

# crystal build -Duse_pcre -Dgc_collect --release regex.cr
{496 => 1, 368 => 23, 1968 => 1, 304 => 11, 4144 => 20, 208 => 5, 4176 => 1, 64 => 864, 1728 => 36, 4000 => 22, 4096 => 15, 5664 => 1}

# crystal build -Duse_pcre regex.cr
{496 => 1, 512 => 1, 368 => 21, 640 => 1, 3120 => 1, 880 => 1, 304 => 10, 688 => 1, 208 => 4, 4240 => 1, 1488 => 1, 4144 => 1, 64 => 867, 1408 => 1, 4160 => 16, 1344 => 15, 4096 => 21, 2880 => 15, 4064 => 18, 2112 => 1, 8096 => 1, 8257 => 1}

# crystal build -Duse_pcre -Dgc_collect regex.cr
{496 => 1, 368 => 23, 1968 => 1, 304 => 11, 4144 => 20, 208 => 5, 4208 => 1, 64 => 864, 1728 => 36, 4000 => 22, 4096 => 15, 5664 => 1}

# crystal build -Duse_pcre2 --release regex.cr
{992 => 1, 704 => 2, 560 => 8, 832 => 1, 4080 => 2, 432 => 16, 1008 => 1, 768 => 1, 320 => 5, 4352 => 1, 224 => 3, 2528 => 1, 4256 => 1, 1536 => 1, 4160 => 15, 80 => 52, 4368 => 1, 4176 => 1, 4112 => 2, 2704 => 1, 2832 => 1, 2112 => 7, 64 => 769, 4096 => 42, 2816 => 26, 4064 => 11, 2752 => 14, 8192 => 1, 10848 => 1, 192 => 1, 8257 => 1, 8160 => 1, 3424 => 1, 4000 => 1, 6848 => 1, 2144 => 1, 4144 => 4, 2944 => 1}

# crystal build -Duse_pcre2 -Dgc_collect --release regex.cr
{992 => 1, 560 => 10, 1904 => 1, 432 => 16, 1840 => 1, 3904 => 1, 320 => 5, 1728 => 1, 4160 => 3, 224 => 5, 4224 => 1, 1488 => 5, 80 => 50, 3664 => 2, 5424 => 1, 3776 => 4, 64 => 740, 4096 => 11, 1472 => 73, 4000 => 19, 3648 => 23, 4064 => 16, 5504 => 2, 5056 => 3, 8096 => 1, 5408 => 2, 5472 => 1, 7584 => 1, 7680 => 1}

# crystal build -Duse_pcre2 regex.cr
{992 => 1, 704 => 2, 560 => 8, 832 => 1, 4080 => 2, 432 => 16, 1008 => 1, 768 => 1, 320 => 5, 4352 => 1, 224 => 3, 2528 => 1, 4256 => 1, 1536 => 1, 4160 => 15, 80 => 52, 4368 => 1, 4176 => 1, 4112 => 2, 2704 => 1, 2832 => 1, 2112 => 8, 64 => 769, 4096 => 42, 2816 => 26, 4064 => 11, 2752 => 14, 8192 => 1, 10848 => 1, 192 => 1, 8257 => 1, 8160 => 1, 3424 => 1, 2944 => 1, 6848 => 1, 4000 => 1, 4144 => 4}

# crystal build -Duse_pcre2 -Dgc_collect regex.cr
{992 => 1, 560 => 10, 1904 => 1, 432 => 16, 1840 => 1, 3904 => 1, 320 => 5, 1728 => 1, 4160 => 3, 224 => 5, 4224 => 1, 1488 => 5, 80 => 50, 3664 => 2, 5424 => 1, 3776 => 4, 64 => 740, 4096 => 11, 1472 => 73, 4000 => 19, 3648 => 23, 4064 => 16, 5504 => 2, 5056 => 3, 8096 => 1, 5408 => 2, 5472 => 1, 7584 => 1, 7680 => 1}

@HertzDevil
Copy link
Contributor

GC.stats.total_bytes never decreases upon garbage collection, the larger GC.stats.total_bytes simply means PCRE2 consumes more memory on average than PCRE1. A better measure of live memory use is GC.stats.heap_size - GC.stats.free_bytes.

If you replace the 1000.times do with while true the memory also doesn't grow in an unbounded manner, so there is really no harmful memory leak in this scenario.

@jwoertink
Copy link
Contributor Author

As discussed in the Discord, it's possible this is just more of a case of #3997

In my case, we're running a TON of regex on web socket connections for parsing chat messages, and with the switch to PCRE2, this doubled our memory use. It seems our app does indeed have a memory leak, but it's most likely not regex. It was just that we had 4GB of ram, and switching to PCRE2 made our leak more prominent.

For now, we can just close this out. I'll open a new issue if I have a more concrete example on what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants