Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --no-gc to crystal build #14223

Open
nobodywasishere opened this issue Jan 11, 2024 · 21 comments
Open

Add --no-gc to crystal build #14223

nobodywasishere opened this issue Jan 11, 2024 · 21 comments

Comments

@nobodywasishere
Copy link
Contributor

Feature Request

Crystal build times can be slow, and the garbage collector can take up a significant portion of this time. This proposal is for a compile-time --no-gc flag that disables the garbage collector during compile-time to speed up compilation time at the cost of memory. This does not break backwards compatibility.

@BlobCodes
Copy link
Contributor

You can already do this using the BDWGC environment variable GC_DONT_GC

Example:

$ GC_DONT_GC=1 time crystal build src/app.cr
1.36user 0.95system 0:01.91elapsed 121%CPU (0avgtext+0avgdata 518428maxresident)k
0inputs+9384outputs (0major+170188minor)pagefaults 0swaps

$ GC_DO_GC=1 time crystal build src/app.cr
2.18user 0.74system 0:02.52elapsed 115%CPU (0avgtext+0avgdata 399432maxresident)k
0inputs+9288outputs (0major+153022minor)pagefaults 0swaps

As you can see, in this test I get about ~30% faster compilation (2.52s -> 1.91s) with ~30% increased max memory (399MB -> 518MB).
I don't know how this scales for bigger projects.

@beta-ziliani
Copy link
Member

beta-ziliani commented Jan 11, 2024

Brilliant idea! I was using GC_INITIAL_HEAP_SIZE to give space GC, and that alone reduced the compilation times ~20% on the compiler. But disabling it entirely never crossed my mind, and it does wonders.

With the compiler GC_DONT_GC=1 makes a huge difference: roughly from 80s to 52s on a cold run and from 50s to 16s on a hot run. And memory doesn't grow beyond 4GB.

@beta-ziliani
Copy link
Member

I wonder if this shouldn't be the default, and have a --gc option instead

@z64
Copy link
Contributor

z64 commented Jan 11, 2024

@beta-ziliani Cool! Yea sometime last year we found that this helps majorly in building Kagi's web server, bringing some of our dev's times down from 5mins -> 2mins (2.5-1min in my personal case).

As a hint, I did some profiling with perf and reviewing in hotspot, and it seems a good chunk of CPU time is specifically on calls stacks bottoming out in GC.add_finalizer. Specifically, this appears to be for def finalize in src/regex/pcre2.cr. This doesn't account for all of the time save from disabling GC completely, but its a significant amount, saving me almost 30s. I will get the flamegraphs again, but I encourage you / others to look deeper.

@MistressRemilia
Copy link
Contributor

I don't know how this scales for bigger projects.

When compiling Benben, I get these numbers on my laptop (Ryzen 3 3200U, I removed ~/.cache/crystal each time):

GC_DO_GC=1 time rake
195.47user 2.24system 3:15.96elapsed 100%CPU (0avgtext+0avgdata 1324580maxresident)k
0inputs+77256outputs (0major+563717minor)pagefaults 0swaps

GC_DONT_GC=1 time rake
199.82user 2.62system 3:20.38elapsed 101%CPU (0avgtext+0avgdata 1635376maxresident)k
0inputs+77288outputs (0major+640754minor)pagefaults 0swaps

@beta-ziliani
Copy link
Member

@MistressRemilia what are the times without it?

@oprypin
Copy link
Member

oprypin commented Jan 11, 2024

@z64 That's an amazing finding 🤩
I might dig into eliminating finalize from pcre2 usages. Actually can someome add to those benchmark results what happens if you just delete def finalize from all regex source files?
Or perhaps someone could profile for the biggest regex usages.

@z64
Copy link
Contributor

z64 commented Jan 11, 2024

Test:

  • Compiler build in debug mode for symbols
  • Building Kagi debug build (~70k SLOC)

Stock v1.11.0: (bottom-up flamegraph)

image

Attribution of 40% cycles from PCRE finalizer: (new:_source:_options points here)

image

With the following patch applied:

--- a/src/regex/pcre2.cr
+++ b/src/regex/pcre2.cr
@@ -254,12 +254,12 @@ module Regex::PCRE2
     end
   end

-  def finalize
-    @match_data.consume_each do |match_data|
-      LibPCRE2.match_data_free(match_data)
-    end
-    LibPCRE2.code_free @re
-  end
+  # def finalize
+  #   @match_data.consume_each do |match_data|
+  #     LibPCRE2.match_data_free(match_data)
+  #   end
+  #   LibPCRE2.code_free @re
+  # end

image

add_finalizer evaporates from the flamegraph

@MistressRemilia
Copy link
Contributor

@beta-ziliani Without...? I have both GC_DO_GC and GC_DONT_GC there. Did I miss something 😅

@BlobCodes
Copy link
Contributor

@MistressRemilia You're executing your whole rakefile without GC, which can decrease performance dramatically.

It probably:

  • Executes the rake process
  • Executes crystal to compile the rakefile
  • Executes the resulting binary
  • Runs shards install
  • Runs the compiler again
  • ...

Try to only show the difference in the final crystal build command

@syeopite
Copy link
Contributor

syeopite commented Jan 11, 2024

Here's the result from Invidious with and without the cache:

Without cache:

$ GC_DONT_GC=1 time crystal build src/invidious.cr
25.07user 6.41system 0:24.19elapsed 130%CPU (0avgtext+0avgdata 2594384maxresident)k
0inputs+201128outputs (0major+748465minor)pagefaults 0swaps

$ GC_DO_GC=1 time crystal build src/invidious.cr
23.71user 3.69system 0:21.76elapsed 125%CPU (0avgtext+0avgdata 1404500maxresident)k
0inputs+169472outputs (0major+811067minor)pagefaults 0swaps

With cache:

$ GC_DONT_GC=1 time crystal build src/invidious.cr
2.73user 1.98system 0:04.28elapsed 110%CPU (0avgtext+0avgdata 1544100maxresident)k
0inputs+36840outputs (0major+316309minor)pagefaults 0swaps

$ GC_DO_GC=1 time crystal build src/invidious.cr
4.35user 2.09system 0:05.96elapsed 108%CPU (0avgtext+0avgdata 1235280maxresident)k
0inputs+36840outputs (0major+421919minor)pagefaults 0swaps

Compiled with an AMD Ryzen 9 7950X

@MistressRemilia
Copy link
Contributor

MistressRemilia commented Jan 11, 2024

@BlobCodes Crystal isn't compiling the Rakefile, that's a Ruby script. The way I have it written, it would first check that the shards.yml and shards.lock weren't updated (which they weren't, and all the shards had been installed), and then do a shards build to make the bin/benben binary.

Here are numbers calling the compiler directly on the same machine using the same command I have in my Rakefile for a release binary, they're pretty much the same numbers:

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
191.92user 2.38system 3:12.39elapsed 100%CPU (0avgtext+0avgdata 1635320maxresident)k
0inputs+61176outputs (0major+632918minor)pagefaults 0swaps

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
201.37user 2.12system 3:21.63elapsed 100%CPU (0avgtext+0avgdata 1324228maxresident)k
0inputs+61176outputs (0major+558645minor)pagefaults 0swaps

And without removing the cache:

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
193.09user 2.45system 3:15.98elapsed 99%CPU (0avgtext+0avgdata 1635816maxresident)k
0inputs+47712outputs (0major+607148minor)pagefaults 0swaps

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
189.86user 2.07system 3:12.29elapsed 99%CPU (0avgtext+0avgdata 1317728maxresident)k
0inputs+47712outputs (0major+528796minor)pagefaults 0swaps

Without the cache on my desktop (Core i9-10850K):

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
91.75user 1.22system 1:32.49elapsed 100%CPU (0avgtext+0avgdata 1324520maxresident)k
384inputs+61176outputs (0major+560001minor)pagefaults 0swaps

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
90.76user 1.48system 1:31.66elapsed 100%CPU (0avgtext+0avgdata 1635224maxresident)k
384inputs+61176outputs (0major+634256minor)pagefaults 0swaps

And with the cache:

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
90.32user 1.18system 1:31.65elapsed 99%CPU (0avgtext+0avgdata 1324388maxresident)k
384inputs+47712outputs (0major+534157minor)pagefaults 0swaps

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
89.63user 1.41system 1:31.15elapsed 99%CPU (0avgtext+0avgdata 1635600maxresident)k
384inputs+47712outputs (0major+608612minor)pagefaults 0swaps

@BlobCodes
Copy link
Contributor

BlobCodes commented Jan 11, 2024

@MistressRemilia

Note that LLVM is of course not garbage collected.
Only the crystal compiler itself will benefit from not garbage collecting.
To evaluate this feature request, it makes no sense to compile using --release (because then 90% of the time is spent in llvm; even if the crystal code would finish instantly, you'd only see small improvements in performance).

@MistressRemilia
Copy link
Contributor

Even without --release I get at best a 1 second difference, cache or not.

@oprypin
Copy link
Member

oprypin commented Jan 12, 2024

I made a patch that avoids using finalizers for libpcre2 and instead wires it up to normal GC.
oprypin@058d581

Unfortunately I am not observing any run time difference between these when using a release compiler to build a non-release compiler. (So disregard the commit message, it was hopeful 😅)

@oprypin
Copy link
Member

oprypin commented Jan 12, 2024

Actually I am also not observing any difference if I just remove the def finalize from libpcre2 like #14223 (comment). Maybe I'm missing something and someone else will have better luck with benchmarking that patch.

@z64
Copy link
Contributor

z64 commented Jan 12, 2024

Actually I am also not observing any difference if I just remove the def finalize from libpcre2 like #14223 (comment). Maybe I'm missing something and someone else will have better luck with benchmarking that patch.

Entirely possible something in our codebase has something the compiler is quite unhappy about WRT regex. I can't tell though from hotspot, can't follow the callers to any specific cause - will probably have to skim through IR. (Perhaps just too many literals, though I don't think we have that many really. We heavily discourage the use over regular string ops that are usually faster)

@beta-ziliani
Copy link
Member

@beta-ziliani Without...? I have both GC_DO_GC and GC_DONT_GC there. Did I miss something 😅

Sorry! misread it. You're compiling in release mode, that explains those huge numbers! 😓

@jwoertink
Copy link
Contributor

Adding this flag has been amazing for development for me. It has sped up the development compilation time enough that my Lucky app now compiles faster than the javascript build! (which takes over a minute itself...)

However, I've noticed that as I save files and Lucky re-compiles the app, over the day I start losing memory. For example, I'll start with 30gb free, save and recompile about 10 times, then I'll have 28gb free. By the end of the day I'm sitting at 2gb free. I guess this is to be expected if garbage collecting isn't happening on each rebuild, but I figured I would at least report my findings.

@oprypin
Copy link
Member

oprypin commented Jan 15, 2024

@jwoertink Hmm if the Crystal process exits, there should be no way that the memory use persists. If this is Linux (or if Mac is similar), make sure you're checking "available" memory rather than "free". Because the "free" memory metric in the free command is merely "wasted" memory that isn't used for any caches that can be dropped any time if necessary.

@crysbot
Copy link

crysbot commented May 30, 2024

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/very-slow-build-speeds-for-hello-world/6881/12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants