Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce threads spawned for codegen #745

Closed
wants to merge 1 commit into from
Closed

Reduce threads spawned for codegen #745

wants to merge 1 commit into from

Conversation

flaviut
Copy link
Contributor

@flaviut flaviut commented Jun 6, 2015

Before:

30.59user 45.65system 0:36.92elapsed 206%CPU (0avgtext+0avgdata 946896maxresident)k
0inputs+532768outputs (0major+824582minor)pagefaults 0swaps

After:

29.47user 3.11system 0:22.75elapsed 143%CPU (0avgtext+0avgdata 945240maxresident)k
0inputs+531824outputs (0major+497122minor)pagefaults 0swaps

Note that the system time is reduced from 45 seconds to 3 seconds: this time is almost pure overhead spent copying pages.

Before:
30.59user 45.65system 0:36.92elapsed 206%CPU (0avgtext+0avgdata 946896maxresident)k
0inputs+532768outputs (0major+824582minor)pagefaults 0swaps

After:
29.47user 3.11system 0:22.75elapsed 143%CPU (0avgtext+0avgdata 945240maxresident)k
0inputs+531824outputs (0major+497122minor)pagefaults 0swaps
@asterite
Copy link
Member

asterite commented Jun 7, 2015

Hi!

I tried your changes in Mac and they result in a slowdown. Here's what I'm doing:

  1. Compile the compiler in release mode, for example bin/crystal build src/compiler/crystal.cr -o crystal-new --release
  2. Copy the compiler to the .build directory so bin/crystal picks it up: mkdir .build; cp crystal-new .build/crystal
  3. Make sure to remove all cached data from previous compilations: rm -rf .crystal
  4. Test the new compiler's performance: time bin/crystal build src/compiler/crystal.cr -o crystal-test -s

You can do steps 1 to 4 for the current compiler and the compiler with your changes applies to see if there's a difference. At least in my case the "Codegen (bc+obj)" takes a few seconds more (about 4 seconds).

Would there be a reason that in Mac forking many times is faster than in linux? If so, we might use an ifdef to choose one strategy or another. I still have to try your changes on linux, but first I want to be sure that you aren't accidentally reusing cached data.

@jhass
Copy link
Member

jhass commented Jun 7, 2015

$ uname -a   
Linux jhass-laptop 4.0.4-2-ARCH #1 SMP PREEMPT Fri May 22 03:05:23 UTC 2015 x86_64 GNU/Linux
$ rm -rf .crystal
$ time make clean crystal FLAGS="--release --stats"
rm -rf .build
rm -rf ./doc
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --release --stats -o .build/crystal src/compiler/crystal.cr
Parse: 00:00:01.0255180
Type inference: 00:00:07.6041620
Codegen (crystal): 00:00:01.5148260
Codegen (bc+obj): 00:03:01.5516630
Codegen (clang): 00:00:03.8879810
make clean crystal FLAGS="--release --stats"  190.57s user 2.04s system 98% cpu 3:16.47 total
$ hub checkout https://github.com/manastech/crystal/pull/745
...
$ rm -rf .crystal/
$ time make clean crystal FLAGS="--release --stats"
rm -rf .build
rm -rf ./doc
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --release --stats -o .build/crystal src/compiler/crystal.cr
Parse: 00:00:00.1183640
Type inference: 00:00:04.6174130
Codegen (crystal): 00:00:01.4481470
Codegen (bc+obj): 00:03:00.8868770
Codegen (clang): 00:00:01.4102650
make clean crystal FLAGS="--release --stats"  189.60s user 1.61s system 101% cpu 3:08.88 total

@flaviut
Copy link
Contributor Author

flaviut commented Jun 7, 2015

This does not seem to provide a performance improvement in release mode as there is only one unit and codegen_many_units is never called:

CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --release --stats -o .build/crystal src/compiler/crystal.cr
Using compiled compiler at .build/crystal
Parse: 00:00:00.1299530
only one
Type inference: 00:00:05.1728430
Codegen (crystal): 00:00:01.6467940
only one
Codegen (bc+obj): 00:03:03.6434620
Codegen (clang): 00:00:01.8393460
      timing(msg) do
        if units.length == 1
          puts "only one"
          first_unit = units.first

          codegen_single_unit(first_unit, target_triple, multithreaded)

          if emit = @emit
            first_unit.emit(emit, original_output_filename || output_filename)
          end
        else
          puts "many"
          codegen_many_units(units, target_triple, multithreaded)
        end
      end

While I'm unable to track down exactly where this happens, I assume it's because LLVM needs access to the whole program at the same time if it's going to perform optimizations like inlining or devirtualization.

My results at the top represent compilation of the compiler in debug mode, which is significantly sped up.

@jhass
Copy link
Member

jhass commented Jun 7, 2015

Right, disregard my above test, it's flawed anyway since not actually using the compiler from this branch.

@asterite
Copy link
Member

asterite commented Jun 7, 2015

Yes, please profile this without --release (yes for creating the first compiler, not for the second one).

@jhass I think you also erase the temporary compiler by doing make clean, maybe you should do bin/crystal build ... instead.

@flaviut
Copy link
Contributor Author

flaviut commented Jun 7, 2015

/h/u/d/c/crystal-lang git:reduce-forking ❯❯❯ rm -rf .crystal; make clean crystal FLAGS="--release"
rm -rf .build
rm -rf ./doc
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --release -o .build/crystal src/compiler/crystal.cr
/h/u/d/c/crystal-lang git:reduce-forking ❯❯❯ rm -rf .crystal; \time make crystal FLAGS="--stats" -B
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --stats -o .build/crystal src/compiler/crystal.cr
Using compiled compiler at .build/crystal
Parse: 00:00:00.1226660
only one
Type inference: 00:00:04.7116780
Codegen (crystal): 00:00:01.6854260
many
Codegen (bc+obj): 00:00:06.1880670
Codegen (clang): 00:00:02.3525730
23.39user 1.83system 0:15.45elapsed 163%CPU (0avgtext+0avgdata 945444maxresident)k
0inputs+532048outputs (0major+292418minor)pagefaults 0swaps

/h/u/d/c/crystal-lang git:reduce-forking ❯❯❯ gco master --force
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
/h/u/d/c/crystal-lang git:master ❯❯❯ rm -rf .crystal; make clean crystal FLAGS="--release"
rm -rf .build
rm -rf ./doc
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --release -o .build/crystal src/compiler/crystal.cr
/h/u/d/c/crystal-lang git:master ❯❯❯ rm -rf .crystal; \time make crystal FLAGS="--stats" -B
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --stats -o .build/crystal src/compiler/crystal.cr
Using compiled compiler at .build/crystal
Parse: 00:00:00.1205280
Type inference: 00:00:04.7666540
Codegen (crystal): 00:00:01.7140080
Codegen (bc+obj): 00:00:11.0089870
Codegen (clang): 00:00:02.3609520
25.07user 22.92system 0:20.35elapsed 235%CPU (0avgtext+0avgdata 947320maxresident)k
0inputs+531784outputs (0major+645582minor)pagefaults 0swaps

@jhass
Copy link
Member

jhass commented Jun 7, 2015

$ rm -rf .crystal
$ time make clean crystal FLAGS="--release --stats"
rm -rf .build
rm -rf ./doc
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --release --stats -o .build/crystal src/compiler/crystal.cr
Parse: 00:00:00.1133590
Type inference: 00:00:08.4715020
Codegen (crystal): 00:00:01.5372070
Codegen (bc+obj): 00:03:28.9384850
Codegen (clang): 00:00:06.4842070
make clean crystal FLAGS="--release --stats"  220.28s user 2.54s system 98% cpu 3:46.21 total
$ touch src/compiler/crystal.cr
$ rm -rf .crystal
$ time make crystal FLAGS="--stats"                         
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --stats -o .build/crystal src/compiler/crystal.cr
Using compiled compiler at .build/crystal
Parse: 00:00:00.4365590
Type inference: 00:00:09.4770200
Codegen (crystal): 00:00:03.9863990
Codegen (bc+obj): 00:00:12.5514720
Codegen (clang): 00:00:01.9006570
make crystal FLAGS="--stats"  37.61s user 26.60s system 222% cpu 28.824 total
$ hub checkout https://github.com/manastech/crystal/pull/745
...
$ rm -rf .crystal
$ time make clean crystal FLAGS="--release --stats"         
rm -rf .build
rm -rf ./doc
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --release --stats -o .build/crystal src/compiler/crystal.cr
Parse: 00:00:00.1392960
Type inference: 00:00:04.9105870
Codegen (crystal): 00:00:01.5402750
Codegen (bc+obj): 00:03:31.4144840
Codegen (clang): 00:00:01.4782640
make clean crystal FLAGS="--release --stats"  220.10s user 1.79s system 100% cpu 3:40.01 total
$ touch src/compiler/crystal.cr
$ rm -rf .crystal
$ time make crystal FLAGS="--stats"
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --stats -o .build/crystal src/compiler/crystal.cr
Using compiled compiler at .build/crystal
Parse: 00:00:00.1106770
Type inference: 00:00:04.9089220
Codegen (crystal): 00:00:01.8094630
Codegen (bc+obj): 00:00:07.0417690
Codegen (clang): 00:00:01.8946120
make crystal FLAGS="--stats"  29.84s user 2.28s system 197% cpu 16.236 total

So ~16s vs ~28s for me.

@miketheman
Copy link
Contributor

Taking a look back on this, running with recent master, on OSX:

$ uname -a
Darwin wing 15.4.0 Darwin Kernel Version 15.4.0: Fri Feb 12 15:56:47 PST 2016; root:xnu-3248.40.183~6/RELEASE_X86_64 x86_64

$ time make clean crystal FLAGS="--release --stats"
time make clean crystal FLAGS="--release --stats"
rm -rf .build
rm -rf ./doc
rm -rf src/llvm/ext/llvm_ext.o
rm -rf src/ext/sigfault.o src/ext/libcrystal.a
c++ -c -o src/llvm/ext/llvm_ext.o src/llvm/ext/llvm_ext.cc `/usr/local/bin/llvm-config-3.5 --cxxflags`
cc -fPIC   -c -o src/ext/sigfault.o src/ext/sigfault.c
ar -rcs src/ext/libcrystal.a src/ext/sigfault.o
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --release --stats -o .build/crystal src/compiler/crystal.cr -D without_openssl -D without_zlib
Parse:                             00:00:00.1382580 (  50.41MB)
Semantic (top level):              00:00:00.0799790 (  66.41MB)
Semantic (abstract def check):     00:00:00.0009700 (  66.41MB)
Semantic (type declarations):      00:00:00.0077000 (  74.41MB)
Semantic (main):                   00:00:07.1716160 ( 562.85MB)
Semantic (cleanup):                00:00:00.0001270 ( 562.85MB)
Semantic (recursive struct check): 00:00:00.0011330 ( 562.85MB)
Codegen (crystal):                 00:00:02.9087470 ( 699.35MB)
Codegen (bc+obj):                  00:04:30.0751160 ( 699.35MB)
Codegen (linking):                 00:00:00.6957770 ( 699.35MB)

real    4m42.225s
user    4m36.965s
sys 0m3.485s

$ touch src/compiler/crystal.cr
$ rm -rf .crystal
$ time make crystal FLAGS="--stats"
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --stats -o .build/crystal src/compiler/crystal.cr -D without_openssl -D without_zlib
Using compiled compiler at .build/crystal
Parse:                             00:00:00.1474360 (  50.28MB)
Semantic (top level):              00:00:00.1068450 (  74.28MB)
Semantic (abstract def check):     00:00:00.0009770 (  74.28MB)
Semantic (type declarations):      00:00:00.0087020 (  74.28MB)
Semantic (initializers):           00:00:00.0207510 (  74.28MB)
Semantic (main):                   00:00:06.2693800 ( 626.71MB)
Semantic (cleanup):                00:00:00.0001230 ( 626.71MB)
Semantic (recursive struct check): 00:00:00.0011770 ( 626.71MB)
Codegen (crystal):                 00:00:03.6329920 ( 676.21MB)
Codegen (bc+obj):                  00:00:15.6289320 ( 676.21MB)
Codegen (linking):                 00:00:01.1216190 ( 676.21MB)

real    0m27.443s
user    1m2.879s
sys 0m7.495s

Manually applying the proposed patch (no longer mergable directly, needed some syntax changes), and running the same test:

$ time make clean crystal FLAGS="--release --stats"
time make clean crystal FLAGS="--release --stats"
rm -rf .build
rm -rf ./doc
rm -rf src/llvm/ext/llvm_ext.o
rm -rf src/ext/sigfault.o src/ext/libcrystal.a
c++ -c -o src/llvm/ext/llvm_ext.o src/llvm/ext/llvm_ext.cc `/usr/local/bin/llvm-config-3.5 --cxxflags`
cc -fPIC   -c -o src/ext/sigfault.o src/ext/sigfault.c
ar -rcs src/ext/libcrystal.a src/ext/sigfault.o
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --release --stats -o .build/crystal src/compiler/crystal.cr -D without_openssl -D without_zlib
Parse:                             00:00:00.1490020 (  50.51MB)
Semantic (top level):              00:00:00.0842270 (  66.51MB)
Semantic (abstract def check):     00:00:00.0008460 (  66.51MB)
Semantic (type declarations):      00:00:00.0073360 (  74.51MB)
Semantic (main):                   00:00:05.0246380 ( 562.95MB)
Semantic (cleanup):                00:00:00.0001560 ( 562.95MB)
Semantic (recursive struct check): 00:00:00.0012300 ( 562.95MB)
Codegen (crystal):                 00:00:02.6866090 ( 691.45MB)
Codegen (bc+obj):                  00:04:32.6508970 ( 691.45MB)
Codegen (linking):                 00:00:01.2189160 ( 691.45MB)

real    4m42.940s
user    4m35.450s
sys 0m3.778s

$ touch src/compiler/crystal.cr
$ rm -rf .crystal
$ time make crystal FLAGS="--stats"
time make crystal FLAGS="--stats"
CRYSTAL_CONFIG_PATH=`pwd`/src ./bin/crystal build --stats -o .build/crystal src/compiler/crystal.cr -D without_openssl -D without_zlib
Using compiled compiler at .build/crystal
Parse:                             00:00:00.3199260 (  50.27MB)
Semantic (top level):              00:00:00.1099910 (  74.27MB)
Semantic (abstract def check):     00:00:00.0009440 (  74.27MB)
Semantic (type declarations):      00:00:00.0089740 (  74.27MB)
Semantic (initializers):           00:00:00.0211150 (  74.27MB)
Semantic (main):                   00:00:07.1027550 ( 594.71MB)
Semantic (cleanup):                00:00:00.0001190 ( 594.71MB)
Semantic (recursive struct check): 00:00:00.0013980 ( 594.71MB)
Codegen (crystal):                 00:00:03.0466980 ( 739.21MB)
Codegen (bc+obj):                  00:00:16.1062540 ( 739.21MB)
Codegen (linking):                 00:00:01.1141030 ( 739.21MB)

real    0m28.352s
user    0m53.466s
sys 0m2.463s

So now it looks pretty comparative in speed.

Note: I never saw the master approach exceed 4 threads (nor saturate CPU use) - and the proposed changes here had no significant speedups observed.

Not certain what the future of this PR should be.

@asterite
Copy link
Member

If someone wants to do this, please open a separate PR that prevents many forks, but only in linux systems. On mac it should keep doing the same.

@asterite asterite closed this Dec 13, 2016
@asterite
Copy link
Member

I finally implemented something similar at c792139 . It's consistently faster on mac, and it should be much faster on linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants