Skip to content
This repository has been archived by the owner on Aug 23, 2018. It is now read-only.

Performance reduced when running on multi-core ( which is by default ) #159

Closed
AntouanK opened this issue Apr 12, 2017 · 38 comments
Closed

Comments

@AntouanK
Copy link

Hi.

Noticed this issue when I was working in parallel on the same project, on my PC and my MacBook Pro.
For some weird reason, a 3 year-old MBP, was compiling faster than a brand new 16-core PC.
For example:

MBP:
screen shot 2017-04-12 at 18 08 58

PC:
screenshot_2017-04-12_18-07-42

After the tip from @eeue56 on the elm slack channel, to use "sysconfcpus", I saw a huge boost.
On the Ryzen PC with linux, with one core ( so with "sysconfcpus -n 1" ) I can run "make build" on ~10.4 seconds!
( on the mac "sysconfcpus -n 1" makes no difference )

So, how come the same process is ~50% slower when running on 16 cores, than running on one?
Is there anything I can do to make the compiler take advantage on the multiple cores?

Thanks.

@process-bot
Copy link

Thanks for the issue! Make sure it satisfies this checklist. My human colleagues will appreciate it!

Here is what to expect next, and if anyone wants to comment, keep these things in mind.

@evancz
Copy link
Contributor

evancz commented Apr 12, 2017

The compiler already should be taking advantage of the cores you have. Here are some things that may be relevant:

  • Haskell introduced a new IO scheduler with GHC 7.8.1 which was targeted at weird performance when you have lots of cores. Unless you built Elm in some crazy way, you should have this.

  • Elm compilation is incredibly slow on CI platforms elm/compiler#1473 reported slow builds, and the root issue seems to be that Haskell's getNumProcessors reports what is available to the system, which may not be available to the actual process. This appears to slow down Haskell's scheduler, which makes sense. If you can, it'd be great to know what getNumProcessors produces on your Linux machine!

That said, I need more information to improve things in a directed way. CPU may not actually be the core issue. Maybe a bunch more RAM is used in one case? And maybe GC is causing the CPU usage as a result? Maybe transferring data between all the different cores is costly, so maybe there's some way to get information on that?

So here's the plan. When 0.19 alpha is out, please test this again and see if the issue persists. In the meantime, if you find any additional information, let me know about it in an organized way. (I.e. work through it on slack, and prefer one coherent and concise comment over ten scattered comments as you learn more.)

Again, the compiler is designed to use as many cores as possible already, so something weird must be going on. Ultimately, I want the compiler to be super fast, so thanks for reporting this and helping make sense of this!

@AntouanK
Copy link
Author

@evancz Thanks for the response.

Unless you built Elm in some crazy way, you should have this.

The elm-make I use comes from the npm install. So that's, I guess, the "normal" binary.

reports what is available to the system, which may not be available to the actual process

When I run elm-make normally, I see all 16 cores go to 99%. So it means that the process can see them all, right?
And still, the total compile time is slower. So there's definitely something going wrong.

I'm not a Haskell programmer, so I don't know where to look at.

Let me know what tests you'd like me to run, and I'm happy to help on that.
I'll definitely try 0.19 alpha when it comes out.

@evancz
Copy link
Contributor

evancz commented Apr 12, 2017

Cool, yeah, the npm one should be good.

If you want to look into it more now, please ask @eeue56 for help on slack. We can proceed without Haskell stuff. For example, knowing about memory usage during compilation can help. Maybe it uses 10mb on your laptop and 100mb on your PC. That would be interesting and helpful to know. Knowing about cache misses may help as well. That kind of thing. I suspect I'll need to get on a machine like yours to test things out, but exploring these things may be helpful or personally interesting nonetheless!

@AntouanK
Copy link
Author

As you can see in the screenshots, there's plenty of memory available.
I'll ping @eeue56 to see if we can maybe get other details out.

In theory, how much would every extra core speed up the compile time?
Since the new Ryzen CPUs are cheap and can run lots of threads ( 16 the 7 series ) it would be a huge gain for devs that will buy them.

@AntouanK
Copy link
Author

AntouanK commented Apr 12, 2017

Regarding cache misses

  • with "sysconfcpus"
$ perf stat -e L1-dcache-loads -e L1-dcache-load-misses make build

sysconfcpus -n 1 elm-make src/App.elm --yes --output dist/elm.js
Success! Compiled 106 modules.                                      
Successfully generated dist/elm.js

 Performance counter stats for 'make build':

    25,389,023,513      L1-dcache-loads:u                                           
        24,876,267      L1-dcache-load-misses:u   #    0.10% of all L1-dcache hits  

      10.811931503 seconds time elapsed
  • without
$ perf stat -e L1-dcache-loads -e L1-dcache-load-misses elm-make src/App.elm --output dist/elm.js
Success! Compiled 106 modules.                                      
Successfully generated dist/elm.js

 Performance counter stats for 'elm-make src/App.elm --output dist/elm.js':

    46,763,012,114      L1-dcache-loads:u                                           
       148,834,742      L1-dcache-load-misses:u   #    0.32% of all L1-dcache hits  

      15.228556020 seconds time elapsed

@AntouanK
Copy link
Author

and you can see it uses all cores when I run it without sysconfcpus -n 1

screenshot_2017-04-12_23-31-38

@eeue56
Copy link

eeue56 commented Apr 16, 2017

We took a look through this on Slack. tl;dr:

  • Same issue as reported in Elm compilation is incredibly slow on CI platforms elm/compiler#1473
  • Happens on all projects. We used elm-css to test
  • Seems to affect all Linux machines. Reproducable across the board.
  • OS X seems to get along just fine using multiple cores, performance is comparable to Linux with a single core, however.
  • getNumProcessors is reported correctly. If you use sysconfcpus -n n, it will report n.
  • Memory usage seems about the same.
  • Large amount of context-switches are triggered on some machines, increasing roughly linearly times per core. I'd say this seems to be related to the issue at a highlevel.
  • Increased amount of instructions, scaling the same as context-switches

Discussed with @AntouanK, it's currently at a "liveable" state for them (decent elm-make times). So we will take a look again after 0.19 is released and try to dig in a bit better then.

Some numbers from my chromebook:

With two cores enabled, getNumProcessors = 2

noah@noah-Swanky:~/dev/elm-css$ perf stat elm-make
Success! Compiled 35 modules.                                       

 Performance counter stats for 'elm-make':

      21781.649106      task-clock (msec)         #    1.262 CPUs utilized          
            17,875      context-switches          #    0.821 K/sec                  
                89      cpu-migrations            #    0.004 K/sec                  
            20,060      page-faults               #    0.921 K/sec                  
    23,485,501,602      cycles                    #    1.078 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    19,333,470,309      instructions              #    0.82  insns per cycle        
     3,577,546,896      branches                  #  164.246 M/sec                  
       191,048,855      branch-misses             #    5.34% of all branches        

      17.262226728 seconds time elapsed

With a single core enabled, getNumProcessors = 1

noah@noah-Swanky:~/dev/elm-css$ perf stat sysconfcpus -n 1 elm-make
Success! Compiled 35 modules.                                       

 Performance counter stats for 'sysconfcpus -n 1 elm-make':

      15567.244782      task-clock (msec)         #    1.002 CPUs utilized          
             1,823      context-switches          #    0.117 K/sec                  
                32      cpu-migrations            #    0.002 K/sec                  
            19,005      page-faults               #    0.001 M/sec                  
    16,821,342,129      cycles                    #    1.081 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    15,000,183,543      instructions              #    0.89  insns per cycle        
     2,714,811,983      branches                  #  174.393 M/sec                  
       168,684,092      branch-misses             #    6.21% of all branches        

      15.537166525 seconds time elapsed

@AntouanK
Copy link
Author

Just for comparison on a beefier CPU ( again on the elm-css project )

With 16 cores enabled ( default )

Performance counter stats for 'elm-make':

      50921.822910      task-clock:u (msec)       #    8.289 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            16,039      page-faults:u             #    0.315 K/sec                  
   101,591,919,919      cycles:u                  #    1.995 GHz                      (83.31%)
     5,486,939,866      stalled-cycles-frontend:u #    5.40% frontend cycles idle     (83.46%)
     8,788,128,333      stalled-cycles-backend:u  #    8.65% backend cycles idle      (83.32%)
    43,475,312,872      instructions:u            #    0.43  insn per cycle         
                                                  #    0.20  stalled cycles per insn  (83.41%)
    10,120,790,403      branches:u                #  198.752 M/sec                    (83.17%)
       164,522,472      branch-misses:u           #    1.63% of all branches          (83.36%)

       6.143030403 seconds time elapsed

With a single core enabled, sysconfcpus -n 1

Performance counter stats for 'sysconfcpus -n 1 elm-make':

       2858.615621      task-clock:u (msec)       #    0.975 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            18,096      page-faults:u             #    0.006 M/sec                  
     8,755,915,430      cycles:u                  #    3.063 GHz                      (83.06%)
     2,272,539,762      stalled-cycles-frontend:u #   25.95% frontend cycles idle     (83.65%)
     2,043,003,105      stalled-cycles-backend:u  #   23.33% backend cycles idle      (83.42%)
    14,712,522,371      instructions:u            #    1.68  insn per cycle         
                                                  #    0.15  stalled cycles per insn  (83.47%)
     2,783,266,581      branches:u                #  973.641 M/sec                    (83.11%)
        67,444,081      branch-misses:u           #    2.42% of all branches          (83.34%)

       2.932051381 seconds time elapsed

@evancz evancz closed this as completed Apr 17, 2017
@evancz evancz reopened this Apr 17, 2017
@evancz
Copy link
Contributor

evancz commented Apr 17, 2017

Okay, after thinking about it more, I think it makes sense to have a meta issue to try to track the problem. It looks like it's related to some general problem that we haven't been able to pin down yet.

I'm not sure where the meta issue should live, so I'm just going to leave it for now.

@jcberentsen
Copy link

Could it be related to this many-core Haskell runtime issue? It might be worth trying running elm-make +RTS -qg to turn parallell GC off? Possibly also with -A50M?

@Augustin82
Copy link

I'm encountering the same problem.

I'm using Ubuntu 16.04.2 LTS 64-bit on a Dell XPS15 (16GB RAM, i7-7700HQ CPU @ 2.80GHz × 8).

Here's the output of elm-make --yes against the elm-css repo:

$ sudo rm -rf elm-stuff && sudo perf stat elm-make --yes
Starting downloads...

  ● rtfeldman/elm-css-util 1.0.2
  ● rtfeldman/hex 1.0.0
  ● elm-lang/core 5.1.1

Packages configured successfully!
Success! Compiled 35 modules.                                       

 Performance counter stats for 'elm-make --yes':

      18956,377203      task-clock (msec)         #    3,297 CPUs utilized          
           899 431      context-switches          #    0,047 M/sec                  
               961      cpu-migrations            #    0,051 K/sec                  
            24 012      page-faults               #    0,001 M/sec                  
    60 360 847 025      cycles                    #    3,184 GHz                    
    42 589 205 040      instructions              #    0,71  insn per cycle         
     8 429 877 689      branches                  #  444,699 M/sec                  
        59 846 870      branch-misses             #    0,71% of all branches        

       5,749273986 seconds time elapsed
$ sudo rm -rf elm-stuff && sudo sysconfcpus -n 1 perf stat elm-make --yes
Starting downloads...

  ● rtfeldman/hex 1.0.0
  ● rtfeldman/elm-css-util 1.0.2
  ● elm-lang/core 5.1.1

Packages configured successfully!
Success! Compiled 35 modules.                                       

 Performance counter stats for 'elm-make --yes':

       2523,628536      task-clock (msec)         #    0,624 CPUs utilized          
               765      context-switches          #    0,303 K/sec                  
                45      cpu-migrations            #    0,018 K/sec                  
            25 436      page-faults               #    0,010 M/sec                  
     8 034 175 352      cycles                    #    3,184 GHz                    
    16 722 884 588      instructions              #    2,08  insn per cycle         
     3 040 339 979      branches                  # 1204,749 M/sec                  
        53 767 368      branch-misses             #    1,77% of all branches        

       4,047055038 seconds time elapsed

@rtfeldman
Copy link

rtfeldman commented Jul 28, 2017

I dove deeper into this. It turns out GHC has some, ahem, conservative default garbage collection settings for parallelized workloads. (Example: the nursery gets a whole megabyte!)

Simon Marlow and others have looked into this and have tuned the default settings for the next release of GHC 8. Fortunately, we can tune things right now using flags in elm-compiler.cabal.

tl;dr

Try adding -with-rtsopts="-A16M -qg" to the end of this ghc-options: line in elm-compiler.cabal.

What this does

  • -A16M increases the nursery size from the default of 1MB to 16MB. Why 16MB? The highest I saw was Simon Marlow mentioning using -A256M. Someone reported faster and faster speeds as they dialed this up incrementally to 128M, but with big increase in memory usage along the way. I chose 16MB because the person who tried that reported it seemed like the "sweet spot" - memory usage increased, but wasn't crazy, and build time decreased a lot. The GHC folks seem to be changing GHC defaults to special-case -A values of 16M and up, as well as 32M and up, but they are staying conservative and keeping default -A at 1M. It is probably worth people trying out different values here, but I wanted to pick one to suggest, so I picked 16M.
  • -qg disables parallel GC. People have reported speed gains from this. As an alternative to -qg, we could try using -n4m -qb0 instead (more on this below), which is a combination of flags that should improve parallel GC performance. Hoewver, I suspect it may be better to turn it off altogether; Simon Marlow suggested there may be an as-yet-unidentified bug in the parallel GC's multithreading logic. Since using sysconfcpus to force using 1 core speeds things up overall, if a parallel GC bug is the root of the problem here, it stands to reason that disabling it would result in better perf than the sysconfcpus trick because at least the compiler logic itself could still use multiple cores. It's probably worth benchmarking both ways.

Bonus perf boost for Linux

For Linux builds (not sure if this is a no-op or causes problems on non-Linux systems), Simon Marlow suggests adding the --numa flag (e.g. -with-rtsopts="-A16M -qg --numa") for another significant perf boost (according to the docs, a 10% speedup is "typical" but varies greatly by hardware and program).

Further Reading

I have read the documentation for the -H flag about 12 times. I still do not know what it does. It sounds like it could lead to reduced memory consumption. Maybe it's awesome. I'm also not sure if GHC 7.x supports it.

If we remove -qg we should also enable -qb0 (for -A32M and up, and honestly possibly also -A16M and up) which turns on load balancing in the nursery for parallel GC. Simon Marlow recommended this. We should also definitely enable -n4m; the current GHC now defaults to -n4m for -A16M or higher.

Longer Version

I put the long version of what I learned into a gist.

@AntouanK
Copy link
Author

thank you @rtfeldman for looking into this.

If I remember correctly, compiling elm-make, is not a one line thing.
If anyone does try it with those flags, can you upload the binaries somewhere so we can try them as well?

@eeue56
Copy link

eeue56 commented Jul 29, 2017

So, I ran almost all the combinations suggested with different values. I've put the most useful raw numbers into this gist, benchmarking against elm-css on a Thinkpad T470s.

Conclusion: the best option is "-with-rtsops=-qg". No amount of -An could beat just disabling the parallel GC. While there is some benefit from using -An, it only starts to kick in when you have a large value, e.g -A64M. This seems like a large amount of memory to reserve for a single elm-make process, and it still did not perform as well as -qg. Other things such as n4m likewise saw little or no performance change.

Make sure to use " -with-rtsopts=-A16M -qg" for checking, and not -with-rtsopts="-A16M -qg" -- or -qg will be parsed as a non-rts option.

tl;dr

"-with-rtsopts=-qg" is the best option for performance on my machine.

@AntouanK
Copy link
Author

@eeue56 How many cores does that machine have?
Since it's an issue that worsens with the more cores you have available, it would be good to test it against a CPU with many.
I can run it on a Ryzen 1700, it has 16 cores.
Let me know if you can make a gist with instructions on what combinations to try, and I'll do it this weekend.

@eeue56
Copy link

eeue56 commented Jul 29, 2017

It is this CPU: https://ark.intel.com/products/97466/Intel-Core-i7-7600U-Processor-4M-Cache-up-to-3_90-GHz

So, in terms of this, 4 cores (2 real cores, 2 more from HT).

If you look at the gist I made, it has all the combinations that had any form of noticeable impact - in order to build:

  • Install ghc 7.10
  • Download https://github.com/elm-lang/elm-platform/blob/master/installers/BuildFromSource.hs
  • Run runhaskell BuildFromSource.hs 0.18
  • cd into Elm-Platform/0.18
  • Modify: elm-package/elm-package.cabal, elm-make/elm-make.cabal, elm-make/elm-compiler to have the options. Your ghc-options should look something like -threaded -O2 -W "with-rtsopts=-A16M -gq". Note where the quotes are.
  • Run cabal install elm-make
  • Run perf stat <dir-with-elm-platform>/Elm-Platform/0.18/.cabal_sandbox/bin/elm-make in the elm-css dir. Make sure to wipe elm-stuff/build-artifacts, but not elm-stuff/packages before each build.

If you have other questions, this discussion is better asked on #elm-dev on Slack

@rtfeldman
Copy link

@AntouanK if you do build those, can you save the different binaries you make so we can post them somewhere for others to try? A higher sample size will give us more confidence!

@AntouanK
Copy link
Author

Thanks @eeue56
Here are my results: https://gist.github.com/AntouanK/1ecec02e08b90be54463d4ad6f0efcab
Still, it confuses me how it's possible to have 1 core outperforming 16 in parallel.
Even with the flag "-with-rtsopts=-qg", if I switch to just one core, it runs faster.
Seems to me that the best solution is to restrict it to one core and not let it multi-thread. Don't know how easy that would be.

@AntouanK
Copy link
Author

AntouanK commented Jul 29, 2017

@rtfeldman Seems like there should be an easy script for this.
I made a docker container, with haskell 7.10, mounted a local directory of elm-css, and then did 3 things repeatedly.

  • sed to change the flags in the cabal files and runhaskell ... again
  • rm -rf the build artifacts in elm-css
  • run perf stat .../elm-make

so should we just make a script instead?
Or I can just upload the binaries in a repo?

@rtfeldman
Copy link

@AntouanK uploading the binaries would be fantastic! I think that would definitely be the easiest for others. 😄

Since sysconfcpus still improved things, I'm curious what happens if you try these RTS options:

  • -N1 (tells GHC to only use one core for the program as well as the GC; should be equivalent to sysconfcpus -n 1)
  • -N16 (tells GHC to use 16 cores, instead of the 32 it might be trying to use due to hyperthreading; this comment suggests that might be a thing
  • -N12 to see if telling it to use many cores, but fewer than 16, is better than telling it to use 1

Among these, I think it'd be most useful o try -N1 -qg, but if you have time, trying all 3 would be great. Even better would be trying 6 combinations - using the different -N flags with -qb0 -n4m instead of -qg. 😄

Saving binaries for anything you do there would be great as well!

@AntouanK
Copy link
Author

AntouanK commented Jul 29, 2017

@rtfeldman With my poor scripting skills, I got a loop going.
Binaries and perf stat outputs are here : https://github.com/AntouanK/elm-make_perf

So far, seems like the flag doesn't make a big difference.
Using one core does.

elm-css

Perf name Time
multicore 5.502 sec
onecore 3.074 sec
-A16M_-qb0.multicore 3.136 sec
-A16M_-qb0.onecore 2.776 sec
-A16M_-qg.multicore 3.230 sec
-A16M_-qg.onecore 2.744 sec
-A32M_-qb0.multicore 3.131 sec
-A32M_-qb0.onecore 2.751 sec
-A32M_-qg.multicore 3.030 sec
-A32M_-qg.onecore 2.744 sec
-A64M_-qb0.multicore 3.327 sec
-A64M_-qb0.onecore 2.754 sec
-A64M_-qg.multicore 3.201 sec
-A64M_-qg.onecore 2.790 sec
-N1_-qb0_-n4m.multicore 3.123 sec
-N1_-qb0_-n4m.onecore 2.748 sec
-N1_-qg.multicore 3.201 sec
-N1_-qg.onecore 2.749 sec
-N12_-qb0_-n4m.multicore 3.192 sec
-N12_-qb0_-n4m.onecore 2.744 sec
-N12_-qg.multicore 3.135 sec
-N12_-qg.onecore 2.767 sec
-N16_-qb0_-n4m.multicore 3.191 sec
-N16_-qb0_-n4m.onecore 2.777 sec
-N16_-qg.multicore 3.214 sec
-N16_-qg.onecore 2.734 sec
-qg.multicore 3.006 sec
-qg.onecore 2.761 sec

@AntouanK
Copy link
Author

AntouanK commented Jul 30, 2017

Added results for elm-spa-example as well

Perf name Time % from multicore
multicore 9.234 sec
onecore 6.385 sec 69.15%
-A16M_-qb0.multicore 7.451 sec 80.69%
-A16M_-qb0.onecore 6.033 sec 65.33%
-A16M_-qg.multicore 7.431 sec 80.47%
-A16M_-qg.onecore 6.151 sec 66.61%
-A32M_-qb0.multicore 7.169 sec 77.64%
-A32M_-qb0.onecore 6.106 sec 66.13%
-A32M_-qg.multicore 7.068 sec 76.54%
-A32M_-qg.onecore 6.266 sec 67.86%
-A64M_-qb0.multicore 7.559 sec 81.86%
-A64M_-qb0.onecore 6.219 sec 67.35%
-A64M_-qg.multicore 7.153 sec 77.46%
-A64M_-qg.onecore 6.015 sec 65.14%
-N12_-qb0_-n4m.multicore 7.301 sec 79.07%
-N12_-qb0_-n4m.onecore 6.123 sec 66.31%
-N12_-qg.multicore 7.373 sec 79.85%
-N12_-qg.onecore 6.297 sec 68.19%
-N16_-qb0_-n4m.multicore 7.586 sec 82.15%
-N16_-qb0_-n4m.onecore 6.340 sec 68.66%
-N16_-qg.multicore 7.248 sec 78.49%
-N16_-qg.onecore 6.139 sec 66.48%
-N1_-qb0_-n4m.multicore 7.543 sec 81.69%
-N1_-qb0_-n4m.onecore 6.131 sec 66.40%
-N1_-qg.multicore 7.561 sec 81.88%
-N1_-qg.onecore 6.097 sec 66.03%
-qg.multicore 7.330 sec 79.38%
-qg.onecore 6.386 sec 69.16%

@rtfeldman
Copy link

Hm, this is very strange. As I recall @eeue56 was seeing like 2-3 second build times on elm-spa-example without sysconfcpus, building on less powerful hardware.

I wonder what the reason is for the discrepancy. 🤔

@AntouanK
Copy link
Author

AntouanK commented Jul 30, 2017

That's with elm-make src/Main.elm?

I tried it again.
I get 8.6 on normal elm-make, and 6.2 with one core.

@rtfeldman
Copy link

rtfeldman commented Sep 15, 2017

I did some more digging around in the docs, and found some more potential answers as to why single-core is outperforming multicore.

-s for Benchmarking Statistics

Before getting into those, it's worth noting that the -s flag (and related flags) print out a wealth of runtime performance statistics after the run. This would be really helpful for these benchmarks!

Flags that possibly should always be enabled for elm-make

  • The --numa flag could have major benefits for Linux in particular. (It seems like it shouldn't be enabled on non-Linux builds though.) As far as I can tell, it's just a straight-up upgrade for Linux builds, and could very well explain the discrepancy between Linux and macOS performance characteristics.
  • The -feager-blackholing flag apparently should always be turned on, according to the docs. ("We recommend compiling any code that is intended to be run in parallel with the -feager-blackholing flag.") Note that it only has any effect if -threaded is enabled, so we'll need to keep that enabled in order to see what it does. (Doesn't sound like it should be much - around the neighborhood of 1-2%, according to the docs, but who knows?)
  • The -fllvm flag seems worth enabling in general, and may separately impact this issue. It enables using LLVM to get additional binary performance optimizations. ("Compile via LLVM instead of using the native code generator. This will generally take slightly longer than the native code generator to compile. Produced code is generally the same speed or faster than the other two code generators.") The one caveat is that compiling via LLVM requires LLVM’s opt and llc executables to be in PATH, so this may not be worth the effort in comparison to the other options.
  • The -threaded flag enables a parallel version of the runtime. ("The threaded runtime system is so-called because it manages multiple OS threads, as opposed to the default runtime system which is purely single-threaded.") It turns out that if this is disabled, then -N is silently ignored. The fact that it is disabled by default explains why -N never had any effect on the previous benchmarks. I don't know why this isn't enabled by default, considering some docs say "In order to make use of multiple CPUs, your program must be linked with the -threaded option" (which empirically seems to be false, since we've seen more CPUs engage when sysconfcpus is not used to limit them), but this very much seems worth trying at any rate. EDIT: It appears elm-make is already using -threaded. So it looks like the reason sysconfcpus has an effect is because elm-make is setting the capability count manually. This line of code would become unnecessary if we used -N (no number), which would make GHC infer it on startup, but it appears that would do nothing more than saving a line of code; GHC infers the same number as what we get by calling this manually.

Flags that deserve experimentation

  • The -ki flag controls default stack size for new threads. Thread stacks live on the heap, apparently, and the default is 1K (GHC is once again conservative). They allocate more memory as they grow (in 32K chunks by default, which is also configurable via the -kc flag, if we care), and then GC it when they're no longer using it. This could explain why throwing more cores at the workload degrades performance; if each of these threads is so resource constrained that it ends up doing a ton of allocation and GC to overcome its 1K default stack size, they might be so much slower than the main thread (which doesn't have this problem) that it leads to a net slowdown.
  • The -H flag sets the default heap size for the garbage collector. The default is 0, and I suspect we can experimentally find a better choice than that, but then again -H and -A are related, and tweaking -A didn't seem to do anything. Still, could be worth a shot since the default seems so bad.
  • The -I flag could be relevant; it controls how often GHC collects garbage when idle. By default it runs every 0.3 seconds, and we can disable it with -I0.

Quick Aside

I also learned a bit about GHC's default pre-emption settings.

GHC implements pre-emptive multitasking: the execution of threads are interleaved in a random fashion. More specifically, a thread may be pre-empted whenever it allocates some memory [...]
The rescheduling timer runs on a 20ms granularity by default, but this may be altered using the -i RTS option. [...]

This sounds like a potential contributing factor, but unfortunately, these docs may be out of date. The RTS docs don't list an -i flag. However, there are plenty of other knobs to try!

@eeue56 and @AntouanK - if you could try the above flags out on your benchmarks, it would be so amazing!!! I feel some combination of these may be the answer we've been looking for! 😄

Note: I'm not actually sure how to be sure these are enabled. I linked to the docs, but as Noah learned, the exact way you specify them can be tricky, and GHC has a habit of silently ignoring flags it doesn't understand.

@rtfeldman
Copy link

As an aside, I'm curious if this is why macOS programs are doing better: maybe GHC thinks they all have 1 core.

I ran this on my MacBook Pro (2 physical cores + 2 HT):

module Main where

import Control.Concurrent

main :: IO ()
main = do
  capabilities <- getNumCapabilities
  putStrLn $ show capabilities

It printed 1. 😕

Curious what others see when running this.

@rtfeldman
Copy link

Nope, that's not it. I also ran it on nixOS with 8 physical cores (Ryzen) and it also printed 1. 😖

@jmitchell
Copy link

@rtfeldman, I get the same on my 8-core AMD under nixOS. Perhaps it's set to 1 by default everywhere, unless overridden with setNumCapabilities or the +RTS -N runtime flag mentioned in the setNumCapabilities doc string:

Set the number of Haskell threads that can run truly simultaneously (on separate physical processors) at any given time. The number passed to forkOn is interpreted modulo this value. The initial value is given by the +RTS -N runtime flag.

@rtfeldman
Copy link

@jmitchell yep, I just confirmed that - it prints out whatever -N flag I get it, but defaults to 1.

@rtfeldman
Copy link

This is even more confusing then. Why would sysconfcpus have any affect at all if GHC always thinks there is only 1 core available unless you tell it otherwise? 🤔

@jmitchell
Copy link

Ah, because setNumCapabilities is applied according to GHC.Conc.getNumProcessors, which on my machine returns 8.

@rtfeldman
Copy link

rtfeldman commented Sep 16, 2017

Ahh, that's why!

I learned from Brian McKenna on Twitter that if you pass -N by itself (no number specified), GHC will infer that for you on startup, making that getNumProcessors line unnecessary. I verified locally that it works; on my MacBook Pro it reported 4, and on my nixOS Ryzen machine it reported 8.

Then I set up a test on Travis to see if maybe -N would lead getNumCapabilities to report a better number than getNumProcessors (which reports 32 cores available, when actually only 2 are available to the virtualized environment on Travis), but sadly the -N approach also reports 32 on Travis. 😞

@andys8
Copy link

andys8 commented Sep 16, 2017

There is also the chance that the cause for the stackoverflow issue #164 is related, if the configured values are relative to the total amount of available memory on a machine, too.

@zwilias
Copy link
Member

zwilias commented Sep 20, 2017

For future posterity: building elm-make with -rtsopts in the GHC options allows setting the RTS options on execution, rather than at compile time.

https://downloads.haskell.org/~ghc/7.8.4/docs/html/users_guide/runtime-control.html

@andys8
Copy link

andys8 commented Sep 20, 2017

@zwilias Any chance to get this flag in by default? (Any implications?) Otherwise a workaround would have to rely on an elm-make fork.

@andys8
Copy link

andys8 commented Oct 2, 2017

#179

@evancz
Copy link
Contributor

evancz commented Mar 7, 2018

I want to make elm/compiler#1473 the canonical issue for this. It probably makes sense to summarize the things that need to happen into a meta issue though. I will coordinate with @eeue56 and @zwilias to get a list of TODO items that should be in that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants