Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upPerformance reduced when running on multi-core ( which is by default ) #159
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
process-bot
Apr 12, 2017
Thanks for the issue! Make sure it satisfies this checklist. My human colleagues will appreciate it!
Here is what to expect next, and if anyone wants to comment, keep these things in mind.
process-bot
commented
Apr 12, 2017
|
Thanks for the issue! Make sure it satisfies this checklist. My human colleagues will appreciate it! Here is what to expect next, and if anyone wants to comment, keep these things in mind. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
evancz
Apr 12, 2017
Contributor
The compiler already should be taking advantage of the cores you have. Here are some things that may be relevant:
-
Haskell introduced a new IO scheduler with GHC 7.8.1 which was targeted at weird performance when you have lots of cores. Unless you built Elm in some crazy way, you should have this.
-
elm/compiler#1473 reported slow builds, and the root issue seems to be that Haskell's
getNumProcessorsreports what is available to the system, which may not be available to the actual process. This appears to slow down Haskell's scheduler, which makes sense. If you can, it'd be great to know whatgetNumProcessorsproduces on your Linux machine!
That said, I need more information to improve things in a directed way. CPU may not actually be the core issue. Maybe a bunch more RAM is used in one case? And maybe GC is causing the CPU usage as a result? Maybe transferring data between all the different cores is costly, so maybe there's some way to get information on that?
So here's the plan. When 0.19 alpha is out, please test this again and see if the issue persists. In the meantime, if you find any additional information, let me know about it in an organized way. (I.e. work through it on slack, and prefer one coherent and concise comment over ten scattered comments as you learn more.)
Again, the compiler is designed to use as many cores as possible already, so something weird must be going on. Ultimately, I want the compiler to be super fast, so thanks for reporting this and helping make sense of this!
|
The compiler already should be taking advantage of the cores you have. Here are some things that may be relevant:
That said, I need more information to improve things in a directed way. CPU may not actually be the core issue. Maybe a bunch more RAM is used in one case? And maybe GC is causing the CPU usage as a result? Maybe transferring data between all the different cores is costly, so maybe there's some way to get information on that? So here's the plan. When 0.19 alpha is out, please test this again and see if the issue persists. In the meantime, if you find any additional information, let me know about it in an organized way. (I.e. work through it on slack, and prefer one coherent and concise comment over ten scattered comments as you learn more.) Again, the compiler is designed to use as many cores as possible already, so something weird must be going on. Ultimately, I want the compiler to be super fast, so thanks for reporting this and helping make sense of this! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
Apr 12, 2017
@evancz Thanks for the response.
Unless you built Elm in some crazy way, you should have this.
The elm-make I use comes from the npm install. So that's, I guess, the "normal" binary.
reports what is available to the system, which may not be available to the actual process
When I run elm-make normally, I see all 16 cores go to 99%. So it means that the process can see them all, right?
And still, the total compile time is slower. So there's definitely something going wrong.
I'm not a Haskell programmer, so I don't know where to look at.
Let me know what tests you'd like me to run, and I'm happy to help on that.
I'll definitely try 0.19 alpha when it comes out.
AntouanK
commented
Apr 12, 2017
|
@evancz Thanks for the response.
The
When I run I'm not a Haskell programmer, so I don't know where to look at. Let me know what tests you'd like me to run, and I'm happy to help on that. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
evancz
Apr 12, 2017
Contributor
Cool, yeah, the npm one should be good.
If you want to look into it more now, please ask @eeue56 for help on slack. We can proceed without Haskell stuff. For example, knowing about memory usage during compilation can help. Maybe it uses 10mb on your laptop and 100mb on your PC. That would be interesting and helpful to know. Knowing about cache misses may help as well. That kind of thing. I suspect I'll need to get on a machine like yours to test things out, but exploring these things may be helpful or personally interesting nonetheless!
|
Cool, yeah, the If you want to look into it more now, please ask @eeue56 for help on slack. We can proceed without Haskell stuff. For example, knowing about memory usage during compilation can help. Maybe it uses 10mb on your laptop and 100mb on your PC. That would be interesting and helpful to know. Knowing about cache misses may help as well. That kind of thing. I suspect I'll need to get on a machine like yours to test things out, but exploring these things may be helpful or personally interesting nonetheless! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
Apr 12, 2017
As you can see in the screenshots, there's plenty of memory available.
I'll ping @eeue56 to see if we can maybe get other details out.
In theory, how much would every extra core speed up the compile time?
Since the new Ryzen CPUs are cheap and can run lots of threads ( 16 the 7 series ) it would be a huge gain for devs that will buy them.
AntouanK
commented
Apr 12, 2017
|
As you can see in the screenshots, there's plenty of memory available. In theory, how much would every extra core speed up the compile time? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
Apr 12, 2017
Regarding cache misses
- with "sysconfcpus"
$ perf stat -e L1-dcache-loads -e L1-dcache-load-misses make build
sysconfcpus -n 1 elm-make src/App.elm --yes --output dist/elm.js
Success! Compiled 106 modules.
Successfully generated dist/elm.js
Performance counter stats for 'make build':
25,389,023,513 L1-dcache-loads:u
24,876,267 L1-dcache-load-misses:u # 0.10% of all L1-dcache hits
10.811931503 seconds time elapsed
- without
$ perf stat -e L1-dcache-loads -e L1-dcache-load-misses elm-make src/App.elm --output dist/elm.js
Success! Compiled 106 modules.
Successfully generated dist/elm.js
Performance counter stats for 'elm-make src/App.elm --output dist/elm.js':
46,763,012,114 L1-dcache-loads:u
148,834,742 L1-dcache-load-misses:u # 0.32% of all L1-dcache hits
15.228556020 seconds time elapsed
AntouanK
commented
Apr 12, 2017
•
|
Regarding cache misses
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
commented
Apr 12, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
eeue56
Apr 16, 2017
We took a look through this on Slack. tl;dr:
- Same issue as reported in elm/compiler#1473
- Happens on all projects. We used elm-css to test
- Seems to affect all Linux machines. Reproducable across the board.
- OS X seems to get along just fine using multiple cores, performance is comparable to Linux with a single core, however.
getNumProcessorsis reported correctly. If you usesysconfcpus -n n, it will reportn.- Memory usage seems about the same.
- Large amount of context-switches are triggered on some machines, increasing roughly linearly times per core. I'd say this seems to be related to the issue at a highlevel.
- Increased amount of instructions, scaling the same as context-switches
Discussed with @AntouanK, it's currently at a "liveable" state for them (decent elm-make times). So we will take a look again after 0.19 is released and try to dig in a bit better then.
Some numbers from my chromebook:
With two cores enabled, getNumProcessors = 2
noah@noah-Swanky:~/dev/elm-css$ perf stat elm-make
Success! Compiled 35 modules.
Performance counter stats for 'elm-make':
21781.649106 task-clock (msec) # 1.262 CPUs utilized
17,875 context-switches # 0.821 K/sec
89 cpu-migrations # 0.004 K/sec
20,060 page-faults # 0.921 K/sec
23,485,501,602 cycles # 1.078 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
19,333,470,309 instructions # 0.82 insns per cycle
3,577,546,896 branches # 164.246 M/sec
191,048,855 branch-misses # 5.34% of all branches
17.262226728 seconds time elapsed
With a single core enabled, getNumProcessors = 1
noah@noah-Swanky:~/dev/elm-css$ perf stat sysconfcpus -n 1 elm-make
Success! Compiled 35 modules.
Performance counter stats for 'sysconfcpus -n 1 elm-make':
15567.244782 task-clock (msec) # 1.002 CPUs utilized
1,823 context-switches # 0.117 K/sec
32 cpu-migrations # 0.002 K/sec
19,005 page-faults # 0.001 M/sec
16,821,342,129 cycles # 1.081 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
15,000,183,543 instructions # 0.89 insns per cycle
2,714,811,983 branches # 174.393 M/sec
168,684,092 branch-misses # 6.21% of all branches
15.537166525 seconds time elapsed
eeue56
commented
Apr 16, 2017
•
|
We took a look through this on Slack. tl;dr:
Discussed with @AntouanK, it's currently at a "liveable" state for them (decent elm-make times). So we will take a look again after 0.19 is released and try to dig in a bit better then. Some numbers from my chromebook:With two cores enabled,
With a single core enabled,
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
Apr 16, 2017
Just for comparison on a beefier CPU ( again on the elm-css project )
With 16 cores enabled ( default )
Performance counter stats for 'elm-make':
50921.822910 task-clock:u (msec) # 8.289 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
16,039 page-faults:u # 0.315 K/sec
101,591,919,919 cycles:u # 1.995 GHz (83.31%)
5,486,939,866 stalled-cycles-frontend:u # 5.40% frontend cycles idle (83.46%)
8,788,128,333 stalled-cycles-backend:u # 8.65% backend cycles idle (83.32%)
43,475,312,872 instructions:u # 0.43 insn per cycle
# 0.20 stalled cycles per insn (83.41%)
10,120,790,403 branches:u # 198.752 M/sec (83.17%)
164,522,472 branch-misses:u # 1.63% of all branches (83.36%)
6.143030403 seconds time elapsed
With a single core enabled, sysconfcpus -n 1
Performance counter stats for 'sysconfcpus -n 1 elm-make':
2858.615621 task-clock:u (msec) # 0.975 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
18,096 page-faults:u # 0.006 M/sec
8,755,915,430 cycles:u # 3.063 GHz (83.06%)
2,272,539,762 stalled-cycles-frontend:u # 25.95% frontend cycles idle (83.65%)
2,043,003,105 stalled-cycles-backend:u # 23.33% backend cycles idle (83.42%)
14,712,522,371 instructions:u # 1.68 insn per cycle
# 0.15 stalled cycles per insn (83.47%)
2,783,266,581 branches:u # 973.641 M/sec (83.11%)
67,444,081 branch-misses:u # 2.42% of all branches (83.34%)
2.932051381 seconds time elapsed
AntouanK
commented
Apr 16, 2017
|
Just for comparison on a beefier CPU ( again on the With 16 cores enabled ( default )
With a single core enabled,
|
evancz
closed this
Apr 17, 2017
evancz
reopened this
Apr 17, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
evancz
Apr 17, 2017
Contributor
Okay, after thinking about it more, I think it makes sense to have a meta issue to try to track the problem. It looks like it's related to some general problem that we haven't been able to pin down yet.
I'm not sure where the meta issue should live, so I'm just going to leave it for now.
|
Okay, after thinking about it more, I think it makes sense to have a meta issue to try to track the problem. It looks like it's related to some general problem that we haven't been able to pin down yet. I'm not sure where the meta issue should live, so I'm just going to leave it for now. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jcberentsen
Jun 2, 2017
Could it be related to this many-core Haskell runtime issue? It might be worth trying running elm-make +RTS -qg to turn parallell GC off? Possibly also with -A50M?
jcberentsen
commented
Jun 2, 2017
|
Could it be related to this many-core Haskell runtime issue? It might be worth trying running |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Augustin82
Jun 20, 2017
I'm encountering the same problem.
I'm using Ubuntu 16.04.2 LTS 64-bit on a Dell XPS15 (16GB RAM, i7-7700HQ CPU @ 2.80GHz × 8).
Here's the output of elm-make --yes against the elm-css repo:
$ sudo rm -rf elm-stuff && sudo perf stat elm-make --yes
Starting downloads...
● rtfeldman/elm-css-util 1.0.2
● rtfeldman/hex 1.0.0
● elm-lang/core 5.1.1
Packages configured successfully!
Success! Compiled 35 modules.
Performance counter stats for 'elm-make --yes':
18956,377203 task-clock (msec) # 3,297 CPUs utilized
899 431 context-switches # 0,047 M/sec
961 cpu-migrations # 0,051 K/sec
24 012 page-faults # 0,001 M/sec
60 360 847 025 cycles # 3,184 GHz
42 589 205 040 instructions # 0,71 insn per cycle
8 429 877 689 branches # 444,699 M/sec
59 846 870 branch-misses # 0,71% of all branches
5,749273986 seconds time elapsed
$ sudo rm -rf elm-stuff && sudo sysconfcpus -n 1 perf stat elm-make --yes
Starting downloads...
● rtfeldman/hex 1.0.0
● rtfeldman/elm-css-util 1.0.2
● elm-lang/core 5.1.1
Packages configured successfully!
Success! Compiled 35 modules.
Performance counter stats for 'elm-make --yes':
2523,628536 task-clock (msec) # 0,624 CPUs utilized
765 context-switches # 0,303 K/sec
45 cpu-migrations # 0,018 K/sec
25 436 page-faults # 0,010 M/sec
8 034 175 352 cycles # 3,184 GHz
16 722 884 588 instructions # 2,08 insn per cycle
3 040 339 979 branches # 1204,749 M/sec
53 767 368 branch-misses # 1,77% of all branches
4,047055038 seconds time elapsed
Augustin82
commented
Jun 20, 2017
|
I'm encountering the same problem. I'm using Ubuntu 16.04.2 LTS 64-bit on a Dell XPS15 (16GB RAM, i7-7700HQ CPU @ 2.80GHz × 8). Here's the output of
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtfeldman
Jul 28, 2017
I dove deeper into this. It turns out GHC has some, ahem, conservative default garbage collection settings for parallelized workloads. (Example: the nursery gets a whole megabyte!)
Simon Marlow and others have looked into this and have tuned the default settings for the next release of GHC 8. Fortunately, we can tune things right now using flags in elm-compiler.cabal.
tl;dr
Try adding -with-rtsopts="-A16M -qg" to the end of this ghc-options: line in elm-compiler.cabal.
What this does
-A16Mincreases the nursery size from the default of 1MB to 16MB. Why 16MB? The highest I saw was Simon Marlow mentioning using-A256M. Someone reported faster and faster speeds as they dialed this up incrementally to128M, but with big increase in memory usage along the way. I chose 16MB because the person who tried that reported it seemed like the "sweet spot" - memory usage increased, but wasn't crazy, and build time decreased a lot. The GHC folks seem to be changing GHC defaults to special-case-Avalues of 16M and up, as well as 32M and up, but they are staying conservative and keeping default-Aat 1M. It is probably worth people trying out different values here, but I wanted to pick one to suggest, so I picked 16M.-qgdisables parallel GC. People have reported speed gains from this. As an alternative to-qg, we could try using-n4m -qb0instead (more on this below), which is a combination of flags that should improve parallel GC performance. Hoewver, I suspect it may be better to turn it off altogether; Simon Marlow suggested there may be an as-yet-unidentified bug in the parallel GC's multithreading logic. Since usingsysconfcpusto force using 1 core speeds things up overall, if a parallel GC bug is the root of the problem here, it stands to reason that disabling it would result in better perf than thesysconfcpustrick because at least the compiler logic itself could still use multiple cores. It's probably worth benchmarking both ways.
Bonus perf boost for Linux
For Linux builds (not sure if this is a no-op or causes problems on non-Linux systems), Simon Marlow suggests adding the --numa flag (e.g. -with-rtsopts="-A16M -qg --numa") for another significant perf boost (according to the docs, a 10% speedup is "typical" but varies greatly by hardware and program).
Further Reading
I have read the documentation for the -H flag about 12 times. I still do not know what it does. It sounds like it could lead to reduced memory consumption. Maybe it's awesome. I'm also not sure if GHC 7.x supports it.
If we remove -qg we should also enable -qb0 (for -A32M and up, and honestly possibly also -A16M and up) which turns on load balancing in the nursery for parallel GC. Simon Marlow recommended this. We should also definitely enable -n4m; the current GHC now defaults to -n4m for -A16M or higher.
Longer Version
I put the long version of what I learned into a gist.
rtfeldman
commented
Jul 28, 2017
•
|
I dove deeper into this. It turns out GHC has some, ahem, conservative default garbage collection settings for parallelized workloads. (Example: the nursery gets a whole megabyte!) Simon Marlow and others have looked into this and have tuned the default settings for the next release of GHC 8. Fortunately, we can tune things right now using flags in tl;drTry adding What this does
Bonus perf boost for LinuxFor Linux builds (not sure if this is a no-op or causes problems on non-Linux systems), Simon Marlow suggests adding the Further ReadingI have read the documentation for the If we remove Longer VersionI put the long version of what I learned into a gist. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
Jul 28, 2017
thank you @rtfeldman for looking into this.
If I remember correctly, compiling elm-make, is not a one line thing.
If anyone does try it with those flags, can you upload the binaries somewhere so we can try them as well?
AntouanK
commented
Jul 28, 2017
|
thank you @rtfeldman for looking into this. If I remember correctly, compiling elm-make, is not a one line thing. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
eeue56
Jul 29, 2017
So, I ran almost all the combinations suggested with different values. I've put the most useful raw numbers into this gist, benchmarking against elm-css on a Thinkpad T470s.
Conclusion: the best option is "-with-rtsops=-qg". No amount of -An could beat just disabling the parallel GC. While there is some benefit from using -An, it only starts to kick in when you have a large value, e.g -A64M. This seems like a large amount of memory to reserve for a single elm-make process, and it still did not perform as well as -qg. Other things such as n4m likewise saw little or no performance change.
Make sure to use " -with-rtsopts=-A16M -qg" for checking, and not -with-rtsopts="-A16M -qg" -- or -qg will be parsed as a non-rts option.
tl;dr
"-with-rtsopts=-qg" is the best option for performance on my machine.
eeue56
commented
Jul 29, 2017
|
So, I ran almost all the combinations suggested with different values. I've put the most useful raw numbers into this gist, benchmarking against elm-css on a Thinkpad T470s. Conclusion: the best option is Make sure to use tl;dr
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
Jul 29, 2017
@eeue56 How many cores does that machine have?
Since it's an issue that worsens with the more cores you have available, it would be good to test it against a CPU with many.
I can run it on a Ryzen 1700, it has 16 cores.
Let me know if you can make a gist with instructions on what combinations to try, and I'll do it this weekend.
AntouanK
commented
Jul 29, 2017
|
@eeue56 How many cores does that machine have? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
eeue56
Jul 29, 2017
It is this CPU: https://ark.intel.com/products/97466/Intel-Core-i7-7600U-Processor-4M-Cache-up-to-3_90-GHz
So, in terms of this, 4 cores (2 real cores, 2 more from HT).
If you look at the gist I made, it has all the combinations that had any form of noticeable impact - in order to build:
- Install ghc 7.10
- Download https://github.com/elm-lang/elm-platform/blob/master/installers/BuildFromSource.hs
- Run
runhaskell BuildFromSource.hs 0.18 - cd into
Elm-Platform/0.18 - Modify:
elm-package/elm-package.cabal,elm-make/elm-make.cabal,elm-make/elm-compilerto have the options. Your ghc-options should look something like-threaded -O2 -W "with-rtsopts=-A16M -gq". Note where the quotes are. - Run
cabal install elm-make - Run
perf stat <dir-with-elm-platform>/Elm-Platform/0.18/.cabal_sandbox/bin/elm-makein the elm-css dir. Make sure to wipeelm-stuff/build-artifacts, but notelm-stuff/packagesbefore each build.
If you have other questions, this discussion is better asked on #elm-dev on Slack
eeue56
commented
Jul 29, 2017
•
|
It is this CPU: https://ark.intel.com/products/97466/Intel-Core-i7-7600U-Processor-4M-Cache-up-to-3_90-GHz So, in terms of this, 4 cores (2 real cores, 2 more from HT). If you look at the gist I made, it has all the combinations that had any form of noticeable impact - in order to build:
If you have other questions, this discussion is better asked on #elm-dev on Slack |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtfeldman
Jul 29, 2017
@AntouanK if you do build those, can you save the different binaries you make so we can post them somewhere for others to try? A higher sample size will give us more confidence!
rtfeldman
commented
Jul 29, 2017
|
@AntouanK if you do build those, can you save the different binaries you make so we can post them somewhere for others to try? A higher sample size will give us more confidence! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
Jul 29, 2017
Thanks @eeue56
Here are my results: https://gist.github.com/AntouanK/1ecec02e08b90be54463d4ad6f0efcab
Still, it confuses me how it's possible to have 1 core outperforming 16 in parallel.
Even with the flag "-with-rtsopts=-qg", if I switch to just one core, it runs faster.
Seems to me that the best solution is to restrict it to one core and not let it multi-thread. Don't know how easy that would be.
AntouanK
commented
Jul 29, 2017
|
Thanks @eeue56 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
Jul 29, 2017
@rtfeldman Seems like there should be an easy script for this.
I made a docker container, with haskell 7.10, mounted a local directory of elm-css, and then did 3 things repeatedly.
- sed to change the flags in the cabal files and
runhaskell ...again - rm -rf the build artifacts in elm-css
- run
perf stat .../elm-make
so should we just make a script instead?
Or I can just upload the binaries in a repo?
AntouanK
commented
Jul 29, 2017
•
|
@rtfeldman Seems like there should be an easy script for this.
so should we just make a script instead? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtfeldman
Jul 29, 2017
@AntouanK uploading the binaries would be fantastic! I think that would definitely be the easiest for others.
Since sysconfcpus still improved things, I'm curious what happens if you try these RTS options:
-N1(tells GHC to only use one core for the program as well as the GC; should be equivalent tosysconfcpus -n 1)-N16(tells GHC to use 16 cores, instead of the 32 it might be trying to use due to hyperthreading; this comment suggests that might be a thing-N12to see if telling it to use many cores, but fewer than 16, is better than telling it to use 1
Among these, I think it'd be most useful o try -N1 -qg, but if you have time, trying all 3 would be great. Even better would be trying 6 combinations - using the different -N flags with -qb0 -n4m instead of -qg.
Saving binaries for anything you do there would be great as well!
rtfeldman
commented
Jul 29, 2017
|
@AntouanK uploading the binaries would be fantastic! I think that would definitely be the easiest for others. Since
Among these, I think it'd be most useful o try Saving binaries for anything you do there would be great as well! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
Jul 29, 2017
@rtfeldman With my poor scripting skills, I got a loop going.
Binaries and perf stat outputs are here : https://github.com/AntouanK/elm-make_perf
So far, seems like the flag doesn't make a big difference.
Using one core does.
elm-css
| Perf name | Time |
|---|---|
| multicore | 5.502 sec |
| onecore | 3.074 sec |
| -A16M_-qb0.multicore | 3.136 sec |
| -A16M_-qb0.onecore | 2.776 sec |
| -A16M_-qg.multicore | 3.230 sec |
| -A16M_-qg.onecore | 2.744 sec |
| -A32M_-qb0.multicore | 3.131 sec |
| -A32M_-qb0.onecore | 2.751 sec |
| -A32M_-qg.multicore | 3.030 sec |
| -A32M_-qg.onecore | 2.744 sec |
| -A64M_-qb0.multicore | 3.327 sec |
| -A64M_-qb0.onecore | 2.754 sec |
| -A64M_-qg.multicore | 3.201 sec |
| -A64M_-qg.onecore | 2.790 sec |
| -N1_-qb0_-n4m.multicore | 3.123 sec |
| -N1_-qb0_-n4m.onecore | 2.748 sec |
| -N1_-qg.multicore | 3.201 sec |
| -N1_-qg.onecore | 2.749 sec |
| -N12_-qb0_-n4m.multicore | 3.192 sec |
| -N12_-qb0_-n4m.onecore | 2.744 sec |
| -N12_-qg.multicore | 3.135 sec |
| -N12_-qg.onecore | 2.767 sec |
| -N16_-qb0_-n4m.multicore | 3.191 sec |
| -N16_-qb0_-n4m.onecore | 2.777 sec |
| -N16_-qg.multicore | 3.214 sec |
| -N16_-qg.onecore | 2.734 sec |
| -qg.multicore | 3.006 sec |
| -qg.onecore | 2.761 sec |
AntouanK
commented
Jul 29, 2017
•
|
@rtfeldman With my poor scripting skills, I got a loop going. So far, seems like the flag doesn't make a big difference.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
Jul 30, 2017
Added results for elm-spa-example as well
| Perf name | Time | % from multicore |
|---|---|---|
| multicore | 9.234 sec | |
| onecore | 6.385 sec | 69.15% |
| -A16M_-qb0.multicore | 7.451 sec | 80.69% |
| -A16M_-qb0.onecore | 6.033 sec | 65.33% |
| -A16M_-qg.multicore | 7.431 sec | 80.47% |
| -A16M_-qg.onecore | 6.151 sec | 66.61% |
| -A32M_-qb0.multicore | 7.169 sec | 77.64% |
| -A32M_-qb0.onecore | 6.106 sec | 66.13% |
| -A32M_-qg.multicore | 7.068 sec | 76.54% |
| -A32M_-qg.onecore | 6.266 sec | 67.86% |
| -A64M_-qb0.multicore | 7.559 sec | 81.86% |
| -A64M_-qb0.onecore | 6.219 sec | 67.35% |
| -A64M_-qg.multicore | 7.153 sec | 77.46% |
| -A64M_-qg.onecore | 6.015 sec | 65.14% |
| -N12_-qb0_-n4m.multicore | 7.301 sec | 79.07% |
| -N12_-qb0_-n4m.onecore | 6.123 sec | 66.31% |
| -N12_-qg.multicore | 7.373 sec | 79.85% |
| -N12_-qg.onecore | 6.297 sec | 68.19% |
| -N16_-qb0_-n4m.multicore | 7.586 sec | 82.15% |
| -N16_-qb0_-n4m.onecore | 6.340 sec | 68.66% |
| -N16_-qg.multicore | 7.248 sec | 78.49% |
| -N16_-qg.onecore | 6.139 sec | 66.48% |
| -N1_-qb0_-n4m.multicore | 7.543 sec | 81.69% |
| -N1_-qb0_-n4m.onecore | 6.131 sec | 66.40% |
| -N1_-qg.multicore | 7.561 sec | 81.88% |
| -N1_-qg.onecore | 6.097 sec | 66.03% |
| -qg.multicore | 7.330 sec | 79.38% |
| -qg.onecore | 6.386 sec | 69.16% |
AntouanK
commented
Jul 30, 2017
•
|
Added results for
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtfeldman
Jul 30, 2017
Hm, this is very strange. As I recall @eeue56 was seeing like 2-3 second build times on elm-spa-example without sysconfcpus, building on less powerful hardware.
I wonder what the reason is for the discrepancy.
rtfeldman
commented
Jul 30, 2017
|
Hm, this is very strange. As I recall @eeue56 was seeing like 2-3 second build times on I wonder what the reason is for the discrepancy. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AntouanK
Jul 30, 2017
That's with elm-make src/Main.elm?
I tried it again.
I get 8.6 on normal elm-make, and 6.2 with one core.
AntouanK
commented
Jul 30, 2017
•
|
That's with I tried it again. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtfeldman
Sep 15, 2017
I did some more digging around in the docs, and found some more potential answers as to why single-core is outperforming multicore.
-s for Benchmarking Statistics
Before getting into those, it's worth noting that the -s flag (and related flags) print out a wealth of runtime performance statistics after the run. This would be really helpful for these benchmarks!
Flags that possibly should always be enabled for elm-make
- The
--numaflag could have major benefits for Linux in particular. (It seems like it shouldn't be enabled on non-Linux builds though.) As far as I can tell, it's just a straight-up upgrade for Linux builds, and could very well explain the discrepancy between Linux and macOS performance characteristics. - The
-feager-blackholingflag apparently should always be turned on, according to the docs. ("We recommend compiling any code that is intended to be run in parallel with the-feager-blackholingflag.") Note that it only has any effect if-threadedis enabled, so we'll need to keep that enabled in order to see what it does. (Doesn't sound like it should be much - around the neighborhood of 1-2%, according to the docs, but who knows?) - The
-fllvmflag seems worth enabling in general, and may separately impact this issue. It enables using LLVM to get additional binary performance optimizations. ("Compile via LLVM instead of using the native code generator. This will generally take slightly longer than the native code generator to compile. Produced code is generally the same speed or faster than the other two code generators.") The one caveat is that compiling via LLVM requires LLVM’soptandllcexecutables to be inPATH, so this may not be worth the effort in comparison to the other options. TheEDIT: It appears-threadedflag enables a parallel version of the runtime. ("The threaded runtime system is so-called because it manages multiple OS threads, as opposed to the default runtime system which is purely single-threaded.") It turns out that if this is disabled, then-Nis silently ignored. The fact that it is disabled by default explains why-Nnever had any effect on the previous benchmarks. I don't know why this isn't enabled by default, considering some docs say "In order to make use of multiple CPUs, your program must be linked with the-threadedoption" (which empirically seems to be false, since we've seen more CPUs engage whensysconfcpusis not used to limit them), but this very much seems worth trying at any rate.elm-makeis already using-threaded. So it looks like the reasonsysconfcpushas an effect is becauseelm-makeis setting the capability count manually. This line of code would become unnecessary if we used-N(no number), which would make GHC infer it on startup, but it appears that would do nothing more than saving a line of code; GHC infers the same number as what we get by calling this manually.
Flags that deserve experimentation
- The
-kiflag controls default stack size for new threads. Thread stacks live on the heap, apparently, and the default is 1K (GHC is once again conservative). They allocate more memory as they grow (in 32K chunks by default, which is also configurable via the-kcflag, if we care), and then GC it when they're no longer using it. This could explain why throwing more cores at the workload degrades performance; if each of these threads is so resource constrained that it ends up doing a ton of allocation and GC to overcome its 1K default stack size, they might be so much slower than the main thread (which doesn't have this problem) that it leads to a net slowdown. - The
-Hflag sets the default heap size for the garbage collector. The default is 0, and I suspect we can experimentally find a better choice than that, but then again-Hand-Aare related, and tweaking-Adidn't seem to do anything. Still, could be worth a shot since the default seems so bad. - The
-Iflag could be relevant; it controls how often GHC collects garbage when idle. By default it runs every 0.3 seconds, and we can disable it with-I0.
Quick Aside
I also learned a bit about GHC's default pre-emption settings.
GHC implements pre-emptive multitasking: the execution of threads are interleaved in a random fashion. More specifically, a thread may be pre-empted whenever it allocates some memory [...]
The rescheduling timer runs on a 20ms granularity by default, but this may be altered using the -i RTS option. [...]
This sounds like a potential contributing factor, but unfortunately, these docs may be out of date. The RTS docs don't list an -i flag. However, there are plenty of other knobs to try!
@eeue56 and @AntouanK - if you could try the above flags out on your benchmarks, it would be so amazing!!! I feel some combination of these may be the answer we've been looking for!
Note: I'm not actually sure how to be sure these are enabled. I linked to the docs, but as Noah learned, the exact way you specify them can be tricky, and GHC has a habit of silently ignoring flags it doesn't understand.
rtfeldman
commented
Sep 15, 2017
•
|
I did some more digging around in the docs, and found some more potential answers as to why single-core is outperforming multicore.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtfeldman
Sep 15, 2017
As an aside, I'm curious if this is why macOS programs are doing better: maybe GHC thinks they all have 1 core.
I ran this on my MacBook Pro (2 physical cores + 2 HT):
module Main where
import Control.Concurrent
main :: IO ()
main = do
capabilities <- getNumCapabilities
putStrLn $ show capabilitiesIt printed 1.
Curious what others see when running this.
rtfeldman
commented
Sep 15, 2017
|
As an aside, I'm curious if this is why macOS programs are doing better: maybe GHC thinks they all have 1 core. I ran this on my MacBook Pro (2 physical cores + 2 HT): module Main where
import Control.Concurrent
main :: IO ()
main = do
capabilities <- getNumCapabilities
putStrLn $ show capabilitiesIt printed Curious what others see when running this. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtfeldman
Sep 16, 2017
Nope, that's not it. I also ran it on nixOS with 8 physical cores (Ryzen) and it also printed 1.
rtfeldman
commented
Sep 16, 2017
|
Nope, that's not it. I also ran it on nixOS with 8 physical cores (Ryzen) and it also printed |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jmitchell
Sep 16, 2017
@rtfeldman, I get the same on my 8-core AMD under nixOS. Perhaps it's set to 1 by default everywhere, unless overridden with setNumCapabilities or the +RTS -N runtime flag mentioned in the setNumCapabilities doc string:
Set the number of Haskell threads that can run truly simultaneously (on separate physical processors) at any given time. The number passed to forkOn is interpreted modulo this value. The initial value is given by the +RTS -N runtime flag.
jmitchell
commented
Sep 16, 2017
|
@rtfeldman, I get the same on my 8-core AMD under nixOS. Perhaps it's set to 1 by default everywhere, unless overridden with
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtfeldman
Sep 16, 2017
@jmitchell yep, I just confirmed that - it prints out whatever -N flag I get it, but defaults to 1.
rtfeldman
commented
Sep 16, 2017
|
@jmitchell yep, I just confirmed that - it prints out whatever |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtfeldman
Sep 16, 2017
This is even more confusing then. Why would sysconfcpus have any affect at all if GHC always thinks there is only 1 core available unless you tell it otherwise?
rtfeldman
commented
Sep 16, 2017
|
This is even more confusing then. Why would |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jmitchell
Sep 16, 2017
Ah, because setNumCapabilities is applied according to GHC.Conc.getNumProcessors, which on my machine returns 8.
jmitchell
commented
Sep 16, 2017
|
Ah, because |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtfeldman
Sep 16, 2017
Ahh, that's why!
I learned from Brian McKenna on Twitter that if you pass -N by itself (no number specified), GHC will infer that for you on startup, making that getNumProcessors line unnecessary. I verified locally that it works; on my MacBook Pro it reported 4, and on my nixOS Ryzen machine it reported 8.
Then I set up a test on Travis to see if maybe -N would lead getNumCapabilities to report a better number than getNumProcessors (which reports 32 cores available, when actually only 2 are available to the virtualized environment on Travis), but sadly the -N approach also reports 32 on Travis.
rtfeldman
commented
Sep 16, 2017
•
|
Ahh, that's why! I learned from Brian McKenna on Twitter that if you pass Then I set up a test on Travis to see if maybe |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andys8
Sep 16, 2017
There is also the chance that the cause for the stackoverflow issue #164 is related, if the configured values are relative to the total amount of available memory on a machine, too.
andys8
commented
Sep 16, 2017
|
There is also the chance that the cause for the stackoverflow issue #164 is related, if the configured values are relative to the total amount of available memory on a machine, too. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zwilias
Sep 20, 2017
Member
For future posterity: building elm-make with -rtsopts in the GHC options allows setting the RTS options on execution, rather than at compile time.
https://downloads.haskell.org/~ghc/7.8.4/docs/html/users_guide/runtime-control.html
|
For future posterity: building https://downloads.haskell.org/~ghc/7.8.4/docs/html/users_guide/runtime-control.html |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andys8
Sep 20, 2017
@zwilias Any chance to get this flag in by default? (Any implications?) Otherwise a workaround would have to rely on an elm-make fork.
andys8
commented
Sep 20, 2017
|
@zwilias Any chance to get this flag in by default? (Any implications?) Otherwise a workaround would have to rely on an elm-make fork. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andys8
commented
Oct 2, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
evancz
Mar 7, 2018
Contributor
I want to make elm/compiler#1473 the canonical issue for this. It probably makes sense to summarize the things that need to happen into a meta issue though. I will coordinate with @eeue56 and @zwilias to get a list of TODO items that should be in that.
|
I want to make elm/compiler#1473 the canonical issue for this. It probably makes sense to summarize the things that need to happen into a meta issue though. I will coordinate with @eeue56 and @zwilias to get a list of TODO items that should be in that. |

AntouanK commentedApr 12, 2017
Hi.
Noticed this issue when I was working in parallel on the same project, on my PC and my MacBook Pro.
For some weird reason, a 3 year-old MBP, was compiling faster than a brand new 16-core PC.
For example:
MBP:

PC:

After the tip from
@eeue56on the elm slack channel, to use "sysconfcpus", I saw a huge boost.On the Ryzen PC with linux, with one core ( so with "sysconfcpus -n 1" ) I can run "make build" on ~10.4 seconds!
( on the mac "sysconfcpus -n 1" makes no difference )
So, how come the same process is ~50% slower when running on 16 cores, than running on one?
Is there anything I can do to make the compiler take advantage on the multiple cores?
Thanks.