Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-4142] Compilation gets slower when allowed more concurrent jobs #46725

Closed
swift-ci opened this issue Mar 2, 2017 · 23 comments
Closed

[SR-4142] Compilation gets slower when allowed more concurrent jobs #46725

swift-ci opened this issue Mar 2, 2017 · 23 comments

Comments

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented Mar 2, 2017

Previous ID SR-4142
Radar rdar://problem/31079426
Original Reporter Suliga (JIRA User)
Type Bug
Status Resolved
Resolution Done

Attachment: Download

Environment

Xcode/Swift versions

  • 8.2.1 (8C1002)

  • Apple Swift version 3.0.2 (swiftlang-800.0.63 clang-800.0.42.1)

  • Code base: Swift 3.0

OS versions tried:

  • El Capitan 10.11.6

  • Sierra 10.12.3

Hardware used:

  • MacBook Pro (Retina, 15-inch, Late 2013)
    4-cores 2.6 GHz Intel Core i7
    16 GB 1600 MHz DDR3
  • Mac Pro (Late 2013)
    2.7 GHz 12-Core Intel Xeon E5
    32 GB 1866 MHz DDR3 ECC
  • Mac mini (Late 2014)
    2-cores 3 GHz Intel Core i7
    16 GB 1600 MHz DDR3
  • iMac (Retina 5K, 27-inch, late 2015)
    4-cores 4 GHz Intel Core i7
    32 GB 1867 MHz DDR3
Additional Detail from JIRA
Votes 8
Component/s Compiler
Labels Bug, Performance
Assignee Suliga (JIRA)
Priority Medium

md5: 705a03b7c838353453626f5302785d52

Issue Description:

We noticed a concerning issue where building our primarily Swift projects would take significantly longer on hardware with more cores available. Building a project on a 4-core MacBookPro would be about 33% faster than building the same project on a powerful 12-core MacPro tower.

Further investigation showed that, counterintuitively, reducing the number of concurrent jobs for xcodebuild on the MacPro machine improved the build times. For the 12-core machine, going from the default 24 jobs setting down to only 5 threads would improve compilation time by 23%.

For most of the hardware we tested with, 4-6 seems to be the optimal number of concurrent threads for the build speed, and allowing more to be used actually hurts the build times. The number of jobs was controlled using both IDEBuildOperationMaxNumberOfConcurrentCompileTasks setting, and -jobs command line option for xcodebuild.

We've observed this pattern both for a sizable project (5K Swift files, containing 540K lines of code plus 2.5K ObjC files, containing 320K lines of code), as well on a smaller one (500 Swift files, containing 36K lines of code and 1,5K ObjC files, containing 150K lines of code).

Initially we thought this could be related to some SSD/filesystem inefficiencies, so we tried building with a ram disk (no difference), and also tried APFS on macSierra with the same results. The issue has been observed both with Whole Module Optimization on and off (for both debug and release builds).

See the chart attached which compares build times for 4 different types of hardware used vs. number of concurrent jobs requested. Notice how for all machines the fastest builds are around 4-6 jobs, after which point build times regress - most prominently for the MacPro. Even a two-core MacMini builds faster than the 12-core MacPro!

Another thing we've observed was that during the builds, the CPUs were heavily under-utilized. On average we've seen the cores being roughly 40% idle, even when building the 0.5M lines of code project.

See the text file attached with build times elapsed showing time consumed by system vs. user for MacPro. Numbers going from 3 to 24 indicate the number of concurrent jobs requested. Notice the times going from 32 minutes (for 3 jobs) up to almost an hour (24 jobs).

Please advise as to why we're seeing such a concerning behavior. Last year (around Swift 1.2-2.0) we performed a similar hardware analysis, and at that time for our projects 12-core MacPro machines would build 2-3x times faster than MacBookPros, and 4-5x faster than the MacMinis - which was expected based on the number of cores used for the builds.

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Mar 2, 2017

Comment by Graydon Hoare (JIRA)

Three things would be useful for diagnosing this:

  • The model numbers of the machines in question (to gauge whether variation in straight-line perf between the models makes sense, eg. newer CPUs / faster clocks)

  • Whether the times collected are WMO or not, and built with optimization or not (to understand the way time is likely to be broken down between different subsystems of the compiler that exploit available concurrency differently, and/or oversubscribe to resources)

  • A breakdown of the per-file and per-phase timing, collected by running the build with
    -driver-time-compilation -Xfrontend -debug-time-compilation

Feel free to email such information privately; the last point in particular will reveal filenames and produce quite a lot of output on a large project.

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Mar 2, 2017

Comment by Jacek (JIRA)

graydon (JIRA User) Thanks for the response. I did include detailed model information in the Environment section of the ticket. Please check it out.
Also as described in the ticket description, we measured both with WMO on and off for debug (no optimizations), and with WMO on for release (with optimizations), and saw the same pattern. The text file attached for MacPro build times vs. # of jobs is without optimizations (as indicated in the file name).

I will provide more details for the phase/file compilation offline.

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Mar 2, 2017

Comment by Graydon Hoare (JIRA)

Also, given that you're testing on swiftlang-800.0.63, you will likely see an improvement in optimized build times if you move to a nightly snapshot toolchain or an 8.3 beta, or something that includes 92440b0 ; especially if you saw a regression relative to Swift 2.x. That change fixes a rather dramatic performance regression in the SILOptimizer phase.

It may also be worth checking to confirm that you're seeing the same overall number of jobs running as you scale the concurrent-job count. This would rule out the possibility of the driver running the same job more than once. To check that, it'd be sufficient to run your workload twice with the instrumentation flags above, once at (say) -j 4 and once at -j 24.

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Mar 2, 2017

Comment by Graydon Hoare (JIRA)

Ah, apologies I didn't expand the environment part to see you'd described the model number variation. I think (hope you concur) that the straight-line perf variation is explained by the model + clock variation in the units, and there's only a question of concurrency to address here.

A further question worth probing: given the large quantity of ObjC in these projects, I wonder if you've tried the new -enable-bridging-pch option in 8.3 betas. It's possible that you're spending quite a bit of time (re)parsing your bridging header in each frontend job. This might explain the similarity in behaviour between debug and opt builds (as would a problem in typechecking, or several other things; there are a few different things to check).

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Mar 2, 2017

Comment by Jacek (JIRA)

Our code base is not ready for 8.3 beta, and apparently the migration is not trivial to do in a short time. Once we've done it we'll see if there is any improvement.

Will run tests as you've suggested and share the results.

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Mar 2, 2017

Comment by Jacek (JIRA)

To confirm yes, we understand the concurrency issue is what we need to have addressed.

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Mar 8, 2017

Comment by Graydon Hoare (JIRA)

I've reproduced this to some extent; running a concurrent swift build with -j 12 or -j 24 does appear to slow down noticeably (say a 20% increase in cumulative user-mode time) relative to -j 4 when run on a 12-way xeon mac pro. Wall time is still faster here, and system time is only going up to 100% of user time rather than 500% and slower wall time as you're seeing, but the phenomenon is occurring and is worrying even if less dramatic than you're seeing. Also confirm the odd observation that the slowdown does not occur on an i7 part like a macbook.

Unclear on cause at present. Will follow up when we know more.

@ematejska
Copy link
Mannequin

@ematejska ematejska mannequin commented Mar 16, 2017

@swift-ci create

@slavapestov
Copy link
Member

@slavapestov slavapestov commented Mar 16, 2017

> Our code base is not ready for 8.3 beta, and apparently the migration is not trivial to do in a short time.

Suliga (JIRA User) Can you file separate bugs about this? 8.3 is not supposed to have any intentional source breaking changes except for a few corner cases where the old compiler allowed unsound code resulting in undefined behavior.

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Mar 17, 2017

Comment by Jacek (JIRA)

@slavapestov We did log bugs for issues we found so far:

  • rdar://30834916

  • rdar://30834480

  • rdar://30835142

  • rdar://30871455

  • rdar://30871960

@slavapestov
Copy link
Member

@slavapestov slavapestov commented Mar 17, 2017

Suliga (JIRA User) Thanks! Looks like the 'initialize' removal was intentional, and all but one of the rest have been fixe in swift-3.1-branch. Can you try a recent development snapshot?

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Mar 22, 2017

Comment by Jacek (JIRA)

@slavapestov One more we had logged rdar://30871790 still reproduces with 3/19 snapshot.

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Apr 27, 2017

Comment by Graydon Hoare (JIRA)

Just to update here: we've discussed in some detail with Jacek and looked at 3 main categories of mitigation.

  • Disabling some client-side software that was interfering with the build

  • Shipped (in 8.3.2) a change to schedule WMO-mode jobs in parallel

  • Looking into clang's module-loading subsystem to see if we can cause less stat()-contention in non-WMO mode

I believe at this point the "slower with more jobs" angle is .. broadly under control, and the 12-core machine is at least pulling its weight (though is still clock-for-clock slower than a newer 4Ghz iMac). Jacek, can you confirm this impression on your end?

Squeezing more parallelism out of the compiler may entail a bit more-involved changes, so I'm also curious to know what the core-utilization figures on your machines are at present (in whatever configuration is fastest). Are you still seeing much idle time?

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Apr 27, 2017

Comment by Jacek (JIRA)

Indeed, between disabling a kernel extension we had on, and the WMO concurrency fixes made in 8.3.2, we no longer see the issue where using more cores hurts builds performance. It is still a problem with Xcode 8.2.1, so it's a combination of factors, not just one. As you noted, a quad core 4GHz iMac is still faster than the 12 core 2.7GHz MacPro.

I'll let ob (JIRA User) add idle data from our side.

Appreciate profiling on-site, and getting the fixes in so fast! Also, let me know if you'd like us to capture and share profiling data with the improved setup to see what else might be impacting the builds perf.

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Apr 28, 2017

Comment by Oscar Bonilla (JIRA)

graydon (JIRA User) What I've been doing is watching the core utilization using htop and noticing that we have some bubbles in the pipeline. For example, one of the swift commands takes about 45s to complete while all the cores sit idle (only one core is at 99% utilization). There are others like that.

What I'm doing is isolating those commands to profile them individually. What kind of information would be useful? Is there some debugging option to swift that can tell you where it's spending its time?

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Apr 28, 2017

Comment by Oscar Bonilla (JIRA)

FWIW, the swift command I'm looking at is one of those where turning on WMO makes it compile faster and with more CPU utilization than with it off.

This is without WMO, I see very little CPU utilization, occasionally one core at 100%:

$ time long.sh >/dev/null 2>&1

real    3m14.846s
user    2m48.390s
sys 0m25.814s

This is with WMO, I see one core at 100% utilization most of the time:

$ time long.sh >/dev/null 2>&1

real    0m46.104s
user    0m58.929s
sys 0m1.905s

Xcode version is 8.3.2 from Apple.

$ swiftc -version
Apple Swift version 3.1 (swiftlang-802.0.53 clang-802.0.42)
Target: x86_64-apple-macosx10.9

long.sh is basically this, I just removed the -whole-module-optimization flag in between runs:

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/swiftc \
-incremental \
-whole-module-optimization \
-module-name \
VoyagerEntities \
-Onone \
-DDEBUG \
-DKIF_EXCEL \
-D \
COCOAPODS \
-D \
DEBUG \
-D \
KIF_EXCEL \
-sdk \
/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator10.3.sdk \
-target \
x86_64-apple-ios9.0 \
-g \
-module-cache-path \
/Users/obonilla/Library/Developer/Xcode/DerivedData/ModuleCache \
-Xfrontend \
-serialize-debugging-options \
-Xfrontend \
-disable-reflection-metadata \
-enable-testing \
-warnings-as-errors \
$(INCLUDES) \ # 1 Include directory with -I
$(FRAMEWORKS) \ # 53 Framework paths with -F
-c \
-num-threads \
24 \
$(SWIFT_FILES) \ # 149 Swift files
-output-file-map \
$(OUTPUT_FILE_MAP) \
-parseable-output \
-serialize-diagnostics \
-emit-dependencies \
-emit-module \
-emit-module-path \
$(EMIT_MODULE_PATH) \
-Xcc \
-I$(SWIFT_OVERRIDES_HMAP) \
-Xcc \
-iquote \
-Xcc \
$(GENERATED_FILES_HMAP) \
-Xcc \
-I$(OWN_TARGET_HEADERS_HMAP) \
-Xcc \
-I$(NON_FRAMEWORK_TARGET_HEADERS) \
-Xcc \
-ivfsoverlay \
-Xcc \
$(ALL_PRODUCT_HEADERS_YAML) \
-Xcc \
-iquote \
-Xcc \
$(PROJECT_HEADERS_HMAP) \
## A bunch more -Xcc -I$(INCLUDE)
-DPOD_CONFIGURATION_DEBUG=1 \
-Xcc \
-DDEBUG=1 \
-Xcc \
-DDEBUG=1 \
-Xcc \
-DKIF_EXCEL=1 \
-Xcc \
-DCOCOAPODS=1 \
-Xcc \
-DKIF_EXCEL=1 \
-emit-objc-header \
-emit-objc-header-path \
$(OBJECT_HEADER_PATH) \
-import-underlying-module \
-Xcc \
-ivfsoverlay \
-Xcc \
$(MODULE_OVERLAY_YAML) \
-Xcc \
-working-directory$(WORKING_DIRECTORY)

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented May 1, 2017

Comment by Oscar Bonilla (JIRA)

This is after adding -Xfrontend -debug-time-compilation

$ sh long.sh 
===-------------------------------------------------------------------------===
                               Swift compilation
===-------------------------------------------------------------------------===
  Total Execution Time: 74.2974 seconds (44.8304 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  40.1621 ( 55.7%)   1.0062 ( 44.8%)  41.1683 ( 55.4%)  41.1721 ( 91.8%)  Type checking / Semantic analysis
  21.8955 ( 30.4%)   0.8569 ( 38.2%)  22.7524 ( 30.6%)   1.9608 (  4.4%)  LLVM output
   0.6463 (  0.9%)   0.0438 (  2.0%)   0.6901 (  0.9%)   0.6901 (  1.5%)  SILGen
   8.8481 ( 12.3%)   0.3238 ( 14.4%)   9.1719 ( 12.3%)   0.4925 (  1.1%)  LLVM optimization
   0.3899 (  0.5%)   0.0039 (  0.2%)   0.3938 (  0.5%)   0.3939 (  0.9%)  SIL optimization
   0.0599 (  0.1%)   0.0043 (  0.2%)   0.0642 (  0.1%)   0.0643 (  0.1%)  Parsing
   0.0304 (  0.0%)   0.0059 (  0.3%)   0.0363 (  0.0%)   0.0363 (  0.1%)  Serialization (swiftmodule)
   0.0201 (  0.0%)   0.0002 (  0.0%)   0.0203 (  0.0%)   0.0203 (  0.0%)  Serialization (swiftdoc)
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0001 (  0.0%)  Name binding
   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0001 (  0.0%)   0.0001 (  0.0%)  AST verification
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  SIL verification (pre-optimization)
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  SIL verification (post-optimization)
  72.0521 (100.0%)   2.2453 (100.0%)  74.2974 (100.0%)  44.8304 (100.0%)  Total

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Nov 28, 2017

Comment by Anton Iakimov (JIRA)

It was solved for me in Xcode 9. Does anyone still have this issue?

@jckarter
Copy link
Member

@jckarter jckarter commented Dec 13, 2017

According to graydon (JIRA User) this was fixed.

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Dec 19, 2017

Comment by John Grange (JIRA)

I am still seeing some very bad performance on a Mac Pro w/ 12 core on Xcode 9.0.1. What version of Xcode 9 has this been fixed in?

@jckarter
Copy link
Member

@jckarter jckarter commented Dec 19, 2017

grangej (JIRA User) What OS are you running? The problem was a combination of kernel issues exacerbated by syscall patterns executed from Clang's VFS layer and poorly-behaved antivirus stuff.

The bug here was specifically that more parallelism was causing less throughput due to the aforementioned issues. Do you observe something similar, that you get better build times if you reduce the number of jobs?

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Dec 20, 2017

Comment by John Grange (JIRA)

@jckarter I am on OS X 12.6 , Mac Pro 12 Core, Xcode 9.01.

Running the following command seems to help, but my MacBook Pro still appears to be faster.

defaults write com.apple.dt.Xcode IDEBuildOperationMaxNumberOfConcurrentCompileTasks 5

@swift-ci
Copy link
Collaborator Author

@swift-ci swift-ci commented Dec 20, 2017

Comment by John Grange (JIRA)

@jckarter I also tried 9.2 and am experiencing similar issues. Should I try upgrading to OS X 13?

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants