Add a job that measures build memory consumption and time #315

paulgessinger · 2020-07-09T16:05:23Z

This runs compilation units individually, based on a compilation database.

Currently, it only prints the output in two tables, one sorted by memory and sorted by compile time.

@msmk0, @HadrienG2 what do you think?

paulgessinger · 2020-07-09T16:09:08Z

Example is here: https://github.com/paulgessinger/acts/runs/851467964

codecov · 2020-07-09T17:03:57Z

Codecov Report

Merging #315 into master will increase coverage by 0.13%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #315      +/-   ##
==========================================
+ Coverage   48.32%   48.45%   +0.13%     
==========================================
  Files         323      324       +1     
  Lines       16376    16280      -96     
  Branches     7603     7554      -49     
==========================================
- Hits         7913     7889      -24     
+ Misses       3178     3139      -39     
+ Partials     5285     5252      -33

Impacted Files	Coverage Δ
...ts/EventData/detail/coordinate_transformations.hpp	`61.11% <0.00%> (-1.86%)`	⬇️
Core/include/Acts/EventData/MultiTrajectory.hpp	`71.42% <0.00%> (ø)`
...re/include/Acts/Propagator/StraightLineStepper.hpp	`68.88% <0.00%> (ø)`
.../include/Acts/EventData/MultiTrajectoryHelpers.hpp	`50.00% <0.00%> (ø)`
...ude/Acts/TrackFinder/CombinatorialKalmanFilter.hpp	`26.86% <0.00%> (+0.59%)`	⬆️
Core/include/Acts/EventData/MultiTrajectory.ipp	`70.90% <0.00%> (+0.73%)`	⬆️
Core/include/Acts/Fitter/KalmanFitter.hpp	`37.27% <0.00%> (+1.04%)`	⬆️
...include/Acts/TrackFinder/CKFSourceLinkSelector.hpp	`43.54% <0.00%> (+1.61%)`	⬆️
Core/include/Acts/Propagator/EigenStepper.ipp	`46.07% <0.00%> (+3.77%)`	⬆️
Core/include/Acts/Propagator/AtlasStepper.hpp	`72.75% <0.00%> (+4.29%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a52f00...82f9d3d. Read the comment docs.

HadrienG2

I would personally measure a RelWithDebInfo job, because...

In one measurement you did on the CKF tests, there was quite a big difference in build RSS between Release builds and RelWithDebInfo builds (generating debug info for all those templates isn't free, I assume).
The parts of Acts that have a big build overhead problem are unit tests. These will mostly be built by developers, and developers tend to care about debug symbols for all sorts of reasons (debugging, profiling, tracing, dynamic analysis...).

...but in any case, thanks for adding this, and I'm obviously in favor of it ;)

HadrienG2 · 2020-07-09T17:50:59Z

.github/workflows/perf.yml

+          -DACTS_BUILD_EVERYTHING=ON
+          -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
+      - name: Measure
+        run: cmakeperf collect build/compile_commands.json -o perf.csv -j$(nproc)


If you want stable time measurements, parallelism is dangerous because you don't know what's running in parallel with you and it may interact badly through some shared resources (memory bus, storage...).

Personally, I care most about RAM consumption right now, and think that timing measurements on cloud VMs are a lost cause, so I'm okay with this.

Right. I wouldn't give too much weight to the time measurements anyway. It speeds up the job a little bit, and the memory measurement should be robust enough with concurrency.

paulgessinger · 2020-07-09T19:48:01Z

On RelWithDebInfo, we now run out of disk space again 😄

HadrienG2 · 2020-07-09T19:52:22Z

Hmmm... this kind of relates to something that I was wondering about: wouldn't it make sense to time one of the existing builds instead of adding another build to the CI workflow ?

(Also, out of curiosity, how do you know that the build ran out of disk space rather than, say, out of RAM ? The failure symptoms do not look super obvious to me...)

HadrienG2 · 2020-07-09T20:04:55Z

Also, if disk usage is the issue, it intuitively sounds like you might be able to get away with amending your measurement script so that it deletes the .o file after every monitored compilation. This is obviously an incompatible alternative to the "altering one of the existing builds" strategy.

You would need to filter out linking jobs or anything else which makes use of those files from the compilation database, though.

paulgessinger · 2020-07-09T20:16:17Z

That's exactly what I tried. Are the linker jobs even part of the compilation database? I didn't see any, at least.

paulgessinger · 2020-07-09T20:18:23Z

Also, out of curiosity, how do you know that the build ran out of disk space rather than, say, out of RAM ? The failure symptoms do not look super obvious to me...

It's mostly a guess. If the kernel nukes processes because its OOM, you'll sometimes get output from the termination signal. If the VM manager kills the whole VM if it goes over disk, it just never prints any output. But yeah, could be that the VM manager also terminates the whole VM if it goes over some memory limit. I'm not sure.

paulgessinger · 2020-07-09T20:20:13Z

Hmmm... this kind of relates to something that I was wondering about: wouldn't it make sense to time one of the existing builds instead of adding another build to the CI workflow ?

I could run the script, and then afterwards invoke ninja to finish the job. Another option would be to not use the compilation database as an input, but just query ninja for all targets, and run those. I'm not sure in the latter case we'd be able to correlate memory footprint to compilation unit 1:1, which I think we want.

HadrienG2 · 2020-07-10T05:24:27Z

Since the job still fails after removing the .o files, I suspect that this could be an OOM scenario.

This is consistent with the fact that building with debug info consumes a lot more RAM, and with the fact that Github claims to provide 7GB of RAM per worker, which is too little to build with 2 cores and debug info according to my measurements (our biggest processes are 5GB in RelWithDebInfo mode here).

It is also consistent with the fact that no output is printed. Given Github Actions's low disk space limits, I bet that they are not using swap, and Linux's behavior when running out of RAM without a swap partition is to instantly freeze without any sane recovery option, as I've experienced too many times while playing with ~~fire~~ Acts builds. Yes, you heard that right, the OOM killer does not work without swap, unless you disable overcommit which in turn is way too pessimistic because of copy-on-write, which in turn is overused because someone thought fork() was a good idea back when Unix was designed...

If this is the issue, running this build at -j1 will work around the problem, and running the coverage build at -j1 may also work around its problem.

paulgessinger · 2020-07-10T06:12:25Z

Maybe. Yeah. Interesting.

msmk0 · 2020-07-10T13:32:00Z

This is a cool tool, but I am not sure that it helps us in the CI (at least in the current form). To be a useful part of the CI we would need to have a comparison with a reference to be alerted when there are regressions (similar to what the coverage is doing).

If you want to add it in its current form (as a first step), I would suggest to combine the coverage workflow and your new build performance one into a single analysis workflow.

HadrienG2 · 2020-07-10T13:33:55Z

@msmk0 I am not sure if coverage can run on a RelWithDebInfo build. But perhaps a Debug build is enough to expose Acts' build RAM consumption problems. It's been a while since I last checked the RAM usage of those builds...

Nevermind, I misunderstood your suggestion as merging the two jobs, instead of merely grouping them together in the GitHub UI.

paulgessinger · 2020-07-10T13:34:34Z

This is a cool tool, but I am not sure that it helps us in the CI (at least in the current form). To be a useful part of the CI we would need to have a comparison with a reference to be alerted when there are regressions (similar to what the coverage is doing).

Right, but that is considerably more work. Maybe I'll throw something together, but not anytime soon.

If you want to add it in its current form (as a first step), I would suggest to combine the coverage workflow and your new build performance one into a single analysis workflow.

Not sure why this would matter, honestly.

msmk0 · 2020-07-10T13:39:41Z

If you want to add it in its current form (as a first step), I would suggest to combine the coverage workflow and your new build performance one into a single analysis workflow.

Not sure why this would matter, honestly.

So they are logically grouped together. Similar to the builds and the checks.

msmk0 · 2020-07-10T13:46:19Z

Right, but that is considerably more work. Maybe I'll throw something together, but not anytime soon.

Completely understandable. That is why mentioned that we could still add it as-is if you want to have this as a first prototype in the CI.

paulgessinger · 2020-07-10T13:59:08Z

Ok, joined the workflows, removed the -j$(nproc) so should run with only one thread now. Let's see.

HadrienG2 · 2020-07-10T16:13:34Z

On the positive side, it's running through this time, and we may have found a way to stabilize the coverage build without even having to exclude every optional Acts component from it.
On the negative side... it's even slower than on my laptop :(
Further motivation to fix the build so that it can run in parallel without dying, I guess...

paulgessinger · 2020-07-10T16:33:38Z

Ok. Care to approve @HadrienG2?

…ct#315)

paulgessinger added 10 commits July 8, 2020 21:21

add build perf job

66141f7

run on all

dd88a43

rename

ac29362

rename again

c9bd956

split steps

47e975d

redirect

30de17b

update script

a6b5139

no redirect

d591fad

bump

1fb1248

bump, print

f8397d6

paulgessinger added the 🚧 WIP Work-in-progress label Jul 9, 2020

acts-issue-bot bot added the Triage label Jul 9, 2020

Merge branch 'master' into build-perf-job

5a4b81b

HadrienG2 suggested changes Jul 9, 2020

View reviewed changes

switch from Release to RelWithDebInfo

3807eb3

HadrienG2 reviewed Jul 9, 2020

View reviewed changes

Bump script, remove .o files

652fe30

msmk0 added the Infrastructure Changes to build tools, continous integration, ... label Jul 10, 2020

acts-issue-bot bot removed the Triage label Jul 10, 2020

move perf job to coverage, rename to analysis

7fe41e8

paulgessinger removed the 🚧 WIP Work-in-progress label Jul 10, 2020

paulgessinger added this to the v0.29.00 milestone Jul 10, 2020

rename workflow itself

f454ad4

msmk0 approved these changes Jul 10, 2020

View reviewed changes

improve output

82f9d3d

HadrienG2 approved these changes Jul 10, 2020

View reviewed changes

paulgessinger merged commit e6ced89 into acts-project:master Jul 10, 2020

paulgessinger deleted the build-perf-job branch July 10, 2020 20:00

paulgessinger added a commit to paulgessinger/acts that referenced this pull request Jul 13, 2020

Add a job that measures build memory consumption and time (acts-proje…

400b034

…ct#315)

HadrienG2 mentioned this pull request Jul 25, 2020

Reduce Debug output size w/o disabling components #263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a job that measures build memory consumption and time #315

Add a job that measures build memory consumption and time #315

paulgessinger commented Jul 9, 2020

paulgessinger commented Jul 9, 2020

codecov bot commented Jul 9, 2020 •

edited

Loading

HadrienG2 left a comment •

edited

Loading

HadrienG2 Jul 9, 2020 •

edited

Loading

paulgessinger Jul 9, 2020

paulgessinger commented Jul 9, 2020

HadrienG2 commented Jul 9, 2020 •

edited

Loading

HadrienG2 commented Jul 9, 2020 •

edited

Loading

paulgessinger commented Jul 9, 2020

paulgessinger commented Jul 9, 2020

paulgessinger commented Jul 9, 2020

HadrienG2 commented Jul 10, 2020 •

edited

Loading

paulgessinger commented Jul 10, 2020

msmk0 commented Jul 10, 2020

HadrienG2 commented Jul 10, 2020 •

edited

Loading

paulgessinger commented Jul 10, 2020

msmk0 commented Jul 10, 2020

msmk0 commented Jul 10, 2020

paulgessinger commented Jul 10, 2020

HadrienG2 commented Jul 10, 2020 •

edited

Loading

paulgessinger commented Jul 10, 2020

Add a job that measures build memory consumption and time #315

Add a job that measures build memory consumption and time #315

Conversation

paulgessinger commented Jul 9, 2020

paulgessinger commented Jul 9, 2020

codecov bot commented Jul 9, 2020 • edited Loading

Codecov Report

HadrienG2 left a comment • edited Loading

Choose a reason for hiding this comment

HadrienG2 Jul 9, 2020 • edited Loading

Choose a reason for hiding this comment

paulgessinger Jul 9, 2020

Choose a reason for hiding this comment

paulgessinger commented Jul 9, 2020

HadrienG2 commented Jul 9, 2020 • edited Loading

HadrienG2 commented Jul 9, 2020 • edited Loading

paulgessinger commented Jul 9, 2020

paulgessinger commented Jul 9, 2020

paulgessinger commented Jul 9, 2020

HadrienG2 commented Jul 10, 2020 • edited Loading

paulgessinger commented Jul 10, 2020

msmk0 commented Jul 10, 2020

HadrienG2 commented Jul 10, 2020 • edited Loading

paulgessinger commented Jul 10, 2020

msmk0 commented Jul 10, 2020

msmk0 commented Jul 10, 2020

paulgessinger commented Jul 10, 2020

HadrienG2 commented Jul 10, 2020 • edited Loading

paulgessinger commented Jul 10, 2020

codecov bot commented Jul 9, 2020 •

edited

Loading

HadrienG2 left a comment •

edited

Loading

HadrienG2 Jul 9, 2020 •

edited

Loading

HadrienG2 commented Jul 9, 2020 •

edited

Loading

HadrienG2 commented Jul 9, 2020 •

edited

Loading

HadrienG2 commented Jul 10, 2020 •

edited

Loading

HadrienG2 commented Jul 10, 2020 •

edited

Loading

HadrienG2 commented Jul 10, 2020 •

edited

Loading