Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a job that measures build memory consumption and time #315

Merged
merged 16 commits into from
Jul 10, 2020

Conversation

paulgessinger
Copy link
Member

This runs compilation units individually, based on a compilation database.

Currently, it only prints the output in two tables, one sorted by memory and sorted by compile time.

@msmk0, @HadrienG2 what do you think?

@paulgessinger paulgessinger added the 🚧 WIP Work-in-progress label Jul 9, 2020
@acts-issue-bot acts-issue-bot bot added the Triage label Jul 9, 2020
@paulgessinger
Copy link
Member Author

@codecov
Copy link

codecov bot commented Jul 9, 2020

Codecov Report

Merging #315 into master will increase coverage by 0.13%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #315      +/-   ##
==========================================
+ Coverage   48.32%   48.45%   +0.13%     
==========================================
  Files         323      324       +1     
  Lines       16376    16280      -96     
  Branches     7603     7554      -49     
==========================================
- Hits         7913     7889      -24     
+ Misses       3178     3139      -39     
+ Partials     5285     5252      -33     
Impacted Files Coverage Δ
...ts/EventData/detail/coordinate_transformations.hpp 61.11% <0.00%> (-1.86%) ⬇️
Core/include/Acts/EventData/MultiTrajectory.hpp 71.42% <0.00%> (ø)
...re/include/Acts/Propagator/StraightLineStepper.hpp 68.88% <0.00%> (ø)
.../include/Acts/EventData/MultiTrajectoryHelpers.hpp 50.00% <0.00%> (ø)
...ude/Acts/TrackFinder/CombinatorialKalmanFilter.hpp 26.86% <0.00%> (+0.59%) ⬆️
Core/include/Acts/EventData/MultiTrajectory.ipp 70.90% <0.00%> (+0.73%) ⬆️
Core/include/Acts/Fitter/KalmanFitter.hpp 37.27% <0.00%> (+1.04%) ⬆️
...include/Acts/TrackFinder/CKFSourceLinkSelector.hpp 43.54% <0.00%> (+1.61%) ⬆️
Core/include/Acts/Propagator/EigenStepper.ipp 46.07% <0.00%> (+3.77%) ⬆️
Core/include/Acts/Propagator/AtlasStepper.hpp 72.75% <0.00%> (+4.29%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a52f00...82f9d3d. Read the comment docs.

Copy link
Contributor

@HadrienG2 HadrienG2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally measure a RelWithDebInfo job, because...

  • In one measurement you did on the CKF tests, there was quite a big difference in build RSS between Release builds and RelWithDebInfo builds (generating debug info for all those templates isn't free, I assume).
  • The parts of Acts that have a big build overhead problem are unit tests. These will mostly be built by developers, and developers tend to care about debug symbols for all sorts of reasons (debugging, profiling, tracing, dynamic analysis...).

...but in any case, thanks for adding this, and I'm obviously in favor of it ;)

-DACTS_BUILD_EVERYTHING=ON
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON
- name: Measure
run: cmakeperf collect build/compile_commands.json -o perf.csv -j$(nproc)
Copy link
Contributor

@HadrienG2 HadrienG2 Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want stable time measurements, parallelism is dangerous because you don't know what's running in parallel with you and it may interact badly through some shared resources (memory bus, storage...).

Personally, I care most about RAM consumption right now, and think that timing measurements on cloud VMs are a lost cause, so I'm okay with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I wouldn't give too much weight to the time measurements anyway. It speeds up the job a little bit, and the memory measurement should be robust enough with concurrency.

@paulgessinger
Copy link
Member Author

On RelWithDebInfo, we now run out of disk space again 😄

@HadrienG2
Copy link
Contributor

HadrienG2 commented Jul 9, 2020

Hmmm... this kind of relates to something that I was wondering about: wouldn't it make sense to time one of the existing builds instead of adding another build to the CI workflow ?

(Also, out of curiosity, how do you know that the build ran out of disk space rather than, say, out of RAM ? The failure symptoms do not look super obvious to me...)

@HadrienG2
Copy link
Contributor

HadrienG2 commented Jul 9, 2020

Also, if disk usage is the issue, it intuitively sounds like you might be able to get away with amending your measurement script so that it deletes the .o file after every monitored compilation. This is obviously an incompatible alternative to the "altering one of the existing builds" strategy.

You would need to filter out linking jobs or anything else which makes use of those files from the compilation database, though.

@paulgessinger
Copy link
Member Author

That's exactly what I tried. Are the linker jobs even part of the compilation database? I didn't see any, at least.

@paulgessinger
Copy link
Member Author

Also, out of curiosity, how do you know that the build ran out of disk space rather than, say, out of RAM ? The failure symptoms do not look super obvious to me...

It's mostly a guess. If the kernel nukes processes because its OOM, you'll sometimes get output from the termination signal. If the VM manager kills the whole VM if it goes over disk, it just never prints any output. But yeah, could be that the VM manager also terminates the whole VM if it goes over some memory limit. I'm not sure.

@paulgessinger
Copy link
Member Author

Hmmm... this kind of relates to something that I was wondering about: wouldn't it make sense to time one of the existing builds instead of adding another build to the CI workflow ?

I could run the script, and then afterwards invoke ninja to finish the job. Another option would be to not use the compilation database as an input, but just query ninja for all targets, and run those. I'm not sure in the latter case we'd be able to correlate memory footprint to compilation unit 1:1, which I think we want.

@HadrienG2
Copy link
Contributor

HadrienG2 commented Jul 10, 2020

Since the job still fails after removing the .o files, I suspect that this could be an OOM scenario.

This is consistent with the fact that building with debug info consumes a lot more RAM, and with the fact that Github claims to provide 7GB of RAM per worker, which is too little to build with 2 cores and debug info according to my measurements (our biggest processes are 5GB in RelWithDebInfo mode here).

It is also consistent with the fact that no output is printed. Given Github Actions's low disk space limits, I bet that they are not using swap, and Linux's behavior when running out of RAM without a swap partition is to instantly freeze without any sane recovery option, as I've experienced too many times while playing with fire Acts builds. Yes, you heard that right, the OOM killer does not work without swap, unless you disable overcommit which in turn is way too pessimistic because of copy-on-write, which in turn is overused because someone thought fork() was a good idea back when Unix was designed...

If this is the issue, running this build at -j1 will work around the problem, and running the coverage build at -j1 may also work around its problem.

@paulgessinger
Copy link
Member Author

Maybe. Yeah. Interesting.

@msmk0 msmk0 added the Infrastructure Changes to build tools, continous integration, ... label Jul 10, 2020
@acts-issue-bot acts-issue-bot bot removed the Triage label Jul 10, 2020
@msmk0
Copy link
Contributor

msmk0 commented Jul 10, 2020

This is a cool tool, but I am not sure that it helps us in the CI (at least in the current form). To be a useful part of the CI we would need to have a comparison with a reference to be alerted when there are regressions (similar to what the coverage is doing).

If you want to add it in its current form (as a first step), I would suggest to combine the coverage workflow and your new build performance one into a single analysis workflow.

@HadrienG2
Copy link
Contributor

HadrienG2 commented Jul 10, 2020

@msmk0 I am not sure if coverage can run on a RelWithDebInfo build. But perhaps a Debug build is enough to expose Acts' build RAM consumption problems. It's been a while since I last checked the RAM usage of those builds...

Nevermind, I misunderstood your suggestion as merging the two jobs, instead of merely grouping them together in the GitHub UI.

@paulgessinger
Copy link
Member Author

This is a cool tool, but I am not sure that it helps us in the CI (at least in the current form). To be a useful part of the CI we would need to have a comparison with a reference to be alerted when there are regressions (similar to what the coverage is doing).

Right, but that is considerably more work. Maybe I'll throw something together, but not anytime soon.

If you want to add it in its current form (as a first step), I would suggest to combine the coverage workflow and your new build performance one into a single analysis workflow.

Not sure why this would matter, honestly.

@msmk0
Copy link
Contributor

msmk0 commented Jul 10, 2020

If you want to add it in its current form (as a first step), I would suggest to combine the coverage workflow and your new build performance one into a single analysis workflow.

Not sure why this would matter, honestly.

So they are logically grouped together. Similar to the builds and the checks.

@msmk0
Copy link
Contributor

msmk0 commented Jul 10, 2020

Right, but that is considerably more work. Maybe I'll throw something together, but not anytime soon.

Completely understandable. That is why mentioned that we could still add it as-is if you want to have this as a first prototype in the CI.

@paulgessinger
Copy link
Member Author

Ok, joined the workflows, removed the -j$(nproc) so should run with only one thread now. Let's see.

@HadrienG2
Copy link
Contributor

HadrienG2 commented Jul 10, 2020

On the positive side, it's running through this time, and we may have found a way to stabilize the coverage build without even having to exclude every optional Acts component from it.
On the negative side... it's even slower than on my laptop :(
Further motivation to fix the build so that it can run in parallel without dying, I guess...

@paulgessinger paulgessinger removed the 🚧 WIP Work-in-progress label Jul 10, 2020
@paulgessinger paulgessinger added this to the v0.29.00 milestone Jul 10, 2020
@paulgessinger
Copy link
Member Author

Ok. Care to approve @HadrienG2?

@paulgessinger paulgessinger merged commit e6ced89 into acts-project:master Jul 10, 2020
@paulgessinger paulgessinger deleted the build-perf-job branch July 10, 2020 20:00
paulgessinger added a commit to paulgessinger/acts that referenced this pull request Jul 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Infrastructure Changes to build tools, continous integration, ...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants