-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow CppLinkAction to inspect input sizes in resource estimation #19203
Conversation
This PR instrumets the CppLinkAction to print log lines for every link operation and dumps the number of inputs, the aggregate size of the inputs, the predicted memory usage, and the actual consumed memory. Calculating the aggregate size of the inputs is a costly operation because we need to expand a nested set and stat all of its files, which explains why this PR is complex. I've modified the local memory estimator to compute the inputs size exactly once per link as we need this information both to predict memory consumption _and_ to emit the log line after the spawn completes. This is meant to aid in assessing the local resources prediction for linking with the goal of adjusting the model, but that should be done in a separate change. Helps address bazelbuild#17368.
I'm interested / watching #17368, and haven't had a chance to try this out, but wonder if the constants used to estimate resources should be more configurable, rather than hardcoded - resource consumption likely varies greatly based on toolchain/linker, flags used (LTO, in particular, results in much more expensive gcc links), or details about the code being linked. Perhaps not in this MR, but maybe in the future, we could consider command line flags to modify these constants, or a configurable bit added to cc_toolchain? |
@kjteske Oh yes, definitely. I'd prefer to see the toolchain definition extended via a function that allows the dynamic computation of compiler / linker resources because, as you say, there are too many variables to take into account and a single model isn't going to work. That said, such a function will need to receive details about the action inputs as parameters and it should only be executed if the action is going to run locally. So I think this PR takes us in that direction anyway. Replacing the built-in model with an externally-supplied one should be a separate change. Plus, having this merged would help us maintain our own internal patch of the model. After this change is in, the model tweaks are a few trivial modifications to just a few lines, which is much easier to maintain than this whole thing. |
Would love to see this change too, I am being harassed for the past month about builds being killed :) |
src/main/java/com/google/devtools/build/lib/rules/cpp/CppLinkAction.java
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/rules/cpp/CppLinkAction.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/rules/cpp/CppLinkAction.java
Show resolved
Hide resolved
a0a4874
to
ab31c6b
Compare
logger.atWarning().log( | ||
"Linker metrics: failed to get size of %s: no metadata", input.getExecPath()); | ||
} | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not swallow Exceptions en bloc. Either only swallow IOException
(worse) or re-throw it wrapped into ExecException
(better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for swallowing exceptions here was that the input sizes aren't used yet, so it seems too heavy-handed to fail hard if something goes wrong. The idea was to have this in place first to measure linker behavior by looking at the logs, and once we have data and a better model, we can put it in place (and then make this a hard failure once we have observed no failures).
I've changed the code to catch IOException
(the broad Exception
was a left-over from a previous draft) and added a TODO.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion here; I'd prefer failing hard because if stat()
doesn't work on an input file, the action will very likely not succeed anyway (every input file must be checksummed so that the action checksum can be computed).
A mitigating factor is that this IOException
only ever happens if I/O is needed to get the metadata, which AFAIU only ever happens during action input discovery, which is not something link actions do so arguably the way this exception is handled is only a theoretical concern.
My bad. Tests should be fixed now. |
FYI: I have to change the log level to |
lazyData = doCostlyEstimation(); | ||
} | ||
|
||
logger.atInfo().log( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Late to the party - do you need this log for every link action or is it ok to get a sample? I am asking because this seems to cause a bit of log spam when looking at large builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sorry for the late reply.
Yes, I kinda needed it for every link action. If I look at our build, there are lots and lots of small link actions, but towards the end of the build, there are only a few large ones, each with a different memory "profile". If I were to sample these, I'd lose valuable information at that stage.
Is this still a problem after @lberki 's change to decrease the logging level?
Update: this logging was removed in 9291ed0 (at head). Hopefully Bazel 7.0 gives you enough time to collect the necessary statistics, that can then be used on CppLinkAction in Starlark. Unfortunately I had no easy way to keep supporting the loggint at head. |
@comius Thanks for the heads up. I wonder though... what's the plan to be able to measure this long-term? Modeling the behavior of the compiler and linker is not a one-time thing. The compiler and linker evolve over time, and Bazel supports many more than just one pair, so there will always be a need to evaluate their resource consumption in order to tweak Bazel's internal model... |
One possibility is to add logging for all spawn actions or if that’s too much implement a filter to log only a given set of mnemonics. cc @lberki wdyt? |
I think that's very reasonable. Logging the estimate / consume resource use has value for everything, not just for C++ linking and honestly, the way we logged the data for C++ linking was kind of a wart. I could imagine emitting the same data on (e.g.) the BES so that people don't have to pore over the Java logs. In general, moving all this to the SpawnAction level would be great. |
This PR instrumets the CppLinkAction to print log lines for every link operation and dumps the number of inputs, the aggregate size of the inputs, the predicted memory usage, and the actual consumed memory.
Calculating the aggregate size of the inputs is a costly operation because we need to expand a nested set and stat all of its files, which explains why this PR is complex. I've modified the local memory estimator to compute the inputs size exactly once per link as we need this information both to predict memory consumption and to emit the log line after the spawn completes.
This is meant to aid in assessing the local resources prediction for linking with the goal of adjusting the model, but that should be done in a separate change.
Helps address #17368.