Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow CppLinkAction to inspect input sizes in resource estimation #19203

Closed
wants to merge 4 commits into from

Conversation

jmmv
Copy link
Contributor

@jmmv jmmv commented Aug 8, 2023

This PR instrumets the CppLinkAction to print log lines for every link operation and dumps the number of inputs, the aggregate size of the inputs, the predicted memory usage, and the actual consumed memory.

Calculating the aggregate size of the inputs is a costly operation because we need to expand a nested set and stat all of its files, which explains why this PR is complex. I've modified the local memory estimator to compute the inputs size exactly once per link as we need this information both to predict memory consumption and to emit the log line after the spawn completes.

This is meant to aid in assessing the local resources prediction for linking with the goal of adjusting the model, but that should be done in a separate change.

Helps address #17368.

This PR instrumets the CppLinkAction to print log lines for every link
operation and dumps the number of inputs, the aggregate size of the
inputs, the predicted memory usage, and the actual consumed memory.

Calculating the aggregate size of the inputs is a costly operation
because we need to expand a nested set and stat all of its files, which
explains why this PR is complex.  I've modified the local memory
estimator to compute the inputs size exactly once per link as we need
this information both to predict memory consumption _and_ to emit the
log line after the spawn completes.

This is meant to aid in assessing the local resources prediction for
linking with the goal of adjusting the model, but that should be done
in a separate change.

Helps address bazelbuild#17368.
@github-actions github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Rules-CPP Issues for C++ rules labels Aug 8, 2023
@kjteske
Copy link
Contributor

kjteske commented Aug 9, 2023

I'm interested / watching #17368, and haven't had a chance to try this out, but wonder if the constants used to estimate resources should be more configurable, rather than hardcoded - resource consumption likely varies greatly based on toolchain/linker, flags used (LTO, in particular, results in much more expensive gcc links), or details about the code being linked.

Perhaps not in this MR, but maybe in the future, we could consider command line flags to modify these constants, or a configurable bit added to cc_toolchain?

@jmmv
Copy link
Contributor Author

jmmv commented Aug 9, 2023

@kjteske Oh yes, definitely. I'd prefer to see the toolchain definition extended via a function that allows the dynamic computation of compiler / linker resources because, as you say, there are too many variables to take into account and a single model isn't going to work.

That said, such a function will need to receive details about the action inputs as parameters and it should only be executed if the action is going to run locally. So I think this PR takes us in that direction anyway. Replacing the built-in model with an externally-supplied one should be a separate change.

Plus, having this merged would help us maintain our own internal patch of the model. After this change is in, the model tweaks are a few trivial modifications to just a few lines, which is much easier to maintain than this whole thing.

@nikhilkalige
Copy link
Contributor

Would love to see this change too, I am being harassed for the past month about builds being killed :)

@oquenchil oquenchil requested review from comius and buildbreaker2021 and removed request for oquenchil August 9, 2023 07:58
logger.atWarning().log(
"Linker metrics: failed to get size of %s: no metadata", input.getExecPath());
}
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not swallow Exceptions en bloc. Either only swallow IOException (worse) or re-throw it wrapped into ExecException (better)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for swallowing exceptions here was that the input sizes aren't used yet, so it seems too heavy-handed to fail hard if something goes wrong. The idea was to have this in place first to measure linker behavior by looking at the logs, and once we have data and a better model, we can put it in place (and then make this a hard failure once we have observed no failures).

I've changed the code to catch IOException (the broad Exception was a left-over from a previous draft) and added a TODO.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here; I'd prefer failing hard because if stat() doesn't work on an input file, the action will very likely not succeed anyway (every input file must be checksummed so that the action checksum can be computed).

A mitigating factor is that this IOException only ever happens if I/O is needed to get the metadata, which AFAIU only ever happens during action input discovery, which is not something link actions do so arguably the way this exception is handled is only a theoretical concern.

@lberki lberki added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Aug 11, 2023
@iancha1992
Copy link
Member

@lberki @jmmv I have retried. But looks like the tests are not passing. Can you please take a look?

cc: @bazelbuild/triage

@jmmv
Copy link
Contributor Author

jmmv commented Aug 11, 2023

My bad. Tests should be fixed now.

@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Aug 17, 2023
@lberki
Copy link
Contributor

lberki commented Sep 11, 2023

FYI: I have to change the log level to FINE because it's spamming our logs quite a bit, and my professional estimate is that there are way fewer people who are interested in this log line than those who aren't so it's better to impose extra work (setting up logging so that these lines get logged) on the former group.

lazyData = doCostlyEstimation();
}

logger.atInfo().log(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late to the party - do you need this log for every link action or is it ok to get a sample? I am asking because this seems to cause a bit of log spam when looking at large builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry for the late reply.

Yes, I kinda needed it for every link action. If I look at our build, there are lots and lots of small link actions, but towards the end of the build, there are only a few large ones, each with a different memory "profile". If I were to sample these, I'd lose valuable information at that stage.

Is this still a problem after @lberki 's change to decrease the logging level?

@sluongng
Copy link
Contributor

sluongng commented Nov 28, 2023

@lberki @jmmv It seems this PR was missed during cherry-pick to 7.0?

Edit: nvm, it did make it in. Sorry for the noise. 🙇

@comius
Copy link
Contributor

comius commented Jan 5, 2024

Update: this logging was removed in 9291ed0 (at head). Hopefully Bazel 7.0 gives you enough time to collect the necessary statistics, that can then be used on CppLinkAction in Starlark.

Unfortunately I had no easy way to keep supporting the loggint at head.

@jmmv
Copy link
Contributor Author

jmmv commented Jan 19, 2024

@comius Thanks for the heads up. I wonder though... what's the plan to be able to measure this long-term? Modeling the behavior of the compiler and linker is not a one-time thing. The compiler and linker evolve over time, and Bazel supports many more than just one pair, so there will always be a need to evaluate their resource consumption in order to tweak Bazel's internal model...

@comius
Copy link
Contributor

comius commented Feb 5, 2024

One possibility is to add logging for all spawn actions or if that’s too much implement a filter to log only a given set of mnemonics. cc @lberki wdyt?

@lberki
Copy link
Contributor

lberki commented Feb 6, 2024

I think that's very reasonable. Logging the estimate / consume resource use has value for everything, not just for C++ linking and honestly, the way we logged the data for C++ linking was kind of a wart. I could imagine emitting the same data on (e.g.) the BES so that people don't have to pore over the Java logs.

In general, moving all this to the SpawnAction level would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Rules-CPP Issues for C++ rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants