Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding CPU Measurements and Unit Tests #14470

Open
wants to merge 45 commits into
base: master
Choose a base branch
from

Conversation

ecngtng
Copy link
Contributor

@ecngtng ecngtng commented Dec 23, 2021

This commit is about reporting CPU user time and CPU system time,
that is already available, but not printed in any log.

Part one

The solution adds CPU metrics after a build as seen below:

INFO: 2 processes: 1 internal, 1 linux-sandbox.
INFO: Total action wall time 70.05s
INFO: Critical path 70.08s (setup 0.00s, action wall time 70.05s)
INFO: Elapsed time 71.58s (preparation 1.45s, execution 70.13s)
INFO: CPU time 5.31s (user 0.00s, system 0.01s, bazel 5.29s)
INFO: Build completed successfully, 2 total actions

If any of the values is not collected, "???" is printed instead like:

INFO: 415 processes: 1 internal, 414 processwrapper-sandbox.
INFO: Total action wall time 127.84s
INFO: Critical path 2.58s (setup 0.00s, action wall time 2.51s)
INFO: Elapsed time 38.37s (preparation 3.98s, execution 34.39s)
INFO: CPU time ???s (user 79.45s, system 22.39s, bazel ???s)
INFO: Build completed successfully, 415 total actions

Note:
The values are only presented if the following flag is set:
--experimental_stats_summary

The following flags must be set to get the measurements collected:
--experimental_collect_local_sandbox_action_metrics
--experimental_profile_cpu_usage (set but default)

Part two

Unit Test is added in runtime-tests.
Test command:
bazel test --experimental_stats_summary //src/test/java/com/google/devtools/build/lib:runtime-tests

@ecngtng
Copy link
Contributor Author

ecngtng commented Dec 27, 2021

Hello, due to I couldn't reproduce the issue locally. Who knows where to get the the log as below:
FAILED: //src/test/java/com/google/devtools/build/lib:runtime-tests (Summary)
/private/var/tmp/_bazel_buildkite/5e1880647af908bfbf43bdcc50066287/execroot/io_bazel/bazel-out/darwin-fastbuild/testlogs/src/test/java/com/google/devtools/build/lib/runtime-tests/test.log
/private/var/tmp/_bazel_buildkite/5e1880647af908bfbf43bdcc50066287/execroot/io_bazel/bazel-out/darwin-fastbuild/testlogs/src/test/java/com/google/devtools/build/lib/runtime-tests/test_attempts/attempt_1.log
/private/var/tmp/_bazel_buildkite/5e1880647af908bfbf43bdcc50066287/execroot/io_bazel/bazel-out/darwin-fastbuild/testlogs/src/test/java/com/google/devtools/build/lib/runtime-tests/test_attempts/attempt_2.log

This commit is about reporting cpu user time and cpu system time,
that is already available, but not printed in any log.
Part one

The solution adds CPU metrics after a build as seen below:

INFO: 2 processes: 1 internal, 1 linux-sandbox.
INFO: Total action wall time 70.05s
INFO: Critical path 70.08s (setup 0.00s, action wall time 70.05s)
INFO: Elapsed time 71.58s (preparation 1.45s, execution 70.13s)
INFO: CPU time 5.31s (user 0.00s, system 0.01s, bazel 5.29s)
INFO: Build completed successfully, 2 total actions

If any of the values is not collected, "???" is printed instead like:

INFO: 415 processes: 1 internal, 414 processwrapper-sandbox.
INFO: Total action wall time 127.84s
INFO: Critical path 2.58s (setup 0.00s, action wall time 2.51s)
INFO: Elapsed time 38.37s (preparation 3.98s, execution 34.39s)
INFO: CPU time ???s (user 79.45s, system 22.39s, bazel ???s)
INFO: Build completed successfully, 415 total actions

Note:
The values is only presented if the following flag is set:
--experimental_stats_summary

The following flags must be set to get the measurements to be
ollected:
--experimental_collect_local_sandbox_action_metrics
--experimental_profile_cpu_usage (set but default)
Part two

Unit Test is added in runtime-tests.
Test command:
./misc/bazel test --experimental_stats_summary //src/test/java/com/google/devtools/build/lib:runtime-tests

Note:
Due to Reporter class has keywords 'final', it'll block mock in UT.
We didn't find a better way to solve this issue.
So in this commit, this keyword 'final' in Reporter was removed.
@ecngtng
Copy link
Contributor Author

ecngtng commented Jan 4, 2022

Here is the answer for my question above.:

  1. Choose the failed test suite, by pressing ”details” link in github. That will take you to https://buildkite.com
  2. Choose the failed test suite again in buildkite.com.
  3. Switch tab from ”Log” to ”Artifacts” for the suite in buildkite.com, to access the log files.

@ecngtng
Copy link
Contributor Author

ecngtng commented Jan 4, 2022

New commit fixed the issue in Mac OS. It's due to Mac OS uses "," to divide integer and decimal fraction instead of '.' in other OS.

E.g.: In Mac OS, it's '9,00'. In other OS, it's '9.00'.

@aiuto aiuto requested a review from meisterT January 19, 2022 04:38
@aiuto aiuto added team-Performance Issues for Performance teams untriaged labels Jan 19, 2022
@larsrc-google
Copy link
Contributor

@ecngtng It's entirely possible that that difference in punctuation is not due to being on Mac per se, but due to Mac doing different i18n.

@ecngtng
Copy link
Contributor Author

ecngtng commented Jan 24, 2022

@larsrc-google So in this case, is it reasonable to distinguish the punctuation with related key words in OS name? Or?

@ulrfa
Copy link
Contributor

ulrfa commented Jan 24, 2022

Distinguishing on OS might cause problems for users in other countries with different default Locale.

One alternative could be to get the JVMs decimal separator char for current Locale.getDefault(), via DecimalFormatSymbols.getInstance().getDecimalSeparator() and use that as condition.

Another alternative could be to use String.format also when constructing the expected result in the test case.

Change distinguishing from OS to getDecimalSeparator().
@ecngtng
Copy link
Contributor Author

ecngtng commented Jan 25, 2022

@ulrfa @larsrc-google Thanks for your good comments. Codes have been updated with your comment.

@meisterT
Copy link
Member

Is this the right place to expose it? Most users likely don't care about CPU time.

The idea behind --experimental_stats_summary is to provide the most useful information in a more concise way.

If a user cares, they can always look it up in the BEP and the trace profile.

@ulrfa
Copy link
Contributor

ulrfa commented Jan 26, 2022

Thank you @meisterT for reviewing!

AFAIK, BEP provides CPU time only for:

  • The JVM process.

and the --profile json file contains CPU usage only for:

  • The JVM Process.
  • All processes running on local host (including other builds and other concurrently executing processes).

I assume the metrics above are relevant mostly for the developers of bazel codebase itself. But this PR provides CPU time for:

  • The JVM process. (“bazel”)
  • All subprocesses spawned in sandboxes for the particular build. (“user” and “system”)

I think the CPU time, for the JVM and subprocesses combined, is one of the most relevant metrics for a user that would like to grasp the overall resource consumption of a build.

In the future, I imagine storing subprocess's CPU time per action in the profile json file, and then calculate summary grouped by mnemonic, to see how much CPU each tool in the build chain consumes. But I think this PR is a good start and makes information conveniently available.

Perhaps --experimental_stats_summary is not suitable for this. @meisterT, do you have ideas about other ways to control verbosity?

@ecngtng
Copy link
Contributor Author

ecngtng commented Feb 10, 2022

@meisterT What are your comments to @ulrfa's response? : )

@larsrc-google
Copy link
Contributor

@wilwell

Copy link
Contributor

@larsrc-google larsrc-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started a review, then realized a general problem: Action CPU time is readily available for standalone/sandboxed actions. For singleplex worker actions, it could be found by taking the processes CPU time before and after making a request (I don't think we do that by default, but we could). But multiplex worker actions simply don't have a well-defined CPU time concept (actually the singleplex action would paper over some things like GC between requests). Since even a single action without CPU time makes the whole thing unknown, it will probably end up being unknown more often than not.

The way around this would be to measure non-worker actions the way you currently do, but measure worker actions by taking the CPU time at start and end (of build or of worker).

@@ -60,6 +63,10 @@
private long executionEndMillis;
private SpawnStats spawnStats;
private Path profilePath;
private static final long UNKNOWN = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rename this to "UNKNOWN_CPU_TIME" to be a bit clearer?

Optional<Duration> cpuUserTimeForActionsDuration = event.getActionResult().cumulativeCommandExecutionUserTime();

if(cpuUserTimeForActionsDuration.isPresent() && (cpuUserTimeForActions != UNKNOWN)) {
cpuUserTimeForActions = cpuUserTimeForActions + cpuUserTimeForActionsDuration.get().toMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use += here and below.

}
Optional<Duration> cpuUserTimeForActionsDuration = event.getActionResult().cumulativeCommandExecutionUserTime();

if(cpuUserTimeForActionsDuration.isPresent() && (cpuUserTimeForActions != UNKNOWN)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Unnecessary extra parentheses.

}
}

private static long sumCpuTimes(long a, long b, long c) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're hardcoding this to three parameters, you might as well name them with what kinds of CPU times they are.

when(result.cumulativeCommandExecutionUserTime()).thenReturn(Optional.empty());
}
else{
when(result.cumulativeCommandExecutionUserTime()).thenReturn(Optional.of(userTime));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify these with Optional.ofNullable().

field1.setLong(buildSummaryStatsModule, 11000);
buildSummaryStatsModule.buildComplete(createBuildEvent());
if(Comma) {
verify(reporterMock).handle(Event.info("CPU time 88,00s (user 55,00s, system 22,00s, bazel 11,00s)"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comma problem would be easier to understand if you use String.format here instead of having the Comma variable. A note on why it's done that way would be good, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this comment, I think you want it to be changed like:
verify(reporterMock).handle(Event.info(String.format("CPU time %.2fs (user %.2fs, system %.2fs, bazel %.2fs)",88.00, 55.00, 22.00, 11.00)));
Is it right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should do it. Plus a comment on why it's not just a fixed string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope, in this way, it could be determined by the system to use '.' or ',' to expresss the floating number.

@ckolli5 ckolli5 added the awaiting-review PR is awaiting review from an assigned reviewer label Apr 26, 2022
@ckolli5
Copy link

ckolli5 commented Apr 27, 2022

Hello @ecngtng, could you please confirm that all the review comment are taken care? Thanks.

@ckolli5 ckolli5 added awaiting-user-response Awaiting a response from the author and removed awaiting-review PR is awaiting review from an assigned reviewer labels Apr 27, 2022
@ecngtng
Copy link
Contributor Author

ecngtng commented Apr 27, 2022

@ckolli5 yes, all the comments are taken care.

@ckolli5 ckolli5 added awaiting-review PR is awaiting review from an assigned reviewer and removed awaiting-user-response Awaiting a response from the author labels May 5, 2022
@ecngtng
Copy link
Contributor Author

ecngtng commented Feb 22, 2023

@larsrc-google Codes have been changed according to your comments. Could you please help to check? Thanks!

@@ -206,6 +229,34 @@ public void buildComplete(BuildCompleteEvent event) {
criticalPathComputer = null;
}
profilePath = null;
cpuUserTimeForActions = Duration.ofMillis(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zeroing them on startup would also make it clearer that they are properly initialized.

@larsrc-google
Copy link
Contributor

Thank you. You missed one comment from before, and I added a couple small things.

@ecngtng
Copy link
Contributor Author

ecngtng commented Feb 23, 2023

@larsrc-google Thanks for your good comments. Except the comment that I'm not sure if I have got your point, all the others have been updated as you suggested. About "Zeroing them on startup would also make it clearer that they are properly initialized.", the zeroing actions have been done in both executorInit() and buildComplete(). Is it not ok? Or?

@larsrc-google
Copy link
Contributor

I don't see the zeroing happening in executorInit(), only in buildComplete(). If you zeroed them in executorInit(), that would be sufficient.

@ecngtng
Copy link
Contributor Author

ecngtng commented Feb 23, 2023

@larsrc-google You're right. After checking, I found they were really not in last commits. I added them back now. (Every interesting, I ever checked codes and confirmed that they merged several days ago.)

@ecngtng
Copy link
Contributor Author

ecngtng commented Feb 24, 2023

@larsrc-google Codes have been changed and all checks passed. Could you please help to check? Thanks!

@larsrc-google
Copy link
Contributor

Looks nice. I'll do some extra checks internally, and then I think we can merge it.

@larsrc-google
Copy link
Contributor

Getting bogged down in all the technical bits, I forgot about @meisterT's comments above. Sorry. We want to reduce the terminal output, not increase it. Putting it in the log (as indicated in the original comment) would be fine, though.

@sgowroji
Copy link
Member

Hi @wilwell, Since I can see that this PR has been approved, please let me know whether I should proceed with importing it.

@larsrc-google
Copy link
Contributor

No, not yet. As per my last comment adding this info to the output from every build is not desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Performance Issues for Performance teams
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants