Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build.d: Reduce the default number of jobs in cgroups #15504

Closed
wants to merge 1 commit into from

Conversation

tim-dlang
Copy link
Contributor

@tim-dlang tim-dlang commented Aug 5, 2023

CircleCI runs the tests on a server with many CPUs, but restricts the
number of CPUs and the memory using Linux cgroups. Using the total
number of CPUs as the default number of jobs can result in many
parallel DMD processes, which can consume too much memory. This can
result in random failures.

This commit tries to detect a reduced number of CPUs, so the number of
jobs can be decreased.

@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @tim-dlang! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#15504"

@tim-dlang tim-dlang marked this pull request as draft August 5, 2023 17:54
@tim-dlang
Copy link
Contributor Author

From the CircleCI documentation:

Java, Erlang and any other languages that introspect the /proc directory for information about CPU count may require additional configuration to prevent them from slowing down when using the CircleCI resource class feature. Programs with this issue may request 32 CPU cores and run slower than they would when requesting one core. Users of languages with this issue should pin their CPU count to their guaranteed CPU resources.

The program build.d uses totalCPUs to determine the number of jobs. The log on CircleCI for this pull request shows 36. This could explain random failures like for dlang/dlang.org#3681, because the jobs can run out of memory.

@tim-dlang tim-dlang force-pushed the circleci_failures branch 7 times, most recently from 3b9327f to 03e5e5d Compare August 5, 2023 18:48
CircleCI runs the tests on a server with many CPUs, but restricts the
number of CPUs and the memory using Linux cgroups. Using the total
number of CPUs as the default number of jobs can result in many
parallel DMD processes, which can consume too much memory. This can
result in random failures.

This commit tries to detect a reduced number of CPUs, so the number of
jobs can be decreased.
@tim-dlang tim-dlang changed the title WIP: Log the number of jobs build.d: Reduce the default number of jobs in cgroups Aug 5, 2023
@tim-dlang tim-dlang marked this pull request as ready for review August 5, 2023 20:07
}
catch (ConvException)
{
stderr.writeln("Warning: /sys/fs/cgroup/cpu/cpu.shares contains unknown value:", cpuSharesStr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an uninformative warning message. It should tell what it was looking for, what the consequence of the failure is, and suggest how to fix it.

@tim-dlang
Copy link
Contributor Author

I now think using the cpu.shares file is not the best way. The factor 1024 is the default and used by CircleCI for one CPU, but other environments could use different values.
Maybe it would be better to set the number of jobs in the .circleci/run.sh files. The Makefiles would need to forward the parameter to build.d.

@Imperatorn
Copy link
Contributor

Any idea why it reports 36 instead of 32?

@dlang-bot dlang-bot removed the stalled label Nov 8, 2023
@tim-dlang
Copy link
Contributor Author

Any idea why it reports 36 instead of 32?

I don't know exactly. Maybe they had only servers with 32 CPU cores when the documentation was written. Later they could have added servers with 36 cores. If they have different servers, that could also further explain, why the tests only sometimes run out of memory.

@tim-dlang
Copy link
Contributor Author

Closing in favour of #15799 and dlang/dlang.org#3724.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants