Skip to content

Comments

[cgroups2] Introduced API to kill the processes inside of cgroup subtree.#550

Closed
DevinLeamy wants to merge 1 commit intoapache:masterfrom
DevinLeamy:cgroups2-destroy
Closed

[cgroups2] Introduced API to kill the processes inside of cgroup subtree.#550
DevinLeamy wants to merge 1 commit intoapache:masterfrom
DevinLeamy:cgroups2-destroy

Conversation

@DevinLeamy
Copy link
Contributor

Introduces

	cgroups2::kill(cgroup)

which will recursively kill all of the cgroups in a subtree.

We additionally update cgroups::destroy to use cgroups::kill such that it now completely destroys a cgroup (i.e. all directories processes).

…ree.

Introduces
```
	cgroups2::kill(cgroup)
```
which will recursively kill all of the cgroups in a subtree.

We additionally update `cgroups::destroy` to use `cgroups::kill` such that
it now completely destroys a cgroup (i.e. all directories processes).
Comment on lines +392 to +394
vector<string> sorted(cgroups->begin(), cgroups->end());
sorted.push_back(cgroup);
std::sort(sorted.rbegin(), sorted.rend());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sort the cgroups here so they are ordered by decreasing nesting depth.

// the most deeply nested directories first.
Try<Nothing> kill = cgroups2::kill(cgroup);
if (kill.isError()) {
return Error("Failed to kill processes in cgroup");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should include the error here

@bmahler bmahler closed this in 756b7d7 Apr 5, 2024
@DevinLeamy DevinLeamy deleted the cgroups2-destroy branch April 8, 2024 13:50
andreaspeters added a commit to m3scluster/clusterd that referenced this pull request Jul 25, 2025
* [cgroups2] Introduce build files for the cgroups2 `Controller` abstraction.

* [cgroups2] Introduces a `Controller` abstraction for cgroups v2 controllers.

NOTE: In cgroups v1 we call "controllers" "subsystems". In cgroups v2,
      we exclusively use the term "controller", which is what is used
      in the Linux documentation for cgroups v2.

For cgroups v1, the `Subsystem` abstraction is used represent a cgroup
controller. `Subsystem`s exist for each of the controllers provided by
the cgroups v1 API.

We do similar for cgroups v2, now introducing the `Controller` abstraction.
The difference between a `Controller` and a `Subsystem` - besides the
name - is that a `Controller` does not have an associated hierarchy. This is
because in cgroups v2, controllers (a.k.a. subsystems) do not need to be
individually mounted. All controllers are "mounted" under the unified
`cgroup2` filesystem that we require be mounted at `/sys/fs/cgroup`.

The `Cgroups2IsolatorProcess` delegates resource isolation requests to
`Controller`s instead of `Subsystem`s.

* [cgroups2] Removed templatized write() method.

`cgroups2::write` does not need to be a template function (unlike
`cgroups2::read`) because standard C++ overloading is sufficient to
handle writing multiple different types, without definition conflicts.

Hence, we make `cgroups2::write` not a template function.

* [cgroups2] Add an interface to read and write the CPU bandwidth limit.

In cgroups v2, the CPU bandwidth and bandwidth period (duration over
which the bandwidth can be spent) are set in `cpu.max`. This patch
introduces an interface to read and update these values.

A `BandwidthLimit` object is introduced, which represents a snapshot
of the `cpu.max` control file.

Note: This stands in contrast to cgroups v1 where the period and
bandwidth were set in separate control files, `cpu.cfs_period_us`
and `cpu.cfs_quota_us` respectively.

This closes #541

* [cgroups2] 'cpu.max' parsing fix and introduce a test.

The 'cpu.max' file is terminated by a newline which causes an error
while parsing. Hence, we trim whitespace before we parse its contents.

A test is also introduced.

This closes #543

* [cgroups2] Introduce the CpuControllerProcess.

Introduces the `CpuControllerProcess`, a cpu isolator that is
implemented using cgroups v2 and indirectly exposed through the
`Cgroups2IsolatorProcess`. Hosts correctly configured for cgroups v2
that provide `cgroups/cpu` in the `isolation` flag and `cpu`
in the `agent_subsystems` flag will use this controller.

This closes #545

* [cgroups2] Introduce interface for reading threads in a cgroup.

- `cgroups2::threads` reads the threads in a cgroup into a `set`,
similar to `cgroups2::processes`.

Moving a process into a cgroup moves all the threads in that
process into the cgroup. A test is introduced to ensure the
threads move, as expected.

This closes #546

* [cgroups2] Adds the `CoreControllerProcess` to the `Cgroups2IsolatorProcess`

The `CoreControllerProcess`, referred to as the "core" controller, is
implemented and enabled by default by the cgroups v2 isolator process.

We enable it by default because all cgroups in cgroups v2 have the
core control files ("cgroup.*"), which the `CoreControllerProcess` manages.

Currently, `CoreControllerProcess` reports the number of processes and
threads in a cgroup, if the `cgroups_cpu_enable_pids_and_tids_count`
flag is provided.

This closes #547

* Removed dead field 'subsystem' from the `LinuxLauncherProcess`.

* [cgroups2] Introduce interface to get cgroups nested inside of a cgroup.

Introduces
```
	cgroups2::get(cgroup)
```
which returns the cgroups inside of the given cgroup.

* [cgroups2] Introduced API to kill the processes inside of cgroup subtree.

Introduces
```
	cgroups2::kill(cgroup)
```
which will recursively kill all of the cgroups in a subtree.

We additionally update `cgroups::destroy` to use `cgroups::kill` such that
it now completely destroys a cgroup (i.e. all directories processes).

This closes #550

* [cgroups2] Introduce utility to parse a container id from a cgroup path.

During agent recovery, we parse the directories in the cgroup hierarchy
to determine what containers were previously running in the agent.

Here we implement the cgroup directory parsing for cgroups v2's updated
cgroup directory structure.

This closes #551

* Linux launcher cleanups.

These are cleanups extracted out from https://github.com/apache/mesos/pull/552/
in order to reduce the diff noise in adding cgroups v2 support.

* [cgroups2] Update the LinuxLauncher to support cgroups v2.

Updates the `LinuxLauncher` to use cgroups v2. Like with other
cgroup v2 functionality, the new launcher is used by default if the
host is correctly configured for cgroups v2 and Mesos has been compiled
with the --enable-cgroups-v2 flag.

This closes #552

* [cgroups2] Introduce `memory` controller.

Introduces the cgroups v2 `memory` controller, the `cgroups2::memory::usage`
function, to obtain the memory usage of a cgroup and its descendants', and a test.

This closes #553

* [cgroups2] Introduced API to set memory.min for a cgroup.

Introduces
```
  cgroups2::memory::min(cgroup)            // get the minimum
  cgroups2::memory::set_min(cgroup, bytes) // set the minimum
```
to get and set the minimum memory in bytes that are guaranteed to not
be reclaimed by the kernel under any conditions.

This closes #554

* [cgroups2] Introduced an interface to set a hard memory limit.

The "memory.max" control contains the hard memory limit that a cgroup
and its descendants must remain below.

We introduce `cgroups2::memory::set_max` and `cgroups2::memory::max`
to set and get this limit.

This closes #557

* Mitigate a case where the agent gets stuck sending TASK_DROPPED.

Per MESOS-7187, there is a case where the master holds a stale resource
UUID for the agent's resources, and all subsequent task launches result
in the agent sending TASK_DROPPED due to "Task assumes outdated resource
state".

While this patch doesn't fix the general issue of MESOS-7187, it does
mitigate a known problematic case due to the introduction of the agent
having its own resource UUID.

* Add a regression test for the mitigation of MESOS-7187.

* Removed trailing spaces.

* [cgroups2] Introduce API to set soft memory protection.

The 'memory.low' control is for soft memory protection. This only
applies when the system is trying to reclaim memory. Soft memory
protection means that the kernel will do its best to not reclaim
memory from the cgroup if its usage is below the value in 'memory.low'.
Before it reclaims any memory below the value in 'memory.low' it will
first reclaim unprotected memory from other cgroups.

We introduce `cgroups2::memory::low` and `cgroups2::memory::set_low`
to set and get this soft memory protection limit.

* [cgroups2] Introduce API to set a soft memory limit.

The "memory.high" control contains the soft memory limit for a cgroup
and its descendants. Exceeding the limit will cause the cgroup's
processes to get throttled and will put the cgroup under memory
pressure.

We introduce `cgroups2::memory::set_high` and `cgroups2::memory::high`
to set and get this soft memory limit.

* [cgroups2] Introduced API to listen for OOM events.

Introduces `cgroups2::memory::events::oom` which returns a future
that resolves when the cgroup reaches its memory limit and allocation
was about to fail.

In cgroups v1, there was a bespoke notification API.

Cgroups v2 provides the 'memory.events' control which contains key-value
pairs of events and the number of times they took place [1]. For OOMs, we
look at the value of the `oom` field. In `cgroups2::memory::events::oom`
we watch for changes to 'memory.events' (via polling every 100ms for now,
and later via inotify) and resolve a future when `events.oom > 0`.

[1] https://docs.kernel.org/admin-guide/cgroup-v2.html#memory

This closes #563

* [cgroups2] Error if `--cgroups_limit_swap` is used when cgroups v2 is used.

Mesos does not support limiting swap memory when using cgroups v2.
This is because the cgroups v2 API allows separate control of swap usage
and careful consideration is needed to figure out how to limit swap
usage.

Therefore, when the `--cgroups_limit_swap` flag is provided and
cgroups v2 is used we error during flag validation.

This closes #565

* [cgroups2] Add a subset of memory usage statistics.

Cgroups v2 exposes memory statistics through the 'memory.stat' control.

Here we introduce `cgroups2::memory::stats` to read a subset of the memory
usage statistics into a new `memory::Stats` object. These statistics will
be used by the `MemoryControllerProcess` to populate a `ResourceStatistics`
object, like is done by the `MemorySubsystemProcess` in cgroups v1.

Additional statistics from the 'memory.stat' control can be included as
they are required.

This closes #564

* [cgroups2] Implement Cgroups 2 isolator w/o nested containers and systemd.

Updates the cgroups v2 isolator to include initialization, cleanup,
update, and recovery logic.

Unlike cgroups v1 we:
- Create a new cgroup namespace during isolation, by introducing a new
  clone namespace flag. This implies that the contained process will
  only have access to cgroups in its cgroup subtree.
- We only need to recover two cgroups (the non-leaf and leaf cgroups [1])
  for each container, rather than having to recover one cgroup for each
  controller the container used.
- We do not yet support nested containers.
- We do not yet have a systemd integration. Since the cgroups v1
  isolator's integration with systemd was largely to extend process
  lifetimes, the cgroups v2 isolator will function on systemd-managed
  machines, despite not having a first-class integration.
  A systemd integration will be added.

Using the cgroups v2 isolator requires Mesos to be compiled with
`--enable-cgroups-v2` and to have the cgroup2 filesystem mounted
at /sys/fs/cgroup. Selecting the correct isolator version (v1 or v2)
is done automatically. v2 is used if the host supports cgroups v2
and is correctly configured.

[1] The "non-leaf" cgroup is the cgroup for a container where resource
    constraints are imposed. The "leaf" cgroup, which has the path
    <non-leaf cgroup>/leaf, is where the container PID is put.
    Container PIDs are only put in leaf cgroups.

This closes #556

* [cgroups2] Fix error message to show the correct path.

The error message for failing to create the leaf cgroup was
printing the non-leaf cgroup, instead of the leaf cgroup.

* [cgroups2] Don't enable controllers in the leaf cgroup.

We cannot enable all the controllers in the leaf cgroup because
we also put the container process in the leaf cgroup, which violates
the no internal process constraint.

For instance, if we enable the memory controller in the leaf cgroup
and then try and move a process into the leaf cgroup the operation
will fail with EBUSY.

If a container wants to manage their own cgroups, they will need to
move their process into a new cgroup _before_ they enable controllers.

* [cgroups2] Made `cgroups2::processes` optionally recursive.

Previously, `cgroups2::processes` could only fetch processes
from inside of the provided cgroup. We can now fetch all of
the processes inside of a cgroup subtree by passing an
(optional) `recursive` flag.

```c++
Try<std::set<pid_t>> processes(
    const std::string& cgroup,
    bool recursive = false);
```

This closes #570

* [cgroups2] Update `destroy` to be async more robust.

We were running into an inconsistent issue with `cgroups2::destroy`.
`cgroups2::destroy` would fail with EBUSY when removing cgroups with `rmdir`.
The error was being caused because some processes had not been killed
when `rmdir` was called on their cgroup; a cgroup with processes
cannot be destroyed.

After signalling a kill (by writing "1" to 'cgroup.kill') sometimes
processes were staying alive long enough to cause `rmdir` to fail.

Hence, we update `cgroups2::destroy` to wait after signalling a
SIGKILL for all the processes to drop before attempting to remove
the cgroups.

Since we wait a maximum of half a second, we don't want to block the
caller. Thus, we update `destroy` to be async.

* Split out cgroups setup / teardown logic in ContainerizerTest.

`StartSlave()` and similar test-setup functions mounted cgroups
v1 hierarchies and initialized controllers. On cgroups v2 machines,
this setup would fail or result in irregular cgroup setups. As a
step towards end-to-end testing for the `MesosContainerizer`, we
update the Agent test fixtures such that they work correctly on
both cgroups v1 and v2 hosts.

This closes #572

* [cgroups2] Add cgroups v2 setup and teardown logic to ContainerizerTest.

`StartSlave()` and similar test-setup functions mounted cgroups v1 hierarchies
and initialized controllers. On cgroups v2 machines, this setup would fail or
result in irregular cgroup setups. As a step towards end-to-end testing for
the `MesosContainerizer`, we update the Agent test fixtures such that they
work correctly on both cgroups v1 and v2 hosts.

This closes #573

* [cgroups2] Report usage statistics for the cgroups v2 isolator process.

Overrides `::usage` for the `Cgroups2IsolatorProcess` so the
MesosContainerizer gets ResourceStatistics reported by the
cgroups v2 controllers processes, for example the `CpuControllerProcess`.

* Fix compilation error when cgroups v2 is not being compiled.

This closes #575

* [cgroups2] Handle missing 'kernel' field in 'memory.stat' on linux < 5.18.

The 'kernel' key was introduced to 'memory.stat' in Kernel 5.18 and therefore
isn't present on older kernels. If it is missing, we set `kernel` to be the
sum of the other kernel usage fields provided in 'memory.stat'. This is an
under-accounting since it doesn't include:

 - various kvm allocations (e.g. allocated pages to create vcpus)
 - io_uring
 - tmp_page in pipes during pipe_write()
 - bpf ringbuffers
 - unix sockets

But it's the best measurement we can provide prior to the 'kernel' stat
being added in 5.18 that catches all of these.

As part of this, we add the 'slab' key (one of the kernel memory usage
fields) to the `memory::Stats` structure.

See kernel patch introducing 'kernel':

https://github.com/torvalds/linux/commit/a8c49af3be5f0b4e105ef678bcf14ef102c270be

This closes #576

* [cgroups2] Watch and respond to container limitations.

Each `ControllerProcess` used by the cgroups v2 isolator
can optionally override `::watch` which is a future that
resolves when a container limitation (e.g. memory limit reached)
is detected.

Here we introduce listening and responding to these
container limitations, like is done in cgroups v1.

* [cgroups2] Introduces the MemoryControllerProcess.

Introduces the `MemoryControllerProcess`, the cgroups v2 memory
isolator, which will be used by the `Cgroups2IsolatorProcess`.

Unlike the `MemorySubsystemProcess`, the cgroups v1 memory isolator, we:

- Don't allow limits on swap memory to be set.
- Don't report memory pressure levels (this facility is no longer part of
  the cgroups memory controller's API)

Future work may include:

- Adding support for swap memory, and
- Reporting the (now available) memory pressure stall information

This patch updates the ROOT_MemUsage so it passes on a cgroups v2
machine using the new MemoryControllerProcess.

This closes #581

* [post-reviews] Replace deprecated disutil LooseVersion with packaging.version.

This also gets rid of the Deprecation Warning we get when running the
post-reviews.py script:

```
DeprecationWarning: distutils Version classes are deprecated.
Use packaging.version instead.
  rbt_version = LooseVersion(rbt_version)
```

Review: https://reviews.apache.org/r/74984/

* [cgroups2] Clarify cgroups2::memory::stats documentation.

After performing some testing, we found that memory.stat contains
information about the cgroup *and its descendants*, but this is
not currently mentioned in our own documentation.

Review: https://reviews.apache.org/r/74980/

* [cgroups2] Add OOM listening to the MemoryControllerProcess.

Introduces OOM listening to the MemoryControllerProcess so that we
detect, report, and respond to OOM events.

Review: https://reviews.apache.org/r/74979/

* [cgroups2] adjust CPU weight values from v1 to v2 default

Modifies the cgroups CPU weights to reflect change
from cpu.shares to cpu.weight.

In v1, cgroups used cpu.shares which has a default of 1024.
In v2, cgroups use cpu.weight which has a default of 100

The range for the cpu.weight is [1,10000], the minimum
weight has been updated to reflect this.

The revocable CPU weight has been scaled down from 10 to 1
to reflect a similar scale to the default.

Review: https://reviews.apache.org/r/74992/

* [cgroups2] populate unevictable field from memory.stat

Review: https://reviews.apache.org/r/74991/

* [cgroup2] Fix CPU isolator tests on cgroups2 systems

The change involves migrating the isolator tests from
MesosTests to ContainerizerTest which inherit from MesosTests
This allows cgroups2 tests to create cgroups in appropriate
directories during tests.

Review: https://reviews.apache.org/r/74989/

* [cgroups2] Add memory usage reporting to the MemoryControllerProcess

Introduces `::usage` to the MemoryControllerProcess to report the total
memory usage of a cgroup as well as memory usage statistics provided
by `cgroups2::memory:stats`.

Review: https://reviews.apache.org/r/74985/

* [cgroups2] Rename constants in cgroups2 isolator.

Specify that the cgroups2 constants are cgroups2 in their names.

This helps avoid redefinition of constants inside test files that
may import constant files from both cgroups v1 and v2.

Review: https://reviews.apache.org/r/74993/

* [cgroups2] Fix cgroups isolator test for RevocableCpu.

This patch fixes the RevocableCpu test for cgroups2
by conditionally skipping the hierarchy check which
is only relevant to cgroups1 systems.

Review: https://reviews.apache.org/r/74994/

* [cgroups2] crash when root folder is not detected when creating cgroups

Based on this ticket (https://issues.apache.org/jira/browse/MESOS-9305)
and the ROOT_CGROUPS_CreateRecursively test in CgroupsIsolatorTest,
there seems to be a possibility that the root folder may be deleted and
new cgroups cannot be properly created.

In v1, this was addressed by enabling recursively creating the groups.

In v2, since we make use of cgroup.subtree_control to determine a cgroup
and its descendents' access to controllers, we cannot recover this effectively
if the root folder is deleted, so we cant just recursively create the folders.

Hence, we elected to crash if the root folder is not found, as it will allow
us to restart and go through the logic that takes care of setting all the
values inside cgroup.subtree_control again.

Review: https://reviews.apache.org/r/74995/

* [cgroups2] Allow cgroups2::enable() to take in a set.

Modifies the cgroups2::controllers::enable function to take in a set of
strings for controllers. This helps eliminate the possibility of duplicate
controllers in the argument, and brings it in line with the
cgroups2::controllers::disable function

Review: https://reviews.apache.org/r/74981/

* [cgroups2] Introduce the PerfEventControllerProcess.

Introduces the controller process for perf event which was also present
in cgroups1. The controller is automatically enabled, and should not be
visible inside the cgroups.controllers file in the root cgroup.

As a consequence, we will not be able to manually enable or disable this
controller via writing to the cgroup.subtree_control file.

References:

* perf_event section in https://docs.kernel.org/admin-guide/cgroup-v2.html
* slide 34 in https://man7.org/conf/ndctechtown2021/cgroups-v2-part-1-intro-NDC-TechTown-2021-Kerrisk.pdf

Review: https://reviews.apache.org/r/74997/

* [cgroups2] Ignore manual enabling of perf_event during prepare phase.

In Cgroups2IsolatorProcess::prepare, it may manually enable controller
by writing to the cgroup.subtree_control process.

For perf_event, since is is automatically turned on, it does not appear
inside the cgroup.controllers file and hence cannot be written to the
cgroup.subtree_control file. For this reason, we skip the enable call for
the perf_event controller.

Review: https://reviews.apache.org/r/74998/

* [contributors] Add Jason Zhou to contributors.yaml.

Review: https://reviews.apache.org/r/75011/

* [agent] Add executor_id / framework_id query parameters in /containers.

We now allow filtering by framework ID and executor ID in addition to
the original functionality of filtering by container ID.

Please note that the /containers endooint only allows select combinations
of these query parameter fields to be populated at once. We will return a
failure if we see that the combination of query paramters is invalid.

We currently accept:
* no query parameters
* only container id
* only framework id
* only framework id and executor id

Review: https://reviews.apache.org/r/75009/

* [agent] Add test for framework_id and executor_id support in /containers.

Review: https://reviews.apache.org/r/75010/

* FIX: add missing cgroups file.

* [mesos-build] Fix python setup in Dockerfiles.

The dockerfile themselves were fixed up to account for updated pip
install urls, deprecated deadsnakes PPA for Ubuntu 16.04, and curl
misconfigurations for installing clang.

Review: https://reviews.apache.org/r/75018/

* CHANGE: add default network if no net options was set.

* Revert "CHANGE: add default network if no net options was set."

This reverts commit 5c3b039db544e937617cd63810185cc95fc2b34b.

* [cgroups2] Remove ENABLE_CGROUPS_V2 ifdefs.

This commit removes the sections where ENABLE_CGROUPS_V2 is used
to determine the compiled code. Any need to determine whether or not
cgroups2 is used will be satisfied using the cgroups2::mounted() function.

This guard was only in place temporarily to avoid breaking our CI
while we figured out how to ensure that all of the CI docker images
have the header.

Review: https://reviews.apache.org/r/75021/

* [mesos-build] Fix python setup in docker-build.sh.

Fixes the python 3.6 install inside docker-build.sh
when the detected OS version is Ubuntu v16.04.

Review: https://reviews.apache.org/r/75020/

* [mesos-build] Move mesos-build to from ubuntu 16.04 to 20.04

Ubuntu 16.04 docker builds were having issues with the jenkins
pipeline as it was missing certain fields in /usr/include/linux/bpf.h
that are present in more modern linux kernels' which were used inside
the ebpf code.

We will try to address this along with Ubuntu's EOL issue by upgrading
to ubuntu 20.04

Review: https://reviews.apache.org/r/75023/

* [mesos-build] Update reviewbot / tidybot / docker-build.sh to support ubuntu 20.04.

The reviewbot, tidybot, and our docker-build scripts
have been updated to use or accomodate for ubuntu 20.04.

Review: https://reviews.apache.org/r/75027/

* [mesos-build] Add readme to support/mesos-build directory.

Review: https://reviews.apache.org/r/75028/

* [mesos-build] Add correct directory to git safe directory.

Previously, the entrypoint.sh added /SRC/.git as a safe directory
after the git clone, which was not useful. We now add it /SRC/ as
a safe directory before the git clone.

Review: https://reviews.apache.org/r/75029/

* Remove trailing whitespace in mesos build entrypoint.

* [mesos-readme] Clarify mesos-build instructions for uploading to dockerhub.

Review: https://reviews.apache.org/r/75030/

* Clarify mesos-build readme instructions.

Provide more specific commands.

* [mesos-build] Add .git directory to safe directory.

Review: https://reviews.apache.org/r/75031/

* Add push instructions to mesos-tidy.

* [mesos-build] Install openjdk 11 on ubuntu 20.04.

Install openjdk 11 on ubuntu 20.04. Our reviewbot is running into issues
where their java 11 installation is missing javac and configure.ac cannot
run properly, this fixes that issue.

Review: https://reviews.apache.org/r/75032/

* [mesos-build] Address dependency issues in centos-7 / ubuntu-20.04.

In CentOS 7, the linux/amd64 base image is missing ebpf fields such as
BPF_PROG_TYPE_CGROUP_DEVICE, which prevents jenkins from building mesos
now that the ENABLE_CGROUPS_V2 macro has been removed as
/usr/include/linux/bpf.h is missing the fields required by ebpf.h.

We installed dependencies from kernel-ml so that we can have newer
headerfiles in /usr/include/linux, which should help us compile the
mesos code in jenkins.

For ubuntu, there is a dependency on installing jdk-11 instead of
the old jdk-8 which is preventing some builds which pull the ubuntu
image to build.

As part of this change, both the CentOS 7 and ubuntu:20.04 will
need to be rebuilt and uploaded to dockerhub for jenkins.

Review: https://reviews.apache.org/r/75035/

* [mesos-build] Add /SRC/.git as safe directory for tidybot.

This change allows us to bypass the git directory warnings for tidybot.
As part of this change we will have to rebuild the tidybot image and
push it to dockerhub.

Review: https://reviews.apache.org/r/75037/

* [mesos-build] Remove setting of environment variable in dockerfiles.

Setting an environment variable as PYTHON_VERSION caused an unexpected
behavior in jenkins as the configure.ac script checked for that exact
environment variable.

PYTHON_VERSION has been renamed and set as an ARG in the dockerfile
so that it will not persist after the build.

Review: https://reviews.apache.org/r/75036/

* [port-mapping] cat back ip_local_port_range after updating ephemeral ports.

This ensures that the update was successful and that the port range is what
we expect.

Review: https://reviews.apache.org/r/75038/

* [port-mapping] Fix typo port_mapping.cpp.

Review: https://reviews.apache.org/r/75042/

* [cgroups2] Fix control reaches end of non-void function.

This change moves the UNREACHABLE macro out of the switch case
to fix the "control reaches end of non-void function" error in
the lambda function for addDevice.

Review: https://reviews.apache.org/r/75043/

* [port_mapping] Fix SmallEgressLimit test.

The test was previously failing as it was timing the echo rather
than ncat. This fix measures the time that ncat takes so that the
elapsed time does not display as 0s and fail the test.

Review: https://reviews.apache.org/r/75044/

* [cgroups2] Fix multi-line comment compilation warning.

This fixes a compilation warning due to the comment line ending with
a backslash character.

* [cgroups2] Fix a compilation error on CentOS 7 due to move operations.

* [route] Use nl_addr_iszero helper when checking for destination IP network.

Previously, when grabbing the destination, we would filter out the default
address at 0.0.0.0/0 by checking that the destination pointer is pointing
at an empty struct.

On newer Linux, it seems to be possible that the destination pointer can
be pointing at a valid struct that corresponds to 0.0.0.0/0. To ensure
that we are able accurately filter out the default route, we switch to the
libnl function nl_addr_iszero to determine if the nl_addr struct corresponds
to 0.0.0.0/0.

We also apply this change to other areas where nl_addr_get_len is used to
ensure that non-empty nl_addr with only zeroes are accounted for.

Review: https://reviews.apache.org/r/75046/

* [routing] Change link::setMAC to return Try<Nothing>.

We have noticed that our code does not treat the setMAC bool return value
differently based on whether it returns true or false. As such, we are
changing the return type to return Nothing so that we either return Error
or Nothing, rather than Error or True or False.

As a consequence of this we are also removing the special case of returning
False but not Error when we get ENODEV from ioctl.

Review: https://reviews.apache.org/r/75056/

* Support constructing net::MAC objects from sockaddr.sa_data.

When using ioctl, we get a char[14] sa_data array from sockaddr that holds the
information necessary to construct a net::MAC object, this patch adds support
for using the sa_data field to directly create a net::MAC object.

Review: https://reviews.apache.org/r/75058/

* [port mapping isolator] Work around apparent MAC address kernel bug.

It seems that there are scenarios where, when using the port mapping isolator,
mesos containers sometimes cannot communicate with the mesos agent as the MAC
address of the veth interface is set incorrectly, leading to dropped packets
by the kernel. This was discovered with the use of tcpdump (which reveals that
the kernel marks the packets as destined for another host), and the latter of
which reveals that the kernel is indeed dropping the packets due to this. We
then found that when we set the mac address on the veth interface, it sometimes
does not "stick" despite ioctl returning successfully.

Observed scenarios with incorrectly assigned MAC addresses:

1. After setting the mac address: ioctl returns the correct MAC address, but
   net::mac returns an incorrect MAC address (different from the original!)
2. After setting the mac address: both ioctl and net::mac return the same MAC
   address, but are both wrong (and different from the original one!)
3. After setting the mac address: there are no cases where ioctl or net::mac
   come back with the same MAC address as before we set the address.
4. Before we set the mac address: there is a possibility that ioctl and
   net::mac results disagree with each other!
5. There is a possibility that the MAC address we set ends up overwritten by
   a garbage value after setMAC has already completed and checked that the
   mac address was set correctly. Since this error happens after this
   function has finished, we cannot log nor detect it in setMAC because we
   have not yet studied at what point this occurs.

Notes:

1. We have observed this behavior only on CentOS 9 systems at the moment,
   CentOS 7 systems under various kernels do not seem to have the issue
   (which is quite strange if this was purely a kernel bug).
2. We have tried kernels 5.15.147, 5.15.160, 5.15.161, all of these have
   this issue on CentOS 9.

This patch adds a workaround for this bug, which is to check that the MAC
address is set correctly after the ioctl call, and retry the address setting
if necessary. In our testing, this workaround appears to workaround scenarios
(1), (2), (3), and (4) above, but it does not address scenario (5).

See MESOS-10243 for additional details, follow-ups.

Review: https://reviews.apache.org/r/75057/

* CHANGE: add default network if no net options was set.

* [build] Fix make distcheck for ubuntu 20.04.

Fixes the makefiles for ubuntu 20.04 so that make distcheck works properly with
its protobuf dependencies generated as part of make distcheck.

The reason that a change was needed in the makefile is because the upgrade from
ubuntu 16.04 to 20.04 also caused the automake version to be updated when
dependencies were being installed during docker build.

The change in the automake version created slight changes in the generated
makefile. Speficically, the distcheck on the new automake-generated makefile now
depends on `BUILT_SOURCES` which causes an error as the CSI protobuf files are
not ready when distcheck is called. So we add the csi build stamps to
`BUILT_SOURCES` to ensure that the protobuf files will be ready when distcheck's
dependencies are made.

The additional chmod change for java is because for some reason when distcheck
attempts to build the mesos-1.12.0.jar from its created distribution, some
folders are missing write permissions, causing the build to terminate.

Review: https://reviews.apache.org/r/75062/

* [build] Use ubuntu:20.04 for verify-reviews.py.

Moving to ubuntu 20.04 so so that we can get the ebpf header files for
our build.

Review: https://reviews.apache.org/r/75063/

* [build] Fix docker-build.sh failing to compile with distcheck.

Review: https://reviews.apache.org/r/75066/

* [build] Fix libevent-enabled cmake builds on ubuntu 20.04.

Our current libevent-enabled cmake builds cannot complete on jenkins as it
gets the 'incomplete definition of type 'struct bio_st'' error.

This is because the upgrade to ubuntu 20.04 also upgraded our openSSL
version from 1.0.2 to 1.1.1, which breaks compatibility with libevent 2.1.5
that was previously used.

A compatibility patch for openSSL 1.1+ was released with libevent 2.1.7, but
the closest tarball that includes a CMakeLists.txt file is 2.1.9, which is
what we will upgrade to.

With the new libevent library, builds are able to complete using cmake and
with libevents enabled. But it still sees the test failures we see on other
builds (such as autotools) and operating system (CentOS 7).

We also need to link libevent_pthread and libevent_openssl with libprocess,
otherwise we will get errors like:

```
ld: 3rdparty/libprocess/src/libprocess.so: undefined reference to bufferevent_openssl_get_ssl
```

Review: https://reviews.apache.org/r/75070/

* [ssl] Remove TLS 1.0 and 1.1 tests.

Currently the SSLProtocolTest with TLS v1.0 and v1.1 do not pass because
those versions were disabled in ubuntu 20.04, see:

https://discourse.ubuntu.com/t/spec-tls-1-0-and-1-1-are-disabled-by-default/41868
https://github.com/SoftEtherVPN/SoftEtherVPN/issues/1358#issuecomment-851427905

Review: https://reviews.apache.org/r/75075/

* [cgroups2] Fix allow deny semantics for device access.

Currently, the EBPF program we generate has the behavior where the deny
list has no effect, as we will allow device access iff the device
matched with an allow entry.

Instead we want to grant access to a device iff it is in a cgroup's
allow list *and not in its deny list.*

This means that we need to change our existing logic, which exits on the
first match. It is not our desired behavior because the current EBPF
program construction logic puts the allow-device checks before the
deny-device checks, meaning that if a device is on both allow and deny
lists for a cgroup, it will be granted access.

This change revamps the EBPF program construction to now check both the
allow and deny list of a cgroup before determining whether access may be
granted. Specifically, if a device is matched with an entry inside the
allow list, we will also be checking if it matches with any entry on
the deny list, and deny the device's access if that is the case.

We also avoid generating specific parts of the EBPF program code to
avoid creating unreachable code, explanations with a diagram are
attached above the cgroups2::devices::DeviceProgram::build function.

Review: https://reviews.apache.org/r/75026/

* Minor edits missed in r/75026.

* [ebpf] Add helper function for getting bpf fd by program id.

Introduces a helper function to help abstract away the logic for
getting bpf program file descriptor by its program id.

Review: https://reviews.apache.org/r/75083/

* [cgroups2] Make cgroups2::path take both absolute and relative paths.

Currently, cgroups2::path assumes the path in the argument is relative.
We want the function to be able to distinguish between absolute and
relative paths, where we only prepend the mounting point on the
relative path.

Review: https://reviews.apache.org/r/75081/

* [cgroups2] Remove accidental tabs.

* [ebpf] Implement atomic replacement of cgroup device programs.

Currently, if we try to attach device ebpf files to the same cgroup
multiple times, they will all be attached, and they will all be run
when a device requests access. This conflicts with our design to have
one ebpf file per cgroup that represents all the files they want to
allow or deny, where that file is updated when the cgroup adds or
removes a device. So we add a patch to atomically replace any existing
ebpf file already attached to our target cgroup using our new ebpf file.

Review: https://reviews.apache.org/r/75080/

* [veth] Add todo to set mac address on create for peer link.

Due to a systemd-induced race-condition related to the MacAddressPolicy
being set to 'persistent' on versions >= 242, we will have to set the
peer link MAC address of the peer link (eth0) when we create the eth0
peer link so that the udev will not try to overwrite it when it is
notified that this device was created, which would lead to a race
condition here where us and udev are racing to see who is the last one
to write our MAC address to eth0.

see: https://issues.apache.org/jira/browse/MESOS-10243

Review: https://reviews.apache.org/r/75087/

* [veth] Avoid udev race condtion on systems with systemd version > 242.

In systems with systemd version above 242, there is a potential data
race where udev will try to update the MAC address of the device at the
same time as us if the systemd's MacAddressPolicy is set to 'persistent'.

To prevent udev from trying to set the veth device's MAC address by
itself, we must set the device MAC address on creation so that
addr_assign_type will be set to NET_ADDR_SET, which prevents udev from
attempting to change the MAC address of the veth device.

See: https://github.com/torvalds/linux/commit/2afb9b533423a9b97f84181e773cf9361d98fed6
See: https://lore.kernel.org/netdev/CAHXsExy8LKzocBdBzss_vjOpc_TQmyzM87KC192HpmuhMcqasg@mail.gmail.com/T/
See: https://issues.apache.org/jira/browse/MESOS-10243

Review: https://reviews.apache.org/r/75086/

* [ssl_tests] Correct ubuntu version on comment.

This commit is part of our effort to upgrade jenkins ubuntu version to
22.04.

Review: https://reviews.apache.org/r/75089/

* [veth] Provide the ability to set veth peer link MAC address on creation.

This addresses the previous todo where we want to set the MAC address
of the peer link when we are creating a veth pair so that we can avoid
the race condition we are racing against udev to see who will set the
MAC address of the interface last.

See: https://reviews.apache.org/r/75087/
See: https://issues.apache.org/jira/browse/MESOS-10243

Review: https://reviews.apache.org/r/75090/

* [jenkins] Create dockerfile compatible with 22.04 build.

For review #75080, we made use of replace_bpf_fd and BPF_F_REPLACE
which were added in kernel 5.6. Our current ubuntu 20.04 base image
uses kernel 5.4.

As such we will be upgrading the ubuntu version used in Jenkins to
22.04, whose base image uses kernel 5.15, so that we can make mesos
on the updated pipeline, enabling reviewbot, tidybot, and coverity.

Review: https://reviews.apache.org/r/75088/

* [build] Use clang-14 for non ubuntu 16.04 targets in docker-build.sh.

As we migrate to ubuntu 22.04, clang-10 is no longer available via
apt-get install. As such, we will move to clang-14, which should allow
us to run the docker-build.sh file with OS equal to ubuntu:22.04.
This should allow coverity bot to compile and return to normal.

Review: https://reviews.apache.org/r/75091/

* [cgroups] Add helper functions for device Entry.

Currently, the Entry class does not have readable helper functions for
determining whether the device accesses represented by one Entry would
be a subset of that of another. In addition, we want more readable ways
to determine if a device has wildcards present and if it has any
accesses specified.

These additions will streamline the logic in the DeviceManager
DeviceManager, which will heavily utilize the Entry class, improving
code readability.

Review: https://reviews.apache.org/r/75096/

* [cgroups2] Introduce a device manager.

This change introduces the DeviceManager to help facilitate device
access management in cgroups2 via ebpf program file changes. This
centralization is needed since we no longer have control files to
leverage as persistence for agent recovery, so we a component that
keeps track of allow/deny device access information and re-configures
the ebpf program for the cgroup.

Device requests can be made to the manager by calling `configure` or
`reconfigure`. Note that `configure` should only be used when setting
up a cgroup's device access, i.e. it has not requested any device to
be allowed/denied before.

In addition, `reconfigure` cannot be used to add deny entries
containing wildcards.

This manager will be made available to all controllers under the
cgroups2 isolator, and the GPU isolator.

Review: https://reviews.apache.org/r/75006/

* [cgroups2] Add ebpf program attachment to the DeviceManager.

Currently, the device manager only keeps track of the state in memory,
and does not commit the changes by attaching an ebpf file to the
corresponding cgroup. We will now generate and attach the ebpf file
when configure and reconfigure are called.

Review: https://reviews.apache.org/r/75102/

* [cgroups2] Fix unsafe Process usage in DeviceManager.

Calls to the DeviceManager wrapper were directly accessing the
state of DeviceManagerProcess. This patch uses the dispatch mechanism
instead, and adjusts the tests accordingly.

* [cgroups] Add helper to find overlapping device access.

Currently, we have to directly compare member variables to see if one
Access object would overlap that of another, which isn't very clear
to people that would be reading the code.

We add a helper to abstract away the logic to see if the accesses
specified in one Access instance would overlap with that of another.

Review: https://reviews.apache.org/r/75107/

* [cgroups] Add Device::Selector::encompasses.

Currently, we have to check via Selector's member variables if the
devices represented by one Selector encompasses those represented by
another.

We add a helper function to simplify the logic which differ depending on
whether one Selector encompasses the other.

Review: https://reviews.apache.org/r/75106/

* [ebpf] Correct ebpf deny block behavior.

Currently, the deny block matches a device access iff all accesses
match on the deny block. For example, a rw access would not match the
deny block even if the deny block had w access specified.

We would expect that the deny block should deny all accesses if the
type, major, and minor number matches, and if any of the device accesses
overlap with what's specified in the deny block.

Review: https://reviews.apache.org/r/75109/

* [cgroups2] Add allow / deny list normalization validation.

Currently we assume that a device state is normalized before using it
for generating ebpf files. However, we have not been enforcing these
constraints on the device access state.

We enforce some basic validation on cgroups2::configure on the state
to ensure that we are able to generate a correct ebpf program. If the
lists are not normalized, we generate incorrect programs!

An allow or deny list is 'normalized' iff everything below are true:

  1. No entries have empty accesses specified.
  2. No two entries on the same list can have the same selector
     (type, major & minor numbers).
  3. No two entries on the same list can be encompassed by the other
     entry. See Entry::encompassed.

This patch adds helpers to check if a device state is normalized,
and will only allow users to create new CgroupDeviceAccess instances
using a helper that checks that the allow and deny lists are normalized.

A new helper function is added to check if an entry would be granted
access, and requires the state to be normalized.

Review: https://reviews.apache.org/r/75099/

* [cgroups2] Clarify device documentation.

* Revert "[cgroups2] Clarify device documentation."

This reverts commit fd17efe3402fc859efc63d3cd32658d1ec61a015.

* Revert "[cgroups2] Add allow / deny list normalization validation."

This reverts commit 45d290aeff6912c8e6a4b1a7358c4e9772c447b4.

* [cgroups2] Helper to check device entry normalization.

Currently we assume that a device state is normalized before using it
for generating ebpf files. However, we have not been enforcing these
constraints.

We add a helper to check if a device state is normalized so that we can
enforce these constraints.

An allow or deny list is 'normalized' iff everything below are true:

1. No Entry has empty accesses specified.
2. No two entries on the same list can have the same selector (type,
   major & minor numbers).
3. No two entries on the same list can be encompassed by the other
   entry (see Entry::encompasses).

Review: https://reviews.apache.org/r/75099/

* [cgroups2] Add helper to normalize allow/deny list.

This patch adds a public helper function to abstract away the logic used
to make a list comply with the 'normalized' requirements.

As a reminder, an allow or deny list is 'normalized' iff everything below
are true:

1. No Entry has empty accesses specified.
2. No two entries on the same list can have the same selector (type,
   major & minor numbers).
3. No two entries on the same list can be encompassed by the other
   entry (see Entry::encompasses).

Review: https://reviews.apache.org/r/75104/

* [cgroups2] Helper to check device access.

A device access is granted if it is encompassed by an allow entry and
does not have access overlaps with any deny entry.

The current process of manually checking if a device access would be
granted given a state is tedious and leads to worse readability.

A new helper function is added to check if an entry would be granted
access in a CgroupDeviceAccess instance, and requires the state to be
normalized.

Review: https://reviews.apache.org/r/75113/

* [cgroups2] Enforce normalization in configure.

We currently do not enforce normalized allow and deny in configure.
However, to ensure that we can generate an ebpf program that behaves
correctly, we have to ensure that allow and deny are normalized.

This patch adds a validation check to ensure that the allow and deny are
normalized before attempting to generate the ebpf program.

Review: https://reviews.apache.org/r/75114/

* [devices] Enforce normalization for DeviceManager configure & reconfigure.

Currently in the configure() and reconfigure() functions in device
manager, we do not ensure that the device access state at the end of the
function call is normalized. So we incorporate normalized() and
normalize() calls to ensure that the allow and deny lists are always
normalized at the end of a configure() or reconfigure() call.

Review: https://reviews.apache.org/r/75115/

* [devices] Add CgroupDeviceAccess::create helper which checks normalization.

Currently, CgroupDeviceAccess instances can be directly constructed
without verifying that its allow and deny lists are normalized.

To codify our normalization constraints, CgroupDeviceAccess can now only
be created with a create() helper.

Review: https://reviews.apache.org/r/75116/

* [devices] Fix DeviceManager tests.

Changes when merging previous changes caused some DeviceManager
testcases to fail.

This patch updates the tests to pass it.

Review: https://reviews.apache.org/r/75117/

* [style] Add newlines for readability.

* [reviewbot] Fix reviewbot build error.

Currently, reviewbot is failing from a 'control reaches end of non-void
function' error due to a switch case inside a lambda in the device
manager code. We use an UNREACHABLE macro to stop this error.

Review: https://reviews.apache.org/r/75128/

* [devices] Add ability to remove cgroup from DeviceManager state.

When destroy() is called on a container, its cgroup and its children
will be cleaned up. We need to remove the cgroup from the device manager
state when this happens to ensure that the state is accurate.

Review: https://reviews.apache.org/r/75120/

* [cgroups2] Pass device manager to controllers & cgroups2 isolator.

Passes the device manager to the cgroups2 isolator on containerizer
startup, and sets up the ability for the manager to be passed to the
device controller and GPU isolator.

Review: https://reviews.apache.org/r/75016/

* [cgroups2] Introduces the DeviceControllerProcess.

Introduces a device controller that supports cgroups v2 and is available
in the Cgroups2IsolatorProcess. Device access control is made through
the DeviceManager.

Review: https://reviews.apache.org/r/75098/

* [cgroups2] create device controller in Cgroups2Isolator.

DeviceController needs to be created in Cgroups2Isolator with the
DeviceManager so that the default whitelist can be properly configured.

Review: https://reviews.apache.org/r/75121/

* [cgroups2] Skip enabling of devices controller.

Similar to the `perf_event` controller, the `devices` controller cannot
be written into cgroup.subtree_control file, so we skip the call to
cgroups2::controllers::enable for the device controller. Otherwise
we will run into an "Invalid argument" error from cgroups.

Review: https://reviews.apache.org/r/75130/

* [cgroups2] Silence incorrect compiler error in the tests.

* [device manager] Let non-wildcards entries check device access.

Currently, we only allow normal Entry instances for checking whether a
device access would be allowed for a cgroup.

We want to also allow NonWildcardEntry instances to do this as well.

Review: https://reviews.apache.org/r/75135/

* [device manager] Add wildcard conversion helper.

We currently have a wildcard conversion helper but it was only
available for use inside the device manager test file.

This change pulls out the helper and makes it available as a static
function for use outside just tests.

Review: https://reviews.apache.org/r/75137/

* [cgroups2] Support DeviceManager in GPU isolator.

Currently, the GPU isolator assumes we are only using cgroups v1, and
makes use of the cgroups::devices::allow and deny functions to control
GPU access.

In Cgroups2, we need to attach ebpf programs for the specific cgroups,
which is done for us in the DeviceManager. Hence, we need to use the
DeviceManager in the GPU isolator depending on whether cgroups v1 or v2
is currently mounted.

Review: https://reviews.apache.org/r/75074/

* [device manager] Add protobuf for cgroup state checkpointing.

Currently the device manager has no means of recovering its state
after an agent restarts. This patch aims to add a protobuf definition
to let the device manager have a checkpoint file that it can use to
recover each cgroup's device access state.

Review: https://reviews.apache.org/r/75141/

* [device manager] Add device state file path helper.

Currently we do not have a place to keep the checkpoint file for the
device manager.

This patch adds a path helper that gives us a file that lets  us
checkpoint the device manager state.

Review: https://reviews.apache.org/r/75142/

* [device manager] Checkpoint state on device manager state change.

Currently we do not checkpoint the device access state of each cgroup
when the configure or reconfigure is called. Meaning that we have
no way of recovering a cgroup's device access state.

We will checkpoint state of the device manager whenever its state
is being changed to ensure that we can recover the most recent state
when necessary.

Review: https://reviews.apache.org/r/75143/

* [device manager] Add args to customize commit_device_access_changes behavior.

Currently, commit_device_access_changes always checkpoints the
device manager state **and** configures the bpf programs for the cgroup
based on its device access state.

We add an argument to commit_device_access_changes for the caller to
determine whether they want the state to be checkpointed along with
attaching the ebpf program.

Review: https://reviews.apache.org/r/75148/

* [cgroups2] Introduce the IoControllerProcess

Introduces the IoControllerProcess.

This replaces the blkio controller from cgroups v1.
We currently only use it to helps us work with the cgroups/all isolation flag.

Review: https://reviews.apache.org/r/75155/

* [hugetlb] Introduce the HugeTLBControllerProcess

Introduces the `HugeTLBControllerProcess`.
Hosts correctly configured for cgroups v2 and provide `cgroups/hugetlb`
in the `isolation` flag and `hugetlb` in the `agent_subsystems` flag
will use this controller.

Review: https://reviews.apache.org/r/75152

* [cpuset] Introduce the CpusetControllerProcess

Hosts correctly configured for cgroups v2 and provide cgroups/cpuset
in the isolation flag and cpuset in the agent_subsystems flag
will use this controller.

Review: https://reviews.apache.org/r/75153

* [pids] Introduce the PidsControllerProcess

Introduces the PidsControllerProcess.
Hosts correctly configured for cgroups v2 and provide cgroups/pids
in the isolation flag and pids in the agent_subsystems flag
will use this controller.

Review: https://reviews.apache.org/r/75154

* [build] Fix compilation error from cherry picks.

Minor fix to allow current builds to progress again.

Review: https://reviews.apache.org/r/75157/

* [cgroups2] Prevent containerId from prepending cgroup root during recovery.

Currently, CgroupsIsolatorTest.ROOT_CGROUPS_PERF_PerfForward fails
during recovery because it cannot create the directory for the recovered
container. This happens because the original containerId, when
recovered, includes the cgroup root. The function that converts a cgroup
to a containerId does not ignore the cgroup root, even though we do not
expect it to be included.

To fix this, we will remove the first token of the cgroup if it matches
the cgroup root. This will prevent attaching an extraneous cgroup root
to the containerId when parsing it from a cgroup.

Review: https://reviews.apache.org/r/75156/

* [paths] Simplify paths::cgroups2::containerId() logic.

Currently we are tokenizing the cgroup argument when we could be
directly operating on the string using our strings utility helpers
instead.

This patch replaces the tokenizing logic to using the strings library.

Review: https://reviews.apache.org/r/75160/

* [cgroups2] Collect process & thread from leaf groups.

Currently, our core controller usage() function try to read the cgroup
files in the argument cgroup. However, in our design for cgroups v2,
processes and threads live in the leaf child of a cgroup. Hence the
usage collection will not find the actual processes and threads for a
cgroup if it's not already specified as a leaf group.

This patch adds a check to see if the argument cgroup is a leaf cgroup,
and will search for processes and threads in the leaf cgroup instead.

Review: https://reviews.apache.org/r/75158/

* [cgroups2] Register core controller in container info.

Currently the core controller is skipped over during prepare(),
and it is not added to the container's registered controlelrs.

We need to register the core controller so that its functions can be
called by the isolator.

Review: https://reviews.apache.org/r/75159/

* [cgroups2] Fix ROOT_CGROUPS_MemoryForward for cgroups2.

The ROOT_CGROUPS_MemoryForward test is currently failing because it is
looking for the memory hierarchy, which no longer exists in cgroups2 due
to the new unified hierarchy.

We skip this hierarchy check if we detect cgroups2 is mounted on the
system.

Review: https://reviews.apache.org/r/75161/

* [cgroups2] Create memory controller using cgroups/all flag.

Currently, we cannot use the cgroups/all flag to its corresponding
controllers in cgroups2.

The flag causes us to grabs all the values in the cgroup.controllers
file. But we should instead just add all the creators when we see
cgroups/all, as many controllers are no longer present in
cgroup.controllers in cgroups v2.

This fix incidentally fixes the
ROOT_CGROUPS_AgentRecoveryWithNewCgroupSubsystems test as it was unable
to create the memory controller using cgroups/all flag.

Review: https://reviews.apache.org/r/75162/

* [cgroups2] Fix ROOT_CGROUPS_AutoLoadSubsystems test.

Currently, the ROOT_CGROUPS_AutoLoadSubsystems test is failing because
it is checking for hierarchies for subsystems enabled under
'cgroups/all'. In cgroups2 we cannot perform this check because of the
unified hierarchy.

Hence we skip this hierarchy check and instead check that all available
cgroups2 controllers are enabled by reading cgroup.controllers and
cgroup.subtree_control.

Review: https://reviews.apache.org/r/75163/

* [cgroups2] Add device manager recovery support.

We currently do not have any method of recovering the device access
states when the cgroups2 isolator is atempting to recover containers.

We add a recovery state here that makes use of the protobuf checkpoint
files to ensure that the previous device accesses of cgroups can be
restored. It will be used by the cgroups2 isolator.

Review: https://reviews.apache.org/r/75145/

* [cgroups2] Recover device manager with cgroups2 isolator.

This patch lets us call the device manager's recovery function after
the containers from recovery_state have been successfully recovered.
Allowing us to begin recovering the cgroup device access state for each
recovered non-orphan container.

Review: https://reviews.apache.org/r/75149/

* [cgroups2] Enable controller in parent cgroups during prepare().

To support nested containers with nested cgroups, we need to enable
controllers in cgroup.subtree_control file for the appropriate nested
cgroup.

To do so, we need to ensure that the parents have the the requested
controller in their cgroup.subtree_control file so that the nested
cgroup can have the controller written into subtree_control as well.
Otherwise we will get a 'no such file or directory' error.

Review: https://reviews.apache.org/r/75166/

* [cgroups2] Add isolate field for nested containers.

Currently we do not support nested containers. We need to let nested
containers pick whether they want their own resource constraints based
on the LinuxInfo::share_cgroups field in the API.

In cgroups v1, we didn't need to track an additional field for this,
because the isolator does not store nested containers within its `infos`
map.

In cgroups v2, we will *always* create cgroups for nested containers,
and LinuxInfo::share_cgroups instead specifies whether these cgroups
will have resource isolation applied to them. (LinuxInfo::share_cgroups
needs to be renamed accordingly).

In later patches, we will use this to skip the update and isolate
calls on the controllers if isolate == false.

Review: https://reviews.apache.org/r/75167/

* [cgroups2] Enforce use of linux launcher with cg2 isolator.

Currently we are not checking that the cgroups isolator is being used
with the linux launcher. We need to ensure that if the linux launcher
is being used, the cgroups isolator is also being used so that the
cgroups for the containers can be made inside the isolator's prepare().

Review: https://reviews.apache.org/r/75171/

* [cgroups2] Separate responsibility for creating cgroup and assigning pids.

In cgroups2, the current linux launcher does not create cgroups nor does
it move the pids into the container's leaf cgroup during fork().

When we launch a container, we first prepare it via the isolators,
then the launcher will call fork to, among other things, move the pid
into its appropriate cgroup. Once the fork is over, isolate() is called
on the isolators.

As such, we will remove the cgroups2 isolator's current behavior of
assigning pids into leaf cgroups as it is already done by the linux
launcher.

Review: https://reviews.apache.org/r/75170/

* [cgroups2] Do DEBUG container check after creating cgroup for it.

Since we create cgroups for all containers, we will create the DEBUG
container. We expect the DEBUG containers to not have its own resource
contraints, so we will return __prepare early before reaching the
CHECK.

Review: https://reviews.apache.org/r/75172/

* [cgroups2] Only call isolate() on controllers of containers with isolate == true.

In cgroups1, when handling nested containers that share cgroups, we
skip isolate calls for them. We want to replicate this behavior in
cgroups2.

Review: https://reviews.apache.org/r/75173/

* [cgroups2] Make isolator update() fail for containers that don't need isolation.

In cgroups1, we returned an error when we see that the container is
sharing cgroups, as it would have no cgroups for update() to take.

In cgroups2 we will mimick this behavior for containers that do not
wish to have their own resource constraints, as we do not expect to call
update() on those containers.

Review: https://reviews.apache.org/r/75174/

* [cgroups2] Make isolator status() return parent status for containers w/ !isolate.

In cgroups1, if a container is sharing cgroups with its parents, we will
return the parent's status.

In cgroups2, we want to mimic this behavior even though we always create
cgroups for our containers. Since we do not do anything with our
container's cgroup if isolate == false.

Review: https://reviews.apache.org/r/75175/

* [cgroups2] Handle unknown containers in watch().

In cgroups1, we returned a pending future for containers that shared
cgroups with their parents.

In cgroups2, since we always create cgroups for our containers, we no
longer need to consider this special case. So we only return failure
if there is an unknown container.

Review: https://reviews.apache.org/r/75176/

* [cgroups2] Perform chown of cgroup if necessary.

In cgroups1, we chown for nested cgroups so that they can create deeper
layers of cgroups. We want to replicate this behavior in cgroups2.

Review: https://reviews.apache.org/r/75178/

* [cgroups2] Enable support for nested containers in the isolator.

We enable support for nested containers on systems with cgroups v2 with
this patch. This means that nested containers will now have their
cgroups created for them, and that the cgroups2 isolator functions will
be called for nested containers.

Review: https://reviews.apache.org/r/75177/

* [cgroups2] cgroups2::destroy retry rmdir on EBUSY.

Currently we only wait until a cgroup's pids (retrieved from
cgroup.procs) is empty. However even if a cgroup's pids are empty the
rmdir call on it may still return EBUSY, causing us to fail the destroy
operation. We want to retry the rmdir operation even on EBUSY for up to
5 seconds to ensure that we are able to delete the cgroup.

This approach is similar to how crun is destroying its cgroups.
see:

https://github.com/containers/crun/blob/10b3038c1398b7db20b1826f94e9d4cb444e9568/src/libcrun/cgroup-utils.c#L471

Review: https://reviews.apache.org/r/75181/

* [libprocess] Add io::Watcher for fs notifications.

Adds basic watcher class for filesystem watch notifications. We
currently only support Linux with inotify.

We currently support inotify events for writing, deleting, and renaming
a file. We do not support watching directories.

Review: https://reviews.apache.org/r/75182/

* [build] Fix cmake build / tidybot.

The cmake build and therefore tidybot are failing because they cannot
find the device manager protobuf files, we will run protobuf for
device_manager/state.proto to generate the files.

Review: https://reviews.apache.org/r/75186/

* [cgroups2] Introduce an OomListener.

We add an OomListener process to allow users to listen for oom events
in a cgroup or any of its descendants.

If the OomListener is terminated, any remaining unsatisfied futures will
be failed.

If the listened cgroup or any of its descendants encounters an oom
event, then the returned future from listen() will become ready, and
action can be taken upon the oom event via future onReady handlers.

The caller can also discard a returned future to stop listening for
events.

Review: https://reviews.apache.org/r/75184/

* [cgroups2] Use OomListener in MemoryControllerProcess.

The MemoryControllerProcess needs an OomListener to ensure that it does
not need to listen for oom events by polling, which causes race
conditions with the oom killer.

We spawn an OomListener in the MemoryControllerProcess and use it
to listen for oom events in any cgroup via oomListen().

Review: https://reviews.apache.org/r/75185/

* [io] Fix warnings during compilation.

Currently we have a warning that occurs on compilation about comparison
between unsigned long and long. We update the tests to suppress this
warning.

Review: https://reviews.apache.org/r/75187/

* [cgroups2] Remove completed todos.

The Todos mention recovery support for the device manager, and to use
the device manager recover in the device controller process.

We have added checkpointing and recovery support in:
https://reviews.apache.org/r/75145

We also only use the device manager recovery in the cgroups2 isolator
instead of in the device controller, as implemented in:
https://reviews.apache.org/r/75149/

As such, these todos are considered completed, and can be removed.

Review: https://reviews.apache.org/r/75189/

* [gpu] Let NvidiaGpuIsolatorProcess support nesting.

The NvidiaGpuIsolatorProcess currently still does not declare itself
to support nesting because nested containers were not supported for
the cgroups2 isolator.

Nested container support was added for cgroups2 isolator in:
https://reviews.apache.org/r/75177/

As such we will declare NvidiaGpuIsolatorProcess to be good for nesting.

Review: https://reviews.apache.org/r/75190/

* [docs] Add public docs for Cgroups v2.

Currently there is no official documentation outlining the changes
we have been making to support Cgroups v2.

We add a main document outlining how Mesos interacts with Cgroups v2,
and update some documents on the changes that were made, such as
the device isolator document.

Review: https://reviews.apache.org/r/75191/

* CHANGE: resolving conflicts

---------

Co-authored-by: Devin Leamy <dleamy@twitter.com>
Co-authored-by: Benjamin Mahler <bmahler@apache.org>
Co-authored-by: None <None>
Co-authored-by: bmahler <benjamin.mahler@gmail.com>
Co-authored-by: Jason Zhou <jasonzhou460@gmail.com>
Co-authored-by: Jason Zhou <jasonzhou@twitter.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants