[cgroups2] Introduced API to kill the processes inside of cgroup subtree. by DevinLeamy · Pull Request #550 · apache/mesos

DevinLeamy · 2024-04-05T18:45:57Z

Introduces

	cgroups2::kill(cgroup)

which will recursively kill all of the cgroups in a subtree.

We additionally update cgroups::destroy to use cgroups::kill such that it now completely destroys a cgroup (i.e. all directories processes).

…ree. Introduces ``` cgroups2::kill(cgroup) ``` which will recursively kill all of the cgroups in a subtree. We additionally update `cgroups::destroy` to use `cgroups::kill` such that it now completely destroys a cgroup (i.e. all directories processes).

DevinLeamy · 2024-04-05T18:46:42Z

src/linux/cgroups2.cpp

+  vector<string> sorted(cgroups->begin(), cgroups->end());
+  sorted.push_back(cgroup);
+  std::sort(sorted.rbegin(), sorted.rend());


We sort the cgroups here so they are ordered by decreasing nesting depth.

bmahler · 2024-04-05T21:15:01Z

src/linux/cgroups2.cpp

+  // the most deeply nested directories first.
+  Try<Nothing> kill = cgroups2::kill(cgroup);
+  if (kill.isError()) {
+    return Error("Failed to kill processes in cgroup");


should include the error here

* [cgroups2] Introduce build files for the cgroups2 `Controller` abstraction. * [cgroups2] Introduces a `Controller` abstraction for cgroups v2 controllers. NOTE: In cgroups v1 we call "controllers" "subsystems". In cgroups v2, we exclusively use the term "controller", which is what is used in the Linux documentation for cgroups v2. For cgroups v1, the `Subsystem` abstraction is used represent a cgroup controller. `Subsystem`s exist for each of the controllers provided by the cgroups v1 API. We do similar for cgroups v2, now introducing the `Controller` abstraction. The difference between a `Controller` and a `Subsystem` - besides the name - is that a `Controller` does not have an associated hierarchy. This is because in cgroups v2, controllers (a.k.a. subsystems) do not need to be individually mounted. All controllers are "mounted" under the unified `cgroup2` filesystem that we require be mounted at `/sys/fs/cgroup`. The `Cgroups2IsolatorProcess` delegates resource isolation requests to `Controller`s instead of `Subsystem`s. * [cgroups2] Removed templatized write() method. `cgroups2::write` does not need to be a template function (unlike `cgroups2::read`) because standard C++ overloading is sufficient to handle writing multiple different types, without definition conflicts. Hence, we make `cgroups2::write` not a template function. * [cgroups2] Add an interface to read and write the CPU bandwidth limit. In cgroups v2, the CPU bandwidth and bandwidth period (duration over which the bandwidth can be spent) are set in `cpu.max`. This patch introduces an interface to read and update these values. A `BandwidthLimit` object is introduced, which represents a snapshot of the `cpu.max` control file. Note: This stands in contrast to cgroups v1 where the period and bandwidth were set in separate control files, `cpu.cfs_period_us` and `cpu.cfs_quota_us` respectively. This closes #541 * [cgroups2] 'cpu.max' parsing fix and introduce a test. The 'cpu.max' file is terminated by a newline which causes an error while parsing. Hence, we trim whitespace before we parse its contents. A test is also introduced. This closes #543 * [cgroups2] Introduce the CpuControllerProcess. Introduces the `CpuControllerProcess`, a cpu isolator that is implemented using cgroups v2 and indirectly exposed through the `Cgroups2IsolatorProcess`. Hosts correctly configured for cgroups v2 that provide `cgroups/cpu` in the `isolation` flag and `cpu` in the `agent_subsystems` flag will use this controller. This closes #545 * [cgroups2] Introduce interface for reading threads in a cgroup. - `cgroups2::threads` reads the threads in a cgroup into a `set`, similar to `cgroups2::processes`. Moving a process into a cgroup moves all the threads in that process into the cgroup. A test is introduced to ensure the threads move, as expected. This closes #546 * [cgroups2] Adds the `CoreControllerProcess` to the `Cgroups2IsolatorProcess` The `CoreControllerProcess`, referred to as the "core" controller, is implemented and enabled by default by the cgroups v2 isolator process. We enable it by default because all cgroups in cgroups v2 have the core control files ("cgroup.*"), which the `CoreControllerProcess` manages. Currently, `CoreControllerProcess` reports the number of processes and threads in a cgroup, if the `cgroups_cpu_enable_pids_and_tids_count` flag is provided. This closes #547 * Removed dead field 'subsystem' from the `LinuxLauncherProcess`. * [cgroups2] Introduce interface to get cgroups nested inside of a cgroup. Introduces ``` cgroups2::get(cgroup) ``` which returns the cgroups inside of the given cgroup. * [cgroups2] Introduced API to kill the processes inside of cgroup subtree. Introduces ``` cgroups2::kill(cgroup) ``` which will recursively kill all of the cgroups in a subtree. We additionally update `cgroups::destroy` to use `cgroups::kill` such that it now completely destroys a cgroup (i.e. all directories processes). This closes #550 * [cgroups2] Introduce utility to parse a container id from a cgroup path. During agent recovery, we parse the directories in the cgroup hierarchy to determine what containers were previously running in the agent. Here we implement the cgroup directory parsing for cgroups v2's updated cgroup directory structure. This closes #551 * Linux launcher cleanups. These are cleanups extracted out from https://github.com/apache/mesos/pull/552/ in order to reduce the diff noise in adding cgroups v2 support. * [cgroups2] Update the LinuxLauncher to support cgroups v2. Updates the `LinuxLauncher` to use cgroups v2. Like with other cgroup v2 functionality, the new launcher is used by default if the host is correctly configured for cgroups v2 and Mesos has been compiled with the --enable-cgroups-v2 flag. This closes #552 * [cgroups2] Introduce `memory` controller. Introduces the cgroups v2 `memory` controller, the `cgroups2::memory::usage` function, to obtain the memory usage of a cgroup and its descendants', and a test. This closes #553 * [cgroups2] Introduced API to set memory.min for a cgroup. Introduces ``` cgroups2::memory::min(cgroup) // get the minimum cgroups2::memory::set_min(cgroup, bytes) // set the minimum ``` to get and set the minimum memory in bytes that are guaranteed to not be reclaimed by the kernel under any conditions. This closes #554 * [cgroups2] Introduced an interface to set a hard memory limit. The "memory.max" control contains the hard memory limit that a cgroup and its descendants must remain below. We introduce `cgroups2::memory::set_max` and `cgroups2::memory::max` to set and get this limit. This closes #557 * Mitigate a case where the agent gets stuck sending TASK_DROPPED. Per MESOS-7187, there is a case where the master holds a stale resource UUID for the agent's resources, and all subsequent task launches result in the agent sending TASK_DROPPED due to "Task assumes outdated resource state". While this patch doesn't fix the general issue of MESOS-7187, it does mitigate a known problematic case due to the introduction of the agent having its own resource UUID. * Add a regression test for the mitigation of MESOS-7187. * Removed trailing spaces. * [cgroups2] Introduce API to set soft memory protection. The 'memory.low' control is for soft memory protection. This only applies when the system is trying to reclaim memory. Soft memory protection means that the kernel will do its best to not reclaim memory from the cgroup if its usage is below the value in 'memory.low'. Before it reclaims any memory below the value in 'memory.low' it will first reclaim unprotected memory from other cgroups. We introduce `cgroups2::memory::low` and `cgroups2::memory::set_low` to set and get this soft memory protection limit. * [cgroups2] Introduce API to set a soft memory limit. The "memory.high" control contains the soft memory limit for a cgroup and its descendants. Exceeding the limit will cause the cgroup's processes to get throttled and will put the cgroup under memory pressure. We introduce `cgroups2::memory::set_high` and `cgroups2::memory::high` to set and get this soft memory limit. * [cgroups2] Introduced API to listen for OOM events. Introduces `cgroups2::memory::events::oom` which returns a future that resolves when the cgroup reaches its memory limit and allocation was about to fail. In cgroups v1, there was a bespoke notification API. Cgroups v2 provides the 'memory.events' control which contains key-value pairs of events and the number of times they took place [1]. For OOMs, we look at the value of the `oom` field. In `cgroups2::memory::events::oom` we watch for changes to 'memory.events' (via polling every 100ms for now, and later via inotify) and resolve a future when `events.oom > 0`. [1] https://docs.kernel.org/admin-guide/cgroup-v2.html#memory This closes #563 * [cgroups2] Error if `--cgroups_limit_swap` is used when cgroups v2 is used. Mesos does not support limiting swap memory when using cgroups v2. This is because the cgroups v2 API allows separate control of swap usage and careful consideration is needed to figure out how to limit swap usage. Therefore, when the `--cgroups_limit_swap` flag is provided and cgroups v2 is used we error during flag validation. This closes #565 * [cgroups2] Add a subset of memory usage statistics. Cgroups v2 exposes memory statistics through the 'memory.stat' control. Here we introduce `cgroups2::memory::stats` to read a subset of the memory usage statistics into a new `memory::Stats` object. These statistics will be used by the `MemoryControllerProcess` to populate a `ResourceStatistics` object, like is done by the `MemorySubsystemProcess` in cgroups v1. Additional statistics from the 'memory.stat' control can be included as they are required. This closes #564 * [cgroups2] Implement Cgroups 2 isolator w/o nested containers and systemd. Updates the cgroups v2 isolator to include initialization, cleanup, update, and recovery logic. Unlike cgroups v1 we: - Create a new cgroup namespace during isolation, by introducing a new clone namespace flag. This implies that the contained process will only have access to cgroups in its cgroup subtree. - We only need to recover two cgroups (the non-leaf and leaf cgroups [1]) for each container, rather than having to recover one cgroup for each controller the container used. - We do not yet support nested containers. - We do not yet have a systemd integration. Since the cgroups v1 isolator's integration with systemd was largely to extend process lifetimes, the cgroups v2 isolator will function on systemd-managed machines, despite not having a first-class integration. A systemd integration will be added. Using the cgroups v2 isolator requires Mesos to be compiled with `--enable-cgroups-v2` and to have the cgroup2 filesystem mounted at /sys/fs/cgroup. Selecting the correct isolator version (v1 or v2) is done automatically. v2 is used if the host supports cgroups v2 and is correctly configured. [1] The "non-leaf" cgroup is the cgroup for a container where resource constraints are imposed. The "leaf" cgroup, which has the path <non-leaf cgroup>/leaf, is where the container PID is put. Container PIDs are only put in leaf cgroups. This closes #556 * [cgroups2] Fix error message to show the correct path. The error message for failing to create the leaf cgroup was printing the non-leaf cgroup, instead of the leaf cgroup. * [cgroups2] Don't enable controllers in the leaf cgroup. We cannot enable all the controllers in the leaf cgroup because we also put the container process in the leaf cgroup, which violates the no internal process constraint. For instance, if we enable the memory controller in the leaf cgroup and then try and move a process into the leaf cgroup the operation will fail with EBUSY. If a container wants to manage their own cgroups, they will need to move their process into a new cgroup _before_ they enable controllers. * [cgroups2] Made `cgroups2::processes` optionally recursive. Previously, `cgroups2::processes` could only fetch processes from inside of the provided cgroup. We can now fetch all of the processes inside of a cgroup subtree by passing an (optional) `recursive` flag. ```c++ Try<std::set<pid_t>> processes( const std::string& cgroup, bool recursive = false); ``` This closes #570 * [cgroups2] Update `destroy` to be async more robust. We were running into an inconsistent issue with `cgroups2::destroy`. `cgroups2::destroy` would fail with EBUSY when removing cgroups with `rmdir`. The error was being caused because some processes had not been killed when `rmdir` was called on their cgroup; a cgroup with processes cannot be destroyed. After signalling a kill (by writing "1" to 'cgroup.kill') sometimes processes were staying alive long enough to cause `rmdir` to fail. Hence, we update `cgroups2::destroy` to wait after signalling a SIGKILL for all the processes to drop before attempting to remove the cgroups. Since we wait a maximum of half a second, we don't want to block the caller. Thus, we update `destroy` to be async. * Split out cgroups setup / teardown logic in ContainerizerTest. `StartSlave()` and similar test-setup functions mounted cgroups v1 hierarchies and initialized controllers. On cgroups v2 machines, this setup would fail or result in irregular cgroup setups. As a step towards end-to-end testing for the `MesosContainerizer`, we update the Agent test fixtures such that they work correctly on both cgroups v1 and v2 hosts. This closes #572 * [cgroups2] Add cgroups v2 setup and teardown logic to ContainerizerTest. `StartSlave()` and similar test-setup functions mounted cgroups v1 hierarchies and initialized controllers. On cgroups v2 machines, this setup would fail or result in irregular cgroup setups. As a step towards end-to-end testing for the `MesosContainerizer`, we update the Agent test fixtures such that they work correctly on both cgroups v1 and v2 hosts. This closes #573 * [cgroups2] Report usage statistics for the cgroups v2 isolator process. Overrides `::usage` for the `Cgroups2IsolatorProcess` so the MesosContainerizer gets ResourceStatistics reported by the cgroups v2 controllers processes, for example the `CpuControllerProcess`. * Fix compilation error when cgroups v2 is not being compiled. This closes #575 * [cgroups2] Handle missing 'kernel' field in 'memory.stat' on linux < 5.18. The 'kernel' key was introduced to 'memory.stat' in Kernel 5.18 and therefore isn't present on older kernels. If it is missing, we set `kernel` to be the sum of the other kernel usage fields provided in 'memory.stat'. This is an under-accounting since it doesn't include: - various kvm allocations (e.g. allocated pages to create vcpus) - io_uring - tmp_page in pipes during pipe_write() - bpf ringbuffers - unix sockets But it's the best measurement we can provide prior to the 'kernel' stat being added in 5.18 that catches all of these. As part of this, we add the 'slab' key (one of the kernel memory usage fields) to the `memory::Stats` structure. See kernel patch introducing 'kernel': https://github.com/torvalds/linux/commit/a8c49af3be5f0b4e105ef678bcf14ef102c270be This closes #576 * [cgroups2] Watch and respond to container limitations. Each `ControllerProcess` used by the cgroups v2 isolator can optionally override `::watch` which is a future that resolves when a container limitation (e.g. memory limit reached) is detected. Here we introduce listening and responding to these container limitations, like is done in cgroups v1. * [cgroups2] Introduces the MemoryControllerProcess. Introduces the `MemoryControllerProcess`, the cgroups v2 memory isolator, which will be used by the `Cgroups2IsolatorProcess`. Unlike the `MemorySubsystemProcess`, the cgroups v1 memory isolator, we: - Don't allow limits on swap memory to be set. - Don't report memory pressure levels (this facility is no longer part of the cgroups memory controller's API) Future work may include: - Adding support for swap memory, and - Reporting the (now available) memory pressure stall information This patch updates the ROOT_MemUsage so it passes on a cgroups v2 machine using the new MemoryControllerProcess. This closes #581 * [post-reviews] Replace deprecated disutil LooseVersion with packaging.version. This also gets rid of the Deprecation Warning we get when running the post-reviews.py script: ``` DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. rbt_version = LooseVersion(rbt_version) ``` Review: https://reviews.apache.org/r/74984/ * [cgroups2] Clarify cgroups2::memory::stats documentation. After performing some testing, we found that memory.stat contains information about the cgroup *and its descendants*, but this is not currently mentioned in our own documentation. Review: https://reviews.apache.org/r/74980/ * [cgroups2] Add OOM listening to the MemoryControllerProcess. Introduces OOM listening to the MemoryControllerProcess so that we detect, report, and respond to OOM events. Review: https://reviews.apache.org/r/74979/ * [cgroups2] adjust CPU weight values from v1 to v2 default Modifies the cgroups CPU weights to reflect change from cpu.shares to cpu.weight. In v1, cgroups used cpu.shares which has a default of 1024. In v2, cgroups use cpu.weight which has a default of 100 The range for the cpu.weight is [1,10000], the minimum weight has been updated to reflect this. The revocable CPU weight has been scaled down from 10 to 1 to reflect a similar scale to the default. Review: https://reviews.apache.org/r/74992/ * [cgroups2] populate unevictable field from memory.stat Review: https://reviews.apache.org/r/74991/ * [cgroup2] Fix CPU isolator tests on cgroups2 systems The change involves migrating the isolator tests from MesosTests to ContainerizerTest which inherit from MesosTests This allows cgroups2 tests to create cgroups in appropriate directories during tests. Review: https://reviews.apache.org/r/74989/ * [cgroups2] Add memory usage reporting to the MemoryControllerProcess Introduces `::usage` to the MemoryControllerProcess to report the total memory usage of a cgroup as well as memory usage statistics provided by `cgroups2::memory:stats`. Review: https://reviews.apache.org/r/74985/ * [cgroups2] Rename constants in cgroups2 isolator. Specify that the cgroups2 constants are cgroups2 in their names. This helps avoid redefinition of constants inside test files that may import constant files from both cgroups v1 and v2. Review: https://reviews.apache.org/r/74993/ * [cgroups2] Fix cgroups isolator test for RevocableCpu. This patch fixes the RevocableCpu test for cgroups2 by conditionally skipping the hierarchy check which is only relevant to cgroups1 systems. Review: https://reviews.apache.org/r/74994/ * [cgroups2] crash when root folder is not detected when creating cgroups Based on this ticket (https://issues.apache.org/jira/browse/MESOS-9305) and the ROOT_CGROUPS_CreateRecursively test in CgroupsIsolatorTest, there seems to be a possibility that the root folder may be deleted and new cgroups cannot be properly created. In v1, this was addressed by enabling recursively creating the groups. In v2, since we make use of cgroup.subtree_control to determine a cgroup and its descendents' access to controllers, we cannot recover this effectively if the root folder is deleted, so we cant just recursively create the folders. Hence, we elected to crash if the root folder is not found, as it will allow us to restart and go through the logic that takes care of setting all the values inside cgroup.subtree_control again. Review: https://reviews.apache.org/r/74995/ * [cgroups2] Allow cgroups2::enable() to take in a set. Modifies the cgroups2::controllers::enable function to take in a set of strings for controllers. This helps eliminate the possibility of duplicate controllers in the argument, and brings it in line with the cgroups2::controllers::disable function Review: https://reviews.apache.org/r/74981/ * [cgroups2] Introduce the PerfEventControllerProcess. Introduces the controller process for perf event which was also present in cgroups1. The controller is automatically enabled, and should not be visible inside the cgroups.controllers file in the root cgroup. As a consequence, we will not be able to manually enable or disable this controller via writing to the cgroup.subtree_control file. References: * perf_event section in https://docs.kernel.org/admin-guide/cgroup-v2.html * slide 34 in https://man7.org/conf/ndctechtown2021/cgroups-v2-part-1-intro-NDC-TechTown-2021-Kerrisk.pdf Review: https://reviews.apache.org/r/74997/ * [cgroups2] Ignore manual enabling of perf_event during prepare phase. In Cgroups2IsolatorProcess::prepare, it may manually enable controller by writing to the cgroup.subtree_control process. For perf_event, since is is automatically turned on, it does not appear inside the cgroup.controllers file and hence cannot be written to the cgroup.subtree_control file. For this reason, we skip the enable call for the perf_event controller. Review: https://reviews.apache.org/r/74998/ * [contributors] Add Jason Zhou to contributors.yaml. Review: https://reviews.apache.org/r/75011/ * [agent] Add executor_id / framework_id query parameters in /containers. We now allow filtering by framework ID and executor ID in addition to the original functionality of filtering by container ID. Please note that the /containers endooint only allows select combinations of these query parameter fields to be populated at once. We will return a failure if we see that the combination of query paramters is invalid. We currently accept: * no query parameters * only container id * only framework id * only framework id and executor id Review: https://reviews.apache.org/r/75009/ * [agent] Add test for framework_id and executor_id support in /containers. Review: https://reviews.apache.org/r/75010/ * FIX: add missing cgroups file. * [mesos-build] Fix python setup in Dockerfiles. The dockerfile themselves were fixed up to account for updated pip install urls, deprecated deadsnakes PPA for Ubuntu 16.04, and curl misconfigurations for installing clang. Review: https://reviews.apache.org/r/75018/ * CHANGE: add default network if no net options was set. * Revert "CHANGE: add default network if no net options was set." This reverts commit 5c3b039db544e937617cd63810185cc95fc2b34b. * [cgroups2] Remove ENABLE_CGROUPS_V2 ifdefs. This commit removes the sections where ENABLE_CGROUPS_V2 is used to determine the compiled code. Any need to determine whether or not cgroups2 is used will be satisfied using the cgroups2::mounted() function. This guard was only in place temporarily to avoid breaking our CI while we figured out how to ensure that all of the CI docker images have the header. Review: https://reviews.apache.org/r/75021/ * [mesos-build] Fix python setup in docker-build.sh. Fixes the python 3.6 install inside docker-build.sh when the detected OS version is Ubuntu v16.04. Review: https://reviews.apache.org/r/75020/ * [mesos-build] Move mesos-build to from ubuntu 16.04 to 20.04 Ubuntu 16.04 docker builds were having issues with the jenkins pipeline as it was missing certain fields in /usr/include/linux/bpf.h that are present in more modern linux kernels' which were used inside the ebpf code. We will try to address this along with Ubuntu's EOL issue by upgrading to ubuntu 20.04 Review: https://reviews.apache.org/r/75023/ * [mesos-build] Update reviewbot / tidybot / docker-build.sh to support ubuntu 20.04. The reviewbot, tidybot, and our docker-build scripts have been updated to use or accomodate for ubuntu 20.04. Review: https://reviews.apache.org/r/75027/ * [mesos-build] Add readme to support/mesos-build directory. Review: https://reviews.apache.org/r/75028/ * [mesos-build] Add correct directory to git safe directory. Previously, the entrypoint.sh added /SRC/.git as a safe directory after the git clone, which was not useful. We now add it /SRC/ as a safe directory before the git clone. Review: https://reviews.apache.org/r/75029/ * Remove trailing whitespace in mesos build entrypoint. * [mesos-readme] Clarify mesos-build instructions for uploading to dockerhub. Review: https://reviews.apache.org/r/75030/ * Clarify mesos-build readme instructions. Provide more specific commands. * [mesos-build] Add .git directory to safe directory. Review: https://reviews.apache.org/r/75031/ * Add push instructions to mesos-tidy. * [mesos-build] Install openjdk 11 on ubuntu 20.04. Install openjdk 11 on ubuntu 20.04. Our reviewbot is running into issues where their java 11 installation is missing javac and configure.ac cannot run properly, this fixes that issue. Review: https://reviews.apache.org/r/75032/ * [mesos-build] Address dependency issues in centos-7 / ubuntu-20.04. In CentOS 7, the linux/amd64 base image is missing ebpf fields such as BPF_PROG_TYPE_CGROUP_DEVICE, which prevents jenkins from building mesos now that the ENABLE_CGROUPS_V2 macro has been removed as /usr/include/linux/bpf.h is missing the fields required by ebpf.h. We installed dependencies from kernel-ml so that we can have newer headerfiles in /usr/include/linux, which should help us compile the mesos code in jenkins. For ubuntu, there is a dependency on installing jdk-11 instead of the old jdk-8 which is preventing some builds which pull the ubuntu image to build. As part of this change, both the CentOS 7 and ubuntu:20.04 will need to be rebuilt and uploaded to dockerhub for jenkins. Review: https://reviews.apache.org/r/75035/ * [mesos-build] Add /SRC/.git as safe directory for tidybot. This change allows us to bypass the git directory warnings for tidybot. As part of this change we will have to rebuild the tidybot image and push it to dockerhub. Review: https://reviews.apache.org/r/75037/ * [mesos-build] Remove setting of environment variable in dockerfiles. Setting an environment variable as PYTHON_VERSION caused an unexpected behavior in jenkins as the configure.ac script checked for that exact environment variable. PYTHON_VERSION has been renamed and set as an ARG in the dockerfile so that it will not persist after the build. Review: https://reviews.apache.org/r/75036/ * [port-mapping] cat back ip_local_port_range after updating ephemeral ports. This ensures that the update was successful and that the port range is what we expect. Review: https://reviews.apache.org/r/75038/ * [port-mapping] Fix typo port_mapping.cpp. Review: https://reviews.apache.org/r/75042/ * [cgroups2] Fix control reaches end of non-void function. This change moves the UNREACHABLE macro out of the switch case to fix the "control reaches end of non-void function" error in the lambda function for addDevice. Review: https://reviews.apache.org/r/75043/ * [port_mapping] Fix SmallEgressLimit test. The test was previously failing as it was timing the echo rather than ncat. This fix measures the time that ncat takes so that the elapsed time does not display as 0s and fail the test. Review: https://reviews.apache.org/r/75044/ * [cgroups2] Fix multi-line comment compilation warning. This fixes a compilation warning due to the comment line ending with a backslash character. * [cgroups2] Fix a compilation error on CentOS 7 due to move operations. * [route] Use nl_addr_iszero helper when checking for destination IP network. Previously, when grabbing the destination, we would filter out the default address at 0.0.0.0/0 by checking that the destination pointer is pointing at an empty struct. On newer Linux, it seems to be possible that the destination pointer can be pointing at a valid struct that corresponds to 0.0.0.0/0. To ensure that we are able accurately filter out the default route, we switch to the libnl function nl_addr_iszero to determine if the nl_addr struct corresponds to 0.0.0.0/0. We also apply this change to other areas where nl_addr_get_len is used to ensure that non-empty nl_addr with only zeroes are accounted for. Review: https://reviews.apache.org/r/75046/ * [routing] Change link::setMAC to return Try<Nothing>. We have noticed that our code does not treat the setMAC bool return value differently based on whether it returns true or false. As such, we are changing the return type to return Nothing so that we either return Error or Nothing, rather than Error or True or False. As a consequence of this we are also removing the special case of returning False but not Error when we get ENODEV from ioctl. Review: https://reviews.apache.org/r/75056/ * Support constructing net::MAC objects from sockaddr.sa_data. When using ioctl, we get a char[14] sa_data array from sockaddr that holds the information necessary to construct a net::MAC object, this patch adds support for using the sa_data field to directly create a net::MAC object. Review: https://reviews.apache.org/r/75058/ * [port mapping isolator] Work around apparent MAC address kernel bug. It seems that there are scenarios where, when using the port mapping isolator, mesos containers sometimes cannot communicate with the mesos agent as the MAC address of the veth interface is set incorrectly, leading to dropped packets by the kernel. This was discovered with the use of tcpdump (which reveals that the kernel marks the packets as destined for another host), and the latter of which reveals that the kernel is indeed dropping the packets due to this. We then found that when we set the mac address on the veth interface, it sometimes does not "stick" despite ioctl returning successfully. Observed scenarios with incorrectly assigned MAC addresses: 1. After setting the mac address: ioctl returns the correct MAC address, but net::mac returns an incorrect MAC address (different from the original!) 2. After setting the mac address: both ioctl and net::mac return the same MAC address, but are both wrong (and different from the original one!) 3. After setting the mac address: there are no cases where ioctl or net::mac come back with the same MAC address as before we set the address. 4. Before we set the mac address: there is a possibility that ioctl and net::mac results disagree with each other! 5. There is a possibility that the MAC address we set ends up overwritten by a garbage value after setMAC has already completed and checked that the mac address was set correctly. Since this error happens after this function has finished, we cannot log nor detect it in setMAC because we have not yet studied at what point this occurs. Notes: 1. We have observed this behavior only on CentOS 9 systems at the moment, CentOS 7 systems under various kernels do not seem to have the issue (which is quite strange if this was purely a kernel bug). 2. We have tried kernels 5.15.147, 5.15.160, 5.15.161, all of these have this issue on CentOS 9. This patch adds a workaround for this bug, which is to check that the MAC address is set correctly after the ioctl call, and retry the address setting if necessary. In our testing, this workaround appears to workaround scenarios (1), (2), (3), and (4) above, but it does not address scenario (5). See MESOS-10243 for additional details, follow-ups. Review: https://reviews.apache.org/r/75057/ * CHANGE: add default network if no net options was set. * [build] Fix make distcheck for ubuntu 20.04. Fixes the makefiles for ubuntu 20.04 so that make distcheck works properly with its protobuf dependencies generated as part of make distcheck. The reason that a change was needed in the makefile is because the upgrade from ubuntu 16.04 to 20.04 also caused the automake version to be updated when dependencies were being installed during docker build. The change in the automake version created slight changes in the generated makefile. Speficically, the distcheck on the new automake-generated makefile now depends on `BUILT_SOURCES` which causes an error as the CSI protobuf files are not ready when distcheck is called. So we add the csi build stamps to `BUILT_SOURCES` to ensure that the protobuf files will be ready when distcheck's dependencies are made. The additional chmod change for java is because for some reason when distcheck attempts to build the mesos-1.12.0.jar from its created distribution, some folders are missing write permissions, causing the build to terminate. Review: https://reviews.apache.org/r/75062/ * [build] Use ubuntu:20.04 for verify-reviews.py. Moving to ubuntu 20.04 so so that we can get the ebpf header files for our build. Review: https://reviews.apache.org/r/75063/ * [build] Fix docker-build.sh failing to compile with distcheck. Review: https://reviews.apache.org/r/75066/ * [build] Fix libevent-enabled cmake builds on ubuntu 20.04. Our current libevent-enabled cmake builds cannot complete on jenkins as it gets the 'incomplete definition of type 'struct bio_st'' error. This is because the upgrade to ubuntu 20.04 also upgraded our openSSL version from 1.0.2 to 1.1.1, which breaks compatibility with libevent 2.1.5 that was previously used. A compatibility patch for openSSL 1.1+ was released with libevent 2.1.7, but the closest tarball that includes a CMakeLists.txt file is 2.1.9, which is what we will upgrade to. With the new libevent library, builds are able to complete using cmake and with libevents enabled. But it still sees the test failures we see on other builds (such as autotools) and operating system (CentOS 7). We also need to link libevent_pthread and libevent_openssl with libprocess, otherwise we will get errors like: ``` ld: 3rdparty/libprocess/src/libprocess.so: undefined reference to bufferevent_openssl_get_ssl ``` Review: https://reviews.apache.org/r/75070/ * [ssl] Remove TLS 1.0 and 1.1 tests. Currently the SSLProtocolTest with TLS v1.0 and v1.1 do not pass because those versions were disabled in ubuntu 20.04, see: https://discourse.ubuntu.com/t/spec-tls-1-0-and-1-1-are-disabled-by-default/41868 https://github.com/SoftEtherVPN/SoftEtherVPN/issues/1358#issuecomment-851427905 Review: https://reviews.apache.org/r/75075/ * [cgroups2] Fix allow deny semantics for device access. Currently, the EBPF program we generate has the behavior where the deny list has no effect, as we will allow device access iff the device matched with an allow entry. Instead we want to grant access to a device iff it is in a cgroup's allow list *and not in its deny list.* This means that we need to change our existing logic, which exits on the first match. It is not our desired behavior because the current EBPF program construction logic puts the allow-device checks before the deny-device checks, meaning that if a device is on both allow and deny lists for a cgroup, it will be granted access. This change revamps the EBPF program construction to now check both the allow and deny list of a cgroup before determining whether access may be granted. Specifically, if a device is matched with an entry inside the allow list, we will also be checking if it matches with any entry on the deny list, and deny the device's access if that is the case. We also avoid generating specific parts of the EBPF program code to avoid creating unreachable code, explanations with a diagram are attached above the cgroups2::devices::DeviceProgram::build function. Review: https://reviews.apache.org/r/75026/ * Minor edits missed in r/75026. * [ebpf] Add helper function for getting bpf fd by program id. Introduces a helper function to help abstract away the logic for getting bpf program file descriptor by its program id. Review: https://reviews.apache.org/r/75083/ * [cgroups2] Make cgroups2::path take both absolute and relative paths. Currently, cgroups2::path assumes the path in the argument is relative. We want the function to be able to distinguish between absolute and relative paths, where we only prepend the mounting point on the relative path. Review: https://reviews.apache.org/r/75081/ * [cgroups2] Remove accidental tabs. * [ebpf] Implement atomic replacement of cgroup device programs. Currently, if we try to attach device ebpf files to the same cgroup multiple times, they will all be attached, and they will all be run when a device requests access. This conflicts with our design to have one ebpf file per cgroup that represents all the files they want to allow or deny, where that file is updated when the cgroup adds or removes a device. So we add a patch to atomically replace any existing ebpf file already attached to our target cgroup using our new ebpf file. Review: https://reviews.apache.org/r/75080/ * [veth] Add todo to set mac address on create for peer link. Due to a systemd-induced race-condition related to the MacAddressPolicy being set to 'persistent' on versions >= 242, we will have to set the peer link MAC address of the peer link (eth0) when we create the eth0 peer link so that the udev will not try to overwrite it when it is notified that this device was created, which would lead to a race condition here where us and udev are racing to see who is the last one to write our MAC address to eth0. see: https://issues.apache.org/jira/browse/MESOS-10243 Review: https://reviews.apache.org/r/75087/ * [veth] Avoid udev race condtion on systems with systemd version > 242. In systems with systemd version above 242, there is a potential data race where udev will try to update the MAC address of the device at the same time as us if the systemd's MacAddressPolicy is set to 'persistent'. To prevent udev from trying to set the veth device's MAC address by itself, we must set the device MAC address on creation so that addr_assign_type will be set to NET_ADDR_SET, which prevents udev from attempting to change the MAC address of the veth device. See: https://github.com/torvalds/linux/commit/2afb9b533423a9b97f84181e773cf9361d98fed6 See: https://lore.kernel.org/netdev/CAHXsExy8LKzocBdBzss_vjOpc_TQmyzM87KC192HpmuhMcqasg@mail.gmail.com/T/ See: https://issues.apache.org/jira/browse/MESOS-10243 Review: https://reviews.apache.org/r/75086/ * [ssl_tests] Correct ubuntu version on comment. This commit is part of our effort to upgrade jenkins ubuntu version to 22.04. Review: https://reviews.apache.org/r/75089/ * [veth] Provide the ability to set veth peer link MAC address on creation. This addresses the previous todo where we want to set the MAC address of the peer link when we are creating a veth pair so that we can avoid the race condition we are racing against udev to see who will set the MAC address of the interface last. See: https://reviews.apache.org/r/75087/ See: https://issues.apache.org/jira/browse/MESOS-10243 Review: https://reviews.apache.org/r/75090/ * [jenkins] Create dockerfile compatible with 22.04 build. For review #75080, we made use of replace_bpf_fd and BPF_F_REPLACE which were added in kernel 5.6. Our current ubuntu 20.04 base image uses kernel 5.4. As such we will be upgrading the ubuntu version used in Jenkins to 22.04, whose base image uses kernel 5.15, so that we can make mesos on the updated pipeline, enabling reviewbot, tidybot, and coverity. Review: https://reviews.apache.org/r/75088/ * [build] Use clang-14 for non ubuntu 16.04 targets in docker-build.sh. As we migrate to ubuntu 22.04, clang-10 is no longer available via apt-get install. As such, we will move to clang-14, which should allow us to run the docker-build.sh file with OS equal to ubuntu:22.04. This should allow coverity bot to compile and return to normal. Review: https://reviews.apache.org/r/75091/ * [cgroups] Add helper functions for device Entry. Currently, the Entry class does not have readable helper functions for determining whether the device accesses represented by one Entry would be a subset of that of another. In addition, we want more readable ways to determine if a device has wildcards present and if it has any accesses specified. These additions will streamline the logic in the DeviceManager DeviceManager, which will heavily utilize the Entry class, improving code readability. Review: https://reviews.apache.org/r/75096/ * [cgroups2] Introduce a device manager. This change introduces the DeviceManager to help facilitate device access management in cgroups2 via ebpf program file changes. This centralization is needed since we no longer have control files to leverage as persistence for agent recovery, so we a component that keeps track of allow/deny device access information and re-configures the ebpf program for the cgroup. Device requests can be made to the manager by calling `configure` or `reconfigure`. Note that `configure` should only be used when setting up a cgroup's device access, i.e. it has not requested any device to be allowed/denied before. In addition, `reconfigure` cannot be used to add deny entries containing wildcards. This manager will be made available to all controllers under the cgroups2 isolator, and the GPU isolator. Review: https://reviews.apache.org/r/75006/ * [cgroups2] Add ebpf program attachment to the DeviceManager. Currently, the device manager only keeps track of the state in memory, and does not commit the changes by attaching an ebpf file to the corresponding cgroup. We will now generate and attach the ebpf file when configure and reconfigure are called. Review: https://reviews.apache.org/r/75102/ * [cgroups2] Fix unsafe Process usage in DeviceManager. Calls to the DeviceManager wrapper were directly accessing the state of DeviceManagerProcess. This patch uses the dispatch mechanism instead, and adjusts the tests accordingly. * [cgroups] Add helper to find overlapping device access. Currently, we have to directly compare member variables to see if one Access object would overlap that of another, which isn't very clear to people that would be reading the code. We add a helper to abstract away the logic to see if the accesses specified in one Access instance would overlap with that of another. Review: https://reviews.apache.org/r/75107/ * [cgroups] Add Device::Selector::encompasses. Currently, we have to check via Selector's member variables if the devices represented by one Selector encompasses those represented by another. We add a helper function to simplify the logic which differ depending on whether one Selector encompasses the other. Review: https://reviews.apache.org/r/75106/ * [ebpf] Correct ebpf deny block behavior. Currently, the deny block matches a device access iff all accesses match on the deny block. For example, a rw access would not match the deny block even if the deny block had w access specified. We would expect that the deny block should deny all accesses if the type, major, and minor number matches, and if any of the device accesses overlap with what's specified in the deny block. Review: https://reviews.apache.org/r/75109/ * [cgroups2] Add allow / deny list normalization validation. Currently we assume that a device state is normalized before using it for generating ebpf files. However, we have not been enforcing these constraints on the device access state. We enforce some basic validation on cgroups2::configure on the state to ensure that we are able to generate a correct ebpf program. If the lists are not normalized, we generate incorrect programs! An allow or deny list is 'normalized' iff everything below are true: 1. No entries have empty accesses specified. 2. No two entries on the same list can have the same selector (type, major & minor numbers). 3. No two entries on the same list can be encompassed by the other entry. See Entry::encompassed. This patch adds helpers to check if a device state is normalized, and will only allow users to create new CgroupDeviceAccess instances using a helper that checks that the allow and deny lists are normalized. A new helper function is added to check if an entry would be granted access, and requires the state to be normalized. Review: https://reviews.apache.org/r/75099/ * [cgroups2] Clarify device documentation. * Revert "[cgroups2] Clarify device documentation." This reverts commit fd17efe3402fc859efc63d3cd32658d1ec61a015. * Revert "[cgroups2] Add allow / deny list normalization validation." This reverts commit 45d290aeff6912c8e6a4b1a7358c4e9772c447b4. * [cgroups2] Helper to check device entry normalization. Currently we assume that a device state is normalized before using it for generating ebpf files. However, we have not been enforcing these constraints. We add a helper to check if a device state is normalized so that we can enforce these constraints. An allow or deny list is 'normalized' iff everything below are true: 1. No Entry has empty accesses specified. 2. No two entries on the same list can have the same selector (type, major & minor numbers). 3. No two entries on the same list can be encompassed by the other entry (see Entry::encompasses). Review: https://reviews.apache.org/r/75099/ * [cgroups2] Add helper to normalize allow/deny list. This patch adds a public helper function to abstract away the logic used to make a list comply with the 'normalized' requirements. As a reminder, an allow or deny list is 'normalized' iff everything below are true: 1. No Entry has empty accesses specified. 2. No two entries on the same list can have the same selector (type, major & minor numbers). 3. No two entries on the same list can be encompassed by the other entry (see Entry::encompasses). Review: https://reviews.apache.org/r/75104/ * [cgroups2] Helper to check device access. A device access is granted if it is encompassed by an allow entry and does not have access overlaps with any deny entry. The current process of manually checking if a device access would be granted given a state is tedious and leads to worse readability. A new helper function is added to check if an entry would be granted access in a CgroupDeviceAccess instance, and requires the state to be normalized. Review: https://reviews.apache.org/r/75113/ * [cgroups2] Enforce normalization in configure. We currently do not enforce normalized allow and deny in configure. However, to ensure that we can generate an ebpf program that behaves correctly, we have to ensure that allow and deny are normalized. This patch adds a validation check to ensure that the allow and deny are normalized before attempting to generate the ebpf program. Review: https://reviews.apache.org/r/75114/ * [devices] Enforce normalization for DeviceManager configure & reconfigure. Currently in the configure() and reconfigure() functions in device manager, we do not ensure that the device access state at the end of the function call is normalized. So we incorporate normalized() and normalize() calls to ensure that the allow and deny lists are always normalized at the end of a configure() or reconfigure() call. Review: https://reviews.apache.org/r/75115/ * [devices] Add CgroupDeviceAccess::create helper which checks normalization. Currently, CgroupDeviceAccess instances can be directly constructed without verifying that its allow and deny lists are normalized. To codify our normalization constraints, CgroupDeviceAccess can now only be created with a create() helper. Review: https://reviews.apache.org/r/75116/ * [devices] Fix DeviceManager tests. Changes when merging previous changes caused some DeviceManager testcases to fail. This patch updates the tests to pass it. Review: https://reviews.apache.org/r/75117/ * [style] Add newlines for readability. * [reviewbot] Fix reviewbot build error. Currently, reviewbot is failing from a 'control reaches end of non-void function' error due to a switch case inside a lambda in the device manager code. We use an UNREACHABLE macro to stop this error. Review: https://reviews.apache.org/r/75128/ * [devices] Add ability to remove cgroup from DeviceManager state. When destroy() is called on a container, its cgroup and its children will be cleaned up. We need to remove the cgroup from the device manager state when this happens to ensure that the state is accurate. Review: https://reviews.apache.org/r/75120/ * [cgroups2] Pass device manager to controllers & cgroups2 isolator. Passes the device manager to the cgroups2 isolator on containerizer startup, and sets up the ability for the manager to be passed to the device controller and GPU isolator. Review: https://reviews.apache.org/r/75016/ * [cgroups2] Introduces the DeviceControllerProcess. Introduces a device controller that supports cgroups v2 and is available in the Cgroups2IsolatorProcess. Device access control is made through the DeviceManager. Review: https://reviews.apache.org/r/75098/ * [cgroups2] create device controller in Cgroups2Isolator. DeviceController needs to be created in Cgroups2Isolator with the DeviceManager so that the default whitelist can be properly configured. Review: https://reviews.apache.org/r/75121/ * [cgroups2] Skip enabling of devices controller. Similar to the `perf_event` controller, the `devices` controller cannot be written into cgroup.subtree_control file, so we skip the call to cgroups2::controllers::enable for the device controller. Otherwise we will run into an "Invalid argument" error from cgroups. Review: https://reviews.apache.org/r/75130/ * [cgroups2] Silence incorrect compiler error in the tests. * [device manager] Let non-wildcards entries check device access. Currently, we only allow normal Entry instances for checking whether a device access would be allowed for a cgroup. We want to also allow NonWildcardEntry instances to do this as well. Review: https://reviews.apache.org/r/75135/ * [device manager] Add wildcard conversion helper. We currently have a wildcard conversion helper but it was only available for use inside the device manager test file. This change pulls out the helper and makes it available as a static function for use outside just tests. Review: https://reviews.apache.org/r/75137/ * [cgroups2] Support DeviceManager in GPU isolator. Currently, the GPU isolator assumes we are only using cgroups v1, and makes use of the cgroups::devices::allow and deny functions to control GPU access. In Cgroups2, we need to attach ebpf programs for the specific cgroups, which is done for us in the DeviceManager. Hence, we need to use the DeviceManager in the GPU isolator depending on whether cgroups v1 or v2 is currently mounted. Review: https://reviews.apache.org/r/75074/ * [device manager] Add protobuf for cgroup state checkpointing. Currently the device manager has no means of recovering its state after an agent restarts. This patch aims to add a protobuf definition to let the device manager have a checkpoint file that it can use to recover each cgroup's device access state. Review: https://reviews.apache.org/r/75141/ * [device manager] Add device state file path helper. Currently we do not have a place to keep the checkpoint file for the device manager. This patch adds a path helper that gives us a file that lets us checkpoint the device manager state. Review: https://reviews.apache.org/r/75142/ * [device manager] Checkpoint state on device manager state change. Currently we do not checkpoint the device access state of each cgroup when the configure or reconfigure is called. Meaning that we have no way of recovering a cgroup's device access state. We will checkpoint state of the device manager whenever its state is being changed to ensure that we can recover the most recent state when necessary. Review: https://reviews.apache.org/r/75143/ * [device manager] Add args to customize commit_device_access_changes behavior. Currently, commit_device_access_changes always checkpoints the device manager state **and** configures the bpf programs for the cgroup based on its device access state. We add an argument to commit_device_access_changes for the caller to determine whether they want the state to be checkpointed along with attaching the ebpf program. Review: https://reviews.apache.org/r/75148/ * [cgroups2] Introduce the IoControllerProcess Introduces the IoControllerProcess. This replaces the blkio controller from cgroups v1. We currently only use it to helps us work with the cgroups/all isolation flag. Review: https://reviews.apache.org/r/75155/ * [hugetlb] Introduce the HugeTLBControllerProcess Introduces the `HugeTLBControllerProcess`. Hosts correctly configured for cgroups v2 and provide `cgroups/hugetlb` in the `isolation` flag and `hugetlb` in the `agent_subsystems` flag will use this controller. Review: https://reviews.apache.org/r/75152 * [cpuset] Introduce the CpusetControllerProcess Hosts correctly configured for cgroups v2 and provide cgroups/cpuset in the isolation flag and cpuset in the agent_subsystems flag will use this controller. Review: https://reviews.apache.org/r/75153 * [pids] Introduce the PidsControllerProcess Introduces the PidsControllerProcess. Hosts correctly configured for cgroups v2 and provide cgroups/pids in the isolation flag and pids in the agent_subsystems flag will use this controller. Review: https://reviews.apache.org/r/75154 * [build] Fix compilation error from cherry picks. Minor fix to allow current builds to progress again. Review: https://reviews.apache.org/r/75157/ * [cgroups2] Prevent containerId from prepending cgroup root during recovery. Currently, CgroupsIsolatorTest.ROOT_CGROUPS_PERF_PerfForward fails during recovery because it cannot create the directory for the recovered container. This happens because the original containerId, when recovered, includes the cgroup root. The function that converts a cgroup to a containerId does not ignore the cgroup root, even though we do not expect it to be included. To fix this, we will remove the first token of the cgroup if it matches the cgroup root. This will prevent attaching an extraneous cgroup root to the containerId when parsing it from a cgroup. Review: https://reviews.apache.org/r/75156/ * [paths] Simplify paths::cgroups2::containerId() logic. Currently we are tokenizing the cgroup argument when we could be directly operating on the string using our strings utility helpers instead. This patch replaces the tokenizing logic to using the strings library. Review: https://reviews.apache.org/r/75160/ * [cgroups2] Collect process & thread from leaf groups. Currently, our core controller usage() function try to read the cgroup files in the argument cgroup. However, in our design for cgroups v2, processes and threads live in the leaf child of a cgroup. Hence the usage collection will not find the actual processes and threads for a cgroup if it's not already specified as a leaf group. This patch adds a check to see if the argument cgroup is a leaf cgroup, and will search for processes and threads in the leaf cgroup instead. Review: https://reviews.apache.org/r/75158/ * [cgroups2] Register core controller in container info. Currently the core controller is skipped over during prepare(), and it is not added to the container's registered controlelrs. We need to register the core controller so that its functions can be called by the isolator. Review: https://reviews.apache.org/r/75159/ * [cgroups2] Fix ROOT_CGROUPS_MemoryForward for cgroups2. The ROOT_CGROUPS_MemoryForward test is currently failing because it is looking for the memory hierarchy, which no longer exists in cgroups2 due to the new unified hierarchy. We skip this hierarchy check if we detect cgroups2 is mounted on the system. Review: https://reviews.apache.org/r/75161/ * [cgroups2] Create memory controller using cgroups/all flag. Currently, we cannot use the cgroups/all flag to its corresponding controllers in cgroups2. The flag causes us to grabs all the values in the cgroup.controllers file. But we should instead just add all the creators when we see cgroups/all, as many controllers are no longer present in cgroup.controllers in cgroups v2. This fix incidentally fixes the ROOT_CGROUPS_AgentRecoveryWithNewCgroupSubsystems test as it was unable to create the memory controller using cgroups/all flag. Review: https://reviews.apache.org/r/75162/ * [cgroups2] Fix ROOT_CGROUPS_AutoLoadSubsystems test. Currently, the ROOT_CGROUPS_AutoLoadSubsystems test is failing because it is checking for hierarchies for subsystems enabled under 'cgroups/all'. In cgroups2 we cannot perform this check because of the unified hierarchy. Hence we skip this hierarchy check and instead check that all available cgroups2 controllers are enabled by reading cgroup.controllers and cgroup.subtree_control. Review: https://reviews.apache.org/r/75163/ * [cgroups2] Add device manager recovery support. We currently do not have any method of recovering the device access states when the cgroups2 isolator is atempting to recover containers. We add a recovery state here that makes use of the protobuf checkpoint files to ensure that the previous device accesses of cgroups can be restored. It will be used by the cgroups2 isolator. Review: https://reviews.apache.org/r/75145/ * [cgroups2] Recover device manager with cgroups2 isolator. This patch lets us call the device manager's recovery function after the containers from recovery_state have been successfully recovered. Allowing us to begin recovering the cgroup device access state for each recovered non-orphan container. Review: https://reviews.apache.org/r/75149/ * [cgroups2] Enable controller in parent cgroups during prepare(). To support nested containers with nested cgroups, we need to enable controllers in cgroup.subtree_control file for the appropriate nested cgroup. To do so, we need to ensure that the parents have the the requested controller in their cgroup.subtree_control file so that the nested cgroup can have the controller written into subtree_control as well. Otherwise we will get a 'no such file or directory' error. Review: https://reviews.apache.org/r/75166/ * [cgroups2] Add isolate field for nested containers. Currently we do not support nested containers. We need to let nested containers pick whether they want their own resource constraints based on the LinuxInfo::share_cgroups field in the API. In cgroups v1, we didn't need to track an additional field for this, because the isolator does not store nested containers within its `infos` map. In cgroups v2, we will *always* create cgroups for nested containers, and LinuxInfo::share_cgroups instead specifies whether these cgroups will have resource isolation applied to them. (LinuxInfo::share_cgroups needs to be renamed accordingly). In later patches, we will use this to skip the update and isolate calls on the controllers if isolate == false. Review: https://reviews.apache.org/r/75167/ * [cgroups2] Enforce use of linux launcher with cg2 isolator. Currently we are not checking that the cgroups isolator is being used with the linux launcher. We need to ensure that if the linux launcher is being used, the cgroups isolator is also being used so that the cgroups for the containers can be made inside the isolator's prepare(). Review: https://reviews.apache.org/r/75171/ * [cgroups2] Separate responsibility for creating cgroup and assigning pids. In cgroups2, the current linux launcher does not create cgroups nor does it move the pids into the container's leaf cgroup during fork(). When we launch a container, we first prepare it via the isolators, then the launcher will call fork to, among other things, move the pid into its appropriate cgroup. Once the fork is over, isolate() is called on the isolators. As such, we will remove the cgroups2 isolator's current behavior of assigning pids into leaf cgroups as it is already done by the linux launcher. Review: https://reviews.apache.org/r/75170/ * [cgroups2] Do DEBUG container check after creating cgroup for it. Since we create cgroups for all containers, we will create the DEBUG container. We expect the DEBUG containers to not have its own resource contraints, so we will return __prepare early before reaching the CHECK. Review: https://reviews.apache.org/r/75172/ * [cgroups2] Only call isolate() on controllers of containers with isolate == true. In cgroups1, when handling nested containers that share cgroups, we skip isolate calls for them. We want to replicate this behavior in cgroups2. Review: https://reviews.apache.org/r/75173/ * [cgroups2] Make isolator update() fail for containers that don't need isolation. In cgroups1, we returned an error when we see that the container is sharing cgroups, as it would have no cgroups for update() to take. In cgroups2 we will mimick this behavior for containers that do not wish to have their own resource constraints, as we do not expect to call update() on those containers. Review: https://reviews.apache.org/r/75174/ * [cgroups2] Make isolator status() return parent status for containers w/ !isolate. In cgroups1, if a container is sharing cgroups with its parents, we will return the parent's status. In cgroups2, we want to mimic this behavior even though we always create cgroups for our containers. Since we do not do anything with our container's cgroup if isolate == false. Review: https://reviews.apache.org/r/75175/ * [cgroups2] Handle unknown containers in watch(). In cgroups1, we returned a pending future for containers that shared cgroups with their parents. In cgroups2, since we always create cgroups for our containers, we no longer need to consider this special case. So we only return failure if there is an unknown container. Review: https://reviews.apache.org/r/75176/ * [cgroups2] Perform chown of cgroup if necessary. In cgroups1, we chown for nested cgroups so that they can create deeper layers of cgroups. We want to replicate this behavior in cgroups2. Review: https://reviews.apache.org/r/75178/ * [cgroups2] Enable support for nested containers in the isolator. We enable support for nested containers on systems with cgroups v2 with this patch. This means that nested containers will now have their cgroups created for them, and that the cgroups2 isolator functions will be called for nested containers. Review: https://reviews.apache.org/r/75177/ * [cgroups2] cgroups2::destroy retry rmdir on EBUSY. Currently we only wait until a cgroup's pids (retrieved from cgroup.procs) is empty. However even if a cgroup's pids are empty the rmdir call on it may still return EBUSY, causing us to fail the destroy operation. We want to retry the rmdir operation even on EBUSY for up to 5 seconds to ensure that we are able to delete the cgroup. This approach is similar to how crun is destroying its cgroups. see: https://github.com/containers/crun/blob/10b3038c1398b7db20b1826f94e9d4cb444e9568/src/libcrun/cgroup-utils.c#L471 Review: https://reviews.apache.org/r/75181/ * [libprocess] Add io::Watcher for fs notifications. Adds basic watcher class for filesystem watch notifications. We currently only support Linux with inotify. We currently support inotify events for writing, deleting, and renaming a file. We do not support watching directories. Review: https://reviews.apache.org/r/75182/ * [build] Fix cmake build / tidybot. The cmake build and therefore tidybot are failing because they cannot find the device manager protobuf files, we will run protobuf for device_manager/state.proto to generate the files. Review: https://reviews.apache.org/r/75186/ * [cgroups2] Introduce an OomListener. We add an OomListener process to allow users to listen for oom events in a cgroup or any of its descendants. If the OomListener is terminated, any remaining unsatisfied futures will be failed. If the listened cgroup or any of its descendants encounters an oom event, then the returned future from listen() will become ready, and action can be taken upon the oom event via future onReady handlers. The caller can also discard a returned future to stop listening for events. Review: https://reviews.apache.org/r/75184/ * [cgroups2] Use OomListener in MemoryControllerProcess. The MemoryControllerProcess needs an OomListener to ensure that it does not need to listen for oom events by polling, which causes race conditions with the oom killer. We spawn an OomListener in the MemoryControllerProcess and use it to listen for oom events in any cgroup via oomListen(). Review: https://reviews.apache.org/r/75185/ * [io] Fix warnings during compilation. Currently we have a warning that occurs on compilation about comparison between unsigned long and long. We update the tests to suppress this warning. Review: https://reviews.apache.org/r/75187/ * [cgroups2] Remove completed todos. The Todos mention recovery support for the device manager, and to use the device manager recover in the device controller process. We have added checkpointing and recovery support in: https://reviews.apache.org/r/75145 We also only use the device manager recovery in the cgroups2 isolator instead of in the device controller, as implemented in: https://reviews.apache.org/r/75149/ As such, these todos are considered completed, and can be removed. Review: https://reviews.apache.org/r/75189/ * [gpu] Let NvidiaGpuIsolatorProcess support nesting. The NvidiaGpuIsolatorProcess currently still does not declare itself to support nesting because nested containers were not supported for the cgroups2 isolator. Nested container support was added for cgroups2 isolator in: https://reviews.apache.org/r/75177/ As such we will declare NvidiaGpuIsolatorProcess to be good for nesting. Review: https://reviews.apache.org/r/75190/ * [docs] Add public docs for Cgroups v2. Currently there is no official documentation outlining the changes we have been making to support Cgroups v2. We add a main document outlining how Mesos interacts with Cgroups v2, and update some documents on the changes that were made, such as the device isolator document. Review: https://reviews.apache.org/r/75191/ * CHANGE: resolving conflicts --------- Co-authored-by: Devin Leamy <dleamy@twitter.com> Co-authored-by: Benjamin Mahler <bmahler@apache.org> Co-authored-by: None <None> Co-authored-by: bmahler <benjamin.mahler@gmail.com> Co-authored-by: Jason Zhou <jasonzhou460@gmail.com> Co-authored-by: Jason Zhou <jasonzhou@twitter.com>

DevinLeamy commented Apr 5, 2024

View reviewed changes

bmahler reviewed Apr 5, 2024

View reviewed changes

bmahler closed this in 756b7d7 Apr 5, 2024

DevinLeamy deleted the cgroups2-destroy branch April 8, 2024 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[cgroups2] Introduced API to kill the processes inside of cgroup subtree.#550

[cgroups2] Introduced API to kill the processes inside of cgroup subtree.#550
DevinLeamy wants to merge 1 commit intoapache:masterfrom
DevinLeamy:cgroups2-destroy

DevinLeamy commented Apr 5, 2024

Uh oh!

DevinLeamy Apr 5, 2024

Uh oh!

bmahler Apr 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

DevinLeamy commented Apr 5, 2024

Uh oh!

DevinLeamy Apr 5, 2024

Choose a reason for hiding this comment

Uh oh!

bmahler Apr 5, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants