The document describes a runner for executing asynchronous job piles. The jobs are ran by the daemon, known as piled, while the management tasks of stating/stopping/monitoring are carried out using a command line tool pilec.
Is a daemon that manages and supervises the jobs.
The daemon can be started as a unit:
[Unit]
Description=piled daemon
[Service]
ExecStart=/usr/bin/piled --address localhost:9999 --config /etc/piled/daemon.yaml
Delegate=true
Or through systemd-run:
$ sudo systemd-run -p Delegate=true -t --wait -- ./piled --config $PWD/server.yaml
The daemon allows fundamental operations of starting a job, stopping it, observing its output and querying the status.
Each job is assigned an ID by the daemon, which has the format pile-<uuid>
and
is returned to the client in the start response. Job ID needs to be used for all
requests for that job.
The daemon uses a simple config file:
# server's certificate and key, can be an absolute path or relative wrt. this
# config file
cert: certs/server-cert.pem
key: certs/server-key.pem
# CA certificate that
ca-cert: certs/ca-cert.pem
# address to listen on, format accepted by URL parse, the hostname is ignored
listen-address: tcp://localhost:9999
Jobs are run as pid 1 (in their namespace). Note, that a new proc is mounted on top of /proc, hence other processes are not trivially discoverable.
The jobs are run in a separate network namespace. However, no means of network configuration are provided by piled. It is up to the job to set the network up as desired.
The jobs are run in a separate mount namespace. For the purpose of mounting a new /proc, the propagation is changed recursively to private for / mount. Thus, changes to mounts will not be propagated from the mount ns to parent, and neither in reverse direction.
The piled runner manages cgroups by itself. Because of this, it is assumed that
piled is executed under a systemd scope/service with delegation turned on (.e.
Delegate=true
in the properties). This is mandatory for not interfering with
systemd group management and vice versa. See
https://systemd.io/CGROUP_DELEGATION/ for details.
The daemon sets up a hierarchy as follows, where piled.service
is a delegation
point from systemd.
-- piled.service -- runner -- 1234 (piled)
\- pile-<uuid> - 9999 (job1)
\- pile-<uuid> - 12341 (job2)
\- pile-<uuid> - 3333 (job3)
The leaf groups are prepared and their resource limits are configured by the runner. The processes are placed to cgroups through a shim whose job is to move the process to a leaf group, and exec into the provided command.
Communication with piled is done only over TCP INET sockets. Other means of communication are not supported.
Connections are authenticated using mTLS, with TLS set to version 1.3. Upon connection the server shall verify the client's certificate using using the system CA pool and whatever additional CA certificates were provided in the configuration. See the limitations section for further discussion.
Authorization is based on simple tokens. The token can be a JWT with scopes that identify which operations are possible. For simplicity, the code uses hardcoded strings, with 2 known tokes, one for read-only access (output, status), and the other allowing read-write (start, stop, output, status). Token is included in each request. The demo code does not provide any means of 'generating' the tokens.
In theory if Unix socket communication was supported, the authorization scheme could be extended to observe peercred and allow read-write for all root owned processes.
Jobs are assumed to complete at some point. When requested to stop, for simplicity the whole cgroup gets killed through cgroup.kill, so that there are no runaway processes (with limitations as described at the end of the document).
When starting a job, its output is redirected to a file derived from the job's name. Piled maintains a storage location under /tmp/piled/output/. There is a single output for each job which collects both stdout and stderr (similarly to syslog/journal etc.). Stdin is connected to /dev/null. When streaming the job's output is delivered in chunks of any size. It is expected that the client will print the chunks directly to stdout (similarly to what journalctl would do) without any formatting or line separation.
No assumptions should be made about PATH or any other environment variable. Jobs are assumed to be specified using full command paths. Should there be a need to set specific environment bits, the command needs to be wrapped by shell, eg. `sh -c 'export FOO=bar; exec myprogram'``
Example usage. Start a trivial job:
$ ./pilec --config client.yaml \
start -- /usr/bin/bash -c 'echo Hello from job'
pile.c74cb008-9b63-439c-a158-12cedcf41733
Start a long running job:
$ ./pilec --config client.yaml \
start -- /usr/bin/bash -c 'while true; do echo "$(date) -- log"; sleep 5; done'
pile.7954ce77-4b7f-4fed-b6c6-0868174bf6dd
Query the status of a job:
$ ./pilec --config client.yaml \
status pile.7954ce77-4b7f-4fed-b6c6-0868174bf6dd
active
$ ./pilec --config client.yaml \
status pile.4fc399e9-d813-4df2-b33d-24f876aa407e
exited
$ ./pilec --config client.yaml \
status pile.15343990-15aa-4d80-a82b-8896ab1e1de9
failed (status=3)
See the output of a job:
$ ./pilec --config client.yaml \
status pile.7954ce77-4b7f-4fed-b6c6-0868174bf6dd
Fri Apr 8 20:38:07 CEST 2022 -- log
Fri Apr 8 20:38:12 CEST 2022 -- log
Fri Apr 8 20:38:17 CEST 2022 -- log
...<output continues>
Stop a job:
$ ./pilec --config client.yaml stop pile.7954ce77-4b7f-4fed-b6c6-0868174bf6dd
stopped (status=0)
Uses a simple config file:
# client's certificate and key
cert: certs/client-cert.pem
key: certs/client-key.pem
# cert: testdata/client-cert.pem
# key: testdata/client-key.pem
# CA certificate
ca-cert: certs/ca-cert.pem
# server address, in the format accepted by grpc address parser
address: dns:localhost:9999
# authorization token
token: admin-token
Overrides for address
and token
can be passed in the command line as
--address
and --token
parameters.
Resource control is implemented on top of cgroup v2 controller. For simplicity the resource constraints are expressed in a format that is applicable to the *.max knob of a given controller, thus there's cpu.max, io.max, memory.max for each of the supported controllers.
While there are more knobs, implementing support for each and every one is outside of the scope of piled.
The jobs are run inside the same cgroup namespace as the piled daemon, although in a per-job leaf group. Since the daemon is within the system hierarchy, the complete hierarchy will also be visible to the jobs, making it possible for rouge jobs to escape their original cgroup.
No effort is made to observe cgroup event notifications. Any status updates happen when requested by the client.
A new clean tmpfs is not mounted in the job's mount namespace /tmp, /run or
/var/run locations. This allows for sharing content with the jobs, but has a
potential of triggering an information leak, eg. /tmp/.X11-unix
is trivially
accessible.
The same applies to /sys, or /dev.
No effort is made to sandbox the processes using LSM, Seccomp or caps. Everything runs with the same security context as piled.
No effort is made to preserve state across piled restarts.
Normally either CN or SANs would need to match the address on which the server listens, though for simplicity of starting the server on whatever host, the verification of the server's certificate does not include the name of the certificate, but rather assumes a valid certificate (given the system pool CAs or the supplementary certificate) with a well known CN=piled is good enough.