Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

docs: benchmark stage1 systemd resource usage #1788

Closed
jonboulle opened this issue Nov 24, 2015 · 28 comments
Closed

docs: benchmark stage1 systemd resource usage #1788

jonboulle opened this issue Nov 24, 2015 · 28 comments

Comments

@jonboulle
Copy link
Contributor

Inside the two default stage1 backends (the standard systemd-nspawn-based one and the LKVM one), rkt uses systemd to manage the process lifecycle and provide some other features like logging (via systemd-journald). This means there is some resource usage overhead associated with each pod, beyond just the resource consumption of the constituent apps. This is by design and a known consequence of rkt's daemonless execution model.

For the most part, this overhead should be minimal - on the order of a few megabytes of memory and negligible CPU. But it's important to have a solid understanding of what exactly this overhead is, and to track it over time (e.g. to ensure we don't have any regressions which dramatically increase this overhead, particularly when looking to integrate rkt with cluster manager systems like Kubernetes.

@jonboulle
Copy link
Contributor Author

/cc @yifan-gu @dchen1107

@vishh
Copy link

vishh commented Dec 2, 2015

cc @vishh

@cgonyeo
Copy link
Member

cgonyeo commented Dec 8, 2015

I wrote a small go program that runs rkt and monitors it and its children's cpu and memory usage through proc.

For lack of a better place, I put it in a new repo: https://github.com/dgonyeo/rkt-monitor. If others like it perhaps a subdirectory in the rkt repo would be a better home for it.

Right now some of the logic is kinda hacky (panics everywhere), but it could be cleaned up fairly easily.

It has access to a lot of information, but right now only prints a small subset of it. That could be easily changed, if more detailed reports would be useful.

I imagine some CI could be set up to build/run this against rkt with a handful of ACIs, and flag any PRs that result in systemd, systemd-journal, or rktusing an abnormally high amount of resources.

Sample output:

systemd-journal(6741): seconds alive: 9  avg CPU: 0%  avg Mem: 29 kB  peak Mem: 45 kB
worker(6746): seconds alive: 9  avg CPU: 23%  avg Mem: 238 kB  peak Mem: 360 kB
rkt(6688): seconds alive: 10  avg CPU: 0%  avg Mem: 26 kB  peak Mem: 27 kB
systemd(6740): seconds alive: 9  avg CPU: 0%  avg Mem: 35 kB  peak Mem: 35 kB

@alban
Copy link
Member

alban commented Dec 8, 2015

systemd-cgtop can also provide information. I have a quite old systemd on my host Fedora 22, so I am running systemd-cgtop from a debootstrap tree in order to use the last version:

$ sudo systemd-nspawn -D /home/alban/distro-trees/debian/ \
        --share-system \
        --bind /run --bind /var/run \
        systemd-cgtop --depth=4 --order=memory --iterations=1 --batch

Control Group                                          Procs   %CPU   Memory  Input/s Output/s
/                                                        194      -     5.9G        -        -
/machine.slice                                             8      -    19.7M        -        -
/machine.slice/ma...7f\x2d84cd\x2df562edf4c99d.scope       4      -     7.1M        -        -
/machine.slice/ma...7f\x2d9dc4\x2d3c82ca318a9d.scope       4      -     4.8M        -        -
/machine.slice/ma...df562edf4c99d.scope/system.slice       3      -     2.1M        -        -
/machine.slice/ma....scope/system.slice/etcd.service       1      -     1.4M        -        -
/machine.slice/ma...scope/system.slice/redis.service       1      -   768.0K        -        -
/machine.slice/ma...d3c82ca318a9d.scope/system.slice       3      -   108.0K        -        -
/machine.slice/ma...ope/system.slice/busybox.service       2      -   108.0K        -        -
/machine.slice/ma...x2d3c82ca318a9d.scope/init.scope       1      -        -        -        -
/machine.slice/ma...m.slice/systemd-journald.service       1      -        -        -        -
/machine.slice/ma...x2df562edf4c99d.scope/init.scope       1      -        -        -        -
/machine.slice/ma...m.slice/systemd-journald.service       1      -        -        -        -

In this example, I am running etcd and redis in the first pod and busybox in the second pod.

The first pod takes 7.2M memory in total but the applications etcd and redis only take 1.3M and 768.0K respectively. That's 5.1M overhead for this pod.

When several pods are running with the same stage1, surely all the pods are running the same version of systemd, systemd-journald, etc. It would be nice to check that they are loaded only one time in memory, thanks to shared libraries and overlay fs.

When checking the /proc/$SYSTEMD_PID/maps of each pod, I see that it uses the same inodes for e.g. /lib64/libselinux.so.1 across pods if overlay fs is enabled. And when overlay fs is disabled, it uses different inodes. Is there a way to check the memory pages are actually shared?

The flavor host copies the binaries systemd, systemd-journald, etc. in stage1 at each rkt run instead of relying on bind mounts or overlay fs. This is something that could be fixed with bind mounts.

@jellonek
Copy link
Contributor

jellonek commented Dec 8, 2015

Informational - in next two weeks I will be working on adding Clear Containers bits into kvm flavor (shared base image with systemd/sshd/init/enter/gc binaries and then DAX ) what in result should greatly reduced memory overhead in this flavor.

@alban
Copy link
Member

alban commented Dec 14, 2015

About systemd-cgtop: it prints more info when the accounting options are enabled in /etc/systemd/system.conf (+reboot):

DefaultMemoryAccounting=yes

systemd/systemd#2163

@jonboulle
Copy link
Contributor Author

@dgonyeo bump

@cgonyeo
Copy link
Member

cgonyeo commented Dec 16, 2015

I'm looking at how to use systemd-cgtop now, and the man page for it says that in order to get resource monitoring for a unit, to "add the CPUAccounting=1, MemoryAccounting=1 and BlockIOAccounting=1 settings in the unit files in question". I was planning on launching rkt via systemd-run, and I see that it has a flag, -p, for setting a unit property. This "takes an assignment in the same format as systemctl(1)'s set-property command", which another man page provides the following example for: systemctl set-property foobar.service CPUShares=777. In this example, does anyone know what foobar.service should be in my systemd-run call?

I did try to omit foobar.service, but the following resulted in systemd-cgtop not showing me resource usage:

sudo systemd-run -p CPUAccounting=1 -p MemoryAccounting=1 -p BlockIOAccounting=1 rkt run --insecure-options=image /home/derek/git/rkt-monitor/mem-stresser.aci

@alban
Copy link
Member

alban commented Dec 16, 2015

@dgonyeo I think it is easier to enable the accounting options in /etc/systemd/system.conf on the host, and then it is no longer necessary to add it per unit.

@cgonyeo
Copy link
Member

cgonyeo commented Dec 16, 2015

That does have the caveat that you'll only be able to use this thing on properly configured systems. That's easier for me though, so unless someone knows how to properly configured systemd-run for this I'll just go with that.

@cgonyeo
Copy link
Member

cgonyeo commented Dec 18, 2015

I've got systemd-cgtop reporting memory/cpu usage now, but some things don't seem quite right to me:

The labels for the columns on the right are: Procs, %CPU, Memory, Input/s, Output/s

/system.slice/run-r558090defc7c407aba72f27d057a8156.service                                             4   94.0    84.4M        -        -
/system.slice/run-r558090defc7c407aba72f27d057a8156.service/system.slice                                2   22.2  1020.0K        -        -
/system.slice/run-r558090defc7c407aba72f27d057a8156.service/system.slice/rkt-log-stresser.service       1   22.2  1020.0K        -        -
/system.slice/run-r558090defc7c407aba72f27d057a8156.service/system.slice/systemd-journald.service       1      -        -        -        -

This is with the log-stresser ACI, that just spits out timestamps in an infinite loop to generate log messages.

On my machine systemd-journald always consumes - of memory, I can't see the usage of systemd in here, and the service is taking up way more memory than the pod itself is. Should I just move forward with remembering all the numbers systemd-cgtop gives me, or is there anything off here?

@alban alban modified the milestones: v0.15.0, v0.14.0 Dec 18, 2015
@cgonyeo
Copy link
Member

cgonyeo commented Dec 22, 2015

Well until I figure out the systemd-cgtop thing (sorry, I've been slow on that), I just poked the gopsutil implementation again. Since I switched to using gopsutil instead of reading proc myself, I'm a little more confident in these numbers.

I also was unsure whether to display memory usage based on the VMS or RSS of a given process, but based on this SO post I just changed it to RSS.

CPU reporting is also currently broken. I know how to fix it, but it's not worth the time if I'm ditching this implementation soon anyway.

derek@rokot ~/git/rkt-monitor> sudo ./rkt-monitor log-stresser.aci
worker(7005): seconds alive: 9  avg CPU: 0.000000%  avg Mem: 5 mB  peak Mem: 6 mB
rkt(6978): seconds alive: 10  avg CPU: 0.000000%  avg Mem: 2 mB  peak Mem: 2 mB
systemd(7000): seconds alive: 9  avg CPU: 0.000000%  avg Mem: 4 mB  peak Mem: 4 mB
systemd-journal(7001): seconds alive: 9  avg CPU: 0.000000%  avg Mem: 7 mB  peak Mem: 7 mB

@jonboulle
Copy link
Contributor Author

CPU usage should be an easy fix right? IMHO let's get that going so we have a basic sample set of the whole picture.

@alban alban modified the milestones: v0.16.0, v0.15.0 Jan 7, 2016
@alban alban modified the milestones: v1.0.0, v0.16.0 Jan 20, 2016
@jonboulle
Copy link
Contributor Author

ping @dgonyeo

@cgonyeo
Copy link
Member

cgonyeo commented Jan 25, 2016

Ok, I finally had time to revisit this. I think the numbers are accurate, and I got the CPU reporting fixed:

derek@rokot ~/go/src/github.com/dgonyeo/rkt-monitor> go build && sudo ./rkt-monitor mem-stresser.aci
worker(14111): seconds alive: 9  avg CPU: 96.989163%  avg Mem: 35 mB  peak Mem: 35 mB
rkt(14086): seconds alive: 10  avg CPU: 2.990006%  avg Mem: 364 kB  peak Mem: 364 kB
systemd(14106): seconds alive: 9  avg CPU: 0.000000%  avg Mem: 4 mB  peak Mem: 4 mB
systemd-journal(14107): seconds alive: 9  avg CPU: 0.000000%  avg Mem: 6 mB  peak Mem: 6 mB

What else do you guys want to see? Maybe 95th percentile CPU and memory usage? More configuration over what it can print, or how rkt is called?

I'd like to add this to rkt's benchmark tests once we're happy with the state of this, ideally ASAP.

@vishh
Copy link

vishh commented Jan 25, 2016

Can you describe what the workload does? Ideally, we will need the base
usage, and also usage under stress (heavy logging, or continual restarts,
etc).

On Mon, Jan 25, 2016 at 2:32 PM, Derek Gonyeo notifications@github.com
wrote:

Ok, I finally had time to revisit this. I think the numbers are accurate,
and I got the CPU reporting fixed:

derek@rokot ~/go/src/github.com/dgonyeo/rkt-monitor> go build && sudo ./rkt-monitor mem-stresser.aci
worker(14111): seconds alive: 9 avg CPU: 96.989163% avg Mem: 35 mB peak Mem: 35 mB
rkt(14086): seconds alive: 10 avg CPU: 2.990006% avg Mem: 364 kB peak Mem: 364 kB
systemd(14106): seconds alive: 9 avg CPU: 0.000000% avg Mem: 4 mB peak Mem: 4 mB
systemd-journal(14107): seconds alive: 9 avg CPU: 0.000000% avg Mem: 6 mB peak Mem: 6 mB

What else do you guys want to see? Maybe 95th percentile CPU and memory
usage? More configuration over what it can print, or how rkt is called?

I'd like to add this to rkt's benchmark tests once we're happy with the
state of this, ideally ASAP.


Reply to this email directly or view it on GitHub
#1788 (comment).

@iaguis iaguis modified the milestones: v1.1.0, v1.0.0 Feb 2, 2016
@cgonyeo
Copy link
Member

cgonyeo commented Feb 10, 2016

I was imagining something along the lines of a make benchmark, so it'd be easy to see performance metrics on any given setup.

@alban alban modified the milestones: v1.2.0, v1.1.0 Feb 24, 2016
@iaguis iaguis modified the milestones: v1+, v1.2.0 Mar 18, 2016
@jonboulle jonboulle modified the milestones: v1.3.0, v1+ Mar 20, 2016
@jonboulle
Copy link
Contributor Author

@dgonyeo can you pick this up for the next release?

@cgonyeo
Copy link
Member

cgonyeo commented Mar 21, 2016

I'll be working on this this week. I'm going to switch over to using cadvisor (which until a PR is merged, will be a specific branch in @sjpotter's fork.

Scenarios I'm going to get benchmarks for:

  • An ACI that logs as much as possible
  • A pod with an absurd number of apps in it
  • Many pods at once

And I imagine it'll by default run these benchmarks with each stage1 available. If anyone has additional scenarios I should be looking at, please speak up.

@jonboulle
Copy link
Contributor Author

If the rkt-monitor stuff is already working can we just land that as-is and consider moving to cadvisor later after things have settled over there?

@cgonyeo
Copy link
Member

cgonyeo commented Mar 24, 2016

Made a PR for it: #2324

@alban alban modified the milestones: v1.4.0, v1.3.0 Mar 31, 2016
@iaguis iaguis modified the milestones: v1.5.0, v1.4.0 Apr 14, 2016
@jonboulle
Copy link
Contributor Author

@dgonyeo before closing this out - could you produce some example stats please? e.g. against the last release and against master?

@cgonyeo
Copy link
Member

cgonyeo commented Apr 26, 2016

Certainly.

derek@proton ~> cat /proc/cpuinfo | grep "model name"
model name  : Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
model name  : Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
model name  : Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
model name  : Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
derek@proton ~> uname -a                             
Linux proton 4.4.6 #1-NixOS SMP Wed Mar 16 15:43:17 UTC 2016 x86_64 GNU/Linux

v1.4.0

log-stresser.aci

derek@proton ~/go/src/github.com/coreos/rkt/tests/rkt-monitor> sudo ./rkt-monitor log-stresser.aci
rkt(18493): seconds alive: 10  avg CPU: 28.314541%  avg Mem: 2 mB  peak Mem: 2 mB
systemd(18515): seconds alive: 9  avg CPU: 0.000000%  avg Mem: 4 mB  peak Mem: 4 mB
systemd-journal(18517): seconds alive: 9  avg CPU: 88.397098%  avg Mem: 7 mB  peak Mem: 7 mB
worker(18521): seconds alive: 9  avg CPU: 7.330367%  avg Mem: 5 mB  peak Mem: 6 mB

mem-stresser.aci

derek@proton ~/go/src/github.com/coreos/rkt/tests/rkt-monitor> sudo ./rkt-monitor mem-stresser.aci 
worker(18634): seconds alive: 9  avg CPU: 98.550401%  avg Mem: 318 mB  peak Mem: 555 mB
rkt(18599): seconds alive: 10  avg CPU: 3.583814%  avg Mem: 2 mB  peak Mem: 2 mB
systemd(18628): seconds alive: 9  avg CPU: 0.000000%  avg Mem: 4 mB  peak Mem: 4 mB
systemd-journal(18630): seconds alive: 9  avg CPU: 0.000000%  avg Mem: 6 mB  peak Mem: 6 mB

cpu-stresser.aci

derek@proton ~/go/src/github.com/coreos/rkt/tests/rkt-monitor> sudo ./rkt-monitor cpu-stresser.aci 
rkt(18706): seconds alive: 10  avg CPU: 3.587050%  avg Mem: 2 mB  peak Mem: 2 mB
systemd(18736): seconds alive: 9  avg CPU: 0.000000%  avg Mem: 4 mB  peak Mem: 4 mB
systemd-journal(18740): seconds alive: 9  avg CPU: 0.000000%  avg Mem: 6 mB  peak Mem: 6 mB
worker(18744): seconds alive: 9  avg CPU: 88.937493%  avg Mem: 808 kB  peak Mem: 808 kB

too-many-apps.podmanifest

derek@proton ~/go/src/github.com/coreos/rkt/tests/rkt-monitor> sudo ./rkt-monitor too-many-apps.podmanifest -d 30s
# Identical (aside from PID) worker-binary lines removed
worker-binary(24029): seconds alive: 18  avg CPU: 0.000000%  avg Mem: 840 kB  peak Mem: 840 kB
worker-binary(23691): seconds alive: 19  avg CPU: 0.000000%  avg Mem: 840 kB  peak Mem: 840 kB
rkt(23316): seconds alive: 23  avg CPU: 9.484321%  avg Mem: 5 mB  peak Mem: 20 mB
rkt(23378): seconds alive: 1  avg CPU: 0.000000%  avg Mem: 15 mB  peak Mem: 15 mB
systemd(23408): seconds alive: 19  avg CPU: 1.614304%  avg Mem: 16 mB  peak Mem: 16 mB
systemd-journal(23411): seconds alive: 19  avg CPU: 0.000000%  avg Mem: 6 mB  peak Mem: 6 mB

Master (98da3e5)

log-stresser.aci

derek@proton ~/go/src/github.com/coreos/rkt/tests/rkt-monitor> sudo ./rkt-monitor log-stresser.aci 
rkt(31364): seconds alive: 10  avg CPU: 36.442197%  avg Mem: 4 mB  peak Mem: 19 mB
systemd(31392): seconds alive: 8  avg CPU: 0.000000%  avg Mem: 4 mB  peak Mem: 4 mB
systemd-journal(31394): seconds alive: 8  avg CPU: 87.003957%  avg Mem: 7 mB  peak Mem: 7 mB
worker(31398): seconds alive: 8  avg CPU: 7.628103%  avg Mem: 5 mB  peak Mem: 6 mB

mem-stresser.aci

derek@proton ~/go/src/github.com/coreos/rkt/tests/rkt-monitor> sudo ./rkt-monitor mem-stresser.aci 
systemd(31624): seconds alive: 8  avg CPU: 0.000000%  avg Mem: 4 mB  peak Mem: 4 mB
systemd-journal(31626): seconds alive: 8  avg CPU: 0.000000%  avg Mem: 6 mB  peak Mem: 6 mB
worker(31630): seconds alive: 8  avg CPU: 96.679234%  avg Mem: 238 mB  peak Mem: 426 mB
rkt(31595): seconds alive: 10  avg CPU: 17.035268%  avg Mem: 4 mB  peak Mem: 19 mB

cpu-stresser.aci

derek@proton ~/go/src/github.com/coreos/rkt/tests/rkt-monitor> sudo ./rkt-monitor cpu-stresser.aci 
systemd(31771): seconds alive: 8  avg CPU: 0.000000%  avg Mem: 4 mB  peak Mem: 4 mB
systemd-journal(31773): seconds alive: 8  avg CPU: 0.000000%  avg Mem: 6 mB  peak Mem: 6 mB
worker(31777): seconds alive: 8  avg CPU: 87.497889%  avg Mem: 808 kB  peak Mem: 808 kB
rkt(31745): seconds alive: 10  avg CPU: 16.936711%  avg Mem: 4 mB  peak Mem: 19 mB

too-many-apps.podmanifest

derek@proton ~/go/src/github.com/coreos/rkt/tests/rkt-monitor> sudo ./rkt-monitor too-many-apps.podmanifest -d 30s
# Idential (aside from PID) prepare-app and worker-binary lines removed
rkt(31853): seconds alive: 22  avg CPU: 14.676260%  avg Mem: 5 mB  peak Mem: 20 mB
systemd(31886): seconds alive: 18  avg CPU: 4.109397%  avg Mem: 15 mB  peak Mem: 16 mB
systemd-journal(31889): seconds alive: 18  avg CPU: 0.000000%  avg Mem: 6 mB  peak Mem: 6 mB
prepare-app(31971): seconds alive: 1  avg CPU: 0.000000%  avg Mem: 0 B  peak Mem: 0 B
worker-binary(32147): seconds alive: 17  avg CPU: 0.000000%  avg Mem: 840 kB  peak Mem: 840 kB

@cgonyeo
Copy link
Member

cgonyeo commented Apr 26, 2016

Want those in the docs anywhere, or is this issue good enough?

@jonboulle
Copy link
Contributor Author

Somewhere, please. Open to suggestions for what it looks like exactly. Some
prior art for example:
https://github.com/coreos/etcd/tree/master/Documentation/benchmarks

On Wed, Apr 27, 2016 at 12:48 AM, Derek Gonyeo notifications@github.com
wrote:

Want those in the docs anywhere, or is this issue good enough?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#1788 (comment)

@vishh
Copy link

vishh commented Jun 1, 2016

cc @dchen1107

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants