gomaxprocs: Use a ticker (30s) to set gomaxprocs #8278

marclop · 2022-06-01T22:11:28Z

Motivation/summary

Adds a new goroutine which tries to refresh the gomaxprocs setting every
30s. This accounts for cases where the CFS quotas may be refreshed
without the APM Server process being restarted (like in ESS / ECE).

Checklist

~~- [ ] Update CHANGELOG.asciidoc~~
~~- [ ] Update package changelog.yml (only if changes to apmpackage have been made)~~
~~- [ ] Documentation has been updated~~

How to test these changes

docker-compose up -d
cd systemtest/cmd/runapm && go run main.go, copy the image that is used.
Inspect the created container and obtain the FLEET_ENROLLMENT_TOKEN from the runapm container (you can stop the runapm container afterwards).
Create a docker-compose.override.yml in the root of the APM Server with these contents:

services:
  apm-server:
    image: USE_BUILDAPM_RESULTING_IMAGE
    cpus: 2.7
    ports:
      - 8200:8200
    environment:
      - "FLEET_URL=https://fleet-server:8220"
      - "FLEET_CA=/etc/pki/tls/certs/fleet-ca.pem"
      - "FLEET_ENROLL=1"
      - "FLEET_ENROLLMENT_TOKEN=USE_RUNAPM_ENROLLMENT_TOKEN"
    healthcheck:
      test: ["CMD-SHELL", "curl -s http://localhost:8200 | grep -q 'true'"]
      retries: 300
      interval: 1s
    depends_on:
      elasticsearch: { condition: service_healthy }
      kibana: { condition: service_healthy }
      fleet-server: { condition: service_healthy }
    volumes:
      - "./testing/docker/fleet-server/certificate.pem:/etc/pki/tls/certs/fleet-ca.pem"

docker-compose up -d. Verify that the apm-server container is reachable.
Run: docker update -cpus=1 <CONTAINER ID>
Observe a log message (In the observability logs UI) indicating that the CPU quota has been updated to 1.

Related issues

Part of #7967

Adds a new goroutine which tries to refresh the gomaxprocs setting every `30s`. This accounts for cases where the CFS quotas may be refreshed without the APM Server process being restarted (like in ESS / ECE). Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>

apmmachine · 2022-06-01T22:36:43Z

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-06-07T09:41:36.168+0000
Duration: 123 min 15 sec

Test stats 🧪

Test	Results
Failed	0
Passed	2742
Skipped	9
Total	2751

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/hey-apm : Run the hey-apm benchmark.
/package : Generate and publish the docker images.
/test windows : Build & tests on Windows.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

apmmachine · 2022-06-01T22:36:52Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	100.0% (`42/42`)	💚
Files	91.96% (`183/199`)	👍
Classes	93.435% (`427/457`)	👍
Methods	89.218% (`1084/1215`)	👍 0.018
Lines	76.812% (`13191/17173`)	👍 0.014
Conditionals	100.0% (`0/0`)	💚

beater/beater.go

…ith-30s-ticker

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>

beater/beater.go

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>

…ith-30s-ticker

axw

still in draft, but this looks good to go :)

marclop · 2022-06-08T08:40:11Z

Benchmark results

I benchmarked this PR changes vs 8.2.2 since that's the last version that didn't include the gomaxprocs
change. The summary is that using the autogomaxprocs library to limit the Go runtime's gomaxprocs utilizes
the granted CPU quota more efficiently, results in significantly less throttling, and translates into more
performance for the APM Server, all sizes but particularly for the smaller APM Server sizes (1-4GB of RAM).

In ESS, containers with RAM <16GB will have CPU credits granted to them over the period of time where they
aren't consuming more vCPU quota than granted (see table below). This allows the containers to consume more
cpu time than they would be otherwise allowed to. This "boosted" quota is equal to the vCPU an instance with
16GB would have.

Since the APM Server's maximum instance size is 8gb of RAM, boosting will always be a factor. The benchmark
warm-up mostly depleted the CPU credits before the actual benchmarks began, but it was increasingly harder to
do that for the 8GB APM Server size since the GOMAXPROCS value is set to 1 (vCPU 1.35 = 1).

Containers <4GB will be granted an extra 1GB of vCPU quota. For instance, 1GB will have the vCPU of a 2GB
container, 2GB will have the vCPU a 3GB container would have and a 4GB container will have the vCPU of a 4GB
container. The 4 to 8GB increment will double, as it would be expected.

Size	vCPU quota (CPU cores)	Performance Delta vs `8.2.2`	Agents
1gb	0.3385	~75%	64
2gb	0.5577	>100%	64
4gb	0.6769	>100%	64
8gb	1.3539	>36-45%	384

Detailed benchmark results

1GB

Quota: 33.85 (ms) (0.33 vCPU)
Comment: The CPU throttled time is significantly lower when automaxprocs are set.

$ export SIZE=1g; benchstat -alpha 0.34 benchresults/8.2.2/$SIZE.txt benchresults/automaxprocs/$SIZE.txt
name                 old time/op              new time/op              delta
1000Transactions-64               242ms ±57%               115ms ±20%     ~     (p=0.667 n=2+2)
AgentAll-64                       21.6s ± 0%               12.3s ± 1%  -43.08%  (p=0.333 n=2+2)
AgentGo-64                        4.80s ± 1%               2.63s ± 1%  -45.16%  (p=0.333 n=2+2)
AgentNodeJS-64                    2.43s ± 0%               1.51s ± 1%  -37.81%  (p=0.333 n=2+2)
AgentPython-64                    8.74s ± 1%               4.96s ± 0%  -43.19%  (p=0.333 n=2+2)
AgentRuby-64                      4.24s ± 1%               2.54s ± 3%  -39.97%  (p=0.333 n=2+2)
OTLPTraces-64                     270µs ± 0%               138µs ± 0%  -48.91%  (p=0.333 n=2+2)

name                 old events/sec           new events/sec           delta
1000Transactions-64               6.16k ±57%               9.06k ±20%     ~     (p=0.667 n=2+2)
AgentAll-64                         840 ± 0%                1472 ± 1%  +75.29%  (p=0.333 n=2+2)
AgentGo-64                        1.09k ± 1%               1.99k ± 0%  +81.74%  (p=0.333 n=2+2)
AgentNodeJS-64                      830 ± 0%                1334 ± 1%  +60.80%  (p=0.333 n=2+2)
AgentPython-64                      805 ± 1%                1417 ± 0%  +75.97%  (p=0.333 n=2+2)
AgentRuby-64                        888 ± 1%                1480 ± 3%  +66.55%  (p=0.333 n=2+2)
OTLPTraces-64                     3.71k ± 0%               7.26k ± 0%  +95.78%  (p=0.333 n=2+2)

name                 old alloc/op             new alloc/op             delta
1000Transactions-64              17.6MB ± 1%              17.9MB ± 0%   +1.56%  (p=0.333 n=2+2)
AgentAll-64                       751MB ± 0%               712MB ± 0%   -5.21%  (p=0.333 n=2+2)
AgentGo-64                        172MB ± 0%               167MB ± 0%   -3.03%  (p=0.333 n=2+2)
AgentNodeJS-64                   88.6MB ± 0%              86.8MB ± 0%   -2.01%  (p=0.333 n=2+2)
AgentPython-64                    313MB ± 0%               299MB ± 0%   -4.33%  (p=0.333 n=2+2)
AgentRuby-64                      152MB ± 0%               148MB ± 0%   -2.86%  (p=0.333 n=2+2)
OTLPTraces-64                    10.8kB ± 0%              10.8kB ± 0%   -0.32%  (p=0.333 n=2+2)

name                 old allocs/op            new allocs/op            delta
1000Transactions-64                131k ± 1%                129k ± 1%   -1.24%  (p=0.333 n=2+2)
AgentAll-64                       8.65M ± 0%               8.38M ± 0%   -3.12%  (p=0.333 n=2+2)
AgentGo-64                        1.82M ± 0%               1.77M ± 0%   -2.98%  (p=0.333 n=2+2)
AgentNodeJS-64                    1.07M ± 0%               1.06M ± 0%   -1.06%  (p=0.333 n=2+2)
AgentPython-64                    3.84M ± 0%               3.74M ± 0%   -2.44%  (p=0.333 n=2+2)
AgentRuby-64                      1.78M ± 0%               1.75M ± 0%   -1.72%  (p=0.333 n=2+2)
OTLPTraces-64                      79.5 ± 1%                76.0 ± 0%   -4.40%  (p=0.333 n=2+2)

name                 old error_responses/sec  new error_responses/sec  delta
OTLPTraces-64                      0.03 ±33%                0.01 ±31%  -66.19%  (p=0.333 n=2+2)

8.2.2

349,657 (ms) <- Max observed throttled CPU time without automaxprocs.

8.3.0-SNAPSHOT (+ticker change)

57,422 (ms) <- Max observed throttled CPU time with automaxprocs.

2GB

Quota: 50.77 (ms) (0.5 vCPU)
Comment: Again, the autogomaxprocs seems to boost the performance a lot by reducing the amount of
CPU time throttled.

$ export SIZE=2g; benchstat -alpha 0.34 benchresults/8.2.2/$SIZE.txt benchresults/automaxprocs/$SIZE.txt
name                 old time/op     new time/op     delta
1000Transactions-64      194ms ±50%      107ms ±31%      ~     (p=0.667 n=2+2)
AgentAll-64              16.5s ± 0%       7.1s ± 2%   -57.08%  (p=0.333 n=2+2)
AgentGo-64               3.71s ± 0%      1.94s ± 8%   -47.68%  (p=0.333 n=2+2)
AgentNodeJS-64           1.86s ± 2%      0.90s ± 3%   -51.53%  (p=0.333 n=2+2)
AgentPython-64           6.51s ± 2%      3.14s ± 1%   -51.81%  (p=0.333 n=2+2)
AgentRuby-64             3.30s ± 0%      1.70s ± 3%   -48.38%  (p=0.333 n=2+2)
OTLPTraces-64            217µs ± 1%       87µs ± 1%   -59.77%  (p=0.333 n=2+2)

name                 old events/sec  new events/sec  delta
1000Transactions-64      6.91k ±50%     10.34k ±31%      ~     (p=0.667 n=2+2)
AgentAll-64              1.10k ± 1%      2.56k ± 2%  +132.41%  (p=0.333 n=2+2)
AgentGo-64               1.41k ± 0%      2.72k ± 8%   +92.04%  (p=0.333 n=2+2)
AgentNodeJS-64           1.08k ± 2%      2.23k ± 3%  +106.23%  (p=0.333 n=2+2)
AgentPython-64           1.08k ± 2%      2.24k ± 1%  +107.36%  (p=0.333 n=2+2)
AgentRuby-64             1.14k ± 0%      2.21k ± 3%   +93.85%  (p=0.333 n=2+2)
OTLPTraces-64            4.62k ± 1%     11.48k ± 1%  +148.59%  (p=0.333 n=2+2)

name                 old alloc/op    new alloc/op    delta
1000Transactions-64     17.6MB ± 1%     17.6MB ± 1%      ~     (p=1.000 n=2+2)
AgentAll-64              738MB ± 0%      700MB ± 0%    -5.14%  (p=0.333 n=2+2)
AgentGo-64               171MB ± 0%      167MB ± 0%    -2.71%  (p=0.333 n=2+2)
AgentNodeJS-64          89.1MB ± 0%     85.6MB ± 0%    -3.82%  (p=0.333 n=2+2)
AgentPython-64           310MB ± 0%      296MB ± 0%    -4.57%  (p=0.333 n=2+2)
AgentRuby-64             153MB ± 0%      147MB ± 0%    -3.65%  (p=0.333 n=2+2)
OTLPTraces-64           10.8kB ± 0%     10.8kB ± 0%    -0.26%  (p=0.333 n=2+2)

name                 old allocs/op   new allocs/op   delta
1000Transactions-64       130k ± 1%       126k ± 1%    -3.21%  (p=0.333 n=2+2)
AgentAll-64              8.55M ± 0%      8.30M ± 0%    -2.89%  (p=0.333 n=2+2)
AgentGo-64               1.82M ± 0%      1.77M ± 0%    -2.75%  (p=0.333 n=2+2)
AgentNodeJS-64           1.08M ± 0%      1.06M ± 0%    -2.01%  (p=0.333 n=2+2)
AgentPython-64           3.82M ± 0%      3.72M ± 0%    -2.54%  (p=0.333 n=2+2)
AgentRuby-64             1.79M ± 0%      1.75M ± 0%    -2.19%  (p=0.333 n=2+2)
OTLPTraces-64             79.0 ± 0%       75.0 ± 0%    -5.06%  (p=0.333 n=2+2)

8.2.2

402,422 (ms) <- Max observed throtled CPU time.

8.3.0-SNAPSHOT (+ticker change)

210,886 (ms) <- Max observed throttled CPU time.

4GB

Quota: 67.69 (ms) (0.67 vCPU)
Comment: Again, the autogomaxprocs seems to boost the performance a lot by reducing the amount of
CPU time throttled.

$ export SIZE=4g; benchstat -alpha 0.34 benchresults/8.2.2/$SIZE.txt benchresults/automaxprocs/$SIZE.txt
name                 old time/op     new time/op     delta
1000Transactions-64      212ms ± 0%       85ms ±24%   -59.87%  (p=0.200 n=2+3)
AgentAll-64              9.46s ± 0%      4.66s ± 1%   -50.77%  (p=0.200 n=2+3)
AgentGo-64               2.09s ± 0%      1.03s ± 1%   -50.73%  (p=0.200 n=2+3)
AgentNodeJS-64           1.15s ± 0%      0.59s ± 1%   -48.60%  (p=0.200 n=2+3)
AgentPython-64           3.88s ± 1%      1.85s ± 7%   -52.25%  (p=0.200 n=2+3)
AgentRuby-64             1.59s ±19%      0.95s ± 1%   -40.22%  (p=0.200 n=2+3)
OTLPTraces-64            123µs ± 0%       55µs ± 0%   -55.37%  (p=0.200 n=2+3)

name                 old events/sec  new events/sec  delta
1000Transactions-64      4.72k ± 0%     12.15k ±28%  +157.61%  (p=0.200 n=2+3)
AgentAll-64              1.91k ± 0%      3.88k ± 1%  +102.81%  (p=0.200 n=2+3)
AgentGo-64               2.51k ± 0%      5.10k ± 1%  +102.92%  (p=0.200 n=2+3)
AgentNodeJS-64           1.76k ± 0%      3.42k ± 1%   +94.35%  (p=0.200 n=2+3)
AgentPython-64           1.81k ± 1%      3.80k ± 7%  +109.81%  (p=0.200 n=2+3)
AgentRuby-64             2.44k ±19%      3.94k ± 1%   +61.31%  (p=0.200 n=2+3)
OTLPTraces-64            8.16k ± 0%     18.28k ± 0%  +124.06%  (p=0.200 n=2+3)

name                 old alloc/op    new alloc/op    delta
1000Transactions-64     17.4MB ± 0%     17.5MB ± 1%    +0.56%  (p=0.200 n=2+3)
AgentAll-64              713MB ± 0%      694MB ± 0%    -2.56%  (p=0.200 n=2+3)
AgentGo-64               167MB ± 0%      165MB ± 0%    -1.17%  (p=0.200 n=2+3)
AgentNodeJS-64          87.7MB ± 0%     85.3MB ± 0%    -2.71%  (p=0.200 n=2+3)
AgentPython-64           302MB ± 0%      295MB ± 0%    -2.47%  (p=0.200 n=2+3)
AgentRuby-64             148MB ± 0%      145MB ± 0%    -1.78%  (p=0.200 n=2+3)
OTLPTraces-64           10.5kB ± 0%     10.7kB ± 0%    +1.68%  (p=0.200 n=2+3)

name                 old allocs/op   new allocs/op   delta
1000Transactions-64       129k ± 0%       125k ± 1%    -2.96%  (p=0.200 n=2+3)
AgentAll-64              8.38M ± 0%      8.27M ± 0%    -1.39%  (p=0.200 n=2+3)
AgentGo-64               1.79M ± 0%      1.76M ± 0%    -1.62%  (p=0.200 n=2+3)
AgentNodeJS-64           1.07M ± 0%      1.05M ± 0%    -1.54%  (p=0.200 n=2+3)
AgentPython-64           3.76M ± 0%      3.71M ± 0%    -1.38%  (p=0.200 n=2+3)
AgentRuby-64             1.76M ± 0%      1.74M ± 0%    -1.17%  (p=0.200 n=2+3)
OTLPTraces-64             77.0 ± 0%       75.0 ± 0%      ~     (p=0.600 n=2+3)

8.2.2

357,734 (ms) <- Max observed throtled CPU time.

8.3.0-SNAPSHOT (+ticker change)

113,050 (ms) <- Max observed throttled CPU time.

8GB

Quota: 135.39 (ms) (1.35 vCPU)
Comment: There was barely any throttled CPU time and oftentimes the load wasn't enough with 64 agents
so I re-run the benchmark with 384 for both commits and the results are much more consistent.

$ export SIZE=8g; benchstat -alpha 0.34 benchresults/8.2.2/$SIZE.txt benchresults/automaxprocs/$SIZE.txt
name                 old time/op     new time/op     delta
1000Transactions-64     67.0ms ± 1%     64.1ms ± 4%     ~     (p=0.400 n=2+3)
AgentAll-64              3.82s ± 2%      3.15s ±17%  -17.48%  (p=0.200 n=2+3)
AgentGo-64               849ms ± 2%      704ms ±14%  -17.07%  (p=0.200 n=2+3)
AgentNodeJS-64           470ms ± 0%      411ms ± 9%  -12.54%  (p=0.200 n=2+3)
AgentPython-64           1.59s ± 0%      1.32s ± 7%  -17.14%  (p=0.200 n=2+3)
AgentRuby-64             658ms ±17%      696ms ± 6%     ~     (p=1.000 n=2+3)
OTLPTraces-64           54.1µs ± 0%     44.3µs ±10%  -18.09%  (p=0.200 n=2+3)

name                 old events/sec  new events/sec  delta
1000Transactions-64      14.9k ± 1%      15.6k ± 4%     ~     (p=0.400 n=2+3)
AgentAll-64              4.73k ± 2%      5.82k ±18%  +23.11%  (p=0.200 n=2+3)
AgentGo-64               6.17k ± 2%      7.51k ±13%  +21.68%  (p=0.200 n=2+3)
AgentNodeJS-64           4.29k ± 0%      4.92k ± 9%  +14.85%  (p=0.200 n=2+3)
AgentPython-64           4.42k ± 0%      5.35k ± 7%  +20.96%  (p=0.200 n=2+3)
AgentRuby-64             5.87k ±17%      5.40k ± 6%     ~     (p=1.000 n=2+3)
OTLPTraces-64            18.5k ± 0%      22.7k ±10%  +22.68%  (p=0.200 n=2+3)

name                 old alloc/op    new alloc/op    delta
1000Transactions-64     17.3MB ± 0%     17.8MB ± 0%   +2.93%  (p=0.200 n=2+3)
AgentAll-64              692MB ± 0%      696MB ± 0%     ~     (p=0.400 n=2+3)
AgentGo-64               163MB ± 0%      166MB ± 0%   +1.98%  (p=0.200 n=2+3)
AgentNodeJS-64          85.1MB ± 0%     85.5MB ± 1%     ~     (p=0.800 n=2+3)
AgentPython-64           296MB ± 0%      297MB ± 0%     ~     (p=0.400 n=2+3)
AgentRuby-64             145MB ± 0%      146MB ± 1%     ~     (p=0.400 n=2+3)
OTLPTraces-64           10.4kB ± 0%     10.9kB ± 0%   +4.54%  (p=0.200 n=2+3)

name                 old allocs/op   new allocs/op   delta
1000Transactions-64       128k ± 0%       128k ± 0%   -0.24%  (p=0.200 n=2+3)
AgentAll-64              8.26M ± 0%      8.30M ± 0%   +0.46%  (p=0.200 n=2+3)
AgentGo-64               1.75M ± 0%      1.77M ± 0%   +0.94%  (p=0.200 n=2+3)
AgentNodeJS-64           1.05M ± 0%      1.06M ± 0%     ~     (p=0.800 n=2+3)
AgentPython-64           3.72M ± 0%      3.73M ± 0%   +0.49%  (p=0.200 n=2+3)
AgentRuby-64             1.73M ± 0%      1.74M ± 1%   +0.59%  (p=0.200 n=2+3)
OTLPTraces-64             76.0 ± 0%       77.0 ± 0%   +1.32%  (p=0.200 n=2+3)

$ export SIZE=8g-384; benchstat -alpha 0.21 benchresults/8.2.2/$SIZE.txt benchresults/automaxprocs/$SIZE.txt
name                  old time/op              new time/op              delta
1000Transactions-384               102ms ± 3%                59ms ± 3%   -42.12%  (p=0.100 n=3+3)
AgentAll-384                       3.43s ± 6%               2.34s ± 2%   -31.96%  (p=0.100 n=3+3)
AgentGo-384                        927ms ± 1%               451ms ± 7%   -51.36%  (p=0.100 n=3+3)
AgentNodeJS-384                    526ms ± 1%               314ms ± 3%   -40.35%  (p=0.100 n=3+3)
AgentPython-384                    1.54s ±15%               1.03s ± 7%   -32.97%  (p=0.100 n=3+3)
AgentRuby-384                      844ms ± 1%               520ms ±13%   -38.34%  (p=0.100 n=3+3)
OTLPTraces-384                    59.8µs ± 5%              44.4µs ±13%   -25.64%  (p=0.100 n=3+3)

name                  old events/sec           new events/sec           delta
1000Transactions-384               9.86k ± 3%              17.02k ± 3%   +72.66%  (p=0.100 n=3+3)
AgentAll-384                       5.26k ± 6%               7.72k ± 2%   +46.69%  (p=0.100 n=3+3)
AgentGo-384                        5.65k ± 1%              11.64k ± 7%  +106.05%  (p=0.100 n=3+3)
AgentNodeJS-384                    3.83k ± 1%               6.42k ± 3%   +67.72%  (p=0.100 n=3+3)
AgentPython-384                    4.61k ±14%               6.82k ± 7%   +47.94%  (p=0.100 n=3+3)
AgentRuby-384                      4.45k ± 1%               7.29k ±13%   +63.92%  (p=0.100 n=3+3)
OTLPTraces-384                     16.6k ± 4%               22.7k ±12%   +36.62%  (p=0.100 n=3+3)

name                  old alloc/op             new alloc/op             delta
1000Transactions-384              17.6MB ± 0%              18.1MB ± 0%    +3.10%  (p=0.100 n=3+3)
AgentAll-384                       695MB ± 0%               701MB ± 0%    +0.85%  (p=0.100 n=3+3)
AgentGo-384                        164MB ± 0%               167MB ± 0%    +2.04%  (p=0.100 n=3+3)
AgentNodeJS-384                   86.2MB ± 0%              85.9MB ± 0%    -0.34%  (p=0.100 n=3+3)
AgentPython-384                    296MB ± 0%               296MB ± 0%      ~     (p=0.700 n=3+3)
AgentRuby-384                      146MB ± 0%               146MB ± 0%      ~     (p=0.700 n=3+3)
OTLPTraces-384                    10.4kB ± 0%              10.9kB ± 0%    +5.48%  (p=0.100 n=3+3)

name                  old allocs/op            new allocs/op            delta
1000Transactions-384                130k ± 0%                131k ± 0%    +0.36%  (p=0.100 n=3+3)
AgentAll-384                       8.27M ± 0%               8.33M ± 0%    +0.69%  (p=0.100 n=3+3)
AgentGo-384                        1.76M ± 0%               1.78M ± 0%    +1.05%  (p=0.100 n=3+3)
AgentNodeJS-384                    1.06M ± 0%               1.06M ± 0%    -0.21%  (p=0.200 n=3+3)
AgentPython-384                    3.72M ± 0%               3.72M ± 0%      ~     (p=0.700 n=3+3)
AgentRuby-384                      1.74M ± 0%               1.75M ± 0%    +0.33%  (p=0.200 n=3+3)
OTLPTraces-384                      76.0 ± 0%                77.0 ± 0%    +1.32%  (p=0.100 n=3+3)

name                  old error_responses/sec  new error_responses/sec  delta
OTLPTraces-384                     1.44 ±147%                0.05 ±64%   -96.23%  (p=0.100 n=3+3)

8.2.2

341,896 (ms) <- Max observed throtled CPU time. (64 agents)
419,190 (ms) <- Max observed throtled CPU time. (384 agents)

8.3.0-SNAPSHOT (+ticker change)

14,856 (ms) <- Max observed throttled CPU time (64). Minimal
21,419 (ms) <- Max observed throttled CPU time (384). Minimal

16GB (2x 8GB)

The CPU quota is the same as for the 8GB for each instance, since APM Server has a hard limit on its size set to 8GB.
Comment: For this topology, the number of agents was increased to 196 to try and increase the CPU load
on each of the APM Servers + discount the fact that there are 2 APM Servers. I ran the benchmarks with both 64 and 196
agents to increase the load, but there wasn't any throttled CPU time in during the benchmark for the automaxprocs.

$ export SIZE=16g; benchstat -alpha 0.34 benchresults/8.2.2/$SIZE.txt benchresults/automaxprocs/$SIZE.txt
name                 old time/op     new time/op     delta
1000Transactions-64     68.9ms ± 4%     64.9ms ± 9%     ~     (p=0.400 n=3+3)
AgentAll-64              1.77s ± 9%      1.73s ±16%     ~     (p=0.700 n=3+3)
AgentGo-64               472ms ± 7%      417ms ± 4%  -11.77%  (p=0.100 n=3+3)
AgentNodeJS-64           291ms ±19%      211ms ±13%  -27.43%  (p=0.100 n=3+3)
AgentPython-64           1.02s ±12%      0.60s ± 6%  -41.18%  (p=0.100 n=3+3)
AgentRuby-64             508ms ± 8%      302ms ± 3%  -40.47%  (p=0.100 n=3+3)
OTLPTraces-64           44.9µs ± 4%     44.5µs ± 4%     ~     (p=1.000 n=3+3)

name                 old events/sec  new events/sec  delta
1000Transactions-64      14.5k ± 4%      15.5k ± 9%     ~     (p=0.400 n=3+3)
AgentAll-64              10.2k ± 9%      10.6k ±17%     ~     (p=0.700 n=3+3)
AgentGo-64               11.1k ± 7%      12.6k ± 4%  +13.19%  (p=0.100 n=3+3)
AgentNodeJS-64           7.07k ±21%      9.64k ±14%  +36.32%  (p=0.100 n=3+3)
AgentPython-64           6.96k ±13%     11.75k ± 5%  +68.87%  (p=0.100 n=3+3)
AgentRuby-64             7.42k ± 8%     12.42k ± 3%  +67.43%  (p=0.100 n=3+3)
OTLPTraces-64            22.3k ± 4%      22.5k ± 4%     ~     (p=1.000 n=3+3)

name                 old alloc/op    new alloc/op    delta
1000Transactions-64     17.4MB ± 0%     17.8MB ± 0%   +2.85%  (p=0.100 n=3+3)
AgentAll-64              696MB ± 0%      702MB ± 0%   +0.76%  (p=0.100 n=3+3)
AgentGo-64               164MB ± 0%      167MB ± 0%   +1.53%  (p=0.100 n=3+3)
AgentNodeJS-64          85.2MB ± 0%     86.4MB ± 0%   +1.37%  (p=0.100 n=3+3)
AgentPython-64           295MB ± 0%      297MB ± 0%   +0.46%  (p=0.100 n=3+3)
AgentRuby-64             145MB ± 0%      146MB ± 0%   +0.75%  (p=0.100 n=3+3)
OTLPTraces-64           10.5kB ± 0%     11.0kB ± 0%   +4.26%  (p=0.100 n=3+3)

name                 old allocs/op   new allocs/op   delta
1000Transactions-64       128k ± 0%       128k ± 1%     ~     (p=0.700 n=3+3)
AgentAll-64              8.26M ± 0%      8.31M ± 0%   +0.52%  (p=0.100 n=3+3)
AgentGo-64               1.76M ± 0%      1.77M ± 0%   +0.43%  (p=0.100 n=3+3)
AgentNodeJS-64           1.05M ± 0%      1.06M ± 0%   +0.69%  (p=0.100 n=3+3)
AgentPython-64           3.71M ± 0%      3.72M ± 0%   +0.35%  (p=0.100 n=3+3)
AgentRuby-64             1.73M ± 0%      1.74M ± 0%   +0.39%  (p=0.100 n=3+3)
OTLPTraces-64             77.0 ± 0%       77.0 ± 0%     ~     (all equal)

$ export SIZE=16g-196; benchstat -alpha 0.41 benchresults/8.2.2/$SIZE.txt benchresults/automaxprocs/$SIZE.txt
name                  old time/op              new time/op              delta
1000Transactions-196              74.4ms ± 4%              59.7ms ± 2%  -19.73%  (p=0.100 n=3+3)
AgentAll-196                       2.05s ±10%               1.68s ± 7%  -18.20%  (p=0.100 n=3+3)
AgentGo-196                        500ms ± 7%               393ms ± 3%  -21.36%  (p=0.100 n=3+3)
AgentNodeJS-196                    235ms ± 6%               238ms ± 5%     ~     (p=0.700 n=3+3)
AgentPython-196                    754ms ±10%               637ms ±13%  -15.50%  (p=0.200 n=3+3)
AgentRuby-196                      379ms ± 4%               351ms ± 8%   -7.37%  (p=0.400 n=3+3)
OTLPTraces-196                    50.3µs ±11%              47.7µs ±11%     ~     (p=0.700 n=3+3)

name                  old events/sec           new events/sec           delta
1000Transactions-196               13.5k ± 4%               16.8k ± 2%  +24.48%  (p=0.100 n=3+3)
AgentAll-196                       8.87k ±11%              10.80k ± 7%  +21.74%  (p=0.100 n=3+3)
AgentGo-196                        10.4k ± 8%               13.3k ± 3%  +27.58%  (p=0.100 n=3+3)
AgentNodeJS-196                    8.58k ± 6%               8.46k ± 5%     ~     (p=0.700 n=3+3)
AgentPython-196                    9.37k ±10%              11.14k ±13%  +18.84%  (p=0.200 n=3+3)
AgentRuby-196                      9.93k ± 4%              10.76k ± 9%   +8.36%  (p=0.400 n=3+3)
OTLPTraces-196                     19.9k ±13%               20.9k ±13%     ~     (p=0.700 n=3+3)

name                  old alloc/op             new alloc/op             delta
1000Transactions-196              17.5MB ± 0%              17.9MB ± 0%   +2.27%  (p=0.100 n=3+3)
AgentAll-196                       700MB ± 0%               705MB ± 0%   +0.75%  (p=0.100 n=3+3)
AgentGo-196                        164MB ± 1%               167MB ± 0%   +2.41%  (p=0.100 n=3+3)
AgentNodeJS-196                   86.3MB ± 0%              88.3MB ± 0%   +2.31%  (p=0.100 n=3+3)
AgentPython-196                    297MB ± 1%               302MB ± 1%   +1.58%  (p=0.100 n=3+3)
AgentRuby-196                      147MB ± 0%               149MB ± 0%   +1.61%  (p=0.100 n=3+3)
OTLPTraces-196                    10.5kB ± 0%              11.0kB ± 1%   +4.64%  (p=0.100 n=3+3)

name                  old allocs/op            new allocs/op            delta
1000Transactions-196                129k ± 0%                128k ± 0%   -1.26%  (p=0.100 n=3+3)
AgentAll-196                       8.28M ± 0%               8.30M ± 0%   +0.26%  (p=0.100 n=3+3)
AgentGo-196                        1.75M ± 1%               1.77M ± 0%   +0.85%  (p=0.100 n=3+3)
AgentNodeJS-196                    1.06M ± 0%               1.06M ± 0%   +0.80%  (p=0.100 n=3+3)
AgentPython-196                    3.72M ± 0%               3.73M ± 0%   +0.45%  (p=0.200 n=3+3)
AgentRuby-196                      1.74M ± 0%               1.75M ± 0%   +0.60%  (p=0.100 n=3+3)
OTLPTraces-196                      76.7 ± 1%                77.0 ± 0%     ~     (p=1.000 n=3+3)

name                  old error_responses/sec  new error_responses/sec  delta
OTLPTraces-196                     0.56 ±124%               0.58 ±115%     ~     (p=1.000 n=3+3)

8.2.2

254,214 (ms) <- Max observed throtled CPU time. APM 1
132,636 (ms) <- Max observed throtled CPU time. APM 2

8.3.0-SNAPSHOT (+ticker change)

0 (ms) <- No observed throttled CPU time.

lahsivjar

We don't have any such case currently but something to note for future, a 30 seconds refresh interval for cases with dynamic boosting and using vCPUs > 1 might lead to unexpected throttling (when the vCPUs are reset after the boost but GOMAXPROCS is still using the higher value, 30 seconds delay would mean 300 periods assuming CFS period = 100ms).

apmmachine · 2022-06-08T10:23:28Z

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-06-08T08:40:39.555+0000
Duration: 103 min 44 sec

Test stats 🧪

Test	Results
Failed	0
Passed	2742
Skipped	9
Total	2751

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/hey-apm : Run the hey-apm benchmark.
/package : Generate and publish the docker images.
/test windows : Build & tests on Windows.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

apmmachine · 2022-06-08T10:23:37Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	100.0% (`42/42`)	💚
Files	91.96% (`183/199`)	👍
Classes	93.435% (`427/457`)	👍
Methods	89.3% (`1085/1215`)	👍 0.1
Lines	76.789% (`13187/17173`)	👍 0.008
Conditionals	100.0% (`0/0`)	💚

apmmachine · 2022-06-08T10:23:58Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-06-08T20:35:59.289+0000
Duration: 31 min 40 sec

Test stats 🧪

Test	Results
Failed	0
Passed	4045
Skipped	13
Total	4058

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/hey-apm : Run the hey-apm benchmark.
/package : Generate and publish the docker images.
/test windows : Build & tests on Windows.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

marclop · 2022-06-08T10:38:44Z

/test

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>

Adds a new goroutine which tries to refresh the gomaxprocs setting every `30s`. This accounts for cases where the CFS quotas may be refreshed without the APM Server process being restarted (like in ESS / ECE). Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit c61dd3a)

Adds a new goroutine which tries to refresh the gomaxprocs setting every `30s`. This accounts for cases where the CFS quotas may be refreshed without the APM Server process being restarted (like in ESS / ECE). Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com> (cherry picked from commit c61dd3a) Co-authored-by: Marc Lopez Rubio <marc5.12@outlook.com>

stuartnelson3 · 2022-06-14T06:53:18Z

confirmed:

terminal:

~/w/apm-server (8.3=)
# docker update --cpus=1 22ae530c2295
22ae530c2295
~/w/apm-server (8.3=)
# docker update --cpus=3 22ae530c2295
22ae530c2295

observability log ui

08:51:09.080
elastic_agent.apm_server
[elastic_agent.apm_server][info] maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
08:51:39.080
elastic_agent.apm_server
[elastic_agent.apm_server][info] maxprocs: Updating GOMAXPROCS=3: determined from CPU quota

axw · 2022-06-15T03:28:37Z

I've also tested this in QA cloud environment and it seems to be working, though I'm a bit unsure about the specific quotas being set.

I started up a 1GB Integrations Server: it started out with GOMAXPROCS=1, and then switched to GOMAXPROCS=8 after a short time. I then changed the server to 8GB, and it started with GOMAXPROCS=4 and then switched to GOMAXPROCS=8. It does line up with the reported quota/period, though.

marclop · 2022-06-15T07:39:40Z

@axw thanks for testing it in QA too. The QA quotas will be bigger since the VMs where the APM Server containers are run will be smaller resulting in a higher share of the CPU cores for each of workload. The boost to 8 would indicate that CPU credits are available for the container to use so it seems to be working as expected.

marclop added enhancement backport-8.3 Automated backport with mergify labels Jun 1, 2022

axw reviewed Jun 2, 2022

View reviewed changes

beater/beater.go Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into f/adjust-maxprocs-w…

447841d

…ith-30s-ticker

marclop mentioned this pull request Jun 6, 2022

Support benchmarking an unreleased locally built APM Server #7986

Closed

Address review comments

e5ba287

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>

marclop commented Jun 7, 2022

View reviewed changes

beater/beater.go Outdated Show resolved Hide resolved

marclop added 2 commits June 7, 2022 11:40

Simplify log filter

bb1e21d

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>

Merge remote-tracking branch 'upstream/main' into f/adjust-maxprocs-w…

75fd5d1

…ith-30s-ticker

axw approved these changes Jun 7, 2022

View reviewed changes

Merge branch 'main' into f/adjust-maxprocs-with-30s-ticker

0ada0ce

marclop marked this pull request as ready for review June 8, 2022 08:40

lahsivjar approved these changes Jun 8, 2022

View reviewed changes

Merge branch 'main' into f/adjust-maxprocs-with-30s-ticker

e586974

marclop enabled auto-merge (squash) June 8, 2022 10:20

marclop added 2 commits June 8, 2022 14:21

Add changelog entry

ec134f2

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>

amend entry

ae036b2

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>

marclop added test-plan v8.3.0 labels Jun 8, 2022

marclop merged commit c61dd3a into elastic:main Jun 8, 2022

marclop deleted the f/adjust-maxprocs-with-30s-ticker branch June 8, 2022 21:06

mergify bot mentioned this pull request Jun 8, 2022

[8.3] gomaxprocs: Use a ticker (30s) to set gomaxprocs (backport #8278) #8343

Merged

marclop mentioned this pull request Jun 9, 2022

apmbench: Define benchmark scenarios and topologies #7858

Closed

axw mentioned this pull request Jun 13, 2022

all: Use gomaxprocs library to set GOMAXPROCS #8175

Merged

3 tasks

stuartnelson3 self-assigned this Jun 13, 2022

This was referenced Jun 13, 2022

Automatically set GOMAXPROCS elastic/beats#31920

Open

Automatically set GOMAXPROCS elastic/elastic-agent-shipper#55

Open

stuartnelson3 added the test-plan-ok label Jun 14, 2022

marclop mentioned this pull request Jun 14, 2022

Investigate auto-setting GOMAXPROCS #7967

Closed

joshdover mentioned this pull request Jul 5, 2022

Automatically set GOMAXPROCS elastic/fleet-server#1633

Open

cmacknz mentioned this pull request Mar 30, 2023

Automatically set GOMAXPROCS elastic/elastic-agent#2423

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gomaxprocs: Use a ticker (30s) to set gomaxprocs #8278

gomaxprocs: Use a ticker (30s) to set gomaxprocs #8278

marclop commented Jun 1, 2022 •

edited

Loading

apmmachine commented Jun 1, 2022 •

edited

Loading

Build stats

Test stats 🧪

apmmachine commented Jun 1, 2022 •

edited

Loading

axw left a comment

marclop commented Jun 8, 2022 •

edited

Loading

lahsivjar left a comment

apmmachine commented Jun 8, 2022

Build stats

Test stats 🧪

apmmachine commented Jun 8, 2022 •

edited

Loading

apmmachine commented Jun 8, 2022 •

edited

Loading

Build stats

Test stats 🧪

marclop commented Jun 8, 2022

stuartnelson3 commented Jun 14, 2022

axw commented Jun 15, 2022

marclop commented Jun 15, 2022

gomaxprocs: Use a ticker (30s) to set gomaxprocs #8278

gomaxprocs: Use a ticker (30s) to set gomaxprocs #8278

Conversation

marclop commented Jun 1, 2022 • edited Loading

Motivation/summary

Checklist

How to test these changes

Related issues

apmmachine commented Jun 1, 2022 • edited Loading

❕ Build Aborted

Build stats

Test stats 🧪

🤖 GitHub comments

apmmachine commented Jun 1, 2022 • edited Loading

🌐 Coverage report

axw left a comment

Choose a reason for hiding this comment

marclop commented Jun 8, 2022 • edited Loading

Benchmark results

Detailed benchmark results

1GB

8.2.2

8.3.0-SNAPSHOT (+ticker change)

2GB

8.2.2

8.3.0-SNAPSHOT (+ticker change)

4GB

8.2.2

8.3.0-SNAPSHOT (+ticker change)

8GB

8.2.2

8.3.0-SNAPSHOT (+ticker change)

16GB (2x 8GB)

8.2.2

8.3.0-SNAPSHOT (+ticker change)

lahsivjar left a comment

Choose a reason for hiding this comment

apmmachine commented Jun 8, 2022

❕ Build Aborted

Build stats

Test stats 🧪

🤖 GitHub comments

apmmachine commented Jun 8, 2022 • edited Loading

🌐 Coverage report

apmmachine commented Jun 8, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

marclop commented Jun 8, 2022

stuartnelson3 commented Jun 14, 2022

axw commented Jun 15, 2022

marclop commented Jun 15, 2022

marclop commented Jun 1, 2022 •

edited

Loading

apmmachine commented Jun 1, 2022 •

edited

Loading

apmmachine commented Jun 1, 2022 •

edited

Loading

marclop commented Jun 8, 2022 •

edited

Loading

apmmachine commented Jun 8, 2022 •

edited

Loading

apmmachine commented Jun 8, 2022 •

edited

Loading