Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

fix: blocking behaviour under load #144

Merged
merged 22 commits into from
Apr 5, 2021
Merged

fix: blocking behaviour under load #144

merged 22 commits into from
Apr 5, 2021

Conversation

trentm
Copy link
Member

@trentm trentm commented Mar 26, 2021

This make a few changes to fix issues with the APM agent under heavy
load and with a slow or non-responsive APM server.

  1. A new maxQueueSize config option is added (default 1024 for now) to
    control how many events (transactions, spans, errors, metricsets)
    will be queued before being dropped if events are incoming faster
    than can be sent to APM server.

    Dev Note: This is a FIFO, so we will tend to drop new events and keep
    older ones, whose relevance fades. That isn't ideal. The current
    architecture of using node.js streams to buffer doesn't make it easy
    to drop older events. Dropping newer events under overload is deemed
    good enough for now.

  2. JSON encoding of events (when uncorking) is done in limited size
    batches to control the amount of single chunk CPU eventloop blocking
    time. (See MAX_WRITE_BATCH_SIZE in Client._writev.) Internal stats
    are collected to watch for long(est) batch processing times.

    Dev Note: Eventually an investigation into putting JSON encoding into
    a separate worker thread might be interesting.

  3. The handling of individual requests to the APM Server intake API has
    be rewritten to handle some error cases -- especially from a
    non-responsive APM server -- and to ensure that only one intake
    request is being performed at a time.

  4. Support for backoff on intake API requests has been implemented per
    https://github.com/elastic/apm/blob/master/specs/agents/transport.md#transport-errors

Fixes: #136


This commit is a first carryover from https://github.com/trentm/apm-nodejs-http-client/tree/play
there is still work to be done before this is complete and reviewable.

This make a few changes to fix issues with the APM agent under heavy
load and with a slow or non-responsive APM server.

1. A new `maxQueueSize` config option is added (default 1024 for now) to
   control how many events (transactions, spans, errors, metricsets)
   will be queued before being dropped if events are incoming faster
   than can be sent to APM server.

   Dev Note: This is a FIFO, so we will tend to drop new events and keep
   older ones, whose relevance fades. That isn't ideal. The current
   architecture of using node.js streams to buffer doesn't make it easy
   to drop older events. Dropping newer events under overload is deemed
   good enough for now.

2. JSON encoding of events (when uncorking) is done in limited size
   batches to control the amount of single chunk CPU eventloop blocking
   time. (See MAX_WRITE_BATCH_SIZE in Client._writev.) Internal stats
   are collected to watch for long(est) batch processing times.

   Dev Note: Eventually an investigation into putting JSON encoding into
   a separate worker thread might be interesting.

3. The handling of individual requests to the APM Server intake API has
   be rewritten to handle some error cases -- especially from a
   non-responsive APM server -- and to ensure that only one intake
   request is being performed at a time.

4. Support for backoff on intake API requests has been implemented per
   https://github.com/elastic/apm/blob/master/specs/agents/transport.md#transport-errors

Fixes: #136

* * *

This commit is a first carryover from https://github.com/trentm/apm-nodejs-http-client/tree/play
there is still work to be done before this is complete and reviewable.
@trentm trentm self-assigned this Mar 26, 2021
@trentm trentm added this to Planned in APM-Agents (OLD) via automation Mar 26, 2021
@github-actions github-actions bot added the agent-nodejs Make available for APM Agents project planning. label Mar 26, 2021
@apmmachine
Copy link

apmmachine commented Mar 26, 2021

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: astorm commented: jenkins run the tests please

  • Start Time: 2021-04-01T21:50:54.368+0000

  • Duration: 6 min 58 sec

  • Commit: 0dac0b1

Test stats 🧪

Test Results
Failed 0
Passed 14478
Skipped 0
Total 14478

Trends 🧪

Image of Build Times

Image of Tests

Steps errors 2

Expand to view the steps failures

Run Tests
  • Took 1 min 48 sec . View more details on here
  • Description: .ci/scripts/test.sh '15'
Error signal
  • Took 0 min 0 sec . View more details on here
  • Description: hudson.AbortException: script returned exit code 1

Log output

Expand to view the last 100 lines of log output

[2021-04-01T21:56:46.783Z] | |___| (_) | (_| |  __/ (_| (_) \ V /
[2021-04-01T21:56:46.783Z]  \_____\___/ \__,_|\___|\___\___/ \_/
[2021-04-01T21:56:46.783Z]                               Bash-20210309-2b87ace
[2021-04-01T21:56:46.783Z] 
[2021-04-01T21:56:46.783Z] 
[2021-04-01T21:56:46.783Z] ==> git version 2.17.1 found
[2021-04-01T21:56:46.783Z] ==> curl 7.58.0 (x86_64-pc-linux-gnu) libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
[2021-04-01T21:56:46.783Z] Release-Date: 2018-01-24
[2021-04-01T21:56:46.783Z] Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smb smbs smtp smtps telnet tftp 
[2021-04-01T21:56:46.783Z] Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL 
[2021-04-01T21:56:46.783Z] ==> Jenkins CI detected.
[2021-04-01T21:56:46.784Z]     project root: .
[2021-04-01T21:56:46.784Z] --> token set from env
[2021-04-01T21:56:46.784Z]     Yaml found at: .codecov.yml
[2021-04-01T21:56:46.784Z] ==> Running gcov in . (disable via -X gcov)
[2021-04-01T21:56:46.784Z] ==> Python coveragepy not found
[2021-04-01T21:56:46.784Z] ==> Searching for coverage reports in:
[2021-04-01T21:56:46.784Z]     + .
[2021-04-01T21:56:46.784Z]     -> Found 1 reports
[2021-04-01T21:56:46.784Z] ==> Detecting git/mercurial file structure
[2021-04-01T21:56:46.784Z] ==> Reading reports
[2021-04-01T21:56:46.784Z]     + ./coverage.lcov bytes=19546
[2021-04-01T21:56:46.784Z] ==> Appending adjustments
[2021-04-01T21:56:46.784Z]     https://docs.codecov.io/docs/fixing-reports
[2021-04-01T21:56:47.046Z]     -> No adjustments found
[2021-04-01T21:56:47.047Z] ==> Gzipping contents
[2021-04-01T21:56:47.047Z]         8.0K	/tmp/codecov.l6ZOBC.gz
[2021-04-01T21:56:47.047Z] ==> Uploading reports
[2021-04-01T21:56:47.047Z]     url: https://codecov.io
[2021-04-01T21:56:47.047Z]     query: branch=elastic%2Ftrentm%2Fblocking-behavior&commit=0dac0b1582f7b34130e06894ca383f1624578ef3&build=20&build_url=https%3A%2F%2Fapm-ci.elastic.co%2Fjob%2Fapm-agent-nodejs%2Fjob%2Fapm-nodejs-http-client-mbp%2Fjob%2FPR-144%2F20%2F&name=&tag=&slug=elastic%2Fapm-nodejs-http-client&service=jenkins&flags=&pr=144&job=&cmd_args=
[2021-04-01T21:56:47.047Z] ->  Pinging Codecov
[2021-04-01T21:56:47.047Z] https://codecov.io/upload/v4?package=bash-20210309-2b87ace&token=secret&branch=elastic%2Ftrentm%2Fblocking-behavior&commit=0dac0b1582f7b34130e06894ca383f1624578ef3&build=20&build_url=https%3A%2F%2Fapm-ci.elastic.co%2Fjob%2Fapm-agent-nodejs%2Fjob%2Fapm-nodejs-http-client-mbp%2Fjob%2FPR-144%2F20%2F&name=&tag=&slug=elastic%2Fapm-nodejs-http-client&service=jenkins&flags=&pr=144&job=&cmd_args=
[2021-04-01T21:56:47.309Z] ->  Uploading to
[2021-04-01T21:56:47.309Z] https://storage.googleapis.com/codecov/v4/raw/2021-04-01/5447F42A29F360B6E08CC683F6CEFF6D/0dac0b1582f7b34130e06894ca383f1624578ef3/07fca202-16d2-493b-a335-5a29d5f7fef8.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=GOOG1EQX6OZVJGHKK3633AAFGLBUCOOATRACRQRQF6HMSMLYUP6EAD6XSWAAY%2F20210401%2FUS%2Fs3%2Faws4_request&X-Amz-Date=20210401T215647Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=89e9407ad6b9f2a98bb49cf4b9c6ba327b193a368bcce1904ca7fa200183f809
[2021-04-01T21:56:47.309Z]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
[2021-04-01T21:56:47.309Z]                                  Dload  Upload   Total   Spent    Left  Speed
[2021-04-01T21:56:47.309Z] 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  5752    0     0  100  5752      0  26264 --:--:-- --:--:-- --:--:-- 26264
[2021-04-01T21:56:47.309Z]     -> Reports have been successfully queued for processing at https://codecov.io/github/elastic/apm-nodejs-http-client/commit/0dac0b1582f7b34130e06894ca383f1624578ef3
[2021-04-01T21:56:47.426Z] /tmp/.npm-global/bin/tap-xunit -> /tmp/.npm-global/lib/node_modules/tap-xunit/bin/tap-xunit
[2021-04-01T21:56:47.426Z] /tmp/.npm-global/bin/txunit -> /tmp/.npm-global/lib/node_modules/tap-xunit/bin/tap-xunit
[2021-04-01T21:56:47.426Z] + tap-xunit@2.4.1
[2021-04-01T21:56:47.426Z] added 21 packages from 21 contributors in 1.904s
[2021-04-01T21:56:47.443Z] Recording test results
[2021-04-01T21:56:48.277Z] [Checks API] No suitable checks publisher found.
[2021-04-01T21:56:48.306Z] Running in /var/lib/jenkins/workspace/pm-nodejs-http-client-mbp_PR-144/src/github.com/elastic/apm-nodejs-http-client
[2021-04-01T21:56:48.318Z] [INFO] Codecov: Getting branch ref...
[2021-04-01T21:56:48.419Z] Masking supported pattern matches of $GITHUB_TOKEN
[2021-04-01T21:56:48.473Z] [INFO] Codecov: Sending data...
[2021-04-01T21:56:48.874Z] + curl -sSLo codecov.sh https://codecov.io/bash
[2021-04-01T21:56:49.206Z] + bash codecov.sh
[2021-04-01T21:56:49.206Z] 
[2021-04-01T21:56:49.206Z]   _____          _
[2021-04-01T21:56:49.206Z]  / ____|        | |
[2021-04-01T21:56:49.206Z] | |     ___   __| | ___  ___ _____   __
[2021-04-01T21:56:49.206Z] | |    / _ \ / _` |/ _ \/ __/ _ \ \ / /
[2021-04-01T21:56:49.206Z] | |___| (_) | (_| |  __/ (_| (_) \ V /
[2021-04-01T21:56:49.206Z]  \_____\___/ \__,_|\___|\___\___/ \_/
[2021-04-01T21:56:49.206Z]                               Bash-20210309-2b87ace
[2021-04-01T21:56:49.206Z] 
[2021-04-01T21:56:49.206Z] 
[2021-04-01T21:56:49.206Z] ==> git version 2.17.1 found
[2021-04-01T21:56:49.206Z] ==> curl 7.58.0 (x86_64-pc-linux-gnu) libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
[2021-04-01T21:56:49.206Z] Release-Date: 2018-01-24
[2021-04-01T21:56:49.206Z] Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smb smbs smtp smtps telnet tftp 
[2021-04-01T21:56:49.206Z] Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL 
[2021-04-01T21:56:49.206Z] ==> Jenkins CI detected.
[2021-04-01T21:56:49.207Z]     project root: .
[2021-04-01T21:56:49.207Z] --> token set from env
[2021-04-01T21:56:49.207Z]     Yaml found at: .codecov.yml
[2021-04-01T21:56:49.207Z] ==> Running gcov in . (disable via -X gcov)
[2021-04-01T21:56:49.207Z] ==> Python coveragepy not found
[2021-04-01T21:56:49.207Z] ==> Searching for coverage reports in:
[2021-04-01T21:56:49.207Z]     + .
[2021-04-01T21:56:49.207Z]     -> Found 1 reports
[2021-04-01T21:56:49.207Z] ==> Detecting git/mercurial file structure
[2021-04-01T21:56:49.207Z] ==> Reading reports
[2021-04-01T21:56:49.207Z]     + ./coverage.lcov bytes=19549
[2021-04-01T21:56:49.207Z] ==> Appending adjustments
[2021-04-01T21:56:49.207Z]     https://docs.codecov.io/docs/fixing-reports
[2021-04-01T21:56:49.470Z]     -> No adjustments found
[2021-04-01T21:56:49.470Z] ==> Gzipping contents
[2021-04-01T21:56:49.470Z]         8.0K	/tmp/codecov.jq7rrc.gz
[2021-04-01T21:56:49.470Z] ==> Uploading reports
[2021-04-01T21:56:49.470Z]     url: https://codecov.io
[2021-04-01T21:56:49.470Z]     query: branch=elastic%2Ftrentm%2Fblocking-behavior&commit=0dac0b1582f7b34130e06894ca383f1624578ef3&build=20&build_url=https%3A%2F%2Fapm-ci.elastic.co%2Fjob%2Fapm-agent-nodejs%2Fjob%2Fapm-nodejs-http-client-mbp%2Fjob%2FPR-144%2F20%2F&name=&tag=&slug=elastic%2Fapm-nodejs-http-client&service=jenkins&flags=&pr=144&job=&cmd_args=
[2021-04-01T21:56:49.470Z] ->  Pinging Codecov
[2021-04-01T21:56:49.470Z] https://codecov.io/upload/v4?package=bash-20210309-2b87ace&token=secret&branch=elastic%2Ftrentm%2Fblocking-behavior&commit=0dac0b1582f7b34130e06894ca383f1624578ef3&build=20&build_url=https%3A%2F%2Fapm-ci.elastic.co%2Fjob%2Fapm-agent-nodejs%2Fjob%2Fapm-nodejs-http-client-mbp%2Fjob%2FPR-144%2F20%2F&name=&tag=&slug=elastic%2Fapm-nodejs-http-client&service=jenkins&flags=&pr=144&job=&cmd_args=
[2021-04-01T21:56:49.732Z] ->  Uploading to
[2021-04-01T21:56:49.733Z] https://storage.googleapis.com/codecov/v4/raw/2021-04-01/5447F42A29F360B6E08CC683F6CEFF6D/0dac0b1582f7b34130e06894ca383f1624578ef3/cb0584d0-bb04-44f7-98bb-9a29bc1d1156.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=GOOG1EQX6OZVJGHKK3633AAFGLBUCOOATRACRQRQF6HMSMLYUP6EAD6XSWAAY%2F20210401%2FUS%2Fs3%2Faws4_request&X-Amz-Date=20210401T215649Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=e186c62942ebdb57eea7da6e70247474dbf15b5a43b0e7944cbff07625555e0a
[2021-04-01T21:56:49.733Z]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
[2021-04-01T21:56:49.733Z]                                  Dload  Upload   Total   Spent    Left  Speed
[2021-04-01T21:56:49.733Z] 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  5747    0     0  100  5747      0  42257 --:--:-- --:--:-- --:--:-- 42257
[2021-04-01T21:56:49.733Z]     -> Reports have been successfully queued for processing at https://codecov.io/github/elastic/apm-nodejs-http-client/commit/0dac0b1582f7b34130e06894ca383f1624578ef3
[2021-04-01T21:56:51.523Z] Running on Jenkins in /var/lib/jenkins/workspace/pm-nodejs-http-client-mbp_PR-144
[2021-04-01T21:56:51.555Z] [INFO] getVaultSecret: Getting secrets
[2021-04-01T21:56:51.801Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2021-04-01T21:56:52.402Z] + chmod 755 generate-build-data.sh
[2021-04-01T21:56:52.402Z] + ./generate-build-data.sh https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-nodejs-http-client-mbp/PR-144/ https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-nodejs-http-client-mbp/PR-144/runs/20 FAILURE 357773
[2021-04-01T21:56:52.652Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-nodejs-http-client-mbp/PR-144/runs/20/steps/?limit=10000 -o steps-info.json
[2021-04-01T21:56:53.203Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-nodejs-http-client-mbp/PR-144/runs/20/tests/?status=FAILED -o tests-errors.json

This now does *not* unzip the payload. Is that a compat issue?
…0s: unref it *and* clean it up for that particular case in the conclusion handling
Also, allow any 2xx from APM server's intake API to be a "success".
Really it should be only 202, but previous behaviour was any 2xx
which was relied upon by some tests and some mock APM servers.
@trentm
Copy link
Member Author

trentm commented Mar 29, 2021

Some perf-ish results from testing example-blocking-behavior.js with and without APM and talking to an APM server with different behaviours.

Typically (unless otherwise noted) we are hitting it with load via:

npx autocannon -d 90 -c 97 -p 10 localhost:3000/

which is: make requests as fast as possible for 90 seconds, using 97 concurrent connections, and 10 "pipelined requests". This is a higher level of concurrency in load than the autocannon
default of -c 10 -p 1.

I also observed the memory usage ("REM") using htop while running each load test.

tl;dr

Some observations from the data below:

  1. In all load cases, the resident memory usage of the app stayed less than 150MB. The equivalent cases with an APM agent without these fixes would rise up to 1300MB.
  2. This translated to much better latency figures for higher percentiles:
    │ Stat     │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev     │ Max     │
    ├──────────┼────────┼────────┼────────┼────────┼───────────┼───────────┼─────────┤
    │ no fix   │ 138 ms │ 155 ms │ 458 ms │ 665 ms │ 187.46 ms │ 237.45 ms │ 4876 ms │
    │ with fix │ 126 ms │ 145 ms │ 198 ms │ 208 ms │ 150.13 ms │  21.12 ms │  410 ms │
    
  3. The "no APM server" case below demonstrates backoff working to not pepper the APM server with new requests when it is failing.
  4. In all of the "using APM" cases, the internal APM agent stats show zero or a single APM event encoding event-loop bubble over 10ms. This is compared to without these fixes getting bubbles frequently under load, and up to over full seconds.

baseline, no APM

htop observed: RES ~80MB, CPU ~104%

│ Stat    │ 2.5%  │ 50%   │ 97.5% │ 99%   │ Avg      │ Stdev   │ Max    │
├─────────┼───────┼───────┼───────┼───────┼──────────┼─────────┼────────┤
│ Latency │ 11 ms │ 17 ms │ 31 ms │ 36 ms │ 19.94 ms │ 7.39 ms │ 157 ms │

with APM, local mockapmserver.js (accidentally with payloadLogFile)

│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max    │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼────────┤
│ Latency │ 125 ms │ 138 ms │ 173 ms │ 180 ms │ 142.28 ms │ 18.25 ms │ 584 ms │

APM Agent stats: { apmclient:
   { numEvents: 611271,
     numEventsDropped: 597907,
     numEventsEnqueued: 13364,
     numEventsSent: 13364,
     slowWriteBatch:
      { numOver10Ms: 0,
        encodeTimeMs: 2.441588,
        fullTimeMs: 5.245072,
        numEvents: 32,
        numBytes: 25086 },
     backoffReconnectCount: 0 } }

with APM, local mockapmserver.js that is accepting, reading, but not responding (accidentally with payloadLogFile)

htop: RES around 130-140MB

│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max    │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼────────┤
│ Latency │ 126 ms │ 142 ms │ 181 ms │ 195 ms │ 145.81 ms │ 19.02 ms │ 584 ms │

APM Agent stats: { apmclient:
   { numEvents: 572840,
     numEventsDropped: 564596,
     numEventsEnqueued: 7316,
     numEventsSent: 5492,
     slowWriteBatch:
      { numOver10Ms: 0,
        encodeTimeMs: 2.647948,
        fullTimeMs: 5.601515,
        numEvents: 32,
        numBytes: 25085 },
     backoffReconnectCount: 3 } }
{"log.level":"error","@timestamp":"2021-03-29T21:38:23.224Z","log":{"logger":"elastic-apm-node"},"ecs":{"version":"1.6.0"},"message":"APM Server transport error: intake response timeout: APM server did not respond within 10s of gzip stream finish"}

with APM agent, but no APM server (accidentally with payloadLogFile)

htop: RES around 130-140MB

│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max    │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼────────┤
│ Latency │ 125 ms │ 137 ms │ 171 ms │ 178 ms │ 141.42 ms │ 17.01 ms │ 541 ms │

We see backoff delays, which is good: 4s, 9s, 16s, 25s

[2021-03-29T21:41:14.406Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:41:14.761Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:41:16.152Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:41:20.260Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:41:30.339Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:41:47.043Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:42:11.879Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200

with APM agent, against APM server v7.12.0 in Google Cloud (us-west1)

htop: RES ~150MB

│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max    │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼────────┤
│ Latency │ 126 ms │ 145 ms │ 198 ms │ 208 ms │ 150.13 ms │ 21.12 ms │ 410 ms │

APM Agent stats: { apmclient:
   { numEvents: 569401,
     numEventsDropped: 552964,
     numEventsEnqueued: 16085,
     numEventsSent: 15541,
     slowWriteBatch:
      { numOver10Ms: 0,
        encodeTimeMs: 5.624694,
        fullTimeMs: 6.186725,
        numEvents: 32,
        numBytes: 25084 },
     backoffReconnectCount: 0 } }

comparing to client without these fixes

│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev     │ Max     │
├─────────┼────────┼────────┼────────┼────────┼───────────┼───────────┼─────────┤
│ Latency │ 138 ms │ 155 ms │ 458 ms │ 665 ms │ 187.46 ms │ 237.45 ms │ 4876 ms │

htop: RES up to 1600MB

with APM talking to cloud APM server, with default autocannon load

% npx autocannon -c 10 -d 90 -p 1 localhost:3000/
...
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼───────┤
│ Latency │ 1 ms │ 1 ms │ 5 ms  │ 5 ms │ 1.73 ms │ 1.04 ms │ 25 ms │


APM Agent stats: { apmclient:
   { numEvents: 392594,
     numEventsDropped: 54,
     numEventsEnqueued: 392540,
     numEventsSent: 387102,
     slowWriteBatch:
      { numOver10Ms: 1,
        encodeTimeMs: 2.460121,
        fullTimeMs: 12.261069,
        numEvents: 32,
        numBytes: 25088 },
     backoffReconnectCount: 0 } }

Copy link
Member Author

@trentm trentm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some notes for reviewers.

@@ -12,58 +12,6 @@ const assertMetadata = utils.assertMetadata
const assertEvent = utils.assertEvent
const validOpts = utils.validOpts

test('Event: close - if ndjson stream ends', function (t) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to reviewer: These two tests were identical to the following two. I'm guessing they date back from before stream-chopper was in use, but I'm not sure.

this._chopper.destroy()
this._agent.destroy()
cb(err)
}

function onStream (client, onerror) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to reviewer: A total re-write of onStream -- now called getChoppedStreamHandler and makeIntakeRequest -- is below (after the _getBackoffDelay() function). There is large block comment that should mostly explain it.

@trentm trentm marked this pull request as ready for review March 30, 2021 00:16
@trentm trentm requested a review from astorm March 30, 2021 00:16
Copy link
Contributor

@astorm astorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review pass done -- nits and questions below.

I'm going to take a second pass after I've given the client a try with a few toy applications to make sure I understand the flow of things and to look for an unexpected/surprising changes in behavior we might need to address/document.

It's worth saying out loud that overall the new handler returned by getChoppedStreamHandler is more locally complex than the previous onStream handler. However, all we've really done here is surface the complexity that was hidden away under the pump library and given ourselves the necessary control to address the performance issues and edge cases. It's complex code, but it's also as no more complex than it needs to be. 👍

.ci/.jenkins_nodejs.yml Show resolved Hide resolved
test/lib/unref-client.js Show resolved Hide resolved
index.js Show resolved Hide resolved
index.js Outdated Show resolved Hide resolved
const intakeResTimeoutOnEnd = client._conf.intakeResTimeoutOnEnd

// `_active` is used to coordinate the callback to `client.flush(db)`.
client._active = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proforma question -- are we sure there will only be one intake request active at a time? (if not, is the use client._active introducing the possibility of a race-ish condition if there's any overlap between intake request?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, the code is designed to only have one intake request active at a time. Otherwise there is a bug. The completePart() function is written to handle all the (documented in its top comment) ways the 3 streams can conclude, and only call StreamChopper's next() when those 3 streams are all cleanly finished or destroyed.

Note that this client._active handling was in the earlier onStream code. However, because the earlier code didn't wait on the completion of the HTTP response from APM server it was possible to have multiple intake requests going at the same time.

index.js Show resolved Hide resolved
index.js Show resolved Hide resolved
index.js Show resolved Hide resolved
index.js Show resolved Hide resolved
@trentm trentm requested a review from astorm March 31, 2021 21:10
@trentm
Copy link
Member Author

trentm commented Mar 31, 2021

Testing notes:

Copy link
Contributor

@astorm astorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last question before stamping this -- do you have any thoughts on whether we want to log a periodic message if/when the client ends up dropping spans?

I suspect there's going to be some span heavy applications/services that, with the previous version of the agent, didn't run into significant performance issues but who will end up hitting the span limit and dropping spans. Do you think we need logging in the agent for this, or will the changelog(s) and documentation for maxQueueSize be enough?

index.js Show resolved Hide resolved
Copy link
Contributor

@astorm astorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks for talking those through. GHA tests are passing, and the jenkins failures (per previous comments) appear to be related to the code coverage tooling. Approving.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
agent-nodejs Make available for APM Agents project planning.
Projects
Development

Successfully merging this pull request may close these issues.

Blocking Behavior under Benchmarking Load
3 participants