Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new "run context" management to fix numerous span hierarchy issues #2181

Merged
merged 97 commits into from Oct 27, 2021

Conversation

trentm
Copy link
Member

@trentm trentm commented Aug 6, 2021

This is a big change. Thanks for reviewing it. If you are just here to see what behaviour has changed, scroll down to the "Behaviour changes" section.

Suggestions for reviewing

  • Consider looking at https://github.com/elastic/apm-agent-nodejs/pull/2181/files?w=1 which ignores whitespace diffs. At least the change in s3.js will be much smaller.

  • Consider doing a first pass that checks [ ] Viewed for all test files whose only changes are:

    1. this API change:
      -  var trans = ins.currentTransaction
      +  var trans = ins.currTransaction()
      
    2. or this:
      -  agent._instrumentation.currentTransaction = null
      +  agent._instrumentation.testReset()
      

    .testReset() is the explicit mechanism for tests to reset run context tracking and still (mostly) be able to re-use the Instrumentation in the same test process.

  • I've tried to add a "REVIEW NOTE: ..." comment in a number of places in the diff to given rationale for changes.

  • The instrumentation changes I've made to "lib/instrumentation/modules/aws-sdk/s3.js" is the best example of using the new run context handling. I plan to make some similar improvements to some other instrumentations in subsequent changes.

Terminology

I've use the term "run context" for what we've been tending to call "context" or "async context". A bare "context" is ambiguous -- we have (a) the "context" object on span and transaction objects sent to APM server, (b) trace-"context" for the W3C trace-context headers/data, (c) lambda's "context" object. I avoided using "async context" because we also track context through synchronous execution.

Overview

This PR makes a number of changes to the Instrumentation API that is used by the instrumentations under "lib/instrumentation/modules/". Tracking the "current transaction" and "current span" is now entirely encapsulated in tracking the current "run context" (see RunContext.js) via a "run context manager" singleton (agent._instrumentation._runCtxMgr). This is the thing that uses async_hooks (AsyncHooksRunContextManager) or not (BasicRunContextManager) to track the run context for the currently running JS. They use an interface (and some implementation) very similar to OTel's ContextManager.

The primary reason our context management can't be closer to OTel's is because our apm.startTransaction(...), apm.startSpan(...), span.end(), et al APIs change the current run context. IIUC, this is pretty odd as current APM APIs go. It is also reponsible for some confusion and bugs around parent/child relationships. More typical is an API that wraps a block to run in the context of a new span. Something like the following (much like the withSpan util Dario added to Kibana):

apm.withSpan({ name: 'mySpan' }, (span) => {
  // ... user code here.
  span.end()
})

Eventually we should add an API like this.

Here is brief summary of the parts of the Instrumentation API used for run context tracking:

  • Part of run context tracking is handled by an async-hook (or eventually other mechanisms) tracking new async tasks. The rest of run context tracking is in explicit calls to "bind" a function call to a particular run context.

    ins.bindEmitter(ee)   // (unchanged) bind added event handlers to the curr run context
    ins.bindFunction(fn)  // (unchanged) bind fn to the curr run context
    ins.bindFunctionToRunContext(rc, fn)  // bind fn to a specific run context
    ins.bindFunctionToEmptyRunContext(fn) // an odd ball used to explicitly break run context
    ins.withRunContext(rc, fn, thisArg, ...args)
        // Equivalent to binding `fn` then calling it, but with less overhead.
  • Creating and ending transactions and spans:

    ins.startTransaction(...) -> trans  // (unchanged)
    ins.startSpan(...) -> span          // (unchanged)
    ins.createSpan(...) -> span         // Create span, but don't change the current run context.
    ins.addEndedTransaction(trans)      // (unchanged)
    ins.addEndedSpan(trans)             // (unchanged)
  • Getting and working with run contexts:

    ins.currRunContext() -> rc          // Get the current run context.
    // The following are mostly used internally by above methods.
    ins.supersedeWithTransRunContext(trans)
    ins.supersedeWithSpanRunContext(span)
    ins.supersedeWithEmptyRunContext()

Behaviour changes

First, this PR introduces no changes to the public API.
There are, however, the following changes in behaviour.

  1. If user code creates span A, then creates span B in the same async task: B will be a child of A.

    Before this change this was not the case. See startSpan('A'); startSpan('B') in the same async op: B should be a child of A, but isn't #1889 for discussion. This is why the HTTP span(s) under an ES span is not a child of the ES span; ditto for aws-sdk instrumentation. The "nested spans" case in "test/instrumentation/index.test.js" tests this case.

                                // BEFORE               AFTER
    apm.startTransaction('t0')  // transaction 't0'     transaction 't0'
    apm.startSpan('s1')         // |- span 's1'         `- span 's1'
    apm.startSpan('s2')         // `- span 's2'           `- span 's2'

    Fixing this, also lays the ground work for excluding the HTTP span under ES and aws-sdk spans (S3, SQS); also for exit span handling.

  2. Before this PR, an ended transaction would linger as apm.currentTransaction. Not any more. The 'errors should have a transaction id - ended transaction' test case was removed from "test/instrumentation/index.test.js" for this. This test was added in Add transaction.id to all errors #147. My read of that is that there is no need to associate an error with a transaction if it is captured after the transaction has ended. In other words, I don't think this is a breaking change.

Not a behaviour change, but a possibly significant bug fix is the new implementation of ins.bindEmitter. It fixes an issue where removeListener would fail to actually remove if the same handler function was added to multiple events. The 'removes listeners properly' test case in "test/instrumentation/index.test.js" tests this.

Checklist

  • Implement code
  • Add tests
  • Try this with Kibana (WFM with a local dev build of 7.15), perhaps engage Dario who has played with that much more than I.
  • Do some sanity performance/load testing
  • Add CHANGELOG.asciidoc entry (I'm not commiting this yet to avoid merge conflicts while this PR is being reviewed):
* A significant change was made to internal run context tracking (a.k.a. async
  context tracking). There are no configuration changes or API changes for
  custom instrumentation. ({pull}2181[#2181])
+ 
One behavior change is that multiple spans created synchronously (in the same
async task) will form parent/child relationships; before this change they would
all be siblings. This fixes HTTP child spans of Elasticsearch and aws-sdk
automatic spans to properly be children. ({issues}1889[#1889])

@trentm trentm self-assigned this Aug 6, 2021
@trentm trentm added this to Planned in APM-Agents (OLD) via automation Aug 6, 2021
@github-actions github-actions bot added the agent-nodejs Make available for APM Agents project planning. label Aug 6, 2021
@apmmachine
Copy link
Collaborator

apmmachine commented Aug 6, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-10-26T22:47:02.196+0000

  • Duration: 94 min 17 sec

  • Commit: 2406627

Test stats 🧪

Test Results
Failed 0
Passed 2
Skipped 8
Total 10

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • run module tests for <modules> : Run TAV tests for one or more modules, where <modules> can be either a comma separated list of modules (e.g. memcached,redis) or the string literal ALL to test all modules

  • run benchmark tests : Run the benchmark test only.

The other half is this in elastic-apm-http-client:
    --- apm-nodejs-http-client/index.js	2021-06-09 15:57:41.000000000 -0700
    +++ node_modules/elastic-apm-http-client/index.js	2021-08-09 15:15:37.000000000 -0700
    @@ -205,9 +205,11 @@
       switch (this._conf.serverUrl.protocol) {
         case 'http:':
           this._transport = http
    +      this._uninstrumentedRequest = http.request[this._conf.originalSym]
           break
         case 'https:':
           this._transport = https
    +      this._uninstrumentedRequest = https.request[this._conf.originalSym]
           break
         default:
           throw new Error('Unknown protocol ' + this._conf.serverUrl.protocol)
    @@ -794,7 +796,8 @@
         }

         // Start the request and set its timeout.
    -    const intakeReq = client._transport.request(client._conf.requestIntake)
    +    console.warn('XXX making intake request', client._conf.requestIntake.path)
    +    const intakeReq = client._uninstrumentedRequest(client._conf.requestIntake)
         if (Number.isFinite(client._conf.serverTimeout)) {
           intakeReq.setTimeout(client._conf.serverTimeout)
         }
…orks (see examples/parent-child.js case)

We will have to do the same for other aws-sdk, for ES, and evaluate for
other instrumentations. Note this is an improvement over current
behaviour which doesn't handle cases like the parent/child reln above
anyway.
@trentm trentm moved this from Planned to In Progress in APM-Agents (OLD) Aug 18, 2021
need to use setTimeout to avoid the race, though it *is* slower
running tests. Just a ~15s test run, whcih is fine for one test
file.
…ripped me up on test/sanitize-field-names tests which assume that done() call is immediate
@trentm trentm changed the title feat: new "run context" management to fix numerous span hierarchy issues new "run context" management to fix numerous span hierarchy issues Sep 30, 2021
Copy link
Contributor

@astorm astorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a few passes at this and left some comments/questions, but overall it scans, is well thought out, and looks like it'll work. 👍 There's a few additional "real world" tests I intend to perform for an extra level of certainty, but I want to get this review out now instead of later :)

if (!this._started) {
return null
}
return this._runCtxMgr.active().currTransaction() || null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Deep calls here (_runCtxMgr.active()) should be safe. _runCtxMgr is always set if the instrumentation is started, and the base class active method of a context manager will always return an object.

// interface.
//
// (This class has largerly the same API as @opentelemetry/api `ContextManager`.
// The implementation is adapted from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: oTel's license is Apache-2 and we're BSD-2. Do we want/need to include a copy of the Apache license along with these files, or is the code unique enough that we don't need to do this?

Copy link
Member Author

@trentm trentm Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add an entry to NOTICE.md following the pattern there. I did not include the full Apache-2 license text, however, because that struck me as redundant with the link to it, and the fact that the apache license text is very large and would grokking the notice file more difficult. I'd be fine including the full text, however.
https://github.com/elastic/apm-agent-nodejs/pull/2181/files?w=1#diff-53337bf7ea91d782c573ac8653c18f95b7b23c38da9de08288639bcec9993406

// A run context manager that uses an async hook to automatically track
// run context across async tasks.
//
// (Adapted from https://github.com/open-telemetry/opentelemetry-js/blob/main/packages/opentelemetry-context-async-hooks/src/AsyncHooksContextManager.ts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: oTel's license is Apache-2 and we're BSD-2. Do we want/need to include a copy of the Apache license along with these files, or is the code unique enough that we don't need to do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See NOTICE.md

// "patch-async.js" it does an adequate job of context tracking for much of the
// core Node.js API.
//
// (Adapted from https://github.com/open-telemetry/opentelemetry-js/blob/main/packages/opentelemetry-context-async-hooks/src/AsyncHooksContextManager.ts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: oTel's license is Apache-2 and we're BSD-2. Do we want/need to include a copy of the Apache license along with these files, or is the code unique enough that we don't need to do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See NOTICE.md

// anything for other code bound to the original RunContext (e.g. via
// `ins.bindFunction` or `ins.bindEmitter`).
//
// RunContext is roughly equivalent to OTel's `Context` interface in concept.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: oTel's license is Apache-2 and we're BSD-2. Do we want/need to include a copy of the Apache license along with these files, or is the code unique enough that we don't need to do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See NOTICE.md

@@ -110,7 +109,3 @@ function unwrap (nodule, name) {
return nodule[name][symbols.unwrap]()
}
}

function isWrapped (wrapped) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that isWrapped is an exported function from lib/instrumentation/shimmer.js, and removing it will cause things to break if someone's directly using our shimmer. The shimmer isn't listed as a formal API, but its importance in custom instrumentation means people may be using it.

Question: Lacking clear evidence either way -- do you think this is safe to remove, or should we leave it here out of an abundance of caution?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really want an abundance of caution to dictate not being able to remove undocumented internal APIs. The same could be said for all the API changes I'm making here on the Instrumentation object, no?

This isWrapped method was added in #429 solely for a fix to a single version of mimic-response (version 1.0.0). The issue was fixed in mimic-response@1.0.1, which was released in 2018.

I'd like to say that if someone is doing a custom instrumentation and then adding it via apm.addPatch(), that they should bring along their own shimmer.js and not use the private one.

How strongly do you stand on this? :)

} else {
this._runCtxMgr = new BasicRunContextManager(this._log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOP Question: Should loading the patch-async module (require('./patch-async')(this)) be the responsibility of the context manager object, or is there value to having this happen seperately? (I'm thinking a PatchAsyncRunContextManager class that extends the BasicRunContextManager class and calls require('./patch-async')(this) in its constructor (after calling super())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a reasonable thing to add, yah. I'd prefer to do it in a separate change, or at least not unless there is other work we want to do this PR. Note that I do intend to come back to this area later to add a RunContextManager based on Async Local Storage.

if (typeof target !== 'function') {
return target
}
// this._log.trace('bind %s to fn "%s"', runContext, target.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: stray comment or deliberate debugging code left in place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deliberate. When debugging context propagation issues, this debug print is sometimes useful to understand some of the mechanics. However, it is waay too noisy to leave in by default, even at 'trace' level. Also the use of target.name isn't obvious to reproduce from memory later.

If you prefer, I could move this out to this comment in DEVELOPMENT.md.

@astorm
Copy link
Contributor

astorm commented Oct 12, 2021

@trentm I've finished the last of the real world smoke testing I wanted to do. One possible thing of interest -- I noticed was that with a handlebars template we went from having the spans listed async to blocking in the UI.

Before:

before

After:

after

I believe this behavior is within expectations and more accurately reflects the behavior of the wrapped compile method, but is something customers may ask about.

still a 👍 /approval from here.

@trentm
Copy link
Member Author

trentm commented Oct 12, 2021

I noticed was that with a handlebars template we went from having the spans listed async to blocking in the UI.

@astorm Could this possibly have been due to #2326 (which is merged to master, but not yet in a release)? I.e. could your "before" code have been the latest release rather than the latest master?

@trentm trentm added this to the 7.16 milestone Oct 15, 2021
@trentm
Copy link
Member Author

trentm commented Oct 19, 2021

@dgieselaar If/when you have a chance, I would love to get your feedback on whether this branch of the APM agent seems to work for you for self-APM-tracing of Kibana. The following are my notes from playing with using this branch to trace a dev build of the current 7.x branch of Kibana. No problem if you don't have time. My hope is to merge this later this week.


Some Kibana self-APM playing with this:

  1. checkout the 7.x branch of kibana.git and make this diff:
diff --git a/package.json b/package.json
index 76177b90191..22d1ba1c0e7 100644
--- a/package.json
+++ b/package.json
@@ -221,7 +221,7 @@
     "deep-freeze-strict": "^1.1.1",
     "deepmerge": "^4.2.2",
     "del": "^5.1.0",
-    "elastic-apm-node": "^3.21.1",
+    "elastic-apm-node": "git://github.com/elastic/apm-agent-nodejs.git#trentm/run-ctx-mgr",
     "execa": "^4.0.2",
     "exit-hook": "^2.2.0",
     "expiry-js": "0.1.7",

and create a "config/apm.dev.js" with content like this:

module.exports = {
  active: true,
  centralConfig: false,
  serverUrl: 'https://trentm-play5.apm.us-west1.gcp.cloud.es.io',
  breakdownMetrics: false,
  transactionSampleRate: 1,  // Note I'm going for no sampling here. Often this is reduced to 0.1 or something for fear APM perf impact may be large.
  metricsInterval: '120s',
  captureHeaders: false,
  captureBody: 'off',
  logUncaughtExceptions: true,
  captureSpanStackTraces: false,
  secretToken: '[REDACTED]'
}
  1. Run a 7.16 ES and APM server for our Kibana-under-test to use via apm-integration-testing:
./scripts/compose.py start 7.16 --no-kibana \
    --apm-server-opt="apm-server.kibana.enabled=true" \
    --apm-server-opt="apm-server.kibana.host=172.17.0.1:5601"
  1. Build and run:
nvm use
yarn kbn clean
yarn kbn bootstrap
ELASTIC_APM_LOG_LEVEL=debug yarn start --no-base-path --no-watch | ecslog
  1. I started a "esapp.js" play app that makes an Elasticsearch query when you call its "GET /" endpoint. I loaded that up via hey -c 10 -q 100 -z 60s http://localhost:3000/ to get some APM data. Then I clicked around in APM UI in the Kibana under test. For example I got this error:
    Screen Shot 2021-10-18 at 5 06 23 PM

  2. Then I poked around in the Kibana in my "trentm-play5" cloud.elastic.co deployment to look at the Kibana-under-test. For example the trace with the above error is:

Screen Shot 2021-10-18 at 5 16 06 PM

The only real thing to notice here is that the parent/child relationship for the HTTP spans under ES spans is handled (this is #1963) Separate work, i.e. a different PR, will remove those HTTP spans.

@dgieselaar
Copy link
Member

@trentm Love to see the progress here, will try to have a look tomorrow. Thanks for doing this!

@dgieselaar
Copy link
Member

@trentm some quick observations:

  • ES/HTTP child relationships are fixed!
  • I no longer need to run the callback in a new async context to prevent other spans that are created becoming siblings of the span that wraps the callback
  • I do still have to await an empty promise before creating a new span, otherwise ended spans will end up being the parent of spans that are created immediately after

@trentm
Copy link
Member Author

trentm commented Oct 25, 2021

@dgieselaar Thanks for looking!

  • ES/HTTP child relationships are fixed!

  • I no longer need to run the callback in a new async context to prevent other spans that are created becoming siblings of the span that wraps the callback

Good. That is expected, assuming I understand you correctly.

  • I do still have to await an empty promise before creating a new span, otherwise ended spans will end up being the parent of spans that are created immediately after

My understanding is this is the same issue as you described earlier in #1964
My expectation is that this is fixed by this PR, but there is a subtlety.

I expect that: ending span A, then starting span B in the same async task, should result in span B not being a child of A.

spanA.end()
apm.startSpan('B')   // B is *not* a child of A

However, the possible subtlety is: if span B is started in an async task that was created when span A was still active, then span B will be a child of A, even if A has ended by the time that async task runs.

var spanA = apm.startSpan('A')
setImmediate(function () {
  var spanB = apm.startSpan('B')   // B *is* a child of A, even though A ended before this async task ran.
})
spanA.end()

@dgieselaar
Copy link
Member

dgieselaar commented Oct 26, 2021

@trentm I think it happens in this scenario:

async function getData ( ) {
	await firstFetch();

	await secondFetch();
}

function firstFetch ( ) {
	return withApmSpan('first_fetch', Promise.all([
		esClient.search(),
		esClient.search()
	]);
}

function secondFetch ( ) {
	return esClient.search();
}

E.g., for this route, we end up with get_latency_charts as a child of setup_request, even though the latter is a blocking request that happens before the former.

Is that expected?

@trentm
Copy link
Member Author

trentm commented Oct 26, 2021

@dgieselaar Thanks very much for the scenario. I believe I've reproduced and understand roughly what is happening.

tl;dr: Yes, this (still requiring the await Promise.resolve() before agent.startSpan() in withSpan()) is expected with this PR. I plan to have a follow-up PR that adds a withSpan API to the APM agent that will do the "right" thing using internal run-context APIs. If that works as I hope, then Kibana's withSpan() implementation should be able to change to a small shim over the APM agent withSpan().

My sample code is at https://gist.github.com/trentm/6f118e2abb8188b8fe3d92cd1d7cbd8f if you care. The beginnings of the apm.withSpan() implementation is https://gist.github.com/trentm/6f118e2abb8188b8fe3d92cd1d7cbd8f#file-dario-js-L16-L52

Running the "dario.js" sample code using trentWithSpan results in a trace with parent/child relationships as follows:

    transaction 7741b0 "t"
    `- span 2d65cd "first_fetch"
      `- span c21999 "ES1"
      `- span a78358 "ES2"
    `- span f97746 "second_fetch"
      `- span 9c3cac "ES3"

@dgieselaar
Copy link
Member

@trentm: perfect! looks like your version will work with sync functions too (ie a function that does not return a promise), is that correct?

@trentm
Copy link
Member Author

trentm commented Oct 27, 2021

looks like your version will work with sync functions too (ie a function that does not return a promise), is that correct?

@dgieselaar That is the plan. I haven't tested that yet, though.

@trentm trentm merged commit cf0708c into master Oct 27, 2021
@trentm trentm deleted the trentm/run-ctx-mgr branch October 27, 2021 16:35
APM-Agents (OLD) automation moved this from In Progress to Done Oct 27, 2021
@apmmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-10-27T16:21:40.525+0000

  • Duration: 21 min 43 sec

  • Commit: 7972658

Test stats 🧪

Test Results
Failed 0
Passed 22
Skipped 0
Total 22

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • run module tests for <modules> : Run TAV tests for one or more modules, where <modules> can be either a comma separated list of modules (e.g. memcached,redis) or the string literal ALL to test all modules

  • run benchmark tests : Run the benchmark test only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent-nodejs Make available for APM Agents project planning.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants