-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracing Context Preservation (async/await and various of libraries) #60
Comments
We're having a similar conversation in opentracing/opentracing-javascript#112. Since the purpose and the concerns are the same, I'm thinking maybe you would be interested to join the conversation and maybe this can be made as a library that is not specific to either OpenTracing or AWS X-Ray. Thoughts? P.S.: there is a know memory leak in |
I've created an experimental branch to test out cls-hooks. The branch is currently published to npm using the |
hey @chrisradek - do you expect this experimental release to work? I can't seen to pinpoint the issue but subsegments are missing, sometimes they seem to show up a minute or so after the x-ray trace, but other times never. |
Edit: rewriting this comment for brevity: It looks like the last subsegment is not transmitted until the next trace. Consider the following code: 'use strict';
const AWSXRay = require('aws-xray-sdk');
AWSXRay.capturePromise();
module.exports.handler = async () => {
for (let i = 0; i < 10; i++) {
console.log('loop', i);
await new Promise(resolve => {
AWSXRay.captureAsyncFunc(`loop ${i}`, async subsegment => {
await new Promise(resolve => {
setTimeout(resolve, 100);
});
subsegment.close();
resolve();
});
});
}
return 'Done';
}; Logs look like this: First invoke:
Second invoke (you can see the first thing it does is send the last subsegment for trace
|
@dougmoscrop By chance, have you seen this issue happening outside of Lambda functions? |
Sorry, I don't have any non-Lambda based use of X-Ray. I think if anything this can make it appear like the experimental branch isn't working although It seems a separate issue from actual async/await support. |
Any update on this? AWS Lambda has had a Node.js 8 runtime since 4/2/18. When is this SDK going to have a stable release that supports async/await? |
I haven't had time to really dive in to it, but after deploying the experimental branch, our traces still lack hierarchy too. It looked like it was working but wasn't. I realize this is a crappy/unhelpfully vague comment, but yeah this incompatibility between AWS services is being woefully under-addressed :( |
@dougmoscrop |
Sure, here's an example: This is a Lambda processing a Kinesis stream. It looks up a "pointer" from Dynamo to the current S3 key then it updates that object with the new records, saves it as a new S3 object, then moves the Dynamo pointer to the new S3 key (key is a uuid), then deletes the old one. So it's We open a custom segment for this workflow called Context View. But most of the S3/Dynamo operations do not appear as children of it. You can see that they 'belong' to the Context View because it spans the entire time of all the operations. However, the records in our stream are 'grouped' in to separate Context Views depending on some property; so while the attached image shows only one, there could be a dozen or a hundred depending on what's in the Kinesis stream. When this happens, it's much less clear which sub-segments correlate to which job. What I would expect is that simply all the S3 and Dynamo calls you see in the above screenshot show up as children of Context View. Edit: Oh and then there are times where there's a parent-child hierarchy that is incorrect, such as: PutObject happens entirely after GetObject, not within it as a call. The gap you see is our "business logic" as I mentioned above, so its like this:
But it shows up as if s3.get was calling s3.put Stuff like this makes it really hard to know what to believe, like seeing S3 taking 100+ms to get an object that is probably a kilobyte at most; we checked with AWS support and S3 logs show response times more like 13ms. It could totally be a bug in the SDK or the way we're calling it, but it makes us wonder if it's just doing accounting wrong. For that issue specifically, we can start instrumenting our code ourselves, but that kinda compromises the X-Ray value prop. |
Hello 👋 I want to be able to log the time it takes for an async function to return. I tried the latest version of
Here's my code const AWSXRay = require('aws-xray-sdk');
AWSXRay.captureAWS(require('aws-sdk'));
AWSXRay.middleware.enableDynamicNaming();
// This sleep function is to demonstrate an async function
function sleep(ms) {
return new Promise((resolve, reject) => {
setTimeout(resolve, ms);
});
}
function captureAsync(func, args) {
return new Promise((resolve, reject) => {
AWSXRay.captureAsyncFunc(func.name, (seg) => {
// The seg is `undefined` here
func.apply(this, args)
.then((result) => {
seg.close();
resolve(result);
})
.catch((error) => {
seg.close(error);
reject(error);
});
});
});
}
const trace = (handler) => async (req, res) => {
const open = AWSXRay.express.openSegment('some-name');
if (!res.req) {
// This was necessary for another error
res.req = req;
}
open(req, res);
// Throws in `captureAsync`
await captureAsync(sleep, [1000]);
const handlerResponse = await handler(req, res);
const close = AWSXRay.express.closeSegment();
close(req, res);
return handlerResponse;
}
module.exports = { trace }; Am I using this API incorrectly or is this the same bug others are experiencing? |
@styfle |
@chrisradek No lambda. This is running on my own machine in a Docker container. |
@styfle I roughly converted your example to the following: const trace = require('./trace').trace;
const AWSXRay = require('aws-xray-sdk');
const express = require('express');
const app = express();
const handle = trace(console.log);
const open = AWSXRay.express.openSegment('some-name');
const close = AWSXRay.express.closeSegment();
app.use(open);
app.get('/', handle);
app.use(close);
app.listen(3001); I also removed the open/close segment logic from your trace handler: const trace = (handler) => async (req, res) => {
if (!res.req) {
// This was necessary for another error
res.req = req;
}
await captureAsync(sleep, [1000]);
const handlerResponse = await handler(req, res);
return handlerResponse;
} It looks like we currently require the This issue might be specific to our express middleware if it doesn't handle multiple async middlewares in the chain, so we should open another issue to track that if this is the case. |
@chrisradek Thanks for the response! 😃 I'm not using express, I'm using micro which looks a lot like the Lambda API. Maybe the missing |
@styfle You can also instrument your code manually. Here's an example of what that could look like, though I haven't used micro so not sure if there are any problems with this approach: const AWSXRay = require('aws-xray-sdk');
async function example(foo, bar) {
const namespace = AWSXRay.getNamespace();
const segment = new AWSXRay.Segment('some-name');
return await namespace.runAndReturn(async () => {
AWSXRay.setSegment(segment);
await captureAsync(sleep, [1000]);
await sleep(100);
segment.close();
return 'done';
});
}
// This sleep function is to demonstrate an async function
function sleep(ms) {
return new Promise((resolve, reject) => {
setTimeout(resolve, ms);
});
}
function captureAsync(func, args) {
return new Promise((resolve, reject) => {
AWSXRay.captureAsyncFunc(func.name, (seg) => {
// The seg is `undefined` here
func.apply(this, args)
.then((result) => {
seg.close();
resolve(result);
})
.catch((error) => {
seg.close(error);
reject(error);
});
});
});
}
example('a', 'b'); This example creates a segment everytime |
@chrisradek You are amazing!!!!! 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 I spent hours trying to figure this one out and your solution (with a minor modification) worked 😄 Now the only thing missing is the URL disappeared from the console 🤔 |
@chrisradek On second thought, there are still problems when calling capture more than once. You can use this repository to reproduce the issue https://github.com/styfle/aws-xray-bug Would you like me to create a new issue here or is this thread fine? |
@styfle |
Hey @chrisradek - would it be helpful if we opened an issue with our enterprise support? Or is this thread enough? |
Also interested in |
@okonon Agreed. Proper context propagation is still a difficult problem to solve in the JS world. The discussion is also present on the OpenTracing github page, and it seems zone.js will not fix the async/await issue. A experimental branch is available, however, cls-hooked still has known issues, and does not appear to be in active development at this time. |
I have the same issue as @dougmoscrop that the last subsegment is not transmitted until the next trace. I tried both the ordinary and the experimental releases, and both of them have the same issue. My code is deployed as a lambda function, using the serverless framework for packaging and deployment. |
Sorry for the issues, this is a known issue with the Lambda Async handler. We are working internally to have it fixed. |
Now that DataDog, Honeycomb, NewRelic, Stackdriver, and even Bluebird are all leveraging |
@awssandra any update on the async problem with not sending the last segment? :-( x-ray is amazing but is not useable this way in production environments |
Also interested in the update regarding this issue |
I wonder what's the current status of this. I'm very interested in this issue #4 (Koa support) I see that some PRs were merged but I'm not sure if everything needed to support koa is already done |
This issue supporting tracing of async/await calls has been resolved using the
This issue will be closed when 3.0.0 is released, which is tracked in issue #157. As for explicitly supporting the Koa framework, that has not been prioritized yet on our roadmap. |
Any updates here? |
For Koa support, please post on #4. At this time we don't have the bandwidth to add support for the Koa framework ourselves, but are always open to PRs. |
Closing this issue since the use of |
Goal
The goal of this issue is to track all tracing context preservation related problems, dive deep on why some of them work and some of them don't, discuss and gather feedback for a way going forward to provide much better customer experience with X-Ray node tracing and further contribute/shape the node context propagation in general.
Problem Statement
The Node.js event loop model has its downsides. One of those is the context preservation through the call chain within a single http request. It is hard to bind arbitrary data (user name, transaction id, tracing context etc) to a full request/response cycle and all relevent context stops at event loop. It's hard to know what is the context for the next callback being executed by event loop.
There is a library continuation-local-storage (CLS for short) that solved most of the problem by binding request context to callbacks. However, this library has two limitations:
Challenges for X-Ray SDK for Node.js
With all the context preservation issues listed above, they are solvable for most of the node application owners who need to preserve contextual data within requests.
cls-*
for explicitly binding context to callbacks in a particular module. Here is an example fork fornode-redis
that patches Redis driver for context preservation.These workarounds are not ideal but work for each service owner as each service only has limited dependency and the owner knows what dependencies the service depends on. It's inevitable and necessary work for service owners to understand this problem space and make necessary code changes since it is business data (user name, transaction id etc) through a request/response cycle.
But for X-Ray SDK one of the core functionalities is to preserve the tracing context through the async call chain and the SDK should abstract this technical challenge away from users who just want all traced functions work once they import the SDK. This issue #28 has some detailed discussion regarding this concern. There are also limitations and known issues on
cls-hooked
one of which is it doesn't work with some certain libraries (example) while the X-Ray SDK should try not to have the SDK users to deal with these caveats.Going Forward
(This section will likely be updated more frequently based on both internal and external discussions and feedback.)
The SDK should re-evaluate CLS as its core dependency and leverage the new tools (cls-hooked or zone.js) for preserving context for most of the major use cases today as well as being flexible enough to be forward compatible with new coming features.
A short-term solution could be to start with an experimental branch that replaces CLS with
cls-hooked
orZone.js
and thorough testing with existing supported libraries and new use cases we intend to cover. Or an alternative approach is to have something in user code withas suggested by @nakedible-p but the issue is that it makes the bundle size more bloated when the SDK is a dependency. Additionally, these libraries are not owned by AWS X-Ray and some critical fixes/supports for the library from X-Ray's prospective might not be as critical for those libraries' owners.
Long Term: We will share this portion with broader audience after internal discussions. Any suggestion is welcome.
The text was updated successfully, but these errors were encountered: