Spec that agents in Lambda should not do back-off #613

trentm · 2022-03-09T22:30:05Z

This tweaks the Lambda and transport specs to say that APM agents in Lambda should not implement back-off for repeated failing intake requests.

Motivation

If a user configures a Lambda function with one of the agents, and sets up the envvars for the APM Lambda extension but does not include the extension layer in their lambda, then the APM agent will get errors attempting to send to the local extension. If the agent implements back-off on repeated intake request errors, then the agent will get into a state where it is delaying intake requests. This could interfere with its ?flushed=true signalling to the extension for each invocation completion.

This should be relatively minor, because (a) it is a configuration error and (b) if the extension is missing it obviously won't be sending on APM data anyway. However, at least with the Node.js APM agent it can lead to the user's Lambda function returning null instead of its actual response: elastic/apm-agent-nodejs#2598 IOW, the broken APM agent is causing harm.

In general, if the extension is missing or is frequently erroring on its intake API endpoint, it isn't the responsibility of the APM agent to back-off. The point of APM agent back-off (if I understand correctly) is to avoid overloading APM server, especially when it is responding with "queue is full" -- i.e. backpressure. However, because the extension is asynchronously sending on APM data, the APM agent doesn't get the actual responses from APM server, so can't meaningfully handle backpressure. It is, or should be, the responsibility of the extension to handle buffering and back-off.

Checklist for a small enhancement

Create PR as draft
Approval by at least one other agent
Mark as Ready for Review (automatically requests reviews from all agents and PM via CODEOWNERS)
- Remove PM from reviewers if impact on product is negligible
- Remove agents from reviewers if the change is not relevant for them
Merge after 2 business days passed without objections
To auto-merge the PR, add /schedule YYYY-MM-DD to the PR description.

apmmachine · 2022-03-09T22:33:47Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-03-21T05:58:00.840+0000
Duration: 3 min 49 sec

Refs: elastic/apm-agent-nodejs#1831 Refs: elastic/apm#613

AlexanderWert

Great clarification of the transport behavior!

felixbarny · 2022-03-10T10:40:42Z

I was originally worried that a backoff would imply that the flush blocks until the end of the backoff or until the flush times out.
I now remember what the Java agent already does to avoid this situation. If there's a backoff, the flush call immediately returns. Also, there's a configurable timeout for how long a call to flush can block at max which defaults to 1s.

A downside of completely disabling the backoff is that if all attempts to connect to the extension fail (for example because the extension layer is not installed), each individual event will cause the agent to attempt creating a new connection. This will lead to more errors (thus, potentially more verbose logging) and overhead due to the repeated attempts to establish a connection.

Therefore, we may rather want to standardize the flush semantics (return immediately on backoff, configurable timeout).

As an additional benefit, this keeps the APM Server Sender logic within agents simple by not having two different backoff strategies for lambda/non-lambda. Having said that, looking at elastic/apm-nodejs-http-client#180, it's not really complex.

trentm · 2022-03-10T15:19:11Z

A downside of completely disabling the backoff is that if all attempts to connect to the extension fail (for example because the extension layer is not installed), each individual event will cause the agent to attempt creating a new connection.

Fair. I haven't tried with the Node.js agent, but I think it'll do similar (with a 20ms bufferWindowTime which attempts to do some batching of a number of events close together).

this keeps the APM Server Sender logic within agents simple

It is true that the APM server client for the Node.js agent has a number of subtle tweaks to deal with the differences of the Lambda Runtime. I.e. currently this logic is not simple in the Node.js agent. (I don't know if that complexity is somewhat justified by the indirect use of the "beforeExit" event that effectively watches for an empty event loop to guess when the function invocation is "done".)

trentm · 2022-03-10T16:52:19Z

From the discussion above, and a little bit of chat on the apm-agent-nodejs call, I'll take a look at the Node.js APM agent again to see if it is reasonably easy to get it to be able to do back-off but not impact the user's Lambda Function responses.

Likely the end result is that I'll retract this PR -- Felix gave the argument for why back-off in the agent is still worthwhile -- and either the Node.js agent will figure out a way to safely do back-off in a Lambda environment, or it'll go off spec for this case.

This ensures that the APM agent instrumentation in a Lambda function will not cause a `null` response from the user's handler if the agent cannot talk to the extension. Note that the spec at elastic/apm#613 related to this might yet change. Refs: elastic/apm-agent-nodejs#1831 Refs: elastic/apm#613

felixbarny · 2022-03-16T14:24:38Z

@basepi Is the Python agent doing an exponential backoff when the Lambda extension returns errors/is not available? If yes, how is the Python agent handling flush requests while it's in a backoff/grace period? Does it short-circuit the flush while in backoff?

basepi · 2022-03-16T15:27:29Z

If we're backing off, the python agent will drop the data on an explicit flush and return immediately.

So we won't hang, we'll just drop the data and end.

Our backoff interval is not configurable.

estolfo · 2022-03-24T09:41:58Z

@trentm, @felixbarny, @basepi is there a conclusion this? The backoff implementation has been merged into the lambda extension and it was mentioned in this discussion that this PR might be retracted, as backoff in the agent is still useful.

Do we want to define an explicit behavior for when the agent is flushed and it's in a backoff/grace period?

felixbarny · 2022-03-24T10:14:11Z

I've created a proposal for specifying the behavior of flush during backoff: #623

trentm · 2022-03-24T15:30:38Z

is there a conclusion this?

My status was still what #613 (comment) says: "Likely the end result is that I'll retract this PR". But otherwise this had moved to a low priority for me. I haven't read Felix's new #623 yet.

felixbarny · 2022-03-24T16:54:08Z

I don't think the priority has changed. I've just created #623 as an alternative to this spec that we can discuss at a later point in time.

trentm · 2022-03-24T17:53:57Z

Closing in favour of #623.

Spec that agents in Lambda should *not* do back-off

140092d

trentm requested review from estolfo, AlexanderWert and felixbarny March 9, 2022 22:30

trentm self-assigned this Mar 9, 2022

trentm added a commit to elastic/apm-nodejs-http-client that referenced this pull request Mar 9, 2022

fix: do not back-off on intake req errors in Lambda env

82436a8

Refs: elastic/apm-agent-nodejs#1831 Refs: elastic/apm#613

trentm mentioned this pull request Mar 9, 2022

fix: do not back-off on intake req errors in Lambda env elastic/apm-nodejs-http-client#180

Merged

AlexanderWert approved these changes Mar 10, 2022

View reviewed changes

trentm added the 8.3-candidate label Mar 15, 2022

trentm closed this Mar 24, 2022

trentm removed the 8.3-candidate label Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec that agents in Lambda should not do back-off #613

Spec that agents in Lambda should not do back-off #613

trentm commented Mar 9, 2022

apmmachine commented Mar 9, 2022 •

edited

Loading

Build stats

AlexanderWert left a comment

felixbarny commented Mar 10, 2022

trentm commented Mar 10, 2022

trentm commented Mar 10, 2022

felixbarny commented Mar 16, 2022

basepi commented Mar 16, 2022 •

edited

Loading

estolfo commented Mar 24, 2022

felixbarny commented Mar 24, 2022 •

edited

Loading

trentm commented Mar 24, 2022

felixbarny commented Mar 24, 2022

trentm commented Mar 24, 2022

Spec that agents in Lambda should *not* do back-off #613

Spec that agents in Lambda should *not* do back-off #613

Conversation

trentm commented Mar 9, 2022

Motivation

Checklist for a small enhancement

apmmachine commented Mar 9, 2022 • edited Loading

💚 Build Succeeded

Build stats

AlexanderWert left a comment

Choose a reason for hiding this comment

felixbarny commented Mar 10, 2022

trentm commented Mar 10, 2022

trentm commented Mar 10, 2022

felixbarny commented Mar 16, 2022

basepi commented Mar 16, 2022 • edited Loading

estolfo commented Mar 24, 2022

felixbarny commented Mar 24, 2022 • edited Loading

trentm commented Mar 24, 2022

felixbarny commented Mar 24, 2022

trentm commented Mar 24, 2022

Spec that agents in Lambda should not do back-off #613

Spec that agents in Lambda should not do back-off #613

apmmachine commented Mar 9, 2022 •

edited

Loading

basepi commented Mar 16, 2022 •

edited

Loading

felixbarny commented Mar 24, 2022 •

edited

Loading