Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec that agents in Lambda should *not* do back-off #613

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 22 additions & 4 deletions specs/agents/tracing-instrumentation-aws-lambda.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,10 +198,28 @@ Field | Value | Description | Source
`context.cloud.origin.region` | e.g. `us-east-1` | S3 bucket region. | `record.awsRegion`
`context.cloud.origin.provider` | `aws` | Use `aws` as fix value. | -

## Data Flushing
Lambda functions are immediately frozen as soon as the handler method ends. In case APM data is sent in an asyncronous way (as most of the agents do by default) data can get lost if not sent before the lambda function ends.
## Transport

Therefore, the Lambda instrumentation has to ensure that data is flushed in a blocking way before the execution of the handler function ends.
Typically, Lambda functions using an APM agent will include the [APM Lambda
Extension](https://github.com/elastic/apm-aws-lambda/tree/main/apm-lambda-extension)
to which the APM agent sends data locally. There are some changes to the APM
agents' [transport behavior](./transport.md) to APM Server in this environment.

Some Lambda functions will use the custom-built Lambda extension that allows the agent to send its data locally. The extension asynchronously forwards the data it receives from the agent to the APM server so the Lambda function can return its result with minimal delay. In order for the extension to know when it can flush its data, it must receive a signal indicating that the lambda function has completed. There are two possible signals: one is via a subscription to the AWS Lambda Logs API and the other is an agent intake request with the query param `flushed=true`. A signal from the agent is preferrable because there is an inherent delay with the sending of the Logs API signal.

### Data Flushing

Lambda function VMs are frozen as soon as the handler method ends and any extensions signal completion. In case APM data is sent in an asynchronous way (as most of the agents do by default) data can get lost if not sent before the lambda function ends. Therefore, the Lambda instrumentation has to ensure that data is flushed in a blocking way before the execution of the handler function ends.

The extension asynchronously forwards the data it receives from the agent to the APM server so the Lambda function can return its result with minimal delay. In order for the extension to know when it can flush its data, it must receive a signal indicating that the lambda function has completed. There are two possible signals: one is via a subscription to the AWS Lambda Logs API and the other is an agent intake request with the query param `flushed=true`. A signal from the agent is preferrable because there is an inherent delay with the sending of the Logs API signal.
Therefore, the agent must send its final intake request at the end of the function invocation with the query param `flushed=true`. In case there is no more data to send at the end of the function invocation, the agent must send an empty intake request with this query param.

### Transport errors

APM agents in a Lambda VM, sending to the local extension SHOULD NOT implement
the back-off / grace period after failed intake requests that is described in
[the transport spec](./transport.md#transport-errors). It is the responsibility
of the extension to handle back-off and buffering, if at all. Because the
extension *asynchronously* passes APM data on to APM server, it does not return
APM server responses to the agent; therefore the agent cannot meaningfully
handle backpressure.

2 changes: 1 addition & 1 deletion specs/agents/transport.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ When a request fails, the agent has no way of knowing exactly what data was succ

The agent should therefore drop the entire compressed buffer: both the internal zlib buffer, and potentially the already compressed data if such data is also buffered. Data subsequently written to the compression library can be directed to a new HTTP request.

The new HTTP request should not necessarily be started immediately after the previous HTTP request fails, as the reason for the failure might not have been resolved up-stream. Instead an incremental back-off algorithm SHOULD be used to delay new requests. The grace period should be calculated in seconds using the algorithm `min(reconnectCount++, 6) ** 2 ± 10%`, where `reconnectCount` starts at zero. So the delay after the first error is 0 seconds, then circa 1, 4, 9, 16, 25 and finally 36 seconds. We add ±10% jitter to the calculated grace period in case multiple agents entered the grace period simultaneously. This way they will not all try to reconnect at the same time.
The new HTTP request should not necessarily be started immediately after the previous HTTP request fails, as the reason for the failure might not have been resolved up-stream. Instead an incremental back-off algorithm SHOULD be used to delay new requests. The grace period should be calculated in seconds using the algorithm `min(reconnectCount++, 6) ** 2 ± 10%`, where `reconnectCount` starts at zero. So the delay after the first error is 0 seconds, then circa 1, 4, 9, 16, 25 and finally 36 seconds. We add ±10% jitter to the calculated grace period in case multiple agents entered the grace period simultaneously. This way they will not all try to reconnect at the same time. (APM agents in an AWS Lambda function that are sending to the APM Lambda Extension SHOULD NOT implement back-off. See [the Lambda instrumentation section on transport errors](tracing-instrumentation-aws-lambda.md#transport-errors).)

Agents should support specifying multiple server URLs. When a transport error occurs, the agent should switch to another server URL at the same time as backing off.

Expand Down