Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On shutdown wait for lambda logs API to report the final platform report metrics #347

Merged
merged 5 commits into from Nov 28, 2022

Conversation

lahsivjar
Copy link
Contributor

@lahsivjar lahsivjar commented Nov 23, 2022

Motivation

platform.report metrics for a lambda invocation are reported in the future invocation (in most cases the next one). Due to this, we will not have the report metrics for the last invocation till shutdown, and, as per the behavior of the extension prior to this PR, we will end up dropping the last metric. For periodic (hourly/daily/weekly) invocations, this will lead to no platform.report metric.

Solution

During shutdown the extension gets a deadline of 2 seconds, this PR uses some of the 2 seconds to wait for the logs API to send us the platform.report metric for the last seen invocation. This wait is executed for each execution env/instance that served an invocation (NOT for each invocation). Per our initial benchmarks, the wait lasts for a maximum of 40ms (avg: ~5ms).

While the PR fixes the platform.report metric for successful invocations, for function crashes (timeouts/OOMs) it is still possible to miss the last platform.report.

Note that the platform.report metric for the last invocation can take as much as 45 minutes to be reported since they will be collected when the lambda execution env shuts down.

How to test?

  1. Create a lambda function with the latest version of the extension and configure it to send load to APM-Server.
  2. Invoke the lambda function a specific number of times.
  3. Observe the number of platform.report metrics in Kibana (can be filtered by kql faas.billed_duration : * for metrics datastream) and assert that it is same as the number of function invocations. (Note that it will take up to 45 minutes for all the platform.report metrics to be indexed).

Steps 1 & 2 can be performed by running cd testing && LAMBDA_RUNTIME=go1.x EC_API_KEY=<ec_api_key> make bench. By default, this will make 500 requests (can be visualized in the summary generated after above command). After confirmation of the test please run LAMBDA_RUNTIME=go1.x EC_API_KEY=<ec_api_key> make destroy to delete the infrastructure.

Related Issues

Related to #334

@github-actions github-actions bot added the aws-λ-extension AWS Lambda Extension label Nov 23, 2022
@elastic-apm-tech elastic-apm-tech added this to In Progress in APM-Agents (OLD) Nov 23, 2022
@apmmachine
Copy link

apmmachine commented Nov 23, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-11-28T09:32:32.195+0000

  • Duration: 9 min 36 sec

Test stats 🧪

Test Results
Failed 0
Passed 202
Skipped 2
Total 204

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@lahsivjar lahsivjar marked this pull request as ready for review November 24, 2022 12:09
@AlexanderWert
Copy link
Member

AlexanderWert commented Nov 24, 2022

@lahsivjar One question: In case that the metrics get reported much later (e.g. with the shutdown, 45 mins after the last invocation), which timestamp is reported for the corresponding metric? Is it the timestamp of the previous invocation or the timestamp of when the platform.report event is processed (which is in that case 45min later)?

In the latter case, would there be an easy way to use the invocation timestamp?

@lahsivjar lahsivjar requested review from a team and AlexanderWert November 24, 2022 12:18
@AlexanderWert
Copy link
Member

I think I found the answer to my question :-)

@lahsivjar
Copy link
Contributor Author

In case that the metrics get reported much later (e.g. with the shutdown, 45 mins after the last invocation), which timestamp is reported for the corresponding metric?

@AlexanderWert The timestamp reported will be the timestamp in the platform.report log event. I am not sure if that time represents the timestamp of the previous invocation or the timestamp at which the log event was generated but definitely not the processing time. Also, in my tests I have observed the shutdown time to be in the range of 5 to 15 minutes (attaching an example of 250 function invocations plotted with @timestamp field as well as with ingested-time to give a rough idea).

Screenshot 2022-11-24 at 9 12 56 PM

^ plotted with @timestamp

Screenshot 2022-11-24 at 9 13 04 PM

^ plotted with ingestion-time

@AlexanderWert
Copy link
Member

looking forward to test this with my functions that run periodically once a day (right now I don't see any metrics at all, hope this will change it ) :-)

@lahsivjar lahsivjar enabled auto-merge (squash) November 28, 2022 09:32
@lahsivjar lahsivjar merged commit 5ed72b5 into elastic:main Nov 28, 2022
APM-Agents (OLD) automation moved this from In Progress to Done Nov 28, 2022
@lahsivjar lahsivjar deleted the 334-fix-pf-report branch November 28, 2022 09:47
@kruskall kruskall self-assigned this Dec 5, 2022
@kruskall
Copy link
Member

kruskall commented Dec 5, 2022

Somehow the assignee only got applied to the other extension PR.

I tested this on 8.6.0 and it worked fine. I discovered an issue with how we were compiling the test function and opened a PR to fix that (#350).
Followed the how to test section and used LAMBDA_RUNTIME=go1.x EC_API_KEY=<ec_api_key> STACK_VERSION=8.6.0 make bench. The number matches the result from the make task output and the platform.report metrics are reported correctly.

@kruskall kruskall removed their assignment Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws-λ-extension AWS Lambda Extension
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants