Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Buffer debug logs and only emit on exception #4432

Open
2 tasks done
hiselitelordship opened this issue May 31, 2024 · 7 comments
Open
2 tasks done
Labels
feature-request feature request help wanted Could use a second pair of eyes/hands need-customer-feedback Requires more customers feedback before making or revisiting a decision

Comments

@hiselitelordship
Copy link

Use case

CloudWatch Logs are vital, but expensive. To control costs you switch log level from DEBUG to INFO. When an exception happens you have very little context to debug what happened. You swtich back to DEBUG level and hope(!) the error happens again.

Solution/User Experience

By configuring a buffer, debug logs would not be written to CloudWatch Logs unless an exception is encountered. In this way, CWL is low cost and minimalist. But, like a stack trace, when an exception is encountered the buffer is flushed to CWL and all the context of the preceding events is captured.

The buffer should not create memory pressure and become the cause of exceptions. Using something like a ring buffer oldest entries will be lost in favour of the freshest. There is likely a threshold in which a flush to CWL could take too long as an execution approaches the 900s exection threshold. Therefore the memory allocation or number of buffer lines could be configurable.

Alternative solutions

No response

Acknowledgment

@hiselitelordship hiselitelordship added feature-request feature request triage Pending triage from maintainers labels May 31, 2024
Copy link

boring-cyborg bot commented May 31, 2024

Thanks for opening your first issue here! We'll come back to you as soon as we can.
In the meantime, check out the #python channel on our Powertools for AWS Lambda Discord: Invite link

@heitorlessa
Copy link
Contributor

(haven't forgotten, adjusting to jet lag from ANZ as I'm back today) - this is a great idea and we were waiting from customer demand to consider it.

I'll share thoughts and questions tomorrow

@heitorlessa heitorlessa removed the triage Pending triage from maintainers label Jun 3, 2024
@heitorlessa
Copy link
Contributor

unless an exception is encountered

This is tricky part. As of now, I can think of these scenarios:

  • Handled exceptions. Code is within a try/except block and customers could do one of the following to flush the buffer:
    • [explicit] call logger.hypothetical_flush_debug_buffer() as part of the Except block
    • [implicit] call logger.exception("something went wrong") that by default captures and formats a stacktrace, but we'd transparently call logger.hypothetical_flush_debug_buffer(). Here be dragons as some customers could love or hate this transparent fashion (e.g., how do I disable this?).
  • Unhandled exceptions. Function crashed due to unexpected and unhandled exception. Lambda doesn't support unhandled exception mechanism (Bug: log_uncaught_exceptions only works locally and not on lambda #1798), but we could workaround in two ways:
    • If customer sets Logger(log_uncaught_exceptions=True) and uses @logger.inject_lambda_context, we could catch all exceptions, log and flush the debug buffer, and re-raise.
    • We could register a no-op internal extension (separate thread), giving us 500ms for any cleanup upon timeout and such, including background jobs. That however would come at a cost (latency to be measured).

Did I miss anything?

I'd love to hear your thoughts on both.

@hiselitelordship
Copy link
Author

I imagine the new behaviour would be Logger(debug_on_exception=True). Explicit and backward compatible. Existing code would therefore not buffer and every debug line is published as expected. No surprises for anyone already using Powertools.

If I set this new option, I expect the logger.exception to flush out the buffer like a stack trace – that is why I set the option. If I didn’t set the option above, then I don’t want this buffering feature and there will be no stack trace or buffering.

But your point remains. If I want to selectively stop publishing the “stack trace” I can’t.

@heitorlessa
Copy link
Contributor

heitorlessa commented Jun 7, 2024

I expect the logger.exception to flush out the buffer like a stack trace
Explicit and backward compatible

Yeah, that's the direction we'd go given our tenets. On Monday (17th), I'll summarize actionable points and tests we've made while we wait for customers +1.

NOTE: we'll implement this across all languages once we have sufficient demand.

@heitorlessa heitorlessa added help wanted Could use a second pair of eyes/hands need-customer-feedback Requires more customers feedback before making or revisiting a decision labels Jun 9, 2024
@xkortex
Copy link

xkortex commented Jun 11, 2024

First, I love the idea of this feature, it's definitely a "have your observability cake and eat it too" moment.

Logger(debug_on_exception=True)

+1, that seems like the best way to implement to start. An env var which also sets it would be additionally welcome, as then users could click-ops turn on debug tracing in response to an unforeseen issue without a code deploy.

If customer sets Logger(log_uncaught_exceptions=True) and uses @logger.inject_lambda_context, we could catch all exceptions, log and flush the debug buffer, and re-raise.

This seems to me like the least surprising behavior, given a function wrapped with @logger.inject_lambda_context. That seems the most logical hardpoint to catch broad exceptions.

As for in-line vs separate thread, whew, that's a challenging one. I definitely see the appeal of the lambda returning as fast as possible on error, but you really don't want your error handling code to itself get swallowed by the void, so you'd want to especially make sure the thread kicked off has a very high degree of success at emitting the error. As long as extension threads are able to keep the lambda hot long enough to complete the handler, I'd be concerned about the lambda being scaled-in mid-process. But I would think the error packaging would be fast enough that it'd be unlikely to drop (I'm just not super knowledgeable on lambda innards to know what keeps one alive).

Perhaps you could define a handler interface with a default stderr impl and let users inject their own function to run such that they could override the error rollup behavior, use threads or not, customize the debug sink (maybe they want to dump to S3 or datadog or whatever).

@heitorlessa
Copy link
Contributor

@sthulb when you can, would you mind dropping the memory test you did for verbose log records if kept in a circular buffer for this feature?

We're handling some critical operational bits this week hence my extended delay in responding it. We're still committed to do this feature and plan to deprecate the log sampling (as this is a much better alternative)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request feature request help wanted Could use a second pair of eyes/hands need-customer-feedback Requires more customers feedback before making or revisiting a decision
Projects
Status: Backlog
Development

No branches or pull requests

3 participants