Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the algorithm used for byte-based log rate limiting #727

Closed
ctlong opened this issue Apr 5, 2023 · 14 comments
Closed

Change the algorithm used for byte-based log rate limiting #727

ctlong opened this issue Apr 5, 2023 · 14 comments

Comments

@ctlong
Copy link
Member

ctlong commented Apr 5, 2023

Enter an issue title

Change the algorithm used for byte-based log rate limiting.

Summary

Context

Diego v2.66.0 added support for a new byte-based log rate limiting mechanism with per-LRP limits (see: proposal). That mechanism currently implements a "token bucket" of size r, initially full and refilled at rate r tokens per second, where r is the log rate limit for the LRP in question. Whenever the log rate limit for an LRP is exceeded an error message (i.e. app instance exceeded log rate limit) is inserted into the LRPs log stream.

Problem

Hypothetical scenario where many/all log lines are dropped
No log line can be emitted unless it's byte size is below the log rate limit for the LRP. This means that if the log rate limit for an LRP is set below the max log line size, 61440B (~64K), then potentially many (or even all) log lines will be dropped for being too large. We could consider this as either a feature or a bug.

Too many log rate limit exceeded error messages
The error messages are emitted every time the log rate limit is exceeded. If that's happening consistently then the error messages – which are not counted against the log rate limit – may significantly increase the actual log rate of the app. This definitely seems like a bug that we should fix.

Proposal

  1. Make sure documentation makes the hypothetical scenario where many/all log lines are dropped very clear so that developers can adjust their app's logging or their log rate limits accordingly.
  2. If the log rate limit is exceeded then drop all logs for 1 second (i.e. introduce a penalty box).
  3. Update the generic error message to include more meaningful information.

Diego repo

  • executor

Describe alternatives you've considered (optional)

  • We could make a change in CAPI to not allow log rate limits that are below the max log line size.
  • Only emit an error message every second rather than after the log rate limit has been exceeded.
  • Declare the hypothetical scenario where many/no log lines are allowed as a bug rather than a feature. Change implementation s.t. If the log rate limit is set below the max log line size then set the burst capacity to the max log line size (i.e. the min bucket size is the max log line size).

Additional Text Output, Screenshots, or contextual information (optional)

@ctlong
Copy link
Member Author

ctlong commented Apr 5, 2023

Started a discussion in CF Slack.

@ctlong
Copy link
Member Author

ctlong commented Apr 13, 2023

Added a new step to the proposal based off Jochen's suggestion in Slack: Update the generic error message to include more meaningful information.

@reneighbor
Copy link
Member

reneighbor commented Apr 14, 2023

For Hypothetical scenario 1 where no log lines are allowed - I can imagine customers being confused that they're missing log lines entirely. If 1 log line is larger in bytes than the max bytes per second, can we truncate the line so they know the line exceeds the limit?

@Benjamintf1
Copy link
Member

We've I think been explicitly avoiding truncating logs to meet the rate limit. Maybe it's something we should have considered more, fair.

TY for bringing it up!!

@winkingturtle-vmw
Copy link
Contributor

We could make a change in CAPI to not allow log rate limits that are below the max log line size.

This makes sense to me. Having some kind of API level failures where the minimum has to be met.

Too many log rate limit exceeded error messages

I think this comment makes sense as the goal of this feature. I think that instead of emitting every second, if we can figure out when it starts and stops, it makes sense.

Update the generic error message to include more meaningful information.

Can you elaborate what type of meaningful information this could be?

@Benjamintf1
Copy link
Member

Benjamintf1 commented May 1, 2023

I'm not sure I agree with a minimum. People should feel free to set their limit low, or entirely off if need be. I'm not sure we have a good argument as to why that shouldn't be allowed. Perhaps a "hey, you sure about that" on the cli? I don't feel like a "never".

Can you elaborate what type of meaningful information this could be?

An example we've talked about is, if we want to go to a timeout type limit, including that information("You are being timed out for 1 second") in the outage warning log.

@ctlong
Copy link
Member Author

ctlong commented May 1, 2023

PSA: planning to discuss this proposal at the working group meeting on Wednesday, May 5th.

@Benjamintf1
Copy link
Member

Benjamintf1 commented May 1, 2023

Just for the record: here's a rephrasing of the state we're in

As part of implementing app level rate limiting with quotas, some changes to the algorithm of log rate limiting were made.
One of those dramatically increased the amount of "log rate exceeded" messages a user could see.
What we're looking to do is possibly:

  1. reduce the amount of log-rate-exceeded messages, reducing the load on the platform
  2. increase the amount of contiguous blocks of logs the user sees, possibly making it easier to interpret the logs that are emitted
  3. keep the amount of clarity around which logs are being elluded due to log rate limits.

There's some future thoughts as well on:

  1. trying to make sure applications can utilize more of their log-rate-limit when we can allow it, but that's not implemented in the pr we made this time. An exmaple would be truncating oversized logs rather then triggering the timeout immediately without logging.

@ctlong
Copy link
Member Author

ctlong commented May 1, 2023

I think that's a good rephrase, thanks Ben! Having a clear idea of (what we the think are) the benefits of the proposal definitely helps with evaluating it.

@Benjamintf1
Copy link
Member

I think that instead of emitting every second, if we can figure out when it starts and stops, it makes sense.

That's why we're talking about doing a timeout box. It both allows us to reduce the number of messages saying when the logs are being emitted, and allows us to demarcate clearer when logs are being dropped.

@ctlong
Copy link
Member Author

ctlong commented May 15, 2023

The outcome of the proposal discussion at the working group was general agreement on the proposal (by the stakeholders present at the meeting).

@ctlong
Copy link
Member Author

ctlong commented May 16, 2023

cloudfoundry/executor#73 was merged, which implements (2).

@geofffranks
Copy link
Contributor

cloudfoundry/executor#73 has now been released in v2.77.0

@winkingturtle-vmw
Copy link
Contributor

I am going to close this issue since all related PRs have been merged. Please re-open if that's not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants