Change the algorithm used for byte-based log rate limiting #727

ctlong · 2023-04-05T19:19:55Z

Enter an issue title

Change the algorithm used for byte-based log rate limiting.

Summary

Context

Diego v2.66.0 added support for a new byte-based log rate limiting mechanism with per-LRP limits (see: proposal). That mechanism currently implements a "token bucket" of size r, initially full and refilled at rate r tokens per second, where r is the log rate limit for the LRP in question. Whenever the log rate limit for an LRP is exceeded an error message (i.e. app instance exceeded log rate limit) is inserted into the LRPs log stream.

Problem

Hypothetical scenario where many/all log lines are dropped
No log line can be emitted unless it's byte size is below the log rate limit for the LRP. This means that if the log rate limit for an LRP is set below the max log line size, 61440B (~64K), then potentially many (or even all) log lines will be dropped for being too large. We could consider this as either a feature or a bug.

Too many log rate limit exceeded error messages
The error messages are emitted every time the log rate limit is exceeded. If that's happening consistently then the error messages – which are not counted against the log rate limit – may significantly increase the actual log rate of the app. This definitely seems like a bug that we should fix.

Proposal

Make sure documentation makes the hypothetical scenario where many/all log lines are dropped very clear so that developers can adjust their app's logging or their log rate limits accordingly.
If the log rate limit is exceeded then drop all logs for 1 second (i.e. introduce a penalty box).
Update the generic error message to include more meaningful information.

Diego repo

executor

Describe alternatives you've considered (optional)

We could make a change in CAPI to not allow log rate limits that are below the max log line size.
Only emit an error message every second rather than after the log rate limit has been exceeded.
Declare the hypothetical scenario where many/no log lines are allowed as a bug rather than a feature. Change implementation s.t. If the log rate limit is set below the max log line size then set the burst capacity to the max log line size (i.e. the min bucket size is the max log line size).

Additional Text Output, Screenshots, or contextual information (optional)

The text was updated successfully, but these errors were encountered:

ctlong · 2023-04-05T21:53:31Z

Started a discussion in CF Slack.

ctlong · 2023-04-13T16:36:54Z

Added a new step to the proposal based off Jochen's suggestion in Slack: Update the generic error message to include more meaningful information.

reneighbor · 2023-04-14T19:17:18Z

For Hypothetical scenario 1 where no log lines are allowed - I can imagine customers being confused that they're missing log lines entirely. If 1 log line is larger in bytes than the max bytes per second, can we truncate the line so they know the line exceeds the limit?

Benjamintf1 · 2023-04-14T21:07:53Z

We've I think been explicitly avoiding truncating logs to meet the rate limit. Maybe it's something we should have considered more, fair.

TY for bringing it up!!

winkingturtle-vmw · 2023-05-01T19:30:10Z

We could make a change in CAPI to not allow log rate limits that are below the max log line size.

This makes sense to me. Having some kind of API level failures where the minimum has to be met.

Too many log rate limit exceeded error messages

I think this comment makes sense as the goal of this feature. I think that instead of emitting every second, if we can figure out when it starts and stops, it makes sense.

Update the generic error message to include more meaningful information.

Can you elaborate what type of meaningful information this could be?

Benjamintf1 · 2023-05-01T20:08:56Z

I'm not sure I agree with a minimum. People should feel free to set their limit low, or entirely off if need be. I'm not sure we have a good argument as to why that shouldn't be allowed. Perhaps a "hey, you sure about that" on the cli? I don't feel like a "never".

Can you elaborate what type of meaningful information this could be?

An example we've talked about is, if we want to go to a timeout type limit, including that information("You are being timed out for 1 second") in the outage warning log.

ctlong · 2023-05-01T20:32:26Z

PSA: planning to discuss this proposal at the working group meeting on Wednesday, May 5th.

Benjamintf1 · 2023-05-01T20:45:24Z

Just for the record: here's a rephrasing of the state we're in

As part of implementing app level rate limiting with quotas, some changes to the algorithm of log rate limiting were made.
One of those dramatically increased the amount of "log rate exceeded" messages a user could see.
What we're looking to do is possibly:

reduce the amount of log-rate-exceeded messages, reducing the load on the platform
increase the amount of contiguous blocks of logs the user sees, possibly making it easier to interpret the logs that are emitted
keep the amount of clarity around which logs are being elluded due to log rate limits.

There's some future thoughts as well on:

trying to make sure applications can utilize more of their log-rate-limit when we can allow it, but that's not implemented in the pr we made this time. An exmaple would be truncating oversized logs rather then triggering the timeout immediately without logging.

ctlong · 2023-05-01T20:57:57Z

I think that's a good rephrase, thanks Ben! Having a clear idea of (what we the think are) the benefits of the proposal definitely helps with evaluating it.

Benjamintf1 · 2023-05-01T20:58:13Z

I think that instead of emitting every second, if we can figure out when it starts and stops, it makes sense.

That's why we're talking about doing a timeout box. It both allows us to reduce the number of messages saying when the logs are being emitted, and allows us to demarcate clearer when logs are being dropped.

ctlong · 2023-05-15T17:55:33Z

The outcome of the proposal discussion at the working group was general agreement on the proposal (by the stakeholders present at the meeting).

ctlong · 2023-05-16T17:13:27Z

cloudfoundry/executor#73 was merged, which implements (2).

geofffranks · 2023-05-31T14:02:30Z

cloudfoundry/executor#73 has now been released in v2.77.0

winkingturtle-vmw · 2024-01-29T17:25:41Z

I am going to close this issue since all related PRs have been merged. Please re-open if that's not the case.

ctlong added the enhancement label Apr 5, 2023

ctlong mentioned this issue Apr 5, 2023

Timeout implementation of reducing overage logs. cloudfoundry/executor#73

Merged

winkingturtle-vmw closed this as completed Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the algorithm used for byte-based log rate limiting #727

Change the algorithm used for byte-based log rate limiting #727

ctlong commented Apr 5, 2023 •

edited

ctlong commented Apr 5, 2023

ctlong commented Apr 13, 2023

reneighbor commented Apr 14, 2023 •

edited

Benjamintf1 commented Apr 14, 2023

winkingturtle-vmw commented May 1, 2023

Benjamintf1 commented May 1, 2023 •

edited

ctlong commented May 1, 2023

Benjamintf1 commented May 1, 2023 •

edited

ctlong commented May 1, 2023 •

edited

Benjamintf1 commented May 1, 2023

ctlong commented May 15, 2023 •

edited

ctlong commented May 16, 2023

geofffranks commented May 31, 2023

winkingturtle-vmw commented Jan 29, 2024

Change the algorithm used for byte-based log rate limiting #727

Change the algorithm used for byte-based log rate limiting #727

Comments

ctlong commented Apr 5, 2023 • edited

Enter an issue title

Summary

Context

Problem

Proposal

Diego repo

Describe alternatives you've considered (optional)

Additional Text Output, Screenshots, or contextual information (optional)

ctlong commented Apr 5, 2023

ctlong commented Apr 13, 2023

reneighbor commented Apr 14, 2023 • edited

Benjamintf1 commented Apr 14, 2023

winkingturtle-vmw commented May 1, 2023

Benjamintf1 commented May 1, 2023 • edited

ctlong commented May 1, 2023

Benjamintf1 commented May 1, 2023 • edited

ctlong commented May 1, 2023 • edited

Benjamintf1 commented May 1, 2023

ctlong commented May 15, 2023 • edited

ctlong commented May 16, 2023

geofffranks commented May 31, 2023

winkingturtle-vmw commented Jan 29, 2024

ctlong commented Apr 5, 2023 •

edited

reneighbor commented Apr 14, 2023 •

edited

Benjamintf1 commented May 1, 2023 •

edited

Benjamintf1 commented May 1, 2023 •

edited

ctlong commented May 1, 2023 •

edited

ctlong commented May 15, 2023 •

edited