Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RLQS: Refactor traffic processing in the RLQS filter & fix a broken expiration check #35723

Merged
merged 4 commits into from
Aug 22, 2024

Conversation

bsurber
Copy link
Contributor

@bsurber bsurber commented Aug 16, 2024

Commit Message: Refactor the RLQS filter's traffic processing & fix the broken action-assignment expiration check.
Additional Description:
Current behavior:

  • Every request that hits the RLQS filter results in the cache thinking that its entry has expired due to the initial / default action of a newly created BucketCache index not having a TTL.
  • If the TTL was fixed, it would still start failing all expiration checks shortly after assignments were received, even if those assignments were sent routinely (not going stale). This is because the expiration time was calculated off of a private field set on BucketCache index creation, and never updated when receiving assignments.
  • Note: Integration testing was previously passing as it was incorrectly expecting multiple RLQS usage reports.
    Risk Level: minor - filter is in-development and currently in a broken state
    Testing:
    Docs Changes:
    Release Notes:
    Platform Specific Features:

…ed false-positive) expiration check

Signed-off-by: Brian Surber <bsurber@google.com>
@bsurber bsurber changed the title Refactor traffic processing in the RLQS filter & fix a broken expiration check RLQS: Refactor traffic processing in the RLQS filter & fix a broken expiration check Aug 16, 2024
Copy link
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Left some comments from a quick first pass.

If possible, I would suggest having this PR only for fixing the expiration check and leaving the refactor into separate PR.

@tyxia tyxia self-assigned this Aug 16, 2024
@tyxia
Copy link
Member

tyxia commented Aug 16, 2024

Also please fix the format so that CI can run. You can use command bazel run //tools/code_format:check_format -- fix
Thanks!

/wait

Signed-off-by: Brian Surber <bsurber@google.com>
Copy link
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM modulo the open threads and code coverage

Thanks!

}
if (ret_status == Envoy::Http::FilterHeadersStatus::StopIteration) {
sendDenyResponse();
quota_buckets_[bucket_id]->quota_usage.num_requests_denied += 1;
Copy link
Member

@tyxia tyxia Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I feel updating the allow/deny stats based on ret_status is not robust. The reason it works here is because our RLQS filter is in non-blocking fashion that it will not wait for response from RLQS server.

StopIteration in Envoy generally could represent that filter is waiting for the remote server response before continuing, rather than tightly tied with localReply/reject.

Thus, can we update the quota_usage inline inside of the functions above when the request is determined to be ALLOW or Throttled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've addressed this internally without too much to say. It'll do for now as that would be a bit of future-proofing for an unknown problem at the cost of additional refactoring.

Signed-off-by: Brian Surber <bsurber@google.com>
@tyxia
Copy link
Member

tyxia commented Aug 21, 2024

/wait

…to the default-action

Signed-off-by: Brian Surber <bsurber@google.com>
Copy link
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tyxia tyxia merged commit 3896048 into envoyproxy:main Aug 22, 2024
47 checks passed
@bsurber bsurber deleted the fix-rlqs-expiration branch August 22, 2024 16:49
@phlax
Copy link
Member

phlax commented Aug 26, 2024

@tyxia @bsurber i think this is causing a very frequent flake - we may need to revert while the issue is resolved

@tyxia
Copy link
Member

tyxia commented Aug 26, 2024

@phlax Thanks for notification!

Could you please highlight which test is flaky? so that we can resolve/revert it accordingly

@phlax
Copy link
Member

phlax commented Aug 26, 2024

see eg here https://dev.azure.com/cncf/envoy/_build/results?buildId=178487&view=logs&j=1439b9f7-a348-5b50-b5fe-ea612ea91241&t=1002ac43-da84-5fae-70b2-98833b702d09&l=386

same test timed out twice on that run - but its happening a lot (if sporadically) elsewhere

@bsurber
Copy link
Contributor Author

bsurber commented Aug 26, 2024

Hm, I had seen ASSERT_TRUE(response_->waitForEndStream()); deadlock during testing but had thought I'd found the cause. I guess that was just occluded by flakiness and that search will have to resume.

tyxia pushed a commit that referenced this pull request Aug 26, 2024
…broken expiration check" (#35847)

Reverts #35723

Signed-off-by: Brian Surber <bsurber@google.com>
tyxia pushed a commit that referenced this pull request Sep 5, 2024
…xpiration check v2 (#35973)

Commit Message: Refactor the RLQS filter's traffic processing & fix the
broken action-assignment expiration check.
Additional Description: Addresses flakiness from [past
PR](#35723) by adding a missing
`sendDenyResponse` & improving test consistency in asynchronous steps.
Risk Level: Minor
Testing: Integration and unit testing run 1k times to check flakiness

---------

Signed-off-by: Brian Surber <bsurber@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants