Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds rate limiting to pull queries #4951

Merged

Conversation

AlanConfluent
Copy link
Member

@AlanConfluent AlanConfluent commented Mar 31, 2020

Description

Adds the use of a rate limiter for pull queries, to allow for capping QPS. Fails immediately if limit is reached.

Fixes #4445

Testing done

mvn package

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@AlanConfluent AlanConfluent requested a review from a team as a code owner March 31, 2020 22:50
@@ -168,6 +172,11 @@ public TableRowsEntity execute(
+ "this feature.");
}

if (!rateLimiter.tryAcquire()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this interact with host forwarding? If I forwarded to a host that is overwhelmed will I then try it's standby? It might make sense to check the error message and not forward to another host in this scenario (not sure if that's what happens or not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair question. At the moment, it checks the limit at both places, which seems like a reasonable method generally since both have to take part in responding.

It doesn't handle this failure in a special manner at the forwarder and it would just try the standby, as you're hinting at. This could be a way to quickly shift load from one overwhelmed host to another, though in the current scheme, next time, we'll just try the overwhelmed host again, so it's not perfect at the moment.

To me, being overwhelmed with queries is not unlike the host being down temporarily, and the solution we have is to fail over to a standby. What do you think @vinothchandar ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlanConfluent I don't understand from the code how it will try the standby if the active has exceeded the rate limit. The check does not happen in the forwarding loop so if the active fails, the query will fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the rate limit check fails at the forwarding node, that it just fails. That means that the node accessed to answer pull queries is overwhelmed. I could move this down to the local pull query section, but it wouldn't limit someone from forwarding lots of pull queries through a single forwarding node, which seems bad.

If it fails the rate limit at the actual active node, then the forwarding loop will go on to the standby.

for (KsqlNode node : filteredAndOrderedNodes) {
      try {
        return routeQuery(node, statement, executionContext, serviceContext, pullQueryContext);
      } catch (Exception t) {
        LOG.debug("Error routing query {} to host {} at timestamp {}",
                 statement.getStatementText(), node, System.currentTimeMillis());
      }
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By forwarding node do you mean a router? The way I understand how this change would work is this: If there is a router and the router exceeds the rate limit, it will fail the query. If the router is not overwhelmed, it will go to the forwarding loop and will try the active and standbys. If there is no router, then the active will fail the query if is has exceeded the rate limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fly by comment (feel free to ignore) :): I feel we can just enforce at the router and over time, things will settle down to that rate when all routers enforce the limit.. This is a simpler model to understand..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I agree it's a little hard to reason about if you're trying to figure out total QPS available.

I have it now check the rate limit so long as it's not been forwarded yet. It should be noted that this doesn't prevent people from deliberately trying to circumvent the limit by always setting the "forwarded" flag, though it may not find the desired data in that case (if it lives elsewhere). Any scheme (without internal, trusted rpcs) that tries to only check a limit for some requests has this issue since flags can be spoofed. I think that's not an issue for this feature though since it's not meant for security.

@agavra agavra requested a review from a team March 31, 2020 23:28
.withConfigs(ImmutableMap.of(KsqlConfig.KSQL_QUERY_PULL_MAX_QPS_CONFIG, 2));

@Test
public void shouldRateLimit() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this fail here? There are no queries issued?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It checks the limit before the request is issued. It effectively asks for permission, and if it's at the limit, it's not given permission.

Copy link
Member

@vpapavas vpapavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AlanConfluent! LGTM. Maybe try the code with the wrk tool to verify it works?

@AlanConfluent
Copy link
Member Author

Thank you @AlanConfluent! LGTM. Maybe try the code with the wrk tool to verify it works?

Did exactly that and verified that at least on a local host, I get the QPS I set in the config.

@AlanConfluent AlanConfluent merged commit 6284111 into confluentinc:master Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pull Queries: Add rate limiting to pull queries
4 participants