Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Uses pull query metrics for all paths, not just /query #5983

Merged

Conversation

AlanConfluent
Copy link
Member

Description

Fixes issue with pull query metrics not being used in every pull query codepath. Now it should be used not only on /query by also the web socket code path.

Testing done

Ran normal tests. Also, verified locally that it's now in use.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

Copy link
Contributor

@apurvam apurvam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. Should we be adding tests which check that metrics are being computed correctly and through every API path?

@AlanConfluent
Copy link
Member Author

Thanks for the patch. Should we be adding tests which check that metrics are being computed correctly and through every API path?

Yes, we probably should. The main difficulty is that PullQueryExecutor isn't currently really tested in its own unit test. I can try to figure out the best way to do this though.

statement,
executionContext,
serviceContext,
pullQueryContext,
routingOptions
);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a reason for these measurements to be in the StreamedQueryResource. They measure a more realistic latency that accounts also for query parsing etc. , basically all that happens before the request reaches the PullQueryExecutor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

@AlanConfluent AlanConfluent Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thinking was that such CPU bound things like parsing should be an order of magnitude or two faster than IO like reading from disk or network. I do recall that in earlier benchmarking, we didn't seem to be IO bound, but that's probably a separate issue.

If what we really want is latency for the whole endpoint, which I agree is preferable, it seems like it would be good to have a more generic latency metric for all resources -- that way we don't special case this. Not sure if that's possible with Vert.x, but I imagine it is. So as to keep consistency with existing metric naming and make it work with the websocket (which is probably harder to make generic), I'll stick to the pull query use case for now. Ok, I'll change this part back and add in the web socket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - might be worth asking @purplefox and/or googling about vert.x to see if there is a generic way to get metrics on all requests. I know this is possible in Jetty, but don't know vert.x.

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of a PITA that we now have two ways of handling pull queries. However, be that as it may, we obviously need to capture metrics from both, in some semblance of a standard way. As @vpapavas says, the moving the latency capture completely into the executor means we're no longer including the time taken to do things like auth checks in the latency. This is obviously bad!

The best solution would be to remove the old code path for pull queries. But this isn't possible short term, (though we should at least start this process, e.g. get the UI team to move to the new HTTP2 endpoint and mark the old websocket endpoint as deprecated).

Given the two code paths, the challenge is therefore to change the code in a way that encapsulates the collection of the metrics and has the minimum duplicate code in the two paths.

Looking at the two code paths, they are drastically different in the way they are architected. This leads me to believe the best approach is for each path to capture the start time of the request (in nanoseconds), and pass this down the call stack. This start time can be passed down into PullQueryExecutor so it can log the latency. I don't actually like this pattern, but I don't think capturing the metrics at a higher level is feasible. I think this only works as there isn't any substantial processing after the executor has done its thing, but this is worth double checking. Ideally, we should be capturing any JSON serialization costs, etc.

statement,
executionContext,
serviceContext,
pullQueryContext,
routingOptions
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@AlanConfluent
Copy link
Member Author

@big-andy-coates

It's a bit of a PITA that we now have two ways of handling pull queries. However, be that as it may, we obviously need to capture metrics from both, in some semblance of a standard way. As @vpapavas says, the moving the latency capture completely into the executor means we're no longer including the time taken to do things like auth checks in the latency. This is obviously bad!

I agree with the general sentiment, though I'm surprised we don't have a framework level latency measurer for web socket requests and endpoint requests.

I'd also be surprised if this made much of a difference, since in theory we should be IO bound, at least for large tables.

Looking at the two code paths, they are drastically different in the way they are architected. This leads me to believe the best approach is for each path to capture the start time of the request (in nanoseconds), and pass this down the call stack. This start time can be passed down into PullQueryExecutor so it can log the latency. I don't actually like this pattern, but I don't think capturing the metrics at a higher level is feasible. I think this only works as there isn't any substantial processing after the executor has done its thing, but this is worth double checking. Ideally, we should be capturing any JSON serialization costs, etc.

I had a version I tried where I separated out the timing and did it after serialization, and in some sense I prefer this since the timing isn't encapsulated anyway. But per all of your other suggestions about encapsulating the metrics object in the PullQueryExecutor, I just did that. From what I can tell, there should really be little work done serializing.

Copy link
Member

@vpapavas vpapavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@big-andy-coates
Copy link
Contributor

I'd also be surprised if this made much of a difference, since in theory we should be IO bound, at least for large tables.

The auth check is potentially a cross machine call! ;)

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few suggestions.

final TableRows tableRows = result.getTableRows();

return new PullQueryPublisher(
final QueryPublisher queryPublisher = new PullQueryPublisher(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inline queryPublisher again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


public PullQueryExecutor(
final KsqlExecutionContext executionContext,
final KsqlEngine ksqlEngine,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to keep this as KsqlExecutionContext and pass in the service Id... less coupling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 241 to 242
pullQueryMetrics.ifPresent(metrics ->
metrics.recordLatency(Time.SYSTEM.nanoseconds() - startTimeNanos));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just have:

Suggested change
pullQueryMetrics.ifPresent(metrics ->
metrics.recordLatency(Time.SYSTEM.nanoseconds() - startTimeNanos));
pullQueryMetrics.ifPresent(metrics ->
metrics.recordLatency(startTimeNanos));

And change recordLatency to take the end time in ns, encapsulating the Time.SYSTEM.nanoseconds() into a single place...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


public void recordLatency(final double value) {
this.latencySensor.record(value);
public void recordLatency(final long nanoSeconds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested above, change to taking the end time, rather than the elapsed time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -80,10 +81,11 @@ public void setUp() {
publisher = new PullQueryPublisher(
serviceContext,
statement,
pullQueryExecutor);
pullQueryExecutor,
0L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we go with a non-zero start time please? Stick it in a constant, e.g. START_TIME = 1234L. Update tests to use START_TIME.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -75,8 +74,7 @@ public void shouldThrowExceptionIfConfigDisabled() {
// When:
final Exception e = assertThrows(
KsqlException.class,
() -> pullQueryExecutor.execute(query, engine.getServiceContext(), Optional.empty(),
Optional.empty())
() -> pullQueryExecutor.execute(query, engine.getServiceContext(), Optional.empty(), 0L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to test its passing an appropriate start time here. Can we inject a clock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 115 to 125
@Test
public void shouldRecordRequestRate() {
// Given:
pullMetrics.recordRate(3);

// When:
final double rate = getMetricValue("-rate");

// Then:
assertThat(rate, closeTo(0.03, 0.001));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than deleting this test, change it to:

@Test	
  public void shouldRecordRequestRate() {	
    // Given:	
    pullMetrics. recordLatency(3000);	
   
    // When:	
    final double rate = getMetricValue("-rate");	

    // Then:	
    assertThat(rate, closeTo(0.03, 0.001));	// <-- numbers need changing. 
  }

Otherwise there's no test testing that the rate is updated when recordLatency is called!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had removed the rate variable since it didn't appear to be read at all in practice, and since you had recommended removing it, but I'm now thinking you just meant the method recordRate. Alright, will add this variable to the recordLatecy call and add back this test.

@AlanConfluent AlanConfluent merged commit 143849c into confluentinc:master Aug 21, 2020
sarwarbhuiyan pushed a commit to sarwarbhuiyan/ksql that referenced this pull request Sep 4, 2020
…tinc#5983)

* fix: Uses pull query metrics for all paths, not just /query
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants