fix: Uses pull query metrics for all paths, not just /query #5983

AlanConfluent · 2020-08-11T18:49:54Z

Description

Fixes issue with pull query metrics not being used in every pull query codepath. Now it should be used not only on /query by also the web socket code path.

Testing done

Ran normal tests. Also, verified locally that it's now in use.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

apurvam

Thanks for the patch. Should we be adding tests which check that metrics are being computed correctly and through every API path?

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

AlanConfluent · 2020-08-11T23:32:46Z

Thanks for the patch. Should we be adding tests which check that metrics are being computed correctly and through every API path?

Yes, we probably should. The main difficulty is that PullQueryExecutor isn't currently really tested in its own unit test. I can try to figure out the best way to do this though.

vpapavas · 2020-08-12T01:10:10Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/execution/PullQueryExecutor.java

          statement,
          executionContext,
          serviceContext,
          pullQueryContext,
          routingOptions
      );
+


There was a reason for these measurements to be in the StreamedQueryResource. They measure a more realistic latency that accounts also for query parsing etc. , basically all that happens before the request reaches the PullQueryExecutor.

My original thinking was that such CPU bound things like parsing should be an order of magnitude or two faster than IO like reading from disk or network. I do recall that in earlier benchmarking, we didn't seem to be IO bound, but that's probably a separate issue.

If what we really want is latency for the whole endpoint, which I agree is preferable, it seems like it would be good to have a more generic latency metric for all resources -- that way we don't special case this. Not sure if that's possible with Vert.x, but I imagine it is. So as to keep consistency with existing metric naming and make it work with the websocket (which is probably harder to make generic), I'll stick to the pull query use case for now. Ok, I'll change this part back and add in the web socket.

That's a good point - might be worth asking @purplefox and/or googling about vert.x to see if there is a generic way to get metrics on all requests. I know this is possible in Jetty, but don't know vert.x.

big-andy-coates

It's a bit of a PITA that we now have two ways of handling pull queries. However, be that as it may, we obviously need to capture metrics from both, in some semblance of a standard way. As @vpapavas says, the moving the latency capture completely into the executor means we're no longer including the time taken to do things like auth checks in the latency. This is obviously bad!

The best solution would be to remove the old code path for pull queries. But this isn't possible short term, (though we should at least start this process, e.g. get the UI team to move to the new HTTP2 endpoint and mark the old websocket endpoint as deprecated).

Given the two code paths, the challenge is therefore to change the code in a way that encapsulates the collection of the metrics and has the minimum duplicate code in the two paths.

Looking at the two code paths, they are drastically different in the way they are architected. This leads me to believe the best approach is for each path to capture the start time of the request (in nanoseconds), and pass this down the call stack. This start time can be passed down into PullQueryExecutor so it can log the latency. I don't actually like this pattern, but I don't think capturing the metrics at a higher level is feasible. I think this only works as there isn't any substantial processing after the executor has done its thing, but this is worth double checking. Ideally, we should be capturing any JSON serialization costs, etc.

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/execution/PullQueryExecutor.java

big-andy-coates · 2020-08-12T10:04:13Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/execution/PullQueryExecutor.java

          statement,
          executionContext,
          serviceContext,
          pullQueryContext,
          routingOptions
      );
+


ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/execution/PullQueryExecutor.java

AlanConfluent · 2020-08-13T02:47:26Z

@big-andy-coates

It's a bit of a PITA that we now have two ways of handling pull queries. However, be that as it may, we obviously need to capture metrics from both, in some semblance of a standard way. As @vpapavas says, the moving the latency capture completely into the executor means we're no longer including the time taken to do things like auth checks in the latency. This is obviously bad!

I agree with the general sentiment, though I'm surprised we don't have a framework level latency measurer for web socket requests and endpoint requests.

I'd also be surprised if this made much of a difference, since in theory we should be IO bound, at least for large tables.

Looking at the two code paths, they are drastically different in the way they are architected. This leads me to believe the best approach is for each path to capture the start time of the request (in nanoseconds), and pass this down the call stack. This start time can be passed down into PullQueryExecutor so it can log the latency. I don't actually like this pattern, but I don't think capturing the metrics at a higher level is feasible. I think this only works as there isn't any substantial processing after the executor has done its thing, but this is worth double checking. Ideally, we should be capturing any JSON serialization costs, etc.

I had a version I tried where I separated out the timing and did it after serialization, and in some sense I prefer this since the timing isn't encapsulated anyway. But per all of your other suggestions about encapsulating the metrics object in the PullQueryExecutor, I just did that. From what I can tell, there should really be little work done serializing.

vpapavas

LGTM

big-andy-coates · 2020-08-18T09:43:16Z

I'd also be surprised if this made much of a difference, since in theory we should be IO bound, at least for large tables.

The auth check is potentially a cross machine call! ;)

big-andy-coates

LGTM with a few suggestions.

big-andy-coates · 2020-08-18T09:45:53Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/api/impl/QueryEndpoint.java

    final TableRows tableRows = result.getTableRows();

-    return new PullQueryPublisher(
+    final QueryPublisher queryPublisher = new PullQueryPublisher(


nit: inline queryPublisher again.

big-andy-coates · 2020-08-18T09:47:21Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/execution/PullQueryExecutor.java


  public PullQueryExecutor(
-      final KsqlExecutionContext executionContext,
+      final KsqlEngine ksqlEngine,


Probably better to keep this as KsqlExecutionContext and pass in the service Id... less coupling.

big-andy-coates · 2020-08-18T09:49:15Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/execution/PullQueryExecutor.java

+      pullQueryMetrics.ifPresent(metrics ->
+          metrics.recordLatency(Time.SYSTEM.nanoseconds() - startTimeNanos));


Why not just have:

Suggested change

pullQueryMetrics.ifPresent(metrics ->

metrics.recordLatency(Time.SYSTEM.nanoseconds() - startTimeNanos));

pullQueryMetrics.ifPresent(metrics ->

metrics.recordLatency(startTimeNanos));

And change recordLatency to take the end time in ns, encapsulating the Time.SYSTEM.nanoseconds() into a single place...?

big-andy-coates · 2020-08-18T09:50:12Z

...rest-app/src/main/java/io/confluent/ksql/rest/server/execution/PullQueryExecutorMetrics.java

-
-  public void recordLatency(final double value) {
-    this.latencySensor.record(value);
+  public void recordLatency(final long nanoSeconds) {


As suggested above, change to taking the end time, rather than the elapsed time?

big-andy-coates · 2020-08-18T09:52:06Z

.../src/test/java/io/confluent/ksql/rest/server/resources/streaming/PullQueryPublisherTest.java

@@ -80,10 +81,11 @@ public void setUp() {
    publisher = new PullQueryPublisher(
        serviceContext,
        statement,
-        pullQueryExecutor);
+        pullQueryExecutor,
+        0L);


Can we go with a non-zero start time please? Stick it in a constant, e.g. START_TIME = 1234L. Update tests to use START_TIME.

big-andy-coates · 2020-08-18T09:52:42Z

...db-rest-app/src/test/java/io/confluent/ksql/rest/server/execution/PullQueryExecutorTest.java

@@ -75,8 +74,7 @@ public void shouldThrowExceptionIfConfigDisabled() {
      // When:
      final Exception e = assertThrows(
          KsqlException.class,
-          () -> pullQueryExecutor.execute(query, engine.getServiceContext(), Optional.empty(),
-              Optional.empty())
+          () -> pullQueryExecutor.execute(query, engine.getServiceContext(), Optional.empty(), 0L)


Would be good to test its passing an appropriate start time here. Can we inject a clock?

big-andy-coates · 2020-08-18T09:54:48Z

...-app/src/test/java/io/confluent/ksql/rest/server/execution/PullQueryExecutorMetricsTest.java

-  @Test
-  public void shouldRecordRequestRate() {
-    // Given:
-    pullMetrics.recordRate(3);
-
-    // When:
-    final double rate = getMetricValue("-rate");
-
-    // Then:
-    assertThat(rate, closeTo(0.03, 0.001));
-  }


Rather than deleting this test, change it to:

@Test public void shouldRecordRequestRate() { // Given: pullMetrics. recordLatency(3000); // When: final double rate = getMetricValue("-rate"); // Then: assertThat(rate, closeTo(0.03, 0.001)); // <-- numbers need changing. }

Otherwise there's no test testing that the rate is updated when recordLatency is called!

I had removed the rate variable since it didn't appear to be read at all in practice, and since you had recommended removing it, but I'm now thinking you just meant the method recordRate. Alright, will add this variable to the recordLatecy call and add back this test.

…tinc#5983) * fix: Uses pull query metrics for all paths, not just /query

fix: Uses pull query metrics for all paths, not just /query

33f7b5c

AlanConfluent requested a review from vpapavas August 11, 2020 18:49

AlanConfluent requested a review from a team as a code owner August 11, 2020 18:49

AlanConfluent assigned vpapavas Aug 11, 2020

Style

b0538fd

apurvam reviewed Aug 11, 2020

View reviewed changes

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestApplication.java Show resolved Hide resolved

vpapavas reviewed Aug 12, 2020

View reviewed changes

big-andy-coates reviewed Aug 12, 2020

View reviewed changes

Feedback

8720613

Fix test

35adefb

vpapavas reviewed Aug 18, 2020

View reviewed changes

big-andy-coates approved these changes Aug 18, 2020

View reviewed changes

AlanConfluent added 2 commits August 21, 2020 13:22

Feedback

0683493

Style

aed898d

AlanConfluent merged commit 143849c into confluentinc:master Aug 21, 2020

sarwarbhuiyan pushed a commit to sarwarbhuiyan/ksql that referenced this pull request Sep 4, 2020

fix: Uses pull query metrics for all paths, not just /query (confluen…

14d53e3

…tinc#5983) * fix: Uses pull query metrics for all paths, not just /query

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Uses pull query metrics for all paths, not just /query #5983

fix: Uses pull query metrics for all paths, not just /query #5983

AlanConfluent commented Aug 11, 2020

apurvam left a comment

AlanConfluent commented Aug 11, 2020

vpapavas Aug 12, 2020

big-andy-coates Aug 12, 2020

AlanConfluent Aug 12, 2020 •

edited

Loading

big-andy-coates Aug 18, 2020

big-andy-coates left a comment •

edited

Loading

big-andy-coates Aug 12, 2020

AlanConfluent commented Aug 13, 2020

vpapavas left a comment

big-andy-coates commented Aug 18, 2020

big-andy-coates left a comment

big-andy-coates Aug 18, 2020

AlanConfluent Aug 21, 2020

big-andy-coates Aug 18, 2020

AlanConfluent Aug 21, 2020

big-andy-coates Aug 18, 2020

AlanConfluent Aug 21, 2020

big-andy-coates Aug 18, 2020

AlanConfluent Aug 21, 2020

big-andy-coates Aug 18, 2020

AlanConfluent Aug 21, 2020

big-andy-coates Aug 18, 2020

AlanConfluent Aug 21, 2020

big-andy-coates Aug 18, 2020

AlanConfluent Aug 21, 2020

		pullQueryMetrics.ifPresent(metrics ->
		metrics.recordLatency(Time.SYSTEM.nanoseconds() - startTimeNanos));

fix: Uses pull query metrics for all paths, not just /query #5983

fix: Uses pull query metrics for all paths, not just /query #5983

Conversation

AlanConfluent commented Aug 11, 2020

Description

Testing done

Reviewer checklist

apurvam left a comment

Choose a reason for hiding this comment

AlanConfluent commented Aug 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlanConfluent Aug 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

big-andy-coates left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlanConfluent commented Aug 13, 2020

vpapavas left a comment

Choose a reason for hiding this comment

big-andy-coates commented Aug 18, 2020

big-andy-coates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlanConfluent Aug 12, 2020 •

edited

Loading

big-andy-coates left a comment •

edited

Loading