Skip to content

Comments

Add query/time metric for SQL queries from router#12867

Merged
rohangarg merged 11 commits intoapache:masterfrom
rohangarg:router_sql_query_time
Sep 7, 2022
Merged

Add query/time metric for SQL queries from router#12867
rohangarg merged 11 commits intoapache:masterfrom
rohangarg:router_sql_query_time

Conversation

@rohangarg
Copy link
Member

This change adds the query/time metric for SQL queries from router. Currently, the native queries do report that metric whereas the SQL queries don't. The biggest problem in support SQL query metrics is that in router the SQL query doesn't have a native query plan which can be used to send metrics.
So, instead we extract sqlQueryId from the query response header and only set that dimension for query/time metric for SQL queries. Due to the lack of a native translated query, we use a dummy native query to interact with QueryMetrics interface but ensure that no dummy dimensions are set in the metric.

The reasons for not de-serializing the SQL query requests are :

  1. It can take time on the router which can add to the latencies of the query
  2. It is not possible to de-serialize the JDBC SQL queries easily (would require same structure as Avatica handlers)

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rohangarg for fixing this. I have few comments on this PR.

{
return null;
}
QueryMetrics<Query<?>> makeMetrics();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not need a default implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the default implementation since it is ok as per the PublicApi documentation to add method to interfaces in a major release. Further, the existing method's semantics are kept same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because it is "okay" to add a thing, forcing anybody that happens to implement an interface to implement a new method adds friction to moving forward. It's "okay" to provide for some way to make changes and move things forward, not to say that it's okay to force people to make simple random changes to keep up with versions. Always make it as simple as possible for someone to bring their extension forward a version, if it's relatively easy to make it so that nobody has to change anything and they get good behavior, then that should be done.

Now, if the default implementation doesn't provide good behavior and instead causes bad things to happen, then people should be forced to implement it. In this case, it seems like a default is good?

Copy link
Member Author

@rohangarg rohangarg Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the default would be a good thing to have here. having a default method in the interface which returns null would mean that all the developers who have written custom GenericQueryMetricsFactory would need to implement this method anyways to get router SQL query metrics. Also this information to extend the factory would need communication.
Further since the method return would be a nullable, all users of the method would need to handle null explicitly in the code from now on.
The good thing that happens with default implementation is that developers who don't care about router metrics for SQL queries don't need to make any code changes to their custom implementation of metrics factory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the answer isn't "I don't have a default implementation because it's okay to add one" it's "I don't have one because there isn't a good implementation for them and anybody who implements this interface really does need to think about what the correct implementation is". In which case, that's the reason to add a new method, so great :).

Anyway, sorry to be pedantic, but for anything that impacts compatibility, it's important to show the work that we've thought about the plight of the developer who is updating their cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I totally agree with your sentiment regarding SPI compat being more of a judgement call rather than a technicality. I started out with a default implementation and then removed it due to the above rationale. But I missed adding the full explanation in the previous comment.
After this discussion I was also thinking if we could change makeMetrics(Query) to makeMetrics(@Nullable Query) and then make the makeMetrics default implementation as makeMetrics(null). But again, with that I think that might impact the semantics of makeMetrics(Query) to expect a non-null Query.
I'll again think if any other way is possible to avoid incompatbility and update if I find one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is resolved. It is the way it is in this PR because that's the way it needs to be.

2. Add router metric tests for JDBC SQL query using avactica JSON
3. Add request log line for native sql queries
@rohangarg
Copy link
Member Author

@cheddar and I had a discussion where we talked about alternative ways to get the queryId and sqlQueryId for user queries. The current way to fetch the ids was from response header which is a last resort rather than a principled way. The response header method isn't very scalable incase we want to add more metrics from the router, since all of them would have to wait for the full request to be completed and then would have to be emitted from the response listeners.

The challenges with other approaches to fetch ids are :

  1. We need to generate query ids incase the user hasn't provided them. This is needed to keep the metrics from routers and other services consistent for same user query. To check and generate the ids, we need to deserialize JDBC and normal SQL queries to a state where we can see the context map for them.
  2. JDBC is a stateful protocol which works on the session state being managed by the broker. So, all the JDBC request won't have the ids in them (config params only allowed in connection-open request). Thus, incase a user is setting id via JDBC, while executing the actual SQL queries we won't be able to know about the ids.
  3. Native query id for SQL can't be set from router for a query since the SQL might break into multiple native queries (union queries), and all of them then would have the same native id.

As a result, we've decided to do the following :

  1. For normal SQL queries, we use a deserialized version of the request and inject it with SQL query-id if the user hasn't overriden it.
  2. We don't emit native query-id for any SQL queries to avoid any problems with union queries
  3. For JDBC queries, we extract the SQL query-id from the response header since that is best we can do as of now.

The above mentioned solution will allow to get the query/time metric for all SQL queries, while giving us some room to add more metrics atleast for normal SQL queries.

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions to work out, once they are worked out I'll be approved.

.build();
queryMetrics.query(query);
queryMetrics.reportQueryTime(0).emit(serviceEmitter);
queryMetrics.sqlQueryId("dummy"); // done just to pacify the code coverage tool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this comment? 6 months from now, when someone reads this code and sees that comment, how did the comment enrich their life? Fwiw, I'm not asking this sarcastically, I'm askign because I want whatever your answer is to be bundled into the comment :).

That or maybe the test can validate that something is done with the sqlQueryId and then it can actually be testing it or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code was added just to make the code coverage tool pass - the actual verification can't be done because we have a no-op implementation for sqlQuery(String) in the default metrics. I've updated the comment to be more clear.

public HttpFields getHeaders()
{
HttpFields httpFields = new HttpFields();
httpFields.add(new HttpField(QueryResource.QUERY_ID_RESPONSE_HEADER, "dummy"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to frown on re-using constants like this in a test. The test is validating the consistency of the API. If you use a constant like this for this part, then someone could come along and change the header that the queryId is returned on, the tests would pass because they are also being changed because they are using the same object, but the production deployment could fail as you've broken the API: anything that depended on the older header name will be broken.

It's better for tests to actually be brittle in these cases: hard-code the header name so that if anything accidentally changes it in the future, it will be caught by the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense to me! 👍 have updated to use hardcoded values to protect against silent failures

private SqlQuery buildSqlQueryWithId(SqlQuery sqlQuery)
{
Map<String, Object> context = new HashMap<>(sqlQuery.getContext());
context.putIfAbsent(BaseQuery.SQL_QUERY_ID, UUID.randomUUID().toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, but this will end up setting the native queryId as well if that was null, right? I.e. when I'm comparing my query/time metrics filtering on a single native queryId, I'll also get the query/time from the router, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offiline, have added BaseQuery.QUERY_ID to the context, so the event will have id dimension filled in router's query/time metric event as well

@rohangarg rohangarg merged commit 7aa8d7f into apache:master Sep 7, 2022
@kfaraz kfaraz added this to the 25.0 milestone Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants