Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ResponseContext #11828

Merged
merged 4 commits into from
Dec 7, 2021
Merged

Conversation

paul-rogers
Copy link
Contributor

@paul-rogers paul-rogers commented Oct 22, 2021

Description

Fixes a number of annoying issues in preparation for request trailers and the query profile.

  • Converts keys from an enum to classes for smaller code
  • Wraps stored values in functions for easier capture for other uses
  • Reworks the "header squeezer" to handle types other than arrays.
  • Uses metadata for visibility, and ability to compress, to replace ad-hoc code.
  • Cleans up JSON serialization for the response context.
  • Other miscellaneous cleanup.

Based on a prototype by Gian.

Preparing for response trailers and query profile

The query profile work needs a way to return profile information at the completion of a query. It turns out Gian had prototyped just such a mechanism. To do that, he modified the response context to make it clear which fields are returned in the trailer vs. the header. vs. not at all.

The query profile will use the response context as a "back door" to sneak metrics information out of each QueryRunner.run() call. This PR contains just the refactoring done by those projects without the additional complexity of the actual trailer or profile code.

Key metadata

ResponseContext provides a set of key values that define the values allowed in the the context. It seems that those values were originally for a small set of request header values, but grew over time to include internal-only values. With the addition of the response trailer (coming in a later PR), we'll also have footer values.

We have code that knows which names should be in the header and picks those out. However, extensions can add keys, which makes a hard-code approach brittle. This PR adds metadata to each key to identify its visibility. (Based on work originally done by Gian.)

Expanded header truncation support

The code can drop (or truncate) header fields. The code hard-codes those fields that can be truncated, and assumes they are arrays. This PR adds the "droppable" state as metadata in the key, and handles non-array fields when dropping. This ensures that extensions get the same "droppable" behavior as core code.

Keys as classes, not enums

Keys are defined as an enum. Enums define a fixed set of values. Yet, we allow extensions to "extend" the set of keys. In this case, it is easier to define keys as objects not in an enum.

With this move, we can simplify key definitions. Enum values can't be instances of a class. But, with the additional responsibilities placed on keys, the enum structured forced quite a bit of redundant code. Changing keys to objects let's us use the usual code reuse techniques to avoid redundancy.

Explicit accessors

The code to use the response context usually looks like this:

responseContext.add(ResponseContext.Key.CPU_CONSUMED_NANOS, cpuTimeNs);

That is rather verbose. And, if we wanted to capture the value for use in, say, metrics, we'd have to hunt down all the places that update the response context and add code to update metrics. This PR provides accessors instead:

responseContext.addCpuNanos(cpuTimeNs);

Release notes

This PR changes the the ResponseContext and it's keys in a breaking way. Any extension that used the prior way to add keys to the response context will need to change the code a bit. However, such changes should be simple.

The prior version of the response context suggested that keys be defined in an enum, then registered. This version suggests that keys be defined as objects, then registered. See the ResponseContext class itself for the details.

Note that this change does not impact any extension unless that extension adds customer response context keys.

Summary

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a local Druid cluster.
  • release note explanation listed above

@paul-rogers
Copy link
Contributor Author

Build is complaining about lack of code coverage in QueryResource.java where I replaced one get method with another. This seems to be a case where the code changed, and we're enforcing new rules that perhaps weren't in place when this code was first added. How do we handle this case?

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a first pass; I think changes look good overall, but am going to try scan through again soon since I forgot how complicated response context stuff was until I started this review and would like to double check 😅

@PublicApi
public abstract class ResponseContext
{
/**
* The base interface of a response context key.
* Should be implemented by every context key.
*/
public interface BaseKey
public interface Key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tagged this PR with design review and release notes labels because of this change.

ResponseContext is marked @PublicApi, which i guess means this part of the code is extensible, so any custom extensions depending on BaseKey will need to be modified to now implement Key (which does not have default implementations, so it would need this label even if not renamed).

This is probably fine assuming that it is still possible for extensions to define custom response context information, just calling it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis, you are right. As it turned out, none of the extensions in the Druid code repo itself actually had this issue, which suggested that, while this is an extension point, it may not actually have many (or any) extensions. So, the benefit of the simpler code (especially as we add the response trailer) seemed worth the risk.

I wonder, how do we normally communicate such potentially breaking changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this interface should be marked ExtensionPoint. PublicApi doesn't seem enough because PublicAPI is not meant to be subclassed, but only means it's stable for callers. @paul-rogers can you add the ExtensionPoint annotation for this interface? Even though it's not marked, I agree that this change should be called out in the release notes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added @ExtensionPoint. How do I record this info so it is available when we prepare release notes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one you added in the PR description seems good enough to me 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you all think about tagging this with @UnstableApi too? I think it's a good idea, because:

  • This is not a common extension point
  • We may be changing it more in the future as part of additional query trailer work

So, thinking about compatibility in each of those changes will be a hassle without much benefit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds good to me 👍

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change LGTM overall. I left some comments about javadoc and test besides trivial comments. @paul-rogers can you please resolve the conflict as well?

Comment on lines 149 to 144
* Keys that are present in both the "X-Druid-Response-Context" header *and* the response context trailer.
*/
HEADER,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sort of keys will be in both the header and the trailer? Also, should it be something like HEADER_AND_TRAILER?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question for the next PR. Here, it seems that, in picking out this set of changes, I picked the wrong comment. Fixed that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm sill curious about what keys will be in both the header and the trailer 🙂 Will the next PR need to modify this enum, such as adding a new Visibility type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Gian's query trailer prototype uncoveredIntervals uncoveredIntervalsOverflowed, missingSegments, cpuConsumed, and a new metrics item are trailer entries. In the query profile work, the profile is also in the trailer only. The final set of values will be HEADER, TRAILER, HEADER_AND_TRAILER, NONE.

Having just written this, it may make more sense to have two Booleans rather than an all-possible-combination enum. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like isVisibleInHeader and isVisibleInTrailer? That sounds reasonable to me. You can do it in your follow-up PR if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson, went ahead and converted Visibility to a Boolean, since I had to make another fix as well. For now, only the header flag appears: the next PR will add the footer flag.

@paul-rogers
Copy link
Contributor Author

Failing in 34294.18 (openjdk8) server module test due to previously-untested code in QueryResource.

Original line:

      if (prevEtag != null && prevEtag.equals(responseContext.get(ResponseContext.Key.ETAG))) {

Revised line:

      if (prevEtag != null && prevEtag.equals(responseContext.getEntityTag())) {

All we're doing is calling a method to get the ETAG value rather than grabbing it directly.

Error:

Diff coverage statistics:
------------------------------------------------------------------------------
|     lines      |    branches    |   functions    |   path
------------------------------------------------------------------------------
| 100% (2/2)     |  25% (1/4)     |  50% (1/2)     | org/apache/druid/server/QueryResource.java
------------------------------------------------------------------------------

ERROR: Insufficient branch coverage of 25% (1/4). Required 50%.

This is odd because QueryResource cannot be tested (AFAIK) in a unit test since we cannot run a Druid server in a unit test. The ETAG, I'm told, has to do with query caching. Presumably, we do test query caching somewhere - integration tests?

It seems this a case where the original author was not obligated to ensure sufficient tests exist, but us folks who come along later are required to rectify that oversight? In code that, seemingly, is not designed to allow unit testing?

Suggestions?

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paul-rogers thanks for addressing comments. You may haven't pushed your last change. Please check #11828 (comment).

For the coverage check failure, I think we have 3 options similar to what I suggested in #11822 (comment).

  • Using your hack to work around the check.
  • Ignoring the test coverage check failure. This makes sense only when all tests in the phase 2 pass. I can trigger the phase 2 tests if we decide to go down this path.
  • Adding a new test. I prefer this option most since it doesn't seem very hard to do it. I'm thinking of adding a new test in QueryResourceTest that verifies the behavior when the etag matches to the previous one. Since this is a unit test, I would say you don't have to test the e2e behavior of result-level cache here. Instead, you can just verify what is returned when etag matches. For easy testing, you may want to mock QueryLifecycleTest and QueryLifecycle, so that you don't have to worry about query processing at all.

@jihoonson
Copy link
Contributor

jihoonson commented Nov 30, 2021

@paul-rogers nice test 🙂 I restarted the failed test which seems flaky. Besides the test failure, please check my comment in #11828 (comment). Other changes LGTM.

@clintropolis do you have more comments?

Fixes a number of issues in preparation for request trailers
and the query profile.

* Converts keys from an enum to classes for smaller code
* Wraps stored values in functions for easier capture for other uses
* Reworks the "header squeezer" to handle types other than arrays.
* Uses metadata for visibility, and ability to compress,
  to replace ad-hoc code.
* Cleans up JSON serialization for the response context.
* Other miscellaneous cleanup.
@paul-rogers
Copy link
Contributor Author

Squashed commits and rebased on master to resolve merge conflicts.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring looks good to me. It looks like the intent is not to actually change behavior right now, except for dropping certain keys from the response context header that didn't really make sense there anyway. (Some of them are clearly for internal state tracking; some of them couldn't possibly have been populated properly by the time the HTTP header needs to be generated.)

I had a few comments on individual pieces; please LMK what you think.


ResponseContext.Keys keys = ResponseContext.Keys.instance();
while (jp.currentToken() == JsonToken.FIELD_NAME) {
final ResponseContext.Key key = keys.keyOf(jp.getText());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want some handling here for the case where a data server generates a key that the Broker hasn't heard of. It may happen when we add new keys. Skipping them seems reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm, excellent point. This can occur during a rolling upgrade when, say, the historical is newer than the broker.

To handle this, added code to "freewheel" over the JSON value, whether it be a scalar or structured. Added tests to match.

This seems like a generic operation, might we have something in one of the utilities? A quick search didn't find anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything for this in JacksonUtils, which is probably where it would be if we had it. I guess it's because we usually deserialize using annotations rather than writing Deserializers directly, and the annotation-based deserialization mechanism has built-in logic to skip unknown properties.

@paul-rogers
Copy link
Contributor Author

Tests got to the second stage (finally!), but failed in the Kafka test with an apparently unrelated issue:

org.apache.druid.java.util.common.ISE: Max number of retries[50] exceeded for Task[waiting for autoScaling task numbers from 1 to 2]. Failing.
	at org.apache.druid.testing.utils.ITRetryUtil.retryUntil(ITRetryUtil.java:94) ~[druid-integration-tests-0.23.0-SNAPSHOT.jar:0.23.0-SNAPSHOT]

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please rename the variable as in my comment.

* would be called only from class static blocks.
*/
private static final ConcurrentMap<String, BaseKey> REGISTERED_KEYS = new ConcurrentSkipListMap<>();
private final ConcurrentMap<String, Key> registered_keys = new ConcurrentSkipListMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final ConcurrentMap<String, Key> registered_keys = new ConcurrentSkipListMap<>();
private final ConcurrentMap<String, Key> registeredKeys = new ConcurrentSkipListMap<>();
Suggested change
private final ConcurrentMap<String, Key> registered_keys = new ConcurrentSkipListMap<>();
private final ConcurrentMap<String, Key> registered_keys = new ConcurrentSkipListMap<>();

@abhishekagarwal87
Copy link
Contributor

@jihoonson - is this PR good to merge now?

@jihoonson jihoonson merged commit 34a3d45 into apache:master Dec 7, 2021
@jihoonson
Copy link
Contributor

Merged. Thanks @paul-rogers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants