-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread-safe QueryMetrics #6402
Thread-safe QueryMetrics #6402
Conversation
@@ -114,10 +113,10 @@ public Void call() | |||
{ | |||
try { | |||
if (bySegment) { | |||
input.run(threadSafeQueryPlus, responseContext) | |||
input.run(queryPlus.withQueryMetricsCopied(), responseContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this allow silent metric skipping bugs? They could be recorded in the copied queryMetrics, but not finally reported in the "root" queryMetrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it could happen if the client of QueryMetrics forgot to emit. But I think the previous approach has the same problem. And this kind of bugs could be easily found and fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously there were no QueryMetrics object in threadSafeQueryPlus, it will throw NPE if somebody tries to emit some metrics by mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll revert changes relating to QueryPlus class. However, since QueryMetrics is now a thread-safe class, the old name withoutThreadUnsafeState() is out of date. I will use withoutQueryMetrics() instead.
builder.setDimension(dimension, value); | ||
synchronized (lock) { | ||
String oldValue = singleValueDims.put(dimension, value); | ||
if (oldValue != null && !oldValue.equals(value)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equal values should not be tolerated. Any reassignment of already existing dimension or metric should be illegal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ServerManager.buildAndDecorateQueryRunner reassigns segment id in each MetricsEmittingQueryRunner. Do you think we should fix that or just tolerate it? I think allowing for equal values make no harm and is convenient in such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing equal metrics could easily shallow mistakes. Metrics or dimensions could be equal because they happen to collide (especially in tests/staging environment), but logically they are different, that will hit in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that allowing equal metrics could shallow mistakes like multiple threads trying to report the same metrics, will prohibit such usage. But I think we should allow equal dimension values because there are valid use cases such as nesting MetricsEmittingQueryRunner. Do you think so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to set dimensions from multiple threads. For example, MetricsEmittingQueryRunner adds custom dimension in one thread #L91 and other dimension in possibility another thread #L112.
It's also possible for one thread to set the same dimensions more than once. For instance, the query runner for historical segment processing contains two MetricsEmittingQueryRunner, which would set "segment" and "status" dimension twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there are cases for setting the same dimensions from multiple threads. But the overhead to prohibit such usages is a bit high for me. You needs to maintain the owner thread for each dimension. And I think the checks in reportMetric and emit are sufficient to catch most bugs.
processing/src/main/java/org/apache/druid/query/DefaultQueryMetrics.java
Show resolved
Hide resolved
private void handleIllegalModification(String queryId, String entryName) | ||
{ | ||
Exception e = new Exception("stack trace"); | ||
log.makeAlert(e, "\"%s\" in QueryMetrics got modified to another value", entryName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should throw an IllegalStateException
, not just log it. Errors of this kind should remain very verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't throw exception here because I think bugs in metrics handling shouldn't fail the whole query. It would allow users continue to query with only "broken" metrics, which may be acceptable in some use cases. I admit that this may be controversial and would like to hear other committer's opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because metrics are "non-critical" it's doubly important to propagate errors related to them - otherwise problems with metrics will remain unnoticed for years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will throw IllegalStateException instead
processing/src/main/java/org/apache/druid/query/DefaultQueryMetrics.java
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/QueryMetrics.java
Outdated
Show resolved
Hide resolved
Thanks @leventov for reviewing! I replied to some of the review comments, please check that. |
@leventov addressed most of your comments. Could you review it again? |
I think the failed test case |
processing/src/main/java/org/apache/druid/query/DefaultQueryMetrics.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/DefaultQueryMetrics.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/DefaultQueryMetrics.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/DefaultQueryMetrics.java
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/QueryMetrics.java
Outdated
Show resolved
Hide resolved
sql/src/test/java/org/apache/druid/sql/calcite/util/SpecificSegmentsQuerySegmentWalker.java
Show resolved
Hide resolved
@GuardedBy("lock") private final Map<String, String> singleValueDims = new HashMap<>(); | ||
@GuardedBy("lock") private final Map<String, String[]> multiValueDims = new HashMap<>(); | ||
@GuardedBy("lock") private final Map<String, Number> metrics = new HashMap<>(); | ||
@GuardedBy("lock") private Thread ownerThread; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should remain protected
for subclasses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subclasses are supposed to use setDimension/setDimensions/reportMetric methods (with proper synchronization and check) instead of directly manipulating on these fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean specifically the ownerThread
field.
@@ -92,7 +94,7 @@ | |||
* 100% guarantee of compatibility, because methods could not only be added to QueryMetrics, existing methods could also | |||
* be changed or removed. | |||
* | |||
* <p>QueryMetrics is designed for use from a single thread, implementations shouldn't care about thread-safety. | |||
* <p><b>All implementations of QueryMetrics should be Thread-safe.</b> | |||
* | |||
* | |||
* Adding new methods to QueryMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the requirements for all methods has changed, probably the name of this class should be changed to "ConcurrentQueryMetrics" to intentionally break proprietary implementations and force them to overhaul.
"DefaultQueryMetrics" OTOH I think could retain the name because it implemented thread-safety itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If needed, we can do the renaming in a separate PR. I want to keep this PR small and easy to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change integrity is more important. I would be understandable if the PR was of thousands of lines already, but it's much smaller and will still be relatively small after the rename.
.../src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByMergingQueryRunnerV2.java
Outdated
Show resolved
Hide resolved
@leventov sorry for the late reply. I have addressed previous round of review comments, could you check it again? |
if (ownerThread == null) { | ||
ownerThread = currThread; | ||
} else if (!ownerThread.equals(currThread)) { | ||
throw new ISE("emit() from multiple threads is not allowed. If it is needed to emit metrics from " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not formatted properly. Concatenated string arg should start at it's own line and in the end );
be on the separate line as well.
@@ -114,10 +113,10 @@ public Void call() | |||
{ | |||
try { | |||
if (bySegment) { | |||
input.run(threadSafeQueryPlus, responseContext) | |||
input.run(queryPlus.withoutQueryMetrics(), responseContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be explained now why queryPlus.withoutQueryMetrics()
is passed down. Maybe involves some research why queryPlus.withoutThreadUnsafeState()
used to be called in the first place.
* | ||
* 1. For all methods that add dimension: if the value of one dimension is updated to another value, throws | ||
* IllegalStateException. | ||
* 2. For all methods that add metrics: if one metric is regitered more than once, throws IllegalStateException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"registered"
/** | ||
* Returns the same QueryPlus object, if it doesn't have {@link QueryMetrics} ({@link #getQueryMetrics()} returns | ||
* null), or returns a new QueryPlus object with {@link Query} from this QueryPlus and null as QueryMetrics. | ||
* | ||
* When you pass QueryPlus to multiple threads, it is usually needed to strip off QueryMetrics component first so that | ||
* each thread will create its own QueryMetrics. See {@link ChainedExecutionQueryRunner} for an example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QueryMetrics should be a javadoc link
@@ -116,7 +115,7 @@ public ChainedExecutionQueryRunner( | |||
public Iterable<T> call() | |||
{ | |||
try { | |||
Sequence<T> result = input.run(threadSafeQueryPlus, responseContext); | |||
Sequence<T> result = input.run(queryPlus.withoutQueryMetrics(), responseContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see, there is at least one reason for doing this, is to not emit queryCpuTime metrics twice, see #1696 (comment). It works relying on a crazy combination of side effects, without a single comment. It requires a refactoring.
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Fixes #4826 and #5128
It should be easy to use QueryMetrics from different threads for scenario like JDBC. The doc says "DefaultQueryMetrics's ownerThread should be reassigned explicitly", but it is not clear how best to do that.
I propose to make QueryMetrics thread safe so that it could be used in different threads with minimum effort. One case directly benefiting from the design is that the yielderOpenCloseExecutor hack could be removed. It also solves the above issues instantly.
leventov has pointed out one drawback of making QueryMetrics thread-safe
But I think it's also possible to catch bugs earlier with thread-safe QueryMetrics. For example, DefaultQueryMetrics can log an error if different values are added to the same dimension or metric.
@leventov @gianm @jon-wei what do you think?