Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread-safe QueryMetrics #6402

Closed
wants to merge 7 commits into from

Conversation

gaodayue
Copy link
Contributor

@gaodayue gaodayue commented Sep 29, 2018

Fixes #4826 and #5128

It should be easy to use QueryMetrics from different threads for scenario like JDBC. The doc says "DefaultQueryMetrics's ownerThread should be reassigned explicitly", but it is not clear how best to do that.

I propose to make QueryMetrics thread safe so that it could be used in different threads with minimum effort. One case directly benefiting from the design is that the yielderOpenCloseExecutor hack could be removed. It also solves the above issues instantly.

leventov has pointed out one drawback of making QueryMetrics thread-safe

making QueryMetrics thread-safe is inherently wrong thing to do. Instead of failing with ConcurrentModificationException, multiple threads will override metrics of each other. Nobody would understand what is going on and if the metrics are trustworthy or not.

But I think it's also possible to catch bugs earlier with thread-safe QueryMetrics. For example, DefaultQueryMetrics can log an error if different values are added to the same dimension or metric.

@leventov @gianm @jon-wei what do you think?

@@ -114,10 +113,10 @@ public Void call()
{
try {
if (bySegment) {
input.run(threadSafeQueryPlus, responseContext)
input.run(queryPlus.withQueryMetricsCopied(), responseContext)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this allow silent metric skipping bugs? They could be recorded in the copied queryMetrics, but not finally reported in the "root" queryMetrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it could happen if the client of QueryMetrics forgot to emit. But I think the previous approach has the same problem. And this kind of bugs could be easily found and fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously there were no QueryMetrics object in threadSafeQueryPlus, it will throw NPE if somebody tries to emit some metrics by mistake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll revert changes relating to QueryPlus class. However, since QueryMetrics is now a thread-safe class, the old name withoutThreadUnsafeState() is out of date. I will use withoutQueryMetrics() instead.

builder.setDimension(dimension, value);
synchronized (lock) {
String oldValue = singleValueDims.put(dimension, value);
if (oldValue != null && !oldValue.equals(value)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equal values should not be tolerated. Any reassignment of already existing dimension or metric should be illegal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServerManager.buildAndDecorateQueryRunner reassigns segment id in each MetricsEmittingQueryRunner. Do you think we should fix that or just tolerate it? I think allowing for equal values make no harm and is convenient in such cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing equal metrics could easily shallow mistakes. Metrics or dimensions could be equal because they happen to collide (especially in tests/staging environment), but logically they are different, that will hit in production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that allowing equal metrics could shallow mistakes like multiple threads trying to report the same metrics, will prohibit such usage. But I think we should allow equal dimension values because there are valid use cases such as nesting MetricsEmittingQueryRunner. Do you think so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give a specific example where setting the same dimensions from multiple threads is needed? Both #4826 and #5128 are due to concurrent metrics reporting, not dimensions setting.

I don't want to allow setting dimensions from multiple threads until it proves to be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to set dimensions from multiple threads. For example, MetricsEmittingQueryRunner adds custom dimension in one thread #L91 and other dimension in possibility another thread #L112.

It's also possible for one thread to set the same dimensions more than once. For instance, the query runner for historical segment processing contains two MetricsEmittingQueryRunner, which would set "segment" and "status" dimension twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there are cases for setting the same dimensions from multiple threads. But the overhead to prohibit such usages is a bit high for me. You needs to maintain the owner thread for each dimension. And I think the checks in reportMetric and emit are sufficient to catch most bugs.

private void handleIllegalModification(String queryId, String entryName)
{
Exception e = new Exception("stack trace");
log.makeAlert(e, "\"%s\" in QueryMetrics got modified to another value", entryName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should throw an IllegalStateException, not just log it. Errors of this kind should remain very verbose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't throw exception here because I think bugs in metrics handling shouldn't fail the whole query. It would allow users continue to query with only "broken" metrics, which may be acceptable in some use cases. I admit that this may be controversial and would like to hear other committer's opinions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because metrics are "non-critical" it's doubly important to propagate errors related to them - otherwise problems with metrics will remain unnoticed for years.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will throw IllegalStateException instead

@gaodayue
Copy link
Contributor Author

gaodayue commented Oct 1, 2018

Thanks @leventov for reviewing! I replied to some of the review comments, please check that.

@gaodayue
Copy link
Contributor Author

gaodayue commented Oct 8, 2018

@leventov addressed most of your comments. Could you review it again?

@gaodayue
Copy link
Contributor Author

gaodayue commented Oct 8, 2018

I think the failed test case RemoteTaskRunnerTest is irrelevant

@GuardedBy("lock") private final Map<String, String> singleValueDims = new HashMap<>();
@GuardedBy("lock") private final Map<String, String[]> multiValueDims = new HashMap<>();
@GuardedBy("lock") private final Map<String, Number> metrics = new HashMap<>();
@GuardedBy("lock") private Thread ownerThread;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should remain protected for subclasses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subclasses are supposed to use setDimension/setDimensions/reportMetric methods (with proper synchronization and check) instead of directly manipulating on these fields

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean specifically the ownerThread field.

@@ -92,7 +94,7 @@
* 100% guarantee of compatibility, because methods could not only be added to QueryMetrics, existing methods could also
* be changed or removed.
*
* <p>QueryMetrics is designed for use from a single thread, implementations shouldn't care about thread-safety.
* <p><b>All implementations of QueryMetrics should be Thread-safe.</b>
*
*
* Adding new methods to QueryMetrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the requirements for all methods has changed, probably the name of this class should be changed to "ConcurrentQueryMetrics" to intentionally break proprietary implementations and force them to overhaul.

"DefaultQueryMetrics" OTOH I think could retain the name because it implemented thread-safety itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If needed, we can do the renaming in a separate PR. I want to keep this PR small and easy to review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change integrity is more important. I would be understandable if the PR was of thousands of lines already, but it's much smaller and will still be relatively small after the rename.

@gaodayue
Copy link
Contributor Author

@leventov sorry for the late reply. I have addressed previous round of review comments, could you check it again?

if (ownerThread == null) {
ownerThread = currThread;
} else if (!ownerThread.equals(currThread)) {
throw new ISE("emit() from multiple threads is not allowed. If it is needed to emit metrics from "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not formatted properly. Concatenated string arg should start at it's own line and in the end ); be on the separate line as well.

@@ -114,10 +113,10 @@ public Void call()
{
try {
if (bySegment) {
input.run(threadSafeQueryPlus, responseContext)
input.run(queryPlus.withoutQueryMetrics(), responseContext)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be explained now why queryPlus.withoutQueryMetrics() is passed down. Maybe involves some research why queryPlus.withoutThreadUnsafeState() used to be called in the first place.

*
* 1. For all methods that add dimension: if the value of one dimension is updated to another value, throws
* IllegalStateException.
* 2. For all methods that add metrics: if one metric is regitered more than once, throws IllegalStateException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"registered"

/**
* Returns the same QueryPlus object, if it doesn't have {@link QueryMetrics} ({@link #getQueryMetrics()} returns
* null), or returns a new QueryPlus object with {@link Query} from this QueryPlus and null as QueryMetrics.
*
* When you pass QueryPlus to multiple threads, it is usually needed to strip off QueryMetrics component first so that
* each thread will create its own QueryMetrics. See {@link ChainedExecutionQueryRunner} for an example.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueryMetrics should be a javadoc link

@@ -116,7 +115,7 @@ public ChainedExecutionQueryRunner(
public Iterable<T> call()
{
try {
Sequence<T> result = input.run(threadSafeQueryPlus, responseContext);
Sequence<T> result = input.run(queryPlus.withoutQueryMetrics(), responseContext);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see, there is at least one reason for doing this, is to not emit queryCpuTime metrics twice, see #1696 (comment). It works relying on a crazy combination of side effects, without a single comment. It requires a refactoring.

@stale
Copy link

stale bot commented Feb 28, 2019

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Feb 28, 2019
@stale
Copy link

stale bot commented Mar 7, 2019

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale bot closed this Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants