Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout to shutdown request to middle manager for indexing service #1862

Merged
merged 2 commits into from Oct 27, 2015

Conversation

drcrallen
Copy link
Contributor

If the middle manager has a catastrophic failure, the overlord can hang here while holding TaskQueue giant.lock(), causing other indexing service TaskQueue to lock up waiting for it to free..

final StatusResponseHolder response = httpClient.go(
new Request(HttpMethod.POST, url),
RESPONSE_HANDLER
).get();
).get(config.getTaskCleanupTimeout().getMillis(), TimeUnit.MILLISECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be taskShutdownTimeout ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@drcrallen drcrallen added this to the 0.9.0 milestone Oct 26, 2015
@@ -81,6 +81,7 @@ The following configs only apply if the overlord is running in remote mode:
|`druid.indexer.runner.compressZnodes`|Indicates whether or not the overlord should expect middle managers to compress Znodes.|true|
|`druid.indexer.runner.maxZnodeBytes`|The maximum size Znode in bytes that can be created in Zookeeper.|524288|
|`druid.indexer.runner.taskCleanupTimeout`|How long to wait before failing a task after a middle manager is disconnected from Zookeeper.|PT15M|
'`druid.indexer.runner.taskShutdownLinkTimeout`|How long to wait on a shutdown request to a middle manager before timing out|PT1M|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing | at the beginning of the line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! the ' and the | look very similar on my intellij. Fixed

@@ -405,8 +406,12 @@ public void shutdown(final String taskId)
log.error("Shutdown failed for %s! Are you sure the task was running?", taskId);
}
}
catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RE(e, "Timed out posting shutdown to [%s] for task [%s]", url, taskId);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing this. incorrect message

@nishantmonu51
Copy link
Member

👍

@xvrl xvrl modified the milestones: 0.8.2, 0.9.0 Oct 26, 2015
@drcrallen
Copy link
Contributor Author

@xvrl I don't think this is a regression. So it doesn't have to block 0.8.2 (even though we'll probably roll it out internally very quickly)

@gianm
Copy link
Contributor

gianm commented Oct 26, 2015

@drcrallen is setting the readTimeout on the http client not good enough? That's supposed to help with this kind of thing, in a nicer way- the connections get closed rather than left dangling when the future times out.

@gianm
Copy link
Contributor

gianm commented Oct 26, 2015

The readTimeout should also be on by default, but I think the default is highish (5 or 15 minutes maybe)

@drcrallen
Copy link
Contributor Author

@gianm It would be nicer, but this uses a global http client, so setting that value would require setting the global timeout. There is no hook in the HttpClient interface to create a clone with a custom timeout. And since it uses connection pools underneath, is it possible to change timeouts for just one type of call while using the same client?

@drcrallen
Copy link
Contributor Author

@gianm I can cancel/interrupt the future if that would make it cleaner.

@gianm
Copy link
Contributor

gianm commented Oct 26, 2015

@drcrallen I don't think the NettyHttpClient does anything special if the future is canceled.

How do you feel about using per request timeouts for this patch? this change in http-client should make it possible: metamx/http-client#24

It should also work to set the global readTimeout lower, I think the overlord doesn't do anything over http that really warrants a long timeout. The long default timeout is mostly for the broker's benefit.

@drcrallen
Copy link
Contributor Author

@gianm That sounds good to me.

final StatusResponseHolder response = httpClient.go(
new Request(HttpMethod.POST, url),
RESPONSE_HANDLER
).get();
).get(config.getTaskShutdownLinkTimeout().getMillis(), TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from metamx/http-client#24 , duration is supposed to go inside "go(...)" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will change when release of http-client is finished. (it is progressing right now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@drcrallen
Copy link
Contributor Author

@xvrl / @gianm / @himanshug should be updated now

@@ -42,6 +42,10 @@
@Min(10 * 1024)
private long maxZnodeBytes = 512 * 1024;

@JsonProperty
@Min(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this Min thingy work on Periods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gooood question, probably not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed min annotation

final StatusResponseHolder response = httpClient.go(
new Request(HttpMethod.POST, url),
RESPONSE_HANDLER
RESPONSE_HANDLER,
config.getTaskShutdownLinkTimeout().toStandardDuration()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have this happen on instantiation of the Config object, because not all Periods can become Durations, and fail-fast is good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually since this is a rare operation, it really should happen on instantiation, as otherwise users might not notice a misconfiguration for a long time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@gianm
Copy link
Contributor

gianm commented Oct 27, 2015

👍

@drcrallen
Copy link
Contributor Author

dumb http-client broke. fixing

gianm added a commit that referenced this pull request Oct 27, 2015
Add timeout to shutdown request to middle manager for indexing service
@gianm gianm merged commit 7df7370 into apache:master Oct 27, 2015
@drcrallen drcrallen deleted the indexingServiceMMGone branch October 27, 2015 21:39
xvrl added a commit that referenced this pull request Oct 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants