New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout to shutdown request to middle manager for indexing service #1862
Conversation
final StatusResponseHolder response = httpClient.go( | ||
new Request(HttpMethod.POST, url), | ||
RESPONSE_HANDLER | ||
).get(); | ||
).get(config.getTaskCleanupTimeout().getMillis(), TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be taskShutdownTimeout ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
6342cc3
to
173a074
Compare
@@ -81,6 +81,7 @@ The following configs only apply if the overlord is running in remote mode: | |||
|`druid.indexer.runner.compressZnodes`|Indicates whether or not the overlord should expect middle managers to compress Znodes.|true| | |||
|`druid.indexer.runner.maxZnodeBytes`|The maximum size Znode in bytes that can be created in Zookeeper.|524288| | |||
|`druid.indexer.runner.taskCleanupTimeout`|How long to wait before failing a task after a middle manager is disconnected from Zookeeper.|PT15M| | |||
'`druid.indexer.runner.taskShutdownLinkTimeout`|How long to wait on a shutdown request to a middle manager before timing out|PT1M| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing |
at the beginning of the line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha! the '
and the |
look very similar on my intellij. Fixed
173a074
to
033712e
Compare
@@ -405,8 +406,12 @@ public void shutdown(final String taskId) | |||
log.error("Shutdown failed for %s! Are you sure the task was running?", taskId); | |||
} | |||
} | |||
catch (InterruptedException e) { | |||
Thread.currentThread().interrupt(); | |||
throw new RE(e, "Timed out posting shutdown to [%s] for task [%s]", url, taskId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing this. incorrect message
👍 |
033712e
to
2e3d5dd
Compare
@xvrl I don't think this is a regression. So it doesn't have to block 0.8.2 (even though we'll probably roll it out internally very quickly) |
@drcrallen is setting the readTimeout on the http client not good enough? That's supposed to help with this kind of thing, in a nicer way- the connections get closed rather than left dangling when the future times out. |
The readTimeout should also be on by default, but I think the default is highish (5 or 15 minutes maybe) |
@gianm It would be nicer, but this uses a global http client, so setting that value would require setting the global timeout. There is no hook in the |
@gianm I can cancel/interrupt the future if that would make it cleaner. |
@drcrallen I don't think the NettyHttpClient does anything special if the future is canceled. How do you feel about using per request timeouts for this patch? this change in http-client should make it possible: metamx/http-client#24 It should also work to set the global readTimeout lower, I think the overlord doesn't do anything over http that really warrants a long timeout. The long default timeout is mostly for the broker's benefit. |
@gianm That sounds good to me. |
final StatusResponseHolder response = httpClient.go( | ||
new Request(HttpMethod.POST, url), | ||
RESPONSE_HANDLER | ||
).get(); | ||
).get(config.getTaskShutdownLinkTimeout().getMillis(), TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from metamx/http-client#24 , duration is supposed to go inside "go(...)" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will change when release of http-client is finished. (it is progressing right now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
2e3d5dd
to
8f73736
Compare
@xvrl / @gianm / @himanshug should be updated now |
@@ -42,6 +42,10 @@ | |||
@Min(10 * 1024) | |||
private long maxZnodeBytes = 512 * 1024; | |||
|
|||
@JsonProperty | |||
@Min(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this Min thingy work on Periods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gooood question, probably not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed min annotation
8f73736
to
060dc8d
Compare
final StatusResponseHolder response = httpClient.go( | ||
new Request(HttpMethod.POST, url), | ||
RESPONSE_HANDLER | ||
RESPONSE_HANDLER, | ||
config.getTaskShutdownLinkTimeout().toStandardDuration() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to have this happen on instantiation of the Config object, because not all Periods can become Durations, and fail-fast is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually since this is a rare operation, it really should happen on instantiation, as otherwise users might not notice a misconfiguration for a long time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
060dc8d
to
44a2b20
Compare
👍 |
dumb http-client broke. fixing |
Add timeout to shutdown request to middle manager for indexing service
If the middle manager has a catastrophic failure, the overlord can hang here while holding
TaskQueue
giant.lock()
, causing other indexing serviceTaskQueue
to lock up waiting for it to free..