[STORM-3587] Allow Scheduler futureTask to gracefully exit before TimeoutException.#3212
[STORM-3587] Allow Scheduler futureTask to gracefully exit before TimeoutException.#3212Ethanlm merged 3 commits intoapache:masterfrom
Conversation
…er message on timeout.
| () -> finalRasStrategy.schedule(toSchedule, td)); | ||
| try { | ||
| result = schedulingFuture.get(schedulingTimeoutSeconds, TimeUnit.SECONDS); | ||
| result = schedulingFuture.get(schedulingTimeoutSeconds + 1, TimeUnit.SECONDS); |
There was a problem hiding this comment.
The timeout is checked in strategy to determine when to terminate. However, if the future task is killed at or around the same time - this results in a TimeOut exception on the task and the result is not propagated back to the caller. +1 gives an additional second before the FutureTask is rudely terminated and allows the result to be returned and examined for the actual message in the result.
There was a problem hiding this comment.
I think we should discuss about whether we want to do this or not. Essentially this applies to every timeout. If our "timeout=x second" in Storm means things will fail/time out at x+1 seconds, then everywhere with timeout configs, we need +1 to make the semantic consistent. I think this is not very necessary
There was a problem hiding this comment.
Should not apply to every timeout. If the scheduled task is a cooperating task that is also using the same timeout to determine when to stop, then we have this situation where scheduler is interrupting the FutureTask before the task is allowed to gracefully exit and return a result.
If the scheduled task is a non-cooperating task (i.e. .not using the timeout), then it is fine to use the specified number.
There was a problem hiding this comment.
Probably I misunderstood. Could you point out to me where we are using SCHEDULING_TIMEOUT_SECONDS_PER_TOPOLOGY too? It looks to me here is the only place the schedulingTimeoutSeconds is being used. I don't see cooperations.
There was a problem hiding this comment.
I see your point.
When I think about cooperating process, is that the strategy is a time-bound task and part of the same code base running in the same JVM - so there never should be a a need to kill a FutureTask except as a precaution against bug introduced inadvertently.
Current ConstraintSolver uses a different (but redundant) config variable for time limit - which is accidentally set to the same default value.
In light of this - it may be better to explicitly pass "max" time limit to the constraint solver. And then determine how much the margin needs to be, and then add the margin to the FutureTask timeout. Note that this extra margin (and the timeout exception should only happen in exceptional case when there is bug in ConstraintSolver. Normally it will/should exit by the timeout duration.
And the result should be available.
There was a problem hiding this comment.
It makes sense to me. Thanks.
There was a problem hiding this comment.
Can you please add some brief comments about the purpose of +1 so future me will not be surprised when I come back to this. Thanks.
Something like Allow the Scheduler futureTask to gracefully exit is good enough for me.
There was a problem hiding this comment.
Pushed the change into ConstrainstSolverStrategy where there is millisecond granularity and avoid hitting the ceiling. Removed +1 from ResourceAwareStrategy.
| if (cluster != null) { | ||
| cluster.setStatus(topo.getId(), "Scheduling Attempted but topology is invalid"); | ||
| if (msg == null) { | ||
| msg = "Scheduling Attempted but topology is invalid"; |
There was a problem hiding this comment.
failed to schedule does not necessarily the topology is invalid.
There was a problem hiding this comment.
That message is a generic default - same as prior default. I believe there is one other caller to this method.
| } else { //Any other failure result | ||
| //The assumption is that the strategy set the status... | ||
| topologySubmitter.markTopoUnsuccess(td, cluster); | ||
| String msg = ""; |
There was a problem hiding this comment.
This can be replaced by result.toString()
| } else { //Any other failure result | ||
| //The assumption is that the strategy set the status... | ||
| topologySubmitter.markTopoUnsuccess(td, cluster); | ||
| String msg = ""; |
There was a problem hiding this comment.
This can be replaced by result.toString()
…t DaemonConfig.SCHEDULING_TIMEOUT_SECONDS_PER_TOPOLOGY seconds and set it own maximum time to be at most 200 ms before.
ResourceAwareScheduler creates a FutureTask with timeout specified in DaemonConfig.
ConstraintSolverStrategy uses the the another configuration variable to determine when to terminate its effort. Limit this value so that it terminates at most slightly before TimeoutException. This graceful exit allows result (and its error) to be available in ResourceAwareScheduler.