Skip to content

[STORM-3587] Allow Scheduler futureTask to gracefully exit before TimeoutException.#3212

Merged
Ethanlm merged 3 commits intoapache:masterfrom
bipinprasad:storm3587
Mar 13, 2020
Merged

[STORM-3587] Allow Scheduler futureTask to gracefully exit before TimeoutException.#3212
Ethanlm merged 3 commits intoapache:masterfrom
bipinprasad:storm3587

Conversation

@bipinprasad
Copy link
Contributor

@bipinprasad bipinprasad commented Feb 21, 2020

ResourceAwareScheduler creates a FutureTask with timeout specified in DaemonConfig.

ConstraintSolverStrategy uses the the another configuration variable to determine when to terminate its effort. Limit this value so that it terminates at most slightly before TimeoutException. This graceful exit allows result (and its error) to be available in ResourceAwareScheduler.

Copy link
Contributor

@kishorvpatil kishorvpatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

() -> finalRasStrategy.schedule(toSchedule, td));
try {
result = schedulingFuture.get(schedulingTimeoutSeconds, TimeUnit.SECONDS);
result = schedulingFuture.get(schedulingTimeoutSeconds + 1, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need + 1

Copy link
Contributor Author

@bipinprasad bipinprasad Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is checked in strategy to determine when to terminate. However, if the future task is killed at or around the same time - this results in a TimeOut exception on the task and the result is not propagated back to the caller. +1 gives an additional second before the FutureTask is rudely terminated and allows the result to be returned and examined for the actual message in the result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should discuss about whether we want to do this or not. Essentially this applies to every timeout. If our "timeout=x second" in Storm means things will fail/time out at x+1 seconds, then everywhere with timeout configs, we need +1 to make the semantic consistent. I think this is not very necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not apply to every timeout. If the scheduled task is a cooperating task that is also using the same timeout to determine when to stop, then we have this situation where scheduler is interrupting the FutureTask before the task is allowed to gracefully exit and return a result.

If the scheduled task is a non-cooperating task (i.e. .not using the timeout), then it is fine to use the specified number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I misunderstood. Could you point out to me where we are using SCHEDULING_TIMEOUT_SECONDS_PER_TOPOLOGY too? It looks to me here is the only place the schedulingTimeoutSeconds is being used. I don't see cooperations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point.

When I think about cooperating process, is that the strategy is a time-bound task and part of the same code base running in the same JVM - so there never should be a a need to kill a FutureTask except as a precaution against bug introduced inadvertently.

Current ConstraintSolver uses a different (but redundant) config variable for time limit - which is accidentally set to the same default value.

In light of this - it may be better to explicitly pass "max" time limit to the constraint solver. And then determine how much the margin needs to be, and then add the margin to the FutureTask timeout. Note that this extra margin (and the timeout exception should only happen in exceptional case when there is bug in ConstraintSolver. Normally it will/should exit by the timeout duration.
And the result should be available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me. Thanks.

Copy link
Contributor

@Ethanlm Ethanlm Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some brief comments about the purpose of +1 so future me will not be surprised when I come back to this. Thanks.
Something like Allow the Scheduler futureTask to gracefully exit is good enough for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed the change into ConstrainstSolverStrategy where there is millisecond granularity and avoid hitting the ceiling. Removed +1 from ResourceAwareStrategy.

if (cluster != null) {
cluster.setStatus(topo.getId(), "Scheduling Attempted but topology is invalid");
if (msg == null) {
msg = "Scheduling Attempted but topology is invalid";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failed to schedule does not necessarily the topology is invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That message is a generic default - same as prior default. I believe there is one other caller to this method.

} else { //Any other failure result
//The assumption is that the strategy set the status...
topologySubmitter.markTopoUnsuccess(td, cluster);
String msg = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced by result.toString()

} else { //Any other failure result
//The assumption is that the strategy set the status...
topologySubmitter.markTopoUnsuccess(td, cluster);
String msg = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced by result.toString()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

Bipin Prasad added 2 commits February 27, 2020 14:17
…t DaemonConfig.SCHEDULING_TIMEOUT_SECONDS_PER_TOPOLOGY seconds and set it own maximum time to be at most 200 ms before.
@bipinprasad bipinprasad changed the title [STORM-3587] Allow Scheduler futureTask to gracefully exit with higher timeout. [STORM-3587] Allow Scheduler futureTask to gracefully exit before TimeoutException. Mar 3, 2020
Copy link
Contributor

@Ethanlm Ethanlm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
Thanks for your patience

@Ethanlm Ethanlm merged commit 4625770 into apache:master Mar 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants