[STORM-3587] Allow Scheduler futureTask to gracefully exit before TimeoutException. by bipinprasad · Pull Request #3212 · apache/storm

bipinprasad · 2020-02-21T20:36:46Z

ResourceAwareScheduler creates a FutureTask with timeout specified in DaemonConfig.

ConstraintSolverStrategy uses the the another configuration variable to determine when to terminate its effort. Limit this value so that it terminates at most slightly before TimeoutException. This graceful exit allows result (and its error) to be available in ResourceAwareScheduler.

…er message on timeout.

kishorvpatil

👍

Ethanlm · 2020-02-21T21:41:35Z

storm-server/src/main/java/org/apache/storm/scheduler/resource/ResourceAwareScheduler.java

                        () -> finalRasStrategy.schedule(toSchedule, td));
                    try {
-                        result = schedulingFuture.get(schedulingTimeoutSeconds, TimeUnit.SECONDS);
+                        result = schedulingFuture.get(schedulingTimeoutSeconds + 1, TimeUnit.SECONDS);


Why do we need + 1

The timeout is checked in strategy to determine when to terminate. However, if the future task is killed at or around the same time - this results in a TimeOut exception on the task and the result is not propagated back to the caller. +1 gives an additional second before the FutureTask is rudely terminated and allows the result to be returned and examined for the actual message in the result.

I think we should discuss about whether we want to do this or not. Essentially this applies to every timeout. If our "timeout=x second" in Storm means things will fail/time out at x+1 seconds, then everywhere with timeout configs, we need +1 to make the semantic consistent. I think this is not very necessary

Should not apply to every timeout. If the scheduled task is a cooperating task that is also using the same timeout to determine when to stop, then we have this situation where scheduler is interrupting the FutureTask before the task is allowed to gracefully exit and return a result.

If the scheduled task is a non-cooperating task (i.e. .not using the timeout), then it is fine to use the specified number.

Probably I misunderstood. Could you point out to me where we are using SCHEDULING_TIMEOUT_SECONDS_PER_TOPOLOGY too? It looks to me here is the only place the schedulingTimeoutSeconds is being used. I don't see cooperations.

I see your point.

When I think about cooperating process, is that the strategy is a time-bound task and part of the same code base running in the same JVM - so there never should be a a need to kill a FutureTask except as a precaution against bug introduced inadvertently.

Current ConstraintSolver uses a different (but redundant) config variable for time limit - which is accidentally set to the same default value.

In light of this - it may be better to explicitly pass "max" time limit to the constraint solver. And then determine how much the margin needs to be, and then add the margin to the FutureTask timeout. Note that this extra margin (and the timeout exception should only happen in exceptional case when there is bug in ConstraintSolver. Normally it will/should exit by the timeout duration.
And the result should be available.

It makes sense to me. Thanks.

Can you please add some brief comments about the purpose of +1 so future me will not be surprised when I come back to this. Thanks.
Something like Allow the Scheduler futureTask to gracefully exit is good enough for me.

Pushed the change into ConstrainstSolverStrategy where there is millisecond granularity and avoid hitting the ceiling. Removed +1 from ResourceAwareStrategy.

Ethanlm · 2020-02-21T21:41:58Z

storm-server/src/main/java/org/apache/storm/scheduler/resource/User.java

        if (cluster != null) {
-            cluster.setStatus(topo.getId(), "Scheduling Attempted but topology is invalid");
+            if (msg == null) {
+                msg = "Scheduling Attempted but topology is invalid";


failed to schedule does not necessarily the topology is invalid.

That message is a generic default - same as prior default. I believe there is one other caller to this method.

Ethanlm · 2020-02-27T15:32:38Z

storm-server/src/main/java/org/apache/storm/scheduler/resource/ResourceAwareScheduler.java

                    } else { //Any other failure result
                        //The assumption is that the strategy set the status...
-                        topologySubmitter.markTopoUnsuccess(td, cluster);
+                        String msg = "";


This can be replaced by result.toString()

Ethanlm · 2020-02-27T15:33:09Z

storm-server/src/main/java/org/apache/storm/scheduler/resource/ResourceAwareScheduler.java

                    } else { //Any other failure result
                        //The assumption is that the strategy set the status...
-                        topologySubmitter.markTopoUnsuccess(td, cluster);
+                        String msg = "";


This can be replaced by result.toString()

…ssage.

…t DaemonConfig.SCHEDULING_TIMEOUT_SECONDS_PER_TOPOLOGY seconds and set it own maximum time to be at most 200 ms before.

Ethanlm

+1
Thanks for your patience

[STORM-3587] Allow Scheduler futureTask to gracefully exit and regist…

e5ff537

…er message on timeout.

kishorvpatil approved these changes Feb 21, 2020

View reviewed changes

Ethanlm reviewed Feb 21, 2020

View reviewed changes

Ethanlm reviewed Feb 27, 2020

View reviewed changes

Bipin Prasad added 2 commits February 27, 2020 14:17

[STORM-3587] Use result.toString() instead of reconstructing error me…

f282df8

…ssage.

[STORM-3587] Change ConstraintSolverStrategy to expect to be killed a…

3a00cd4

…t DaemonConfig.SCHEDULING_TIMEOUT_SECONDS_PER_TOPOLOGY seconds and set it own maximum time to be at most 200 ms before.

bipinprasad changed the title ~~[STORM-3587] Allow Scheduler futureTask to gracefully exit with higher timeout.~~ [STORM-3587] Allow Scheduler futureTask to gracefully exit before TimeoutException. Mar 3, 2020

Ethanlm approved these changes Mar 3, 2020

View reviewed changes

Ethanlm merged commit 4625770 into apache:master Mar 13, 2020

Conversation

bipinprasad commented Feb 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kishorvpatil left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bipinprasad Feb 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ethanlm Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ethanlm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bipinprasad commented Feb 21, 2020 •

edited

Loading

bipinprasad Feb 21, 2020 •

edited

Loading

Ethanlm Feb 28, 2020 •

edited

Loading

Ethanlm left a comment •

edited

Loading