Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No recursion in dispatch error handling #1922

Merged

Conversation

juhoautio
Copy link
Collaborator

@juhoautio juhoautio commented Aug 14, 2018

The code had a recursion loop like this:

selectExecutorAndDispatchFlow
  -> handleDispatchExceptionCase
    -> selectExecutorAndDispatchFlow
      -> ..

..with max depth of activeExecutors.size()

Iterative approach is easier to reason about, and is totally safe against StackOverflowError.

This is also preliminary refactoring to eventually allow configuring higher max for dispatching errors. Currently max dispatch attempts are capped by the total number of active executors (because each executor is only tried once at most). In the future I'd like to remove that limit and rely solely on the MAX_DISPATCHING_ERRORS_PERMITTED property.

@codecov
Copy link

codecov bot commented Aug 14, 2018

Codecov Report

Merging #1922 into master will decrease coverage by 0.01%.
The diff coverage is 78.12%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1922      +/-   ##
============================================
- Coverage     32.95%   32.94%   -0.02%     
+ Complexity     2656     2655       -1     
============================================
  Files           409      409              
  Lines         29257    29258       +1     
  Branches       3721     3721              
============================================
- Hits           9643     9640       -3     
- Misses        18731    18734       +3     
- Partials        883      884       +1
Impacted Files Coverage Δ Complexity Δ
...rc/main/java/azkaban/executor/ExecutorManager.java 44.97% <78.12%> (-0.05%) 53 <0> (ø)
...erver/src/main/java/azkaban/execapp/JobRunner.java 69.76% <0%> (-0.87%) 85% <0%> (-2%)
.../main/java/azkaban/utils/ExecutorServiceUtils.java 100% <0%> (+10%) 5% <0%> (+1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b20d5d0...21c596a. Read the comment docs.

@juhoautio juhoautio force-pushed the no_recursive_dispatch_handling branch from 82c9a19 to e7f8b3f Compare August 14, 2018 10:08
@coveralls
Copy link

coveralls commented Aug 14, 2018

Pull Request Test Coverage Report for Build 4046

  • 28 of 35 (80.0%) changed or added relevant lines in 1 file are covered.
  • 7 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.03%) to 36.063%

Changes Missing Coverage Covered Lines Changed/Added Lines %
azkaban-common/src/main/java/azkaban/executor/ExecutorManager.java 28 35 80.0%
Files with Coverage Reduction New Missed Lines %
azkaban-common/src/main/java/azkaban/jobExecutor/AbstractProcessJob.java 2 60.71%
azkaban-exec-server/src/main/java/azkaban/execapp/JobRunner.java 5 76.03%
Totals Coverage Status
Change from base Build 4040: -0.03%
Covered Lines: 10500
Relevant Lines: 29116

💛 - Coveralls

@juhoautio juhoautio force-pushed the no_recursive_dispatch_handling branch from e7f8b3f to d7339da Compare August 16, 2018 06:03
@juhoautio
Copy link
Collaborator Author

@chengren311 thanks for checking in the unit tests. How about this one now?

@juhoautio
Copy link
Collaborator Author

@chengren311 @kunkun-tang @HappyRay could you consider this, please?

@juhoautio
Copy link
Collaborator Author

Or @jamiesjc ?

logger.error("Failed to dispatch queued execution " + exflow.getId() + " because "
+ giveUpReason);
finalizeFlows(exflow);
// FAILED TOO MANY TIMES - exit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FAILED TOO MANY TIMES or no more executors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that. I thought this comment kind of covers both.. but not as clearly as it could. I'd like to change the comment to: "GIVE UP DISPATCHING - exit". Ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

Set<Executor> remainingExecutors) {
if (reference.getNumErrors() >= this.maxDispatchingErrors) {
return "reached " + ConfigurationKeys.MAX_DISPATCHING_ERRORS_PERMITTED
+ " (tried " + reference.getNumErrors() + " executors)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried XX attempts instead of executors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Executors" is correct currently, because each available executor is only tried once at most – if all have been tried, give up. Also, this loop is about dispatching a single execution.

logger.error("Failed to dispatch queued execution " + exflow.getId() + " because "
+ giveUpReason);
finalizeFlows(exflow);
// GIVE UP DISPATCHING - exit
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengren311 comment better now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

selectedExecutor, exflow.getExecutionId()), e);
handleDispatchExceptionCase(reference, exflow, selectedExecutor,
availableExecutors);
while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use for loop here? my concern for while(true) is it's exposed to the risk of infinite loop if some exit condition is not well defined somehow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengren311 I changed while loop to for.

The code had a recursion loop like this:

selectExecutorAndDispatchFlow
  -> handleDispatchExceptionCase
    -> selectExecutorAndDispatchFlow
      -> ..

..with max depth of activeExecutors.size()

Iterative approach is easier to reason about, and is totally safe against StackOverflowError.

This is also preliminary refactoring to eventually allow configuring higher max for dispatching errors. Currently, however, max dispatch attempts is capped by the total number of active executors (because each executor is only tried once). In the future I'd like to remove that limit and rely solely on the MAX_DISPATCHING_ERRORS_PERMITTED property.
@burgerkingeater burgerkingeater merged commit 82fa690 into azkaban:master Sep 13, 2018
@burgerkingeater
Copy link
Contributor

thanks! @juhoautio

@juhoautio
Copy link
Collaborator Author

Thank you!!

@juhoautio juhoautio deleted the no_recursive_dispatch_handling branch September 13, 2018 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants