Add missing trigger for failed-to-start nodes #13802
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
SUMMARY
Connect #2766
For the same scenario, the numbers with this patch are:
Started 4/5/2023, 8:40:01 AM
Finished 4/5/2023, 8:40:02 AM
So this is 1 second, compared to 50 seconds before the patch.
Looking at the code, I tried to ask the deep question of what exact criteria is missing the trigger here. I believe it's the scenario where the start checks failed. In that case, we are not waiting for the job to finish (because the job never starts to begin with), so we will run out the timer for the workflow manager scheduler if we don't re-schedule right away. That's the reason for the 50 seconds we were hitting before.
This is a very simple patch, and I don't see any risks of over-scheduling. Spawning, and failing to start a node, is a processing action which corresponds to completion of a node. In the general sense, we do need to worry about infinite scheduling loops. As long as our scheduling corresponds to a tangible and finite form of progress for processing jobs, this shouldn't happen.
ISSUE TYPE
COMPONENT NAME