-
Notifications
You must be signed in to change notification settings - Fork 95
Closed
Description
Seen in the wild, event handler not respecting its timeout?
- Workflow stalled.
- "workflow stalled" event handler was run, it timed out after 10 mins.
- Workflow hit stall timeout.
- "abort on stall timeout" event handler was run, this did not time out and had to be killed manually.
CRITICAL - Workflow stalled
WARNING - PT3H stall timer starts NOW
ERROR - [('workflow-event-handler-00', 'stall') cmd] suite_report.py 'stall' 'r7091_physics_test/run11' 'workflow stalled'
[('workflow-event-handler-00', 'stall') ret_code] -9
[('workflow-event-handler-00', 'stall') err] killed on timeout (PT10M)
ERROR - stall EVENT HANDLER FAILED
WARNING - stall timer timed out after PT3H
ERROR - Workflow shutting down - "abort on stall timeout" is set
INFO - platform: xce - remote tidy (on xcel01)
ERROR - [('workflow-event-handler-00', 'shutdown') cmd] suite_report.py 'shutdown' 'r7091_physics_test/run11' '"abort on stall timeout" is set'
[('workflow-event-handler-00', 'shutdown') ret_code] -15
ERROR - shutdown EVENT HANDLER FAILED
INFO - DONE
Why did the "workflow stalled" event handler abide by the timeout and the "abort on stall timeout" event handler ignore it?
This causes issues with auto-restart functionality because if an event handler ends up hanging for whatever reason, there is no timeout to stop it and the workflow will remain in the shutting down state indefinitely.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething is wrong :(Something is wrong :(