Skip to content

shutdown event handler not killed on timeout #6615

@oliver-sanders

Description

@oliver-sanders

Seen in the wild, event handler not respecting its timeout?

  • Workflow stalled.
  • "workflow stalled" event handler was run, it timed out after 10 mins.
  • Workflow hit stall timeout.
  • "abort on stall timeout" event handler was run, this did not time out and had to be killed manually.
CRITICAL - Workflow stalled    
WARNING - PT3H stall timer starts NOW    
ERROR - [('workflow-event-handler-00', 'stall') cmd] suite_report.py 'stall' 'r7091_physics_test/run11' 'workflow stalled'
    [('workflow-event-handler-00', 'stall') ret_code] -9    
    [('workflow-event-handler-00', 'stall') err] killed on timeout (PT10M)
ERROR - stall EVENT HANDLER FAILED    
WARNING - stall timer timed out after PT3H    
ERROR - Workflow shutting down - "abort on stall timeout" is set    
INFO - platform: xce - remote tidy (on xcel01)    
ERROR - [('workflow-event-handler-00', 'shutdown') cmd] suite_report.py 'shutdown' 'r7091_physics_test/run11' '"abort on stall timeout" is set'
    [('workflow-event-handler-00', 'shutdown') ret_code] -15
ERROR - shutdown EVENT HANDLER FAILED    
INFO - DONE    

Why did the "workflow stalled" event handler abide by the timeout and the "abort on stall timeout" event handler ignore it?

This causes issues with auto-restart functionality because if an event handler ends up hanging for whatever reason, there is no timeout to stop it and the workflow will remain in the shutting down state indefinitely.

Metadata

Metadata

Assignees

Labels

bugSomething is wrong :(

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions