Skip to content

Deadlock on workers reaching memory.pause threshold  #5235

@gerrymanoim

Description

@gerrymanoim

Apologies - I'm not sure exactly how to provide a code example that triggers this condition, but here's what I observed:

Sometimes at the end of long runs with 1000s of tasks, I've found that there are straggler tasks that seem to be stuck on workers. These workers seem to be at the memory.pause 0.8 mark and the amount of stuck tasks is equal to the threads available to the dask worker. The workers are heart beating just fine, but don't seem to be actually doing anything with the tasks they're processing (callstacks for each task are blank). Other workers aren't stealing these tasks. When I go kill the workers, the scheduler will go reassign those tasks and everything will complete as normal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions