When running on a heterogeneous cluster, sometimes a task fails not because of a fundamental error in the function, but because of node-specific problems, such as low memory.
For such tasks, it would be useful to be able to pass the executor a "retry_count=n" kwarg, which will mean that if the task fails, the scheduler will attempt to rerun it (pref. on a different worker). Possibly, with a list of exceptions which are "expected" in such a case.
When running on a heterogeneous cluster, sometimes a task fails not because of a fundamental error in the function, but because of node-specific problems, such as low memory.
For such tasks, it would be useful to be able to pass the executor a "retry_count=n" kwarg, which will mean that if the task fails, the scheduler will attempt to rerun it (pref. on a different worker). Possibly, with a list of exceptions which are "expected" in such a case.