I'm opening an anecdotal reporting thread that may hopefully inform usability improvements and bug fixes if/when found. I hope to get more specific as I uncover more & will continue to add info.
In data-heavy dynamic task-of-task workflows, where tasks may return pandas DataFrames on the order of a few GB, users will sometimes create task graphs where one worker client task tries to gather several tasks from many other worker clients (having yielded from the thread pool to gather), but where the total data footprint of the resulting gather will be well into the tens of GBs. Including overhead, this can overwhelm the receiving worker, and start cascading failures because of memory exhaustion.
Ignoring for the moment any design flaws in the program formulation, there are some strange observable behaviors worth reporting:
1 - the nicest and most clear thing that happens is a KilledWorker error, identifying the task that OOM'd the worker N times (for N = allowed-failures+1)
2 - using a LocalCluster w/o a memory limit, I think the nanny dies too because it's co-located within the worker process when the OS (Linux in my case) kills the proc? Anecdotally, what happens SOMETIMES is that tasks are seen to be cancelled, with none found in an error state. Almost always some sort of FATAL-level logging regarding key not found errors are reported by distributed.
3 - in the worst case, a gather thinks it's actually succeeded, but at least one of the gathered results is garbled: it comes back either a string instead of a dataframe, or a malformed dataframe. To see what was happening, I set up one of these degenerate programs to pickle the resulting dataframe from gather to disk when I identified it as garbled (I did this by checking a property I know to be True, specifically if the index is_monotonic_increasing, which showed up as False). Trying to read back the dumped pickle, however, leads to "UnpicklingError: pickle exhausted before end of frame", further indicating data corruption. This is quite hard to reproduce even in my "real" example.
I have yet to produce an example I can share, but I wanted to start a conversation around the observations.
I'm opening an anecdotal reporting thread that may hopefully inform usability improvements and bug fixes if/when found. I hope to get more specific as I uncover more & will continue to add info.
In data-heavy dynamic task-of-task workflows, where tasks may return pandas DataFrames on the order of a few GB, users will sometimes create task graphs where one worker client task tries to gather several tasks from many other worker clients (having yielded from the thread pool to gather), but where the total data footprint of the resulting gather will be well into the tens of GBs. Including overhead, this can overwhelm the receiving worker, and start cascading failures because of memory exhaustion.
Ignoring for the moment any design flaws in the program formulation, there are some strange observable behaviors worth reporting:
1 - the nicest and most clear thing that happens is a KilledWorker error, identifying the task that OOM'd the worker N times (for N = allowed-failures+1)
2 - using a LocalCluster w/o a memory limit, I think the nanny dies too because it's co-located within the worker process when the OS (Linux in my case) kills the proc? Anecdotally, what happens SOMETIMES is that tasks are seen to be cancelled, with none found in an error state. Almost always some sort of FATAL-level logging regarding key not found errors are reported by distributed.
3 - in the worst case, a gather thinks it's actually succeeded, but at least one of the gathered results is garbled: it comes back either a string instead of a dataframe, or a malformed dataframe. To see what was happening, I set up one of these degenerate programs to pickle the resulting dataframe from gather to disk when I identified it as garbled (I did this by checking a property I know to be True, specifically if the index is_monotonic_increasing, which showed up as False). Trying to read back the dumped pickle, however, leads to "UnpicklingError: pickle exhausted before end of frame", further indicating data corruption. This is quite hard to reproduce even in my "real" example.
I have yet to produce an example I can share, but I wanted to start a conversation around the observations.