Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
EDIT4:
'edit2' fix is definitely wrong. currently thinking that this occurs when worker dies in the middle of sending data, really hard to test though...
EDIT3:
the 'edit2' fix doesn't seem to work all the time. seems to be related to when I scatter large files...?
EDIT2:
possible fix (not reproducible here(yet)) is to use this patch on distributed. doesn't really make sense though...
EDIT:
i made tests (python3.7 only) for everything and got everything to pass on my end, suggesting that i'm not reproducing my failures properly. I will continue to try to figure out what is causing my failures.
This is partially an issue, i don't expect this to get merged, but it might be useful. in advance, i realize this is mainly a distributed bug/fault, but it's pid tracking/killing is easier in jobqueue than it is in distributed.
Purpose:
i want to use dask on a backfill queue, IE, workers can be killed with signal 15 at any moment. This appears to work fine with gather, but never with as_completed. Since we can be killed at any moment, I want to save my result files back at the scheduler using as_completed so things progress smoothly. not sure if relevant, but there's a large amount of data that has to be passed from the worker to the scheduler, (on the order of 10+GB), relatively small per job, but a very large amount of them, and i think saving them all for a gather call with be too expensive memory wise.
problem:
I consistently get errors that are similar to log1 or log2, or my jobs seem to stall and not repopulate.
question:
I think while waiting in
as_completed
distributed needs to be checking for lost workers and updating the jobs of those workers as needed, not exactly sure how that should be done though... additionally somehow when you cancel the futures they should be replaced or somehow re-run. At least that's what i surmise from my error logs....important
if you want to try this PR, you have to add this to your distributed installation:
Related:
#122
I tried some techniques from this PR, however I have still seen errors in my production runs with try: except blocks...
Attempts to solve:
I also tried my own as_completed (asyncio only), which appears to work, for my test case, but i'm still getting errors that look like log1 or log2 in production.
A big problem is that I'm using asyncio and it doesn't seem to be a very popular option...
error log1
error log2
sorry for being lengthy..