New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid KeyError on worker restart #138

Merged
merged 2 commits into from Oct 2, 2018

Conversation

Projects
None yet
5 participants
@guillaumeeb
Copy link
Member

guillaumeeb commented Aug 24, 2018

Closes #117.

Should help diagnose problem in #122.

I chose not to raise any Error, even if the job_id is not to be found in finished_jobs of the Scheduler plugin. User may want to manually add jobs and worker to the cluster? Though this is not a good practice.

@guillaumeeb guillaumeeb changed the title Search for job_id in finished_jobs, or even create new one Avoid KeyError on worker restart Aug 24, 2018

@guillaumeeb

This comment has been minimized.

Copy link
Member Author

guillaumeeb commented Aug 24, 2018

@raybellwaves if you still go your memory use case somewhere?

@wgustafson if you want to give that a go?

Don't see how to do a simple unit test for now.
But I could probably build a dask workflow that makes worker die with out of memory on my system if needed.

@wgustafson

This comment has been minimized.

Copy link
Contributor

wgustafson commented Aug 25, 2018

I've been on travel so apologies for the slow response the last week. I have a script that was throwing key errors but I need to clean things up and see if it still reacts the same way. I will try to do this early this coming week.

@raybellwaves

This comment has been minimized.

Copy link
Contributor

raybellwaves commented Aug 25, 2018

Yeah my memory issue use case is https://gist.github.com/raybellwaves/f28777bf840cc40f4c76d88beca528c5. I've added the output to the file.

What's the best way to test your branch (this version)? I'm not familiar cloning branches.

@guillaumeeb

This comment has been minimized.

Copy link
Member Author

guillaumeeb commented Aug 25, 2018

No worries @wgustafson! Was just browsing issues and updating things, I understand perfectly you've got other things to do, no hurry.

@wgustafson

This comment has been minimized.

Copy link
Contributor

wgustafson commented Aug 27, 2018

Taking a look at this again this morning. My reproducible KeyError issue happens with the code I posted in #122 on August 14, 2018, so I just put the log output into that issue to keep things together. In my case, I can be certain the problem is not from a lack of memory. So, I don't know if my problem is the same is @raybellwaves or not.

@lesteve

This comment has been minimized.

Copy link
Collaborator

lesteve commented Aug 29, 2018

What's the best way to test your branch (this version)? I'm not familiar cloning branches.

@raybellwaves here is something that should work for checking out this PR code:

# creates a branch pr/138 with the code from #138
# Run this from your dask-jobqueue git checkout
git fetch -fu upstream refs/pull/138/head:pr/138 && git checkout pr/138
git checkout pr/138

Personally I have a git alias for this (there are other options for this as well if you google for it). Here is an excerpt from my .gitconfig:

[alias]
        pr  = "!f() { git checkout master && git fetch -fu ${2:-upstream} refs/pull/$1/head:pr/$1 && git checkout pr/$1; }; f"

Basically I can do this to get the code of this PR (it also works for getting the latest version of the PR):

git pr 138
@raybellwaves

This comment has been minimized.

Copy link
Contributor

raybellwaves commented Aug 30, 2018

@lesteve thanks. The alias looks very handy.

@guillaumeeb sorry i'll be on vacation until mid-September and won't have chance to try this until then.

@raybellwaves

This comment has been minimized.

Copy link
Contributor

raybellwaves commented Sep 18, 2018

I can confirm the KeyError has gone away. It now says

Job 17752343 has got a restarting worker tcp://10.10.103.1:34877
Job 17752337 has got a restarting worker tcp://10.10.103.9:58938
...

and looking at the dask dashboard I can see some of the tasks erred:
screen shot 2018-09-18 at 4 24 28 pm

My output is here: https://gist.github.com/raybellwaves/3033fe1679510ea4d5d64e0658225fc2

@guillaumeeb

This comment has been minimized.

Copy link
Member Author

guillaumeeb commented Sep 19, 2018

Thanks @raybellwaves, looking at your code, since you're trying to persist the whole dataframe in memory, I assume you got some Memory errors on your workers, which seems to be confirmed by your dashboard screenshot and the orange bars we see.

If I could get some code reviews here this would be nice.

@mrocklin

This comment has been minimized.

Copy link
Member

mrocklin commented Sep 19, 2018

If I could get some code reviews here this would be nice.

According to git blame it looks like @jhamman probably has experience with this part of the codebase. It might be worth pinging him directly.

@raybellwaves

This comment has been minimized.

Copy link
Contributor

raybellwaves commented Sep 19, 2018

Thanks @raybellwaves, looking at your code, since you're trying to persist the whole dataframe in memory, I assume you got some Memory errors on your workers, which seems to be confirmed by your dashboard screenshot and the orange bars we see.

Yep. A one terabyte data frame is a bit too much for our HPC as it's often hard to get multiple jobs running. The output of JOB N has got a restarting worker is now an improvement on the KeyError.

Is there a way to provide information in the output of memory issues? If I didn't have the dashboard up and just saw JOB N has got a restarting worker as a novice user I wouldn't know what that means. I wonder if it can say something like Warning: Potential memory issue. JOB N has got a restarting worker? This may be an upstream issue though dask.distributed? dask?

@guillaumeeb

This comment has been minimized.

Copy link
Member Author

guillaumeeb commented Sep 20, 2018

I wonder if it can say something like Warning: Potential memory issue. JOB N has got a restarting worker? This may be an upstream issue though dask.distributed? dask?

This message is currently in dask-jobqueue, we can improve it there and give an hint that this may be due to some memory problem.

At the same time, there maybe something to do upstream to get the information about why the worker has been restarted...

@guillaumeeb guillaumeeb merged commit 58dffc5 into dask:master Oct 2, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment