Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSF: Accept use_stdin in the constructor #360

Merged
merged 1 commit into from Oct 30, 2019

Conversation

@stuarteberg
Copy link
Contributor

stuarteberg commented Oct 24, 2019

Right now, all LSF options can be specified either in the config OR in the constructor arguments, except the new use-stdin option (introduced in #347). Unlike all the others, that option may only be specified in the config (not the constructor).

I don't see why we'd want use-stdin to be different from all the other LSF options, so this PR allows the user to pass use_stdin to the LSFCluster constructor if she wants to. (As usual, the config is used if no value was passed in the constructor.)

cluster = LSFCluster(cores=15, memory='25GB',
                     use_stdin=True) # <-- now allowed

Side note: I suspect this new setting will be needed by many, if not most, LSF users, so I added some verbose documentation for it.

FWIW, I tested these changes on my LSF cluster, and they work as expected.

@stuarteberg stuarteberg force-pushed the stuarteberg:lsf-use_stdin-arg branch 4 times, most recently from 7f948a4 to 03d3d67 Oct 24, 2019
@stuarteberg

This comment has been minimized.

Copy link
Contributor Author

stuarteberg commented Oct 25, 2019

OK, I got the tests passing, but I'm seeing intermittent failures from Travis (unrelated to this PR).

FWIW, here's the error:

PREFIX=/opt/anaconda

Unpacking payload ...

  0%|          | 0/35 [00:00<?, ?it/s]

No output has been received in the last 10m0s, this potentially
indicates a stalled build or something wrong with the build itself.

Check the details on how to adjust your build configuration on:
https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received

The build has been terminated

At first this error appeared in the JOBQUEUE=pbs build, so I triggered a new build. Now pbs passes, but I see the same error in the JOBQUEUE=slurm build.

I'll force-push one more time to see if I can get lucky with a successful build.

@stuarteberg stuarteberg force-pushed the stuarteberg:lsf-use_stdin-arg branch from 03d3d67 to b524b3e Oct 25, 2019
Copy link
Member

guillaumeeb left a comment

Thanks very much @stuarteberg, that looks very good!

Just one little thing, could you add a test thats shows that lsf_job.use_stdin can be modified according to the constructor argument you added?

@lesteve

This comment has been minimized.

Copy link
Member

lesteve commented Oct 29, 2019

Side note: I suspect this new setting will be needed by many, if not most, LSF users, so I added some verbose documentation for it.

@stuarteberg do you think we should default to use_stdin=True? As mentioned in #328 (comment) it feels like we may have made the wrong choice mostly because it was based on @mrocklin's experience on Summit which may not be representative of other LSF clusters.

Also for further reference could you answer these two questions:

  • what is the output of lsid | head -n1
  • does bsub < job_script (rather than bsub job_script) work for you?
@stuarteberg stuarteberg force-pushed the stuarteberg:lsf-use_stdin-arg branch from b524b3e to ee6fb77 Oct 30, 2019
@stuarteberg

This comment has been minimized.

Copy link
Contributor Author

stuarteberg commented Oct 30, 2019

@guillaumeeb

could you add a test thats shows that lsf_job.use_stdin can be modified according to the constructor argument you added?

OK, done.

@stuarteberg

This comment has been minimized.

Copy link
Contributor Author

stuarteberg commented Oct 30, 2019

@lesteve

Disclaimer: I am not an LSF expert, and I only have experience with one LSF cluster, of which I am merely a user, not an administrator.

do you think we should default to use_stdin=True?

Yes, I think we should. If use_stdin=False, then LSFCluster will write the jobscript to /tmp (or whichever directory distributed.utils.tmpfile() chooses), and then launch the command as follows:

bsub /tmp/jobscript.sh

Unless tmpfile() chooses a location on the shared file system (which seems unlikely), that command is going to fail, because the LSF host that actually executes the job has a completely different /tmp directory.

When using bsub < /tmp/jobscript.sh, (i.e. use_stdin=True) then jobscript.sh is fed to the LSF scheduler and then passed to the LSF host when the job is executed. There is no need for the execution host to have access to the original jobscript file, so it doesn't matter where it was written to originally.

Again, I'm not an expert, but if bsub < /tmp/jobscript.sh didn't work when @mrocklin tried it at Summit, then my hunch is that there's something weird about Summit's configuration. It's even stranger that bsub /tmp/jobscript.sh apparently worked.

In any case, I don't think our current default heuristic is correct, because bsub < is still supported in LSF 10 (as documented in the LSF 10 manual).


Also for further reference could you answer these two questions:

  • what is the output of lsid | head -n1
$ lsid | head -n1
IBM Spectrum LSF Standard 10.1.0.8, May 10 2019
  • does bsub < job_script (rather than bsub job_script) work for you?

Yes, both of those work for me, as long as job_script is located in a shared location, such as my home directory. But (as explained above) if job_script is located in /tmp/, then only bsub < /tmp/job_script works.

@lesteve

This comment has been minimized.

Copy link
Member

lesteve commented Oct 30, 2019

Disclaimer: I am not an LSF expert, and I only have experience with one LSF cluster, of which I am merely a user, not an administrator.

None of us are LSF experts even less LSF administrators ... as someone who has access to a LSF cluster and from your earlier comments I think you qualify as the dask-jobqueue LSF expert ;-).

Thanks a lot your feed-back, it is extremely useful! Also it aligns very much with my understanding of the problem so I think we should switch to use_stdin=True by default.

@stuarteberg

This comment has been minimized.

Copy link
Contributor Author

stuarteberg commented Oct 30, 2019

I think we should switch to use_stdin=True by default.

OK, assuming @guillaumeeb agrees, would you like me to open a separate PR for that, or simply change it now, as part of this PR?

@@ -27,3 +29,18 @@ def pytest_runtest_setup(item):
if envnames:
if item.config.getoption("-E") not in envnames:
pytest.skip("test requires env in %r" % envnames)


@pytest.fixture(autouse=True)

This comment has been minimized.

Copy link
@lesteve

lesteve Oct 30, 2019

Member

Would it be possible to not use autouse here, so that the fixture is explicitly used in test_use_stdin? My preference would be to avoid pytest magic if possible.

This comment has been minimized.

Copy link
@stuarteberg

stuarteberg Oct 30, 2019

Author Contributor

Since lsf_version() is called by default (unless use-stdin is specified in the config), then this monkey-patch is needed by all tests that instantiate LSFCluster(). That includes every test in test_lsf.py and also half of the tests in test_jobqueue_core.py.

I'm not a pytest expert, but IIUC, we need to use autouse=True or we need to add this fixture to every test that needs it in those two files. Is there some better mechanism I'm missing?

This comment has been minimized.

Copy link
@stuarteberg

stuarteberg Oct 30, 2019

Author Contributor

BTW, in the future, if we simply use use-stdin: true by default, then we can forbid use-stdin: null. At that point, there will be no need for lsf_version() anyway. We can delete it, along with this test fixture.

In other words, it's probably not worth debating the technical details of this fixture if we're going to delete it soon, anyway.

This comment has been minimized.

Copy link
@lesteve

lesteve Oct 30, 2019

Member

Right I missed that. I think we can keep it like this for this PR.

When we switch to use_stdin=True, we should remove the lsid logic (and so this autouse fixture). Basically, we thought there was a change in behaviour linked to LSF 10 and my current understanding is that this is not the case but is linked to some quirks on Summit ...

This comment has been minimized.

Copy link
@lesteve

lesteve Oct 30, 2019

Member

Looks like our messages crossed, oh well ... looks like we agree anyway.

@lesteve

This comment has been minimized.

Copy link
Member

lesteve commented Oct 30, 2019

OK, assuming @guillaumeeb agrees, would you like me to open a separate PR for that, or simply change it now, as part of this PR?

If you don't mind, a separate PR would be preferrable. This PR is adding the use_stdin parameter, which was an oversight of #307.

Changing to use_stdin=True is a behaviour change that we may want to understand again in six months time. As such I feel it deserves its PR with its dedicated discussion (rather than plenty of comments in separate issues).

@stuarteberg

This comment has been minimized.

Copy link
Contributor Author

stuarteberg commented Oct 30, 2019

If you don't mind, a separate PR would be preferrable.

I'll open a PR once this one is merged. (It will touch the same files as this one.)

@lesteve

This comment has been minimized.

Copy link
Member

lesteve commented Oct 30, 2019

Thanks I'll merge this one, since this seems perfectly reasonable to me. @guillaumeeb don't hesitate to comment if you think we missed something.

@lesteve lesteve merged commit be60856 into dask:master Oct 30, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@lesteve

This comment has been minimized.

Copy link
Member

lesteve commented Oct 30, 2019

My current thinking is that there are a few fixes that would be nice to include in a release in the near future (say 1-2 weeks) and use_stdin=True by default is one of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.