-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problems related to PBS clusters #18
Comments
On Wed, Feb 24, 2016 at 1:37 PM, Gian Matharu gian@ualberta.ca wrote:
No, I don't believe we've had that problem. Perhaps some sort of 'epilogue'? http://docs.adaptivecomputing.com/torque/3-0-5/a.gprologueepilogue.php Or perhaps just unreliable behavior of PBS. |
Update: By replacing pbs environment variables with mpi directives, gian's system.mpi class removes the need to distinguish between pbs pro and torque in some cases; pbs_sm should work for both variants, I believe. To my knowledge, pbs_lg has been tested on pbs pro clusters but still needs testing on pbs torque clusters (while it may work on the latter with small modifications, it is not likely to work right out of the box). |
I ran into another topic that may warrant some consideration. For the PBS cluster I'm currently using, certain software needs to be loaded prior to running codes e.g. "module load intel" will load intel compilers etc. Admittedly, I'm not sure if this is standard for PBS/clusters in general but would adding the option to specify such directives from within the seisflows system.submit be beneficial? There are easy alternatives (using bash profiles or in some cases one could use a submission script to submit sfrun), so it's not a pressing issue and may not be necessary. |
For controlling user environment, I think the module utility is standard A better approach I think would be to include environment assertions in On Thu, Mar 10, 2016 at 11:26 PM, gianmatharu notifications@github.com
|
Hi Ryan, when I get to 'bin/xspecfem2D', it meets bug EDIT: revised for clarity [rmodrak] |
It seems like As a workaround, try using I'll go ahead edit your post for clarity if that's alright. |
The MPI (mpi4py) system class may not be an ideal solution to task parallelism on clusters. It is an approach that aims to provide flexibility between cluster resource managers. While I haven't encountered the particular issue above, I have encountered issues with the class in other forms. As Ryan has suggested, it is probably best to try and utilize utilities provided by the resource manager (e.g. pbsdsh). |
Thanks everyone for the useful comments. I've opened a new issue #40 for discussing shot parallelism problems of the kind Chao is experiencing. In this new issue, I've posted some previous emails between Gian and myself that seem relevant to the problem at hand. |
It does appear to be hanging, but it is not the issue I saw in issue #40 What workflow are you attempting to run? I'd suggest checking any instances On Mon, Sep 26, 2016 at 1:07 AM, CHAO ZHANG notifications@github.com
|
Perhaps using pbsdsh to invoke a simple "hello world" script might be useful as a debugging step. The integration test located in seisflows/tests/test_system could be employed for this purpose. That said, my understanding from speaking to our local cluster administrator is that the pbsdsh utility itself can be unreliable. This may well be the explanation for what you are seeing, but it may be worth troubleshooting a bit to make sure there's not some alternate explanation. |
If pbsdsh is in fact the problem, a workaround might be to create a your own dsh script by calling ssh within a for loop. You would need to (1) manually specify the compute nodes to run on using the list of allocated nodes made available by PBS (2) use the SendEnv option or something similar to assign each ssh process a unique indentifier and (3) wait until all the ssh processes complete before moving on. A disadvantage of this approach might be that the child ssh processes might continue running in the event the parent process fails. |
Yet another workaround would be to write an entirely new PBS system class based the following package: https://radicalpilot.readthedocs.io/en/latest/ Through an NSF grant, we are currently collaborating with the developers of the radical pilot package, so such an interface might eventually be added to seisflows but not for a while yet. If you get something to work in the meantime please feel free to submit a pull request. |
To reflect the expanded scope of the discussion, I'll go ahead and change the issue title to something more general. Also, it's worth noting that none of the problems mentioned above have been encountered on SLURM clusters. So if the possibility ever arises to switch to SLURM I'd highly recommend it... |
Hi Rran, #!/bin/bash Then it can run successfully, but compared with pbsdsh, it's running much slower. |
This is definitely in the right direction. If you were implementing it in bash, it would be something like you have written. However, since you would be overloading the 'run' method from the pbs system class, actually it needs to be implemented in python. Come to think of it, rather than a subprocess call to ssh, in python it would be probably be better to use parminko (http://stackoverflow.com/questions/3586106/perform-commands-over-ssh-with-python) |
One issue: PBS Pro and Toque are in many ways different and may require separate modules in seisflows/system.
The text was updated successfully, but these errors were encountered: