Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems related to PBS clusters #18

Closed
rmodrak opened this issue Feb 19, 2016 · 17 comments
Closed

problems related to PBS clusters #18

rmodrak opened this issue Feb 19, 2016 · 17 comments

Comments

@rmodrak
Copy link
Collaborator

rmodrak commented Feb 19, 2016

One issue: PBS Pro and Toque are in many ways different and may require separate modules in seisflows/system.

@rmodrak
Copy link
Collaborator Author

rmodrak commented Feb 24, 2016

On Wed, Feb 24, 2016 at 1:37 PM, Gian Matharu gian@ualberta.ca wrote:
Hey Ryan,

 I was wondering if you'd noticed any lag between seisflows executing commands on clusters. I seem to have noticed that there are lags between ending one call and proceeding to another despite being seemingly complete with the prior task. 

No, I don't believe we've had that problem. Perhaps some sort of 'epilogue'? http://docs.adaptivecomputing.com/torque/3-0-5/a.gprologueepilogue.php Or perhaps just unreliable behavior of PBS.

@rmodrak
Copy link
Collaborator Author

rmodrak commented Mar 10, 2016

Update: By replacing pbs environment variables with mpi directives, gian's system.mpi class removes the need to distinguish between pbs pro and torque in some cases; pbs_sm should work for both variants, I believe. To my knowledge, pbs_lg has been tested on pbs pro clusters but still needs testing on pbs torque clusters (while it may work on the latter with small modifications, it is not likely to work right out of the box).

@gianmatharu
Copy link
Collaborator

I ran into another topic that may warrant some consideration. For the PBS cluster I'm currently using, certain software needs to be loaded prior to running codes e.g. "module load intel" will load intel compilers etc. Admittedly, I'm not sure if this is standard for PBS/clusters in general but would adding the option to specify such directives from within the seisflows system.submit be beneficial?

There are easy alternatives (using bash profiles or in some cases one could use a submission script to submit sfrun), so it's not a pressing issue and may not be necessary.

@rmodrak
Copy link
Collaborator Author

rmodrak commented Mar 11, 2016

For controlling user environment, I think the module utility is standard
for all types of clusters, not just PBS. I can't think of a cluster I've
used that hasn't had it. However, the list of available modules itself
(module avail) is highly system dependent, so trying to script module load ... may not be a good approach.

A better approach I think would be to include environment assertions in
check methods. For example, if mpi4py is required but not on the
PYTHONPATH, an environment exception should be returned.

On Thu, Mar 10, 2016 at 11:26 PM, gianmatharu notifications@github.com
wrote:

I ran into another topic that may warrant some consideration. For the PBS
cluster I'm currently using, certain software needs to be loaded prior to
running codes e.g. "module load intel" will load intel compilers etc.
Admittedly, I'm not sure if this is standard for PBS/clusters in general
but would adding the option to specify such directives from within the
seisflows system.submit be beneficial?

There are easy alternatives (using bash profiles or in some cases one
could use a submission script to submit sfrun), so it's not a pressing
issue and may not be necessary.


Reply to this email directly or view it on GitHub
#18 (comment)
.

@dkzhangchao
Copy link

dkzhangchao commented Sep 23, 2016

Hi Ryan,
Sorry for bothering you again, Actually I also meet with bugs when using system='pbs_sm', i am not sure if it's cause by the mpi4py.

when I get to 'bin/xspecfem2D', it meets bug
[gpc-f106n012-ib0:30348] [[7631,1],2] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30353] [[7631,1],7] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30349] [[7631,1],3] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30351] [[7631,1],4] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30350] [[7631,1],1] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30352] [[7631,1],6] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30355] [[7631,1],5] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30355] [[7631,1],0] routed:binomial: Connection to lifeline [[7631,0],0] lost
Do you know what this means?
Thanks

EDIT: revised for clarity [rmodrak]

@rmodrak
Copy link
Collaborator Author

rmodrak commented Sep 23, 2016

It seems like pbs_sm is not working on your cluster. Similar to the problem described here, mpi4py seems to fail when encountering Python subprocess calls. Strangely, on other PBS clusters, this is not always an issue.

As a workaround, try using pbs_torque_sm instead, which uses pbsdsh under the hood rather than mpi4py.

I'll go ahead edit your post for clarity if that's alright.

@gianmatharu
Copy link
Collaborator

The MPI (mpi4py) system class may not be an ideal solution to task parallelism on clusters. It is an approach that aims to provide flexibility between cluster resource managers. While I haven't encountered the particular issue above, I have encountered issues with the class in other forms. As Ryan has suggested, it is probably best to try and utilize utilities provided by the resource manager (e.g. pbsdsh).

@rmodrak
Copy link
Collaborator Author

rmodrak commented Sep 23, 2016

Thanks everyone for the useful comments. I've opened a new issue #40 for discussing shot parallelism problems of the kind Chao is experiencing. In this new issue, I've posted some previous emails between Gian and myself that seem relevant to the problem at hand.

@dkzhangchao
Copy link

Hi Ryan and Gian,

I have tried system='pbsdsh', i also meet the same problems when it execute the

    system.run('solver', 'setup', 
               hosts='all')

11

Actually, i generated the observe data after this, it means that the pbsdsh can be evoked here.
However, it seems that it's stuck in the system.run ( subprocess.call(pbsdsh) ). The bug is showed below:
22

Can you give me advice? BTW, i can get the forward result (obs data) after system.run

@gianmatharu
Copy link
Collaborator

It does appear to be hanging, but it is not the issue I saw in issue #40
(where mpi4py was used). It seems to be due to pbsdsh hanging, which I have
encountered. The issue is inconsistent for me, It can occur on one attempt
and not the subsequent attempt. I don't observe it that frequently any
more.

What workflow are you attempting to run? I'd suggest checking any instances
of system.run (within the worfklow) to check that all the calls to pbsdsh
aren't problematic.

On Mon, Sep 26, 2016 at 1:07 AM, CHAO ZHANG notifications@github.com
wrote:

Hi Ryan and Gian,

I have tried system='pbsdsh', i also meet the same problems when it
execute the

system.run('solver', 'setup',
           hosts='all')

[image: 11]
https://cloud.githubusercontent.com/assets/8068058/18825181/0f90753a-8395-11e6-95c5-0ee7f81da1ab.png

Actually, i generated the observe data after this, it means that the
pbsdsh can be evoked here.
However, it seems that it's stuck in the system.run (
subprocess.call(pbsdsh) ). The bug is showed below:
[image: 22]
https://cloud.githubusercontent.com/assets/8068058/18825372/0fc398c4-8396-11e6-8328-36b0511bf6a0.png

Can you give me advice? BTW, i can get the forward result (obs data) after
system.run


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#18 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AMC0UFdtgvYkx5Egvs31vCq5-OY9RrcPks5qt28rgaJpZM4HduVA
.

@dkzhangchao
Copy link

I do the inversion in workflow. This is the first time to invoke system.run ( that is, forwarding to get the observe data) , i am confused that if the system.run has some problem, why it also generate the data after execute the code

system.run('solver', 'setup',
hosts='all')
BTW, do you think my command of system.run is correct?
11

@rmodrak
Copy link
Collaborator Author

rmodrak commented Sep 26, 2016

Perhaps using pbsdsh to invoke a simple "hello world" script might be useful as a debugging step. The integration test located in seisflows/tests/test_system could be employed for this purpose.

That said, my understanding from speaking to our local cluster administrator is that the pbsdsh utility itself can be unreliable. This may well be the explanation for what you are seeing, but it may be worth troubleshooting a bit to make sure there's not some alternate explanation.

@rmodrak
Copy link
Collaborator Author

rmodrak commented Sep 26, 2016

If pbsdsh is in fact the problem, a workaround might be to create a your own dsh script by calling ssh within a for loop. You would need to (1) manually specify the compute nodes to run on using the list of allocated nodes made available by PBS (2) use the SendEnv option or something similar to assign each ssh process a unique indentifier and (3) wait until all the ssh processes complete before moving on. A disadvantage of this approach might be that the child ssh processes might continue running in the event the parent process fails.

@rmodrak
Copy link
Collaborator Author

rmodrak commented Sep 26, 2016

Yet another workaround would be to write an entirely new PBS system class based the following package:

https://radicalpilot.readthedocs.io/en/latest/

Through an NSF grant, we are currently collaborating with the developers of the radical pilot package, so such an interface might eventually be added to seisflows but not for a while yet. If you get something to work in the meantime please feel free to submit a pull request.

@rmodrak
Copy link
Collaborator Author

rmodrak commented Sep 27, 2016

To reflect the expanded scope of the discussion, I'll go ahead and change the issue title to something more general.

Also, it's worth noting that none of the problems mentioned above have been encountered on SLURM clusters. So if the possibility ever arises to switch to SLURM I'd highly recommend it...

@rmodrak rmodrak changed the title need to distinguish between PBS PRO and TORQUE system interfaces problems related to PBS clusters Sep 27, 2016
@dkzhangchao
Copy link

dkzhangchao commented Sep 27, 2016

Hi Rran,
Yep, i agree with you. To be honest, there are more issues for pbs system than slurm. The staff of my cluster tells me that it seems there is a problem for pbsdsh, i write a scrip which emulates the behavior of pbsdsh. Does it accord with your suggestion?

#!/bin/bash
#temporary work around for pbsdsh
#cat ${PBS_NODEFILE}
k=0;
for i in cat $PBS_NODEFILE;
do
ssh $i "export PBS_VNODENUM=$k; $@" ;
k=$(($k + 1));
done

Then it can run successfully, but compared with pbsdsh, it's running much slower.

@rmodrak
Copy link
Collaborator Author

rmodrak commented Sep 27, 2016

This is definitely in the right direction. If you were implementing it in bash, it would be something like you have written. However, since you would be overloading the 'run' method from the pbs system class, actually it needs to be implemented in python. Come to think of it, rather than a subprocess call to ssh, in python it would be probably be better to use parminko (http://stackoverflow.com/questions/3586106/perform-commands-over-ssh-with-python)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants