problems related to PBS clusters #18

rmodrak · 2016-02-19T05:57:48Z

One issue: PBS Pro and Toque are in many ways different and may require separate modules in seisflows/system.

rmodrak · 2016-02-24T18:45:09Z

On Wed, Feb 24, 2016 at 1:37 PM, Gian Matharu gian@ualberta.ca wrote:
Hey Ryan,

 I was wondering if you'd noticed any lag between seisflows executing commands on clusters. I seem to have noticed that there are lags between ending one call and proceeding to another despite being seemingly complete with the prior task.

No, I don't believe we've had that problem. Perhaps some sort of 'epilogue'? http://docs.adaptivecomputing.com/torque/3-0-5/a.gprologueepilogue.php Or perhaps just unreliable behavior of PBS.

rmodrak · 2016-03-10T05:03:28Z

Update: By replacing pbs environment variables with mpi directives, gian's system.mpi class removes the need to distinguish between pbs pro and torque in some cases; pbs_sm should work for both variants, I believe. To my knowledge, pbs_lg has been tested on pbs pro clusters but still needs testing on pbs torque clusters (while it may work on the latter with small modifications, it is not likely to work right out of the box).

gianmatharu · 2016-03-11T04:26:25Z

I ran into another topic that may warrant some consideration. For the PBS cluster I'm currently using, certain software needs to be loaded prior to running codes e.g. "module load intel" will load intel compilers etc. Admittedly, I'm not sure if this is standard for PBS/clusters in general but would adding the option to specify such directives from within the seisflows system.submit be beneficial?

There are easy alternatives (using bash profiles or in some cases one could use a submission script to submit sfrun), so it's not a pressing issue and may not be necessary.

rmodrak · 2016-03-11T04:47:05Z

For controlling user environment, I think the module utility is standard
for all types of clusters, not just PBS. I can't think of a cluster I've
used that hasn't had it. However, the list of available modules itself
(module avail) is highly system dependent, so trying to script module load ... may not be a good approach.

A better approach I think would be to include environment assertions in
check methods. For example, if mpi4py is required but not on the
PYTHONPATH, an environment exception should be returned.

On Thu, Mar 10, 2016 at 11:26 PM, gianmatharu notifications@github.com
wrote:

I ran into another topic that may warrant some consideration. For the PBS
cluster I'm currently using, certain software needs to be loaded prior to
running codes e.g. "module load intel" will load intel compilers etc.
Admittedly, I'm not sure if this is standard for PBS/clusters in general
but would adding the option to specify such directives from within the
seisflows system.submit be beneficial?

There are easy alternatives (using bash profiles or in some cases one
could use a submission script to submit sfrun), so it's not a pressing
issue and may not be necessary.

—
Reply to this email directly or view it on GitHub
#18 (comment)
.

dkzhangchao · 2016-09-23T15:29:01Z

Hi Ryan,
Sorry for bothering you again, Actually I also meet with bugs when using system='pbs_sm', i am not sure if it's cause by the mpi4py.

when I get to 'bin/xspecfem2D', it meets bug
[gpc-f106n012-ib0:30348] [[7631,1],2] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30353] [[7631,1],7] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30349] [[7631,1],3] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30351] [[7631,1],4] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30350] [[7631,1],1] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30352] [[7631,1],6] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30355] [[7631,1],5] routed:binomial: Connection to lifeline [[7631,0],0] lost
[gpc-f106n012-ib0:30355] [[7631,1],0] routed:binomial: Connection to lifeline [[7631,0],0] lost
Do you know what this means?
Thanks

EDIT: revised for clarity [rmodrak]

rmodrak · 2016-09-23T16:07:32Z

It seems like pbs_sm is not working on your cluster. Similar to the problem described here, mpi4py seems to fail when encountering Python subprocess calls. Strangely, on other PBS clusters, this is not always an issue.

As a workaround, try using pbs_torque_sm instead, which uses pbsdsh under the hood rather than mpi4py.

I'll go ahead edit your post for clarity if that's alright.

gianmatharu · 2016-09-23T16:56:27Z

The MPI (mpi4py) system class may not be an ideal solution to task parallelism on clusters. It is an approach that aims to provide flexibility between cluster resource managers. While I haven't encountered the particular issue above, I have encountered issues with the class in other forms. As Ryan has suggested, it is probably best to try and utilize utilities provided by the resource manager (e.g. pbsdsh).

rmodrak · 2016-09-23T17:24:37Z

Thanks everyone for the useful comments. I've opened a new issue #40 for discussing shot parallelism problems of the kind Chao is experiencing. In this new issue, I've posted some previous emails between Gian and myself that seem relevant to the problem at hand.

dkzhangchao · 2016-09-26T07:07:22Z

Hi Ryan and Gian,

I have tried system='pbsdsh', i also meet the same problems when it execute the

    system.run('solver', 'setup', 
               hosts='all')

Actually, i generated the observe data after this, it means that the pbsdsh can be evoked here.
However, it seems that it's stuck in the system.run ( subprocess.call(pbsdsh) ). The bug is showed below:

Can you give me advice? BTW, i can get the forward result (obs data) after system.run

gianmatharu · 2016-09-26T14:31:16Z

It does appear to be hanging, but it is not the issue I saw in issue #40
(where mpi4py was used). It seems to be due to pbsdsh hanging, which I have
encountered. The issue is inconsistent for me, It can occur on one attempt
and not the subsequent attempt. I don't observe it that frequently any
more.

What workflow are you attempting to run? I'd suggest checking any instances
of system.run (within the worfklow) to check that all the calls to pbsdsh
aren't problematic.

On Mon, Sep 26, 2016 at 1:07 AM, CHAO ZHANG notifications@github.com
wrote:

Hi Ryan and Gian,

I have tried system='pbsdsh', i also meet the same problems when it
execute the
system.run('solver', 'setup',
           hosts='all')
[image: 11]
https://cloud.githubusercontent.com/assets/8068058/18825181/0f90753a-8395-11e6-95c5-0ee7f81da1ab.png

Actually, i generated the observe data after this, it means that the
pbsdsh can be evoked here.
However, it seems that it's stuck in the system.run (
subprocess.call(pbsdsh) ). The bug is showed below:
[image: 22]
https://cloud.githubusercontent.com/assets/8068058/18825372/0fc398c4-8396-11e6-8328-36b0511bf6a0.png

Can you give me advice? BTW, i can get the forward result (obs data) after
system.run

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#18 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AMC0UFdtgvYkx5Egvs31vCq5-OY9RrcPks5qt28rgaJpZM4HduVA
.

dkzhangchao · 2016-09-26T15:52:49Z

I do the inversion in workflow. This is the first time to invoke system.run ( that is, forwarding to get the observe data) , i am confused that if the system.run has some problem, why it also generate the data after execute the code

system.run('solver', 'setup',
hosts='all')
BTW, do you think my command of system.run is correct?

rmodrak · 2016-09-26T19:16:00Z

Perhaps using pbsdsh to invoke a simple "hello world" script might be useful as a debugging step. The integration test located in seisflows/tests/test_system could be employed for this purpose.

That said, my understanding from speaking to our local cluster administrator is that the pbsdsh utility itself can be unreliable. This may well be the explanation for what you are seeing, but it may be worth troubleshooting a bit to make sure there's not some alternate explanation.

rmodrak · 2016-09-26T19:30:27Z

If pbsdsh is in fact the problem, a workaround might be to create a your own dsh script by calling ssh within a for loop. You would need to (1) manually specify the compute nodes to run on using the list of allocated nodes made available by PBS (2) use the SendEnv option or something similar to assign each ssh process a unique indentifier and (3) wait until all the ssh processes complete before moving on. A disadvantage of this approach might be that the child ssh processes might continue running in the event the parent process fails.

rmodrak · 2016-09-26T19:38:38Z

Yet another workaround would be to write an entirely new PBS system class based the following package:

https://radicalpilot.readthedocs.io/en/latest/

Through an NSF grant, we are currently collaborating with the developers of the radical pilot package, so such an interface might eventually be added to seisflows but not for a while yet. If you get something to work in the meantime please feel free to submit a pull request.

rmodrak · 2016-09-27T02:25:57Z

To reflect the expanded scope of the discussion, I'll go ahead and change the issue title to something more general.

Also, it's worth noting that none of the problems mentioned above have been encountered on SLURM clusters. So if the possibility ever arises to switch to SLURM I'd highly recommend it...

dkzhangchao · 2016-09-27T04:28:51Z

Hi Rran,
Yep, i agree with you. To be honest, there are more issues for pbs system than slurm. The staff of my cluster tells me that it seems there is a problem for pbsdsh, i write a scrip which emulates the behavior of pbsdsh. Does it accord with your suggestion?

#!/bin/bash
#temporary work around for pbsdsh
#cat ${PBS_NODEFILE}
k=0;
for i in cat $PBS_NODEFILE;
do
ssh $i "export PBS_VNODENUM=$k; $@" ;
k=$(($k + 1));
done

Then it can run successfully, but compared with pbsdsh, it's running much slower.

rmodrak · 2016-09-27T11:52:52Z

This is definitely in the right direction. If you were implementing it in bash, it would be something like you have written. However, since you would be overloading the 'run' method from the pbs system class, actually it needs to be implemented in python. Come to think of it, rather than a subprocess call to ssh, in python it would be probably be better to use parminko (http://stackoverflow.com/questions/3586106/perform-commands-over-ssh-with-python)

rmodrak changed the title ~~need to distinguish between PBS PRO and TORQUE system interfaces~~ problems related to PBS clusters Sep 27, 2016

bch0w mentioned this issue May 31, 2022

Expand System capabilities to other clusters, workload managers #117

Open

bch0w closed this as not planned Won't fix, can't repro, duplicate, stale May 31, 2022

raulleoncz mentioned this issue Apr 12, 2024

SIGTRAP & SIGFILL Errors #209

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problems related to PBS clusters #18

problems related to PBS clusters #18

rmodrak commented Feb 19, 2016 •

edited

Loading

rmodrak commented Feb 24, 2016

rmodrak commented Mar 10, 2016

gianmatharu commented Mar 11, 2016

rmodrak commented Mar 11, 2016

dkzhangchao commented Sep 23, 2016 •

edited by rmodrak

Loading

rmodrak commented Sep 23, 2016 •

edited

Loading

gianmatharu commented Sep 23, 2016

rmodrak commented Sep 23, 2016

dkzhangchao commented Sep 26, 2016

gianmatharu commented Sep 26, 2016

dkzhangchao commented Sep 26, 2016

rmodrak commented Sep 26, 2016 •

edited

Loading

rmodrak commented Sep 26, 2016 •

edited

Loading

rmodrak commented Sep 26, 2016 •

edited

Loading

rmodrak commented Sep 27, 2016

dkzhangchao commented Sep 27, 2016 •

edited

Loading

rmodrak commented Sep 27, 2016 •

edited

Loading

problems related to PBS clusters #18

problems related to PBS clusters #18

Comments

rmodrak commented Feb 19, 2016 • edited Loading

rmodrak commented Feb 24, 2016

rmodrak commented Mar 10, 2016

gianmatharu commented Mar 11, 2016

rmodrak commented Mar 11, 2016

dkzhangchao commented Sep 23, 2016 • edited by rmodrak Loading

rmodrak commented Sep 23, 2016 • edited Loading

gianmatharu commented Sep 23, 2016

rmodrak commented Sep 23, 2016

dkzhangchao commented Sep 26, 2016

gianmatharu commented Sep 26, 2016

dkzhangchao commented Sep 26, 2016

rmodrak commented Sep 26, 2016 • edited Loading

rmodrak commented Sep 26, 2016 • edited Loading

rmodrak commented Sep 26, 2016 • edited Loading

rmodrak commented Sep 27, 2016

dkzhangchao commented Sep 27, 2016 • edited Loading

rmodrak commented Sep 27, 2016 • edited Loading

rmodrak commented Feb 19, 2016 •

edited

Loading

dkzhangchao commented Sep 23, 2016 •

edited by rmodrak

Loading

rmodrak commented Sep 23, 2016 •

edited

Loading

rmodrak commented Sep 26, 2016 •

edited

Loading

rmodrak commented Sep 26, 2016 •

edited

Loading

rmodrak commented Sep 26, 2016 •

edited

Loading

dkzhangchao commented Sep 27, 2016 •

edited

Loading

rmodrak commented Sep 27, 2016 •

edited

Loading