Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shot parallelism discussion #40

Closed
rmodrak opened this issue Sep 23, 2016 · 4 comments
Closed

shot parallelism discussion #40

rmodrak opened this issue Sep 23, 2016 · 4 comments

Comments

@rmodrak
Copy link
Collaborator

rmodrak commented Sep 23, 2016

On Mon, Jul 18, 2016 at 3:23 PM, Gian Matharu gian@ualberta.ca wrote:

Hi Ryan,

So I was wondering if you've run into any issues using the MPI system class. It works fine on my local PC, but when I try to run on HPC systems I begin running into peculiar issues. Occasionally one of the subprocess calls seems to simply hang (e.g. on a forward/adjoint solve). I was curious if you'd experienced anything similar at any point.

Regards,

Gian

@rmodrak
Copy link
Collaborator Author

rmodrak commented Sep 23, 2016

On Mon, Jul 18, 2016 at 3:19 PM, Ryan Modrak rmodrak@princeton.edu wrote:

The only similar issue I've heard about is that on some LSF and PBS clusters, the workflow hangs after a long time or a large number of iterations. I'm not aware of any cases where a subprocess call is specifically to blame. Perhaps it is worth asking the cluster admins if they've experienced anything similar. If it happens often enough to be disruptive, there may be some way of detecting when a subprocess stalls (link below might relevant) and then rerunning the stalled process (or rerunning all processes, if that makes things simpler).

http://stackoverflow.com/questions/1191374/using-module-subprocess-with-timeout

@rmodrak
Copy link
Collaborator Author

rmodrak commented Sep 23, 2016

On Mon, Aug 29, 2016 at 11:52 AM, Gian Matharu gian@ualberta.ca wrote:

From what I remember, certain sections seemed to hang. I have a feeling it was somehow related to issues with IO. It's possible that there could have been some hidden race condition but I couldn't find it and the subprocess.call command is supposed to be a blocking call.

I think the system class itself is mostly OK, but it's clearly dependent on the solver implementation. Employing parallel shots via a subprocess MPI call to a python wrapper does obfuscate things though. So a simpler solution might be preferable.

@rmodrak
Copy link
Collaborator Author

rmodrak commented Sep 23, 2016

On Mon, Aug 29, 2016 12:11 PM Ryan Modrak wrote:

I agree, employing parallel shots via a subprocess MPI call to a python wrapper is quite obfuscated. Why isn't there some simple readily available solution? Since the shots themselves are embarrasingly parallel, as a basic principle, race conditions shouldnt be an issue, right? I don't see why not there isn't some readily available dsh script that does the trick. For slurm clusters a solution does exist--srun. For PBS clusters there is pbsdsh but that utility itself seems very halfhearted and unstable. Why there isn't some cross platform solution I'm not sure. Please let me know if I am missing anything.

@rmodrak
Copy link
Collaborator Author

rmodrak commented Sep 23, 2016

On Mon, Aug 29, 2016 at 12:28 PM, Gian Matharu gian@ualberta.ca wrote:

Race condition might be the wrong term in this case. It seemed to be a lack of synchronization between the processes, possibly related to reading/writing.

I ran into issues with pbsdsh which is why I left it originally. I briefly looked into a utility called gnu parallel. It basically launches serial programs in an embarrassingly parallel approach. I think the issue I had with it was finding a unique identifier (something like my_rank) to distinguish the shots.

@bch0w bch0w closed this as not planned Won't fix, can't repro, duplicate, stale May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants