Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not urgent] MPI issue with mpirun #274

Open
jchodera opened this issue Jun 21, 2015 · 10 comments
Open

[not urgent] MPI issue with mpirun #274

jchodera opened this issue Jun 21, 2015 · 10 comments

Comments

@jchodera
Copy link
Member

I'm having trouble with an MPI job (3671529).

The error on mpirun launch is:

[proxy:0:2@gpu-2-4.local] HYDU_create_process (./utils/launch/launch.c:69): execvp error on file 0 (No such file or directory)

Here is the request block:

# specify queue                                                                                                                                                                                                                                                              
#PBS -q gpu                                                                                                                                                                                                                                                                  
#                                                                                                                                                                                                                                                                            
# nodes: number of nodes                                                                                                                                                                                                                                                     
#PBS -l procs=16,gpus=1:shared                                                                                                                                                                                                                                               

Just documenting this for now while I debug in case someone else has a similar issue.

@jchodera
Copy link
Member Author

Using a request of 8 processes/gpus (3671530) worked fine:

# specify queue                                                                                                                                                                                                                                                              
#PBS -q gpu                                                                                                                                                                                                                                                                  
#                                                                                                                                                                                                                                                                            
# nodes: number of nodes                                                                                                                                                                                                                                                     
#PBS -l procs=8,gpus=1:shared  

I wonder if there is some sort of node misconfiguration or hang.

@jchodera
Copy link
Member Author

From here this looks like the executable may not be found on the filesystem on all nodes. I wonder if this suggests a GPFS synchronization hiccup.

@tatarsky
Copy link
Contributor

What level of importance is this? Its Sunday. Its Fathers day. Are you asking me to look at this right now @jchodera ?

@jchodera
Copy link
Member Author

Sorry, no! Absolutely not!

@jchodera
Copy link
Member Author

This was just "I am debugging, putting info online just in case useful to others". Will mark as non-urgent.

@jchodera jchodera changed the title MPI issue [NOT urgent] MPI issue with mpirun Jun 21, 2015
@tatarsky
Copy link
Contributor

Thank you. While I do not see anything obvious at the moment I will investigate with you Monday.

@jchodera jchodera changed the title [NOT urgent] MPI issue with mpirun [not urgent] MPI issue with mpirun Jun 21, 2015
@tatarsky
Copy link
Contributor

When is a good time to try to debug this?

@jchodera
Copy link
Member Author

Any chance we could look into this Tue at 12.00P EST? This isn't urgent, but I've noticed some other issues that seem to be do to minor environment differences among nodes. I'm still trying to do more debugging on that (and am slammed with meetings today).

@tatarsky
Copy link
Contributor

Sounds good.

@tatarsky tatarsky self-assigned this Jul 2, 2015
@tatarsky
Copy link
Contributor

tatarsky commented Jul 6, 2015

Was this handled as I believe you said we would review when discussing #275 which is indeed confirmed as a bug and will be some period of time before Adaptive issues a fix I suspect (unless we declare it critical).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants