-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dmtcp with PBS scheduler #439
Comments
Could you try to attach gdb to the process that's stuck and share the backtrace? Is this an MPI application? |
Thank you for the response. It is not an MPI application, but one of the simplest dmtcp examples, dmtcp1.c. Here is the backtrace: (gdb) bt From what I can see, when I checkpoint and then restart without using PBS scheduler, the sockets opened by dmtcp_coordinator match those opened by the restarted application. However, when PBS is used, checkpointing runs fine, but the restarted application opens 1 socket more than dmtcp_coordinator does, so they don't match. Do you know what might be the reason? Sockets opened by the restarted application: Sockets opened by dmtcp_coordinator: |
The application can open any number of sockets; it has no relation to the number of sockets that the coordinator has. However, it sounds like the application process has a socket that it inherits from the PBS launcher process. It should be easy to recognize this socket and "blacklist" it to make sure that DMTCP doesn't try to drain it. In ideal case, there should be just one socket (in connected state) from the dmtcp1 application to the coordinator. The coordinator should have 2 sockets: 1 in connected state (for the dmtcp1 process), and 1 in listen state. So, if you see that the application has 2 sockets open prior to checkpointing, my guess is that it inherits the extra socket from the PBS launcher process. Could you confirm this? Also, the backtrace that you shared is incomplete and doesn't indicate any errors. Could you try running the backtrace command on all the threads (thread apply all bt)? |
Again thanks for your help Rohan. I agree with your analysis. I can recognize the socket the application process inherits from PBS launcher process when the process is running, but do you have idea how to get information about the additional socket before the application is restarted? Shall I talk to PBS people about this? And can you suggest how to blacklist it? Below is the results of backtrace command on all the threads: Thread 2 (Thread 0x7fcbfe94b700 (LWP 17654)): Thread 1 (Thread 0x7fcc02167740 (LWP 17637)): |
In the first PBS job which runs dmtcp1 for the first time, there is only one socket before checkpointing and after checkpointing starts, so dmtcp1 runs well in this job. However, in the second PBS job which restarts dmtcp1 from the last checkpoint saved in the previous job, the application has 2 sockets open prior to checkpointing. I think this confirms that the application process inherits one extra socket from PBS launcher process? |
This fixes issue dmtcp#439.
This fixes issue dmtcp#439.
Fixed via PR #510. |
Hi,
I'm wondering if anybody has got dmtcp work with PBS? I understand it is not a supported resource manger, but want to know if anybody has got luck. I have made similar set up as that for Slurm. The launch part seems to be ok, but resume is not working with either the scheduler, the generated dmtcp_restart_script.sh or the .dmtcp file. strace shows the restarted process is waiting for something:
futex(0x7fb31b83a600, FUTEX_WAIT_PRIVATE, 0, NULL
I'm using dmtcp 3.0.0.
Thanks!
The text was updated successfully, but these errors were encountered: