Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dmtcp with PBS scheduler #439

Closed
angel-smile opened this issue Oct 18, 2016 · 6 comments
Closed

dmtcp with PBS scheduler #439

angel-smile opened this issue Oct 18, 2016 · 6 comments

Comments

@angel-smile
Copy link

Hi,

I'm wondering if anybody has got dmtcp work with PBS? I understand it is not a supported resource manger, but want to know if anybody has got luck. I have made similar set up as that for Slurm. The launch part seems to be ok, but resume is not working with either the scheduler, the generated dmtcp_restart_script.sh or the .dmtcp file. strace shows the restarted process is waiting for something:

futex(0x7fb31b83a600, FUTEX_WAIT_PRIVATE, 0, NULL

I'm using dmtcp 3.0.0.

Thanks!

@rohgarg
Copy link
Contributor

rohgarg commented Nov 1, 2016

Could you try to attach gdb to the process that's stuck and share the backtrace? Is this an MPI application?

@angel-smile
Copy link
Author

Thank you for the response. It is not an MPI application, but one of the simplest dmtcp examples, dmtcp1.c. Here is the backtrace:

(gdb) bt
#0 0x00007f77309b3a00 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007f7731e25258 in dmtcp::ThreadList::waitForAllRestored(Thread*) () at threadlist.cpp:643
#2 0x00007f7731e273c3 in stopthisthread(int) () at threadlist.cpp:598
#3
#4 0x00007f7731458bdd in nanosleep () from /lib64/libc.so.6
#5 0x00007f7731458a50 in sleep () from /lib64/libc.so.6
#6 0x0000000000400c11 in main (argc=-409088528, argv=0x7ffce79dcdf0) at dmtcp1.c:12

From what I can see, when I checkpoint and then restart without using PBS scheduler, the sockets opened by dmtcp_coordinator match those opened by the restarted application. However, when PBS is used, checkpointing runs fine, but the restarted application opens 1 socket more than dmtcp_coordinator does, so they don't match. Do you know what might be the reason?

Sockets opened by the restarted application:
$ ls -l /proc/9836/fd
821 -> socket:[58290]
834 -> socket:[58298]

Sockets opened by dmtcp_coordinator:
$ ls -l /proc/9833/fd
4 -> socket:[58261]
6 -> socket:[58291]

@rohgarg
Copy link
Contributor

rohgarg commented Nov 3, 2016

The application can open any number of sockets; it has no relation to the number of sockets that the coordinator has. However, it sounds like the application process has a socket that it inherits from the PBS launcher process. It should be easy to recognize this socket and "blacklist" it to make sure that DMTCP doesn't try to drain it.

In ideal case, there should be just one socket (in connected state) from the dmtcp1 application to the coordinator. The coordinator should have 2 sockets: 1 in connected state (for the dmtcp1 process), and 1 in listen state. So, if you see that the application has 2 sockets open prior to checkpointing, my guess is that it inherits the extra socket from the PBS launcher process. Could you confirm this?

Also, the backtrace that you shared is incomplete and doesn't indicate any errors. Could you try running the backtrace command on all the threads (thread apply all bt)?

@angel-smile
Copy link
Author

Again thanks for your help Rohan. I agree with your analysis. I can recognize the socket the application process inherits from PBS launcher process when the process is running, but do you have idea how to get information about the additional socket before the application is restarted? Shall I talk to PBS people about this? And can you suggest how to blacklist it?

Below is the results of backtrace command on all the threads:

Thread 2 (Thread 0x7fcbfe94b700 (LWP 17654)):
#0 0x00007fcc004c0283 in poll () from /lib64/libc.so.6
#1 0x00007fcc014f3f23 in dmtcp::ConnectionList::sendReceiveMissingFds() () at ipc/connectionlist.cpp:506
#2 0x00007fcc00e613dd in dmtcp::PluginInfo::processBarrier(dmtcp::BarrierInfo_) () at plugininfo.cpp:122
#3 0x00007fcc00e61690 in dmtcp::PluginInfo::processBarriers() () at plugininfo.cpp:102
#4 0x00007fcc00e634ea in dmtcp::PluginManager::processRestartBarriers() () at pluginmanager.cpp:159
#5 0x00007fcc00e4155f in dmtcp::DmtcpWorker::postRestart() () at dmtcpworker.cpp:529
#6 0x00007fcc00e5a294 in dmtcp::ThreadList::waitForAllRestored(Thread_) () at threadlist.cpp:621
#7 0x00007fcc00e5bb9e in checkpointhread(void_) () at threadlist.cpp:377
#8 0x00007fcc00e52f9c in pthread_start(void_) () at threadwrappers.cpp:154
#9 0x00007fcbff9e2aa1 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fcc00e52dfd in clone_start(void*) () at threadwrappers.cpp:65
#11 0x00007fcc004c9aad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fcc02167740 (LWP 17637)):
#0 0x00007fcbff9e8a00 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007fcc00e5a258 in dmtcp::ThreadList::waitForAllRestored(Thread*) () at threadlist.cpp:643
#2 0x00007fcc00e5c3c3 in stopthisthread(int) () at threadlist.cpp:598
#3
#4 0x00007fcc0048dbdd in nanosleep () from /lib64/libc.so.6
#5 0x00007fcc0048da50 in sleep () from /lib64/libc.so.6
#6 0x0000000000400c11 in main (argc=1467939472, argv=0x7fff577efa90) at dmtcp1.c:12

@angel-smile
Copy link
Author

In ideal case, there should be just one socket (in connected state) from the dmtcp1 application to the coordinator. The coordinator should have 2 sockets: 1 in connected state (for the dmtcp1 process), and 1 in listen state. So, if you see that the application has 2 sockets open prior to checkpointing, my guess is that it inherits the extra socket from the PBS launcher process. Could you confirm this?

In the first PBS job which runs dmtcp1 for the first time, there is only one socket before checkpointing and after checkpointing starts, so dmtcp1 runs well in this job. However, in the second PBS job which restarts dmtcp1 from the last checkpoint saved in the previous job, the application has 2 sockets open prior to checkpointing. I think this confirms that the application process inherits one extra socket from PBS launcher process?

rohgarg added a commit to rohgarg/dmtcp-1 that referenced this issue Jan 9, 2017
rohgarg added a commit to rohgarg/dmtcp-1 that referenced this issue Jan 11, 2017
rohgarg added a commit that referenced this issue Jan 11, 2017
@rohgarg
Copy link
Contributor

rohgarg commented Jan 11, 2017

Fixed via PR #510.

@rohgarg rohgarg closed this as completed Jan 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants