dmtcp with PBS scheduler #439

angel-smile · 2016-10-18T12:21:00Z

Hi,

I'm wondering if anybody has got dmtcp work with PBS? I understand it is not a supported resource manger, but want to know if anybody has got luck. I have made similar set up as that for Slurm. The launch part seems to be ok, but resume is not working with either the scheduler, the generated dmtcp_restart_script.sh or the .dmtcp file. strace shows the restarted process is waiting for something:

futex(0x7fb31b83a600, FUTEX_WAIT_PRIVATE, 0, NULL

I'm using dmtcp 3.0.0.

Thanks!

rohgarg · 2016-11-01T20:11:46Z

Could you try to attach gdb to the process that's stuck and share the backtrace? Is this an MPI application?

angel-smile · 2016-11-02T01:28:43Z

Thank you for the response. It is not an MPI application, but one of the simplest dmtcp examples, dmtcp1.c. Here is the backtrace:

(gdb) bt
#0 0x00007f77309b3a00 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007f7731e25258 in dmtcp::ThreadList::waitForAllRestored(Thread*) () at threadlist.cpp:643
#2 0x00007f7731e273c3 in stopthisthread(int) () at threadlist.cpp:598
#3
#4 0x00007f7731458bdd in nanosleep () from /lib64/libc.so.6
#5 0x00007f7731458a50 in sleep () from /lib64/libc.so.6
#6 0x0000000000400c11 in main (argc=-409088528, argv=0x7ffce79dcdf0) at dmtcp1.c:12

From what I can see, when I checkpoint and then restart without using PBS scheduler, the sockets opened by dmtcp_coordinator match those opened by the restarted application. However, when PBS is used, checkpointing runs fine, but the restarted application opens 1 socket more than dmtcp_coordinator does, so they don't match. Do you know what might be the reason?

Sockets opened by the restarted application:
$ ls -l /proc/9836/fd
821 -> socket:[58290]
834 -> socket:[58298]

Sockets opened by dmtcp_coordinator:
$ ls -l /proc/9833/fd
4 -> socket:[58261]
6 -> socket:[58291]

rohgarg · 2016-11-03T19:23:51Z

The application can open any number of sockets; it has no relation to the number of sockets that the coordinator has. However, it sounds like the application process has a socket that it inherits from the PBS launcher process. It should be easy to recognize this socket and "blacklist" it to make sure that DMTCP doesn't try to drain it.

In ideal case, there should be just one socket (in connected state) from the dmtcp1 application to the coordinator. The coordinator should have 2 sockets: 1 in connected state (for the dmtcp1 process), and 1 in listen state. So, if you see that the application has 2 sockets open prior to checkpointing, my guess is that it inherits the extra socket from the PBS launcher process. Could you confirm this?

Also, the backtrace that you shared is incomplete and doesn't indicate any errors. Could you try running the backtrace command on all the threads (thread apply all bt)?

angel-smile · 2016-11-04T03:39:38Z

Again thanks for your help Rohan. I agree with your analysis. I can recognize the socket the application process inherits from PBS launcher process when the process is running, but do you have idea how to get information about the additional socket before the application is restarted? Shall I talk to PBS people about this? And can you suggest how to blacklist it?

Below is the results of backtrace command on all the threads:

Thread 2 (Thread 0x7fcbfe94b700 (LWP 17654)):
#0 0x00007fcc004c0283 in poll () from /lib64/libc.so.6
#1 0x00007fcc014f3f23 in dmtcp::ConnectionList::sendReceiveMissingFds() () at ipc/connectionlist.cpp:506
#2 0x00007fcc00e613dd in dmtcp::PluginInfo::processBarrier(dmtcp::BarrierInfo_) () at plugininfo.cpp:122
#3 0x00007fcc00e61690 in dmtcp::PluginInfo::processBarriers() () at plugininfo.cpp:102
#4 0x00007fcc00e634ea in dmtcp::PluginManager::processRestartBarriers() () at pluginmanager.cpp:159
#5 0x00007fcc00e4155f in dmtcp::DmtcpWorker::postRestart() () at dmtcpworker.cpp:529
#6 0x00007fcc00e5a294 in dmtcp::ThreadList::waitForAllRestored(Thread_) () at threadlist.cpp:621
#7 0x00007fcc00e5bb9e in checkpointhread(void_) () at threadlist.cpp:377
#8 0x00007fcc00e52f9c in pthread_start(void_) () at threadwrappers.cpp:154
#9 0x00007fcbff9e2aa1 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fcc00e52dfd in clone_start(void*) () at threadwrappers.cpp:65
#11 0x00007fcc004c9aad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fcc02167740 (LWP 17637)):
#0 0x00007fcbff9e8a00 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007fcc00e5a258 in dmtcp::ThreadList::waitForAllRestored(Thread*) () at threadlist.cpp:643
#2 0x00007fcc00e5c3c3 in stopthisthread(int) () at threadlist.cpp:598
#3
#4 0x00007fcc0048dbdd in nanosleep () from /lib64/libc.so.6
#5 0x00007fcc0048da50 in sleep () from /lib64/libc.so.6
#6 0x0000000000400c11 in main (argc=1467939472, argv=0x7fff577efa90) at dmtcp1.c:12

angel-smile · 2016-11-04T05:14:04Z

In ideal case, there should be just one socket (in connected state) from the dmtcp1 application to the coordinator. The coordinator should have 2 sockets: 1 in connected state (for the dmtcp1 process), and 1 in listen state. So, if you see that the application has 2 sockets open prior to checkpointing, my guess is that it inherits the extra socket from the PBS launcher process. Could you confirm this?

In the first PBS job which runs dmtcp1 for the first time, there is only one socket before checkpointing and after checkpointing starts, so dmtcp1 runs well in this job. However, in the second PBS job which restarts dmtcp1 from the last checkpoint saved in the previous job, the application has 2 sockets open prior to checkpointing. I think this confirms that the application process inherits one extra socket from PBS launcher process?

This fixes issue dmtcp#439.

This fixes issue #439.

rohgarg · 2017-01-11T18:07:23Z

Fixed via PR #510.

rohgarg added a commit to rohgarg/dmtcp-1 that referenced this issue Jan 9, 2017

fileconnlist: Mark /proc/pid/environ as pre-existing fd

5c41cae

This fixes issue dmtcp#439.

rohgarg mentioned this issue Jan 9, 2017

fileconnlist: Mark /proc/pid/environ as pre-existing fd #510

Merged

rohgarg added a commit to rohgarg/dmtcp-1 that referenced this issue Jan 11, 2017

fileconnlist: Mark /proc/pid/environ as pre-existing fd

98756ac

This fixes issue dmtcp#439.

rohgarg added a commit that referenced this issue Jan 11, 2017

fileconnlist: Mark /proc/pid/environ as pre-existing fd

8506484

This fixes issue #439.

rohgarg closed this as completed Jan 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dmtcp with PBS scheduler #439

dmtcp with PBS scheduler #439

angel-smile commented Oct 18, 2016

rohgarg commented Nov 1, 2016

angel-smile commented Nov 2, 2016

rohgarg commented Nov 3, 2016

angel-smile commented Nov 4, 2016

angel-smile commented Nov 4, 2016

rohgarg commented Jan 11, 2017

dmtcp with PBS scheduler #439

dmtcp with PBS scheduler #439

Comments

angel-smile commented Oct 18, 2016

rohgarg commented Nov 1, 2016

angel-smile commented Nov 2, 2016

rohgarg commented Nov 3, 2016

angel-smile commented Nov 4, 2016

angel-smile commented Nov 4, 2016

rohgarg commented Jan 11, 2017