cylc task-to-suite messaging in HPC facilities #67

cylc · 2012-06-21T23:38:43Z

Cylc tasks need to report progress (task started, and succeeded or failed, and possibly internal outputs and other messages) back to their parent suite. In an HPC environment you would ideally not run cylc itself on the HPC nodes, but rather on some Linux server or even your own desktop (the "suite host") with the suite submitting jobs to the remote HPCF. Then the running suite does not use any valuable HPC resource, and cylc suite visualization and the GUI tools do not have to be installed or ported (if that's even possible) to the HPC environment. But, remote tasks must be able to communicate, by network socket or by passwordless ssh, back to the suite host ... and it has recently come to my attention that some (many?) HPC facilities do not allow any network routing back out of the compute nodes, for security reasons and/or to avoid extraneous "network chatter" that could have an impact on compute performance. This would seem to imply that cylc (or indeed any general scheduling tool that tracks the progress of its tasks) has to be run on the HPC host itelf, which may be problematic for other reasons (e.g. no long-running jobs allowed?) in addition to the possible inconvenience of not having suite visualization and GUI tools available for cylc users.

This ticket can be used to record any ideas for getting around this problem.

cylc · 2012-06-21T23:42:28Z

One immediate solution (about to be used at MPI, Luis?) is to allow network routing from the HPC to one or more dedicated cylc servers, without opening up the routing more generally. This clearly requires some significant action to be taken by the HPCF sysadmins, however.

cylc · 2012-06-21T23:52:12Z

One-way communication (polling)? Perhaps we could have cylc spawn light-weight local (suite host) processes alongside each remote task, that poll the remote task host(s) to determine remote task progress (remote tasks might have to write progress updates to a standard file or something). Then the local processes, on detecting a change in the status of their associated remote tasks, could invoke the required cylc messaging locally (on the suite host) to update the running suite.

Not a very elegant solution, but maybe something like this would be necessary at sites where it is not possible (or rather not allowed) to route out if the compute nodes, even just to specific dedicated suite servers as above.

cylc · 2012-06-22T00:03:01Z

NIWA wizard Chris has suggested that some kind of remote port forwarding solution might be possible, using (or emulating what can be done with?) ssh. I'm no networking expert, but I think the gist of it was something like this: the ssh process that submits the remote task configures (how?) a local port on the task host (localhost:N?) that forwards traffic across the ssh tunnel to the right port on the suite host. The initial ssh connection would have to be kept open while the task runs.

This (if feasible) would have the advantage of not requiring any action to be taken by your friendly but change-resistant HPC sysadmins (I think?).

m214089 · 2012-06-22T06:32:19Z

We have been using port-forwarding for connecting from the HPC compute nodes to an external machine. The problem with this is that pyro, the high level message transfer component of cylc, is using a rather quite range of ports for communicating. And while the server port is fixed the port used by the client is random. The tool we use for port-forwarding is http://www.accordata.de/downloads/port-proxy/index.html.

m214089 · 2012-06-22T06:57:18Z

It came to my mind that we might get rid of pyro and write a server ourselves in the long term. The capabilities of pyro are not really explored by cylc. It is rather convenient to use it but as the protocol is very simple, we might even get along with something easier, where we do have more control over used ports. I try to discuss that with the network guys in our computing centre. I think we may even get a CS student to implement something, but first it makes sense to get a bit more understanding. We have cracked all protocol levels by now which are used on the client side as we are reimplementing those in C for various reasons not to be outlined here.

matthewrmshin · 2012-06-22T08:02:57Z

We also have this problem - communications to the outside world can only be done via the login node. It is likely that we'll go down the dedicated server route in our new environment.

(In our old environment, we used to have a poor man's way of doing "port forwarding".)

m214089 · 2012-06-22T08:05:28Z

As we will do for the time being as well. Maybe it is at the end not a really big problem for operational or dedicated computing centres? But it will for more 'lightweight' users ...

cylc · 2012-06-23T03:55:54Z

Luis (m214089) and Matt, thanks for the info. Yes, I guess this is a bigger problem for individuals trying to use cylc than for institutions in complete charge of their own computing facility. But even so, it would be nice to be able to run cylc "out of the box" on most existing systems. If either of you understand how the port forwarding solutions work could you write a quick "how to" that we could include in the User Guide?

cylc · 2012-06-23T04:09:15Z

(issue closure above was accidental!)

Luis, you are right that Pyro is used very minimally by cylc, and we probably could quite easily replace it with a simple custom solution. In the early days cylc also used the Pyro Nameserver so that a dedicated server port was not required for each suite, but it was generally thought that this was a bad idea because it meant suites were not entirely independent (e.g. in principle a research suite could bring down an operational one by messing with the Pyro nameserver).

Update: see also Issue #72; for the moment we plan to go with Pyro4 when it gets connection authentication (probably in the next-plus-one release).

dpmatthews · 2012-09-13T11:15:11Z

For tasks running on hosts with no communication back, it should be possible to configure cylc to poll hosts for updates to the task status and to prevent the cylc task commands from attempting to communicate back to the server. See #86.

cylc · 2012-09-13T11:25:25Z

Note my original polling suggestion above #67 (comment)

A local co-process (for want of a better word) could be launched alongside every remote task that requires polling, to do the polling and then run the right cylc messaging commands locally, in effect masquerading as the remote task. In this way we could avoid complicating cylc itself with polling code (it might be time consuming to poll for hundreds of tasks...). The local co-processes would be easy for cylc to monitor by the normal pyro-based method.

m214089 · 2012-09-19T05:53:03Z

I have been thinking a bit on polling and it points out to be much worse than having the communication enabled back to any kind of external server or port-forwarding, because it increases the network traffic inside of the HPC machine. The problem is not the amount of data transferred, but the latency of the messages. So I think for the time being an introduction into port-forwarding is the best solution. A very stable tool, which we use for database connections out of our HPC machine can be found here:

http://www.accordata.net/downloads/port-proxy/index.html

It can be used on a per user base.

hjoliver · 2012-10-05T03:46:09Z

@m214089 the kind of polling I was thinking of, at least in the first instance, was just using ssh to check remote files that report task progress. Does this result in the network traffic problem you're talking about?

m214089 · 2012-10-05T06:12:18Z

Yes, already this is too much ;-)

On 2012-05-10 4:46 , Hilary James Oliver wrote:

@m214089 https://github.com/m214089 the kind of polling I was thinking
of, at least in the first instance, was just using ssh to check remote
files that report task progress. Does this result in the network traffic
problem you're talking about?

—
Reply to this email directly or view it on GitHub
#67 (comment).

                          \\\\\\
                          (-0^0-)

--------------------------oOO--(_)--OOo-----------------------------

Luis Kornblueh Tel. : +49-40-41173289
Max-Planck-Institute for Meteorology Fax. : +49-40-41173298
Bundesstr. 53
D-20146 Hamburg Email: luis.kornblueh@zmaw.de
Federal Republic of Germany

hjoliver · 2012-10-05T06:16:49Z

Luis, do you think it would be wrong to give cylc a task polling capability in spite of the above problem - it would allow users to test cylc on facilities that currently have no easy means of routing back out of the HPC. We can give warnings that it is not a good long term solution, and sysadmins can complain if it does cause a problem!

m214089 · 2012-10-05T06:47:42Z

For testing it is very usefull. So it does make sense. Another thing
it could be nice for on the long term would be to store in the
background the runtime of a certain coponent of the suite and if a new
run takes significantly longer to make a life check ...

And giving the tip to the users is good to because it does contain some
teaching aspects. Great solution you proposed!

On 2012-05-10 7:16 , Hilary James Oliver wrote:

Luis, do you think it would be wrong to give cylc a task polling
capability in spite of the above problem - it would allow users to test
cylc on facilities that currently have no easy means of routing back out
of the HPC. We can give warnings that it is not a good long term
solution, and sysadmins can complain if it does cause a problem!

—
Reply to this email directly or view it on GitHub
#67 (comment).

                          \\\\\\
                          (-0^0-)

--------------------------oOO--(_)--OOo-----------------------------

Luis Kornblueh Tel. : +49-40-41173289
Max-Planck-Institute for Meteorology Fax. : +49-40-41173298
Bundesstr. 53
D-20146 Hamburg Email: luis.kornblueh@zmaw.de
Federal Republic of Germany

matthewrmshin · 2012-10-05T17:52:32Z

I wonder if it is possible to keep open a background process for a pseudo interactive ssh (bash) session to each remote host. Every now and then, the suite can send a polling command to the host via the same ssh session. It should return an output just like an interactive session. This should keep traffic to the minimum. It would be no different from a user keeping a terminal open to a remote host.

hjoliver · 2012-10-07T22:32:16Z

Does opening a new connection result in significantly more network chatter than maintaining and using an already-open connection? And are interactive ssh sessions better in this respect than non-interactive ssh?

matthewrmshin · 2012-10-08T20:15:28Z

The main advantage of a single ssh session per host is that it is less likely for the host to block the next ssh session because there are already too many sessions opened. There are probably other smaller advantages, e.g. it probably does generate slightly less network traffic, as it does not have to re-authenticate and re-run all the start up stuffs.

hjoliver · 2013-06-12T00:13:49Z

Task polling is now complete and merged to master. I've copied a few remaining issues from above to #517.

Tutorial stuff

cylc closed this as completed Jun 23, 2012

cylc reopened this Jun 23, 2012

cylc mentioned this issue Sep 13, 2012

Task communication failures should not cause a task to fail #115

Closed

ghost assigned m214089 and hjoliver Oct 5, 2012

hjoliver mentioned this issue Nov 28, 2012

SSH messaging security #183

Closed

This was referenced May 11, 2013

One way polling task communication method #461

Closed

Job poll and kill etc #462

Merged

hjoliver mentioned this issue Jun 12, 2013

Remaining task communications issues... #517

Closed

hjoliver closed this as completed Jun 12, 2013

MetRonnie pushed a commit that referenced this issue Apr 4, 2024

Merge pull request #67 from MetRonnie/tutorial

5e91a6f

Tutorial stuff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cylc task-to-suite messaging in HPC facilities #67

cylc task-to-suite messaging in HPC facilities #67

cylc commented Jun 21, 2012

cylc commented Jun 21, 2012

cylc commented Jun 21, 2012

cylc commented Jun 22, 2012

m214089 commented Jun 22, 2012

m214089 commented Jun 22, 2012

matthewrmshin commented Jun 22, 2012

m214089 commented Jun 22, 2012

cylc commented Jun 23, 2012

cylc commented Jun 23, 2012

dpmatthews commented Sep 13, 2012

cylc commented Sep 13, 2012

m214089 commented Sep 19, 2012

hjoliver commented Oct 5, 2012

m214089 commented Oct 5, 2012

hjoliver commented Oct 5, 2012

m214089 commented Oct 5, 2012

matthewrmshin commented Oct 5, 2012

hjoliver commented Oct 7, 2012

matthewrmshin commented Oct 8, 2012

hjoliver commented Jun 12, 2013

cylc task-to-suite messaging in HPC facilities #67

cylc task-to-suite messaging in HPC facilities #67

Comments

cylc commented Jun 21, 2012

cylc commented Jun 21, 2012

cylc commented Jun 21, 2012

cylc commented Jun 22, 2012

m214089 commented Jun 22, 2012

m214089 commented Jun 22, 2012

matthewrmshin commented Jun 22, 2012

m214089 commented Jun 22, 2012

cylc commented Jun 23, 2012

cylc commented Jun 23, 2012

dpmatthews commented Sep 13, 2012

cylc commented Sep 13, 2012

m214089 commented Sep 19, 2012

hjoliver commented Oct 5, 2012

m214089 commented Oct 5, 2012

hjoliver commented Oct 5, 2012

m214089 commented Oct 5, 2012

matthewrmshin commented Oct 5, 2012

hjoliver commented Oct 7, 2012

matthewrmshin commented Oct 8, 2012

hjoliver commented Jun 12, 2013